Saturday, November 19, 2022

How to remove Duplicates in DataFrame using PySpark

 Below ways can be used to remove duplicates from a dataframe in PySpark:

  1. distinct
    1. df.distinct()
    2. df.distinct(["Column1","Column2"])
  2. dropDuplicates
    1. df.dropDuplicates()
    2. df.dropDuplicates(["Column1","Column2"])

No comments:

Post a Comment