Knowledge Base: How to remove Duplicates in DataFrame using PySpark

Saturday, November 19, 2022

How to remove Duplicates in DataFrame using PySpark

Below ways can be used to remove duplicates from a dataframe in PySpark:

distinct

df.distinct()
df.distinct(["Column1","Column2"])

dropDuplicates

df.dropDuplicates()
df.dropDuplicates(["Column1","Column2"])

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)