Knowledge Base
Saturday, November 19, 2022
How to remove Duplicates in DataFrame using PySpark
Below ways can be used to remove duplicates from a dataframe in PySpark:
distinct
df.distinct()
df.distinct(["Column1","Column2"])
dropDuplicates
df.dropDuplicates()
df.dropDuplicates(["Column1","Column2"])
No comments:
Post a Comment
Newer Post
Older Post
Home
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment