Saturday, November 19, 2022

How to remove Duplicates in DataFrame using PySpark

 Below ways can be used to remove duplicates from a dataframe in PySpark:

  1. distinct
    1. df.distinct()
    2. df.distinct(["Column1","Column2"])
  2. dropDuplicates
    1. df.dropDuplicates()
    2. df.dropDuplicates(["Column1","Column2"])

Friday, November 18, 2022

How to Sort a Dataframe in PySpark

 Below are different ways to sort a dataframe:

  1. sort
    1. df.sort(df.ColumnName)
    2. df.sort(df.ColumnName.desc())
    3. df.sort(df.ColumnName.desc(), df.ColumnName2)
    4. df.sort(col("ColumnName"))
  2. orderBy
    1. df.orderBy(df.ColumnName)
    2. df.orderBy(df.ColumnName.desc())
    3. df.orderBy(df.ColumnName1.desc(), df.ColumnName2)
    4. df.orderBy(col("ColumnName"))

How to filter a DataFrame using PySpark

 Multiple ways to filter dataframe data:

  1. filter
    1. df.filter(df.ColumnName ==VALUE)
    2. df.filter(col("ColumnName") == VALUE)
    3. df.filter((col("ColumnName1") == VALUE) | (col("ColumnName2") == VALUE))
    4. df.filter((col("ColumnName1") == VALUE) & (col("ColumnName2") == VALUE))
    5. df.filter(col("ColumnName") != VALUE)

How to ADD New Columns in DataFrame using PySpark

 Below are different ways to add new columns to dataframe in PySpark:

  1. withColumn and lit
    1. df.withColumn("NewColumnName", lit("default value for new column"))
  2. withColumn and col (Derived column)
    1. df.withColumn("NewColumnName", col("Column1") * col("Column2"))
  3. select
    1. df.select(lit("default column value").alias("NewColumnName"), col("Column1"), col("Column2"))

How to Rename columns in DataFrame using PySpark

 There are multiple ways to rename columns in dataframe using PySpark.

  1. withColumnRenamed
    1. df = df.withColumnRenamed("Old_ColumnName1", "New_ColumnName1").withColumnRenamed("Old_ColumnName2", "New_ColumnName2")
  2. selectExpr
    1. df = df.selectExpr("Old_ColumnName1 AS NewColumnName1","Old_ColumnName2 AS NewColumnName2")
  3. select(col().alias(), col())
    1. df2 = df.select(col("Old_ColumnName1").alias("NewColumnName1"), col("Old_ColumnName2"))