How to save a DataFrame to PostgreSQL in pyspark

This recipe helps you save a DataFrame to PostgreSQL in pyspark

Recipe Objective: How to save a DataFrame to PostgreSQL in pyspark?

In most big data scenarios, data merging and aggregation are an essential part of the day-to-day activities in big data platforms. In this scenario, we will load the dataframe to the Postgres database table or save the dataframe to the table.

System requirements :

  • Install Ubuntu in the virtual machine click here
  • Install single-node Hadoop machine click here
  • Install pyspark or spark in Ubuntu click here
  • The below codes can be run in Jupyter notebook or any python console.

Learn to Build ETL Data Pipelines on AWS

Step 1: Import the modules

In this scenario, we are going to import the pyspark and pyspark SQL modules and create a spark session as below:

import pyspark from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession.builder.config("spark.jars", "/usr/local/postgresql-42.2.5.jar") \ .master("local").appName("PySpark_Postgres_test").getOrCreate()

The output of the code:

bigdata_1.jpg

Step 2: Create Dataframe to store in Postgres

Here we will create a dataframe to save in a Postgres table for that the Row class is in the pyspark.sql submodule. As shown above, we import the Row from class.

studentDf = spark.createDataFrame([ Row(id=1,name='vijay',marks=67), Row(id=2,name='Ajay',marks=88), Row(id=3,name='jay',marks=79), Row(id=4,name='vinay',marks=67), ])

Explore SQL Database Projects to Add them to Your Data Engineer Resume. 

The output of the code:

bigdata_2.jpg

Step 3: To View Data of the Data Frame

Here we are going to view the data top 5 rows in the dataframe as shown below.

studentDf.show(5)

The output of the code:

bigdata_3.jpg

Step 4: To Save Dataframe to Postgres Table

Here we are going to save the dataframe to the Postgres table which we created earlier. To save, we need to use a write and save method as shown in the below code.

studentDf.select("id","name","marks").write.format("jdbc")\ .option("url", "jdbc:postgresql://localhost:5432/dezyre_new") \ .option("driver", "org.postgresql.Driver").option("dbtable", "students") \ .option("user", "hduser").option("password", "bigdata").save()

The output of the code:

bigdata_4.jpg

To check the output of the saved data frame in the Postgres table, log in Postgres database.

The output of the saved dataframe:

bigdata_5.jpg

As shown in the above image, we have written the dataframe to create a table in Postgres.

Conclusion

Here we learned to save a DataFrame to PostgreSQL in pyspark.

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Snowflake Azure Project to build real-time Twitter feed dashboard
In this Snowflake Azure project, you will ingest generated Twitter feeds to Snowflake in near real-time to power an in-built dashboard utility for obtaining popularity feeds reports.

Azure Project to Build a real-time ADF Pipeline with LogicApps
In this Azure project, you will learn how to create an azure data factory pipeline that ingests real-time water sensor data from multiple European countries. The automated system ensures that only the latest records are processed without manual intervention reducing computational costs while maximizing resource utilization.

dbt Snowflake Project to Master dbt Fundamentals in Snowflake
DBT Snowflake Project to Master the Fundamentals of DBT and learn how it can be used to build efficient and robust data pipelines with Snowflake.

AWS Snowflake Data Pipeline Example using Kinesis and Airflow
Learn to build a Snowflake Data Pipeline starting from the EC2 logs to storage in Snowflake and S3 post-transformation and processing through Airflow DAGs

Python and MongoDB Project for Beginners with Source Code-Part 1
In this Python and MongoDB Project, you learn to do data analysis using PyMongo on MongoDB Atlas Cluster.

Build an ETL Pipeline on EMR using AWS CDK and Power BI
In this ETL Project, you will learn build an ETL Pipeline on Amazon EMR with AWS CDK and Apache Hive. You'll deploy the pipeline using S3, Cloud9, and EMR, and then use Power BI to create dynamic visualizations of your transformed data.

GCP Project-Build Pipeline using Dataflow Apache Beam Python
In this GCP Project, you will learn to build a data pipeline using Apache Beam Python on Google Dataflow.

Real-Time Streaming of Twitter Sentiments AWS EC2 NiFi
Learn to perform 1) Twitter Sentiment Analysis using Spark Streaming, NiFi and Kafka, and 2) Build an Interactive Data Visualization for the analysis using Python Plotly.

Learn Data Processing with Spark SQL using Scala on AWS
In this AWS Spark SQL project, you will analyze the Movies and Ratings Dataset using RDD and Spark SQL to get hands-on experience on the fundamentals of Scala programming language.

Hadoop Project to Perform Hive Analytics using SQL and Scala
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.