Connect Spark to Oracle using JDBC and pyspark
To use the Java Database Connectivity (JDBC) API we’ll need to provide our application with the correct drivers. The drivers can be found on this Oracle repository. You’ll need to know the version of the database you want to connect to. Try using this SQL statement:
SELECT * FROM V$VERSION
To provide the downloaded drivers to your application, add this when evoking it:
2. Reading using JDBC
To fill a Spark dataFrame directly from a database using JDBC we will need a dataFrameReader object:
‘‘‘python dataFrame = SparkSession.builder.getOrCreate() dataFrameReader = dataFrame.read '’’
and to define the JDBC related objects we will use as options when reading from the database:
driver = "oracle.jdbc.OracleDriver" url = "jdbc:oracle:thin:@//<HOST>:<PORT>/service_name" user = "your_user" password = "your_pass"
Note: The JDBC URL is a relatively versatile entity that you probably have seen in some different formats. For me, the only format working properly with pyspark is the one shown above.
There’s still one more option to define:
This option will be used as a source for the
SELECT statement running under the hood. For that, it can be initialized as any valid SQL query content.
dbtable can be the name of a table or a query we want to execute over that table.
dbtable = "your_table" dbtable_2 = "(SELECT * FROM your_table WHERE time > 10)"
Finally, to fill your dataFrame:
df = dataFrameReader.format("jdbc").option("url", url) \ .option("driver", driver) \ .option("dtable", dbtable) \ .option("user", user) \ .option("password", password).load()
df is a Spark dataFrame filled directly from an Oracle database using JDBC!