Skip to content

Basic usage

The blipdataforge library includes several classes to interact with Blip's Data Platform objects such as Data Lakes, Data Contracts and so on, but for the sake of simplicity of usage, just a few classes are exposed to the user.

The DataPlatform class is the main API of the blipdataforge library. It is the entry point for all functions needed by an engineer to build ETL pipelines or interact with Blip's Data Platform. To use this class you have to import it, create an instance, and, then, call the desired method.

In the example below, we are using the write() method to write the data from the df Spark DataFrame into a SQL table called "clients_sandbox.product_analysis.sales":

from blipdataforge import DataPlatform
from pyspark.sql import SparkSession
from datetime import date
from pyspark.sql import Row

spark = SparkSession.builder.getOrCreate()
dp = DataPlatform()
data = [
  Row(id = 1, value = 28.3, date = date(2021,1,1)),
  Row(id = 2, value = 15.8, date = date(2021,1,1)),
  Row(id = 3, value = 20.1, date = date(2021,1,2)),
  Row(id = 4, value = 12.6, date = date(2021,1,3))
]

df = spark.createDataFrame(data)
# Write DataFrame to table at reference "clients_sandbox.product_analysis.sales"
dp.write(
    df,
    catalog="clients_sandbox",
    database="product_analysis",
    table="sales",
    write_mode="overwrite"
)

Tip

If you are running your code in the dev environment, you should use the sandbox catalog of your domain, but if you are running your code in the prd environment, then, you should use the trustedzone catalog of your domain instead. If you're not familiar with this policy, please read the About catalogs and environments section.

After you execute the write() method, it means that we now have a new table available in the Data Lake, called clients_sandbox.product_analysis.sales, and you can use this table further in your code by executing a SELECT * FROM SQL statement over this table, or, using the spark.table() method to access it, like in the example below:

df = spark.table("clients_sandbox.product_analysis.sales")
df.show(n = 5)
+---+-----+----------+
| id|value|      date|
+---+-----+----------+
|  1| 28.3|2021-01-01|
|  2| 15.8|2021-01-01|
|  3| 20.1|2021-01-02|
|  4| 12.6|2021-01-03|
+---+-----+----------+

For more information about which methods are available in the DataPlatform class, we recommend you to see the Facades section of the Library reference.