Skip to content

How to delete data from the Data Lake?

In order to remove/delete some data from the Data Lake, we need to perform a delete operation. The blipdataforge library provide the delete() method of DataPlatform class to do this task.

How the the delete() method works?

As an example, let's first create an actual table to demonstrate how we can delete some data from it. We create this example table by using the write() method that we described at Writing a Spark DataFrame into the Data Lake.

from pyspark.sql import SparkSession
from blipdataforge.facades import DataPlatform
spark = SparkSession.builder.getOrCreate()
dp = DataPlatform()

data = [
  (12114, 'Anne', 21, 1.56, 8, 9, 10, 9, 'Economics', 'SC'),
  (13007, 'Adrian', 23, 1.82, 6, 6, 8, 7, 'Economics', 'SC'),
  (10045, 'George', 29, 1.77, 10, 9, 10, 7, 'Law', 'SC'),
  (12459, 'Adeline', 26, 1.61, 8, 6, 7, 7, 'Law', 'SC'),
  (10190, 'Mayla', 22, 1.67, 7, 7, 7, 9, 'Design', 'AR'),
  (11552, 'Daniel', 24, 1.75, 9, 9, 10, 9, 'Design', 'AR')
]

columns = [
  'StudentID', 'Name', 'Age', 'Height', 'Score1',
  'Score2', 'Score3', 'Score4', 'Course', 'Department'
]

students = spark.createDataFrame(data, columns)
dp.write(
    students,
    catalog="dageneral_sandbox",
    database="blipdataforge",
    table="students",
    write_mode="overwrite"
)

We can see the actual data of this table by using a simple SELECT * FROM SQL statement, like this:

SELECT *
FROM dageneral_sandbox.blipdataforge.students
StudentID Name Age Height Score1 Score2 Score3 Score4 Course Department
12114 Anne 21 1.56 8 9 10 9 Economics SC
13007 Adrian 23 1.82 6 6 8 7 Economics SC
10045 George 29 1.77 10 9 10 7 Law SC
12459 Adeline 26 1.61 8 6 7 7 Law SC
10190 Mayla 22 1.67 7 7 7 9 Design AR
11552 Daniel 24 1.75 9 9 10 9 Design AR

You can see in the output above, that we have in this table data about six different students of some hypotetical university. What if... I don't know, the students Anne and Daniel quitted this university, and, also, filled a lawsuit to the university asking it to delete every personal data that they might have about these students.

We can easily delete the data about these students from this students table, by using the delete() method. The delete() method of DataPlatform class have a catalog, database and a table arguments, just like the write() method. With these arguments you can specify which table you want to delete data from. In the example below, I am referring to the table dageneral_sandbox.blipdataforge.students that we just created in the code example above.

Now, the predicate argument is the key component in the delete() method. In this argument, you specify the predicate (or the delete clause if you prefer to call it this way) that will be used in the delete operation.

With this predicate, you can specify which observations exactly from the students table will be deleted. You can see in the example below, that I am using the predicate Name IN ("Anne", "Daniel"). This is essentially a SQL clause, or a SQL predicate that you would use in a WHERE SQL statement.

So, the predicate argument in the delete() method accepts as input, a string which contains predicates (or SQL clauses) that you would use in a WHERE SQL statement, to identify which exact observations from the table you want to delete.

students_to_remove = ["Anne", "Daniel"]
students_to_remove = [f'"{student}"' for student in students_to_remove]
students_to_remove = ", ".join(students_to_remove)

predicate = f"Name IN ({students_to_remove})"
dp.delete(
    catalog="dageneral_sandbox",
    database="blipdataforge",
    table="students",
    predicate=predicate
)

After we run the code above with delete() method, we can take a new look at the data present in the dageneral_sandbox.blipdataforge.students table. You can see in the result below, that the students Anne and Daniel were removed from the table succesfully.

SELECT *
FROM dageneral_sandbox.blipdataforge.students
StudentID Name Age Height Score1 Score2 Score3 Score4 Course Department
13007 Adrian 23 1.82 6 6 8 7 Economics SC
10190 Mayla 22 1.67 7 7 7 9 Design AR
12459 Adeline 26 1.61 8 6 7 7 Law SC
10045 George 29 1.77 10 9 10 7 Law SC

Differences between a DROP and a DELETE operation

Although they do resemble each other, a drop operation is a completly different operation than a delete operation. A drop operation is meant to completly erase an entire table (or database). You can see a drop operation as a complete destruction. It destroys the entire object (table or database) that you are aiming for.

In contrast, a delete operation deletes only the data that you ask it to delete. In other words, a delete operation is meant to destroy specific ranges/sections of data from a table. It destroys only a small part of the data present in a table.

If you want to, you can delete all data from a specific table using a delete operation. You just need to use a delete clause that covers all data present in the table. But you cannot delete the table itself with a delete operation (this is a job for a drop operation).