How to delete data from the Data Lake?
In order to remove/delete some data from the Data Lake, we need
to perform a delete operation. The blipdataforge library provide
the delete() method of DataPlatform class to do this task.
How the the delete() method works?
As an example, let's first create an actual table to demonstrate how we can delete
some data from it. We create this example table by using the write() method that we
described at Writing a Spark DataFrame into the Data Lake.
from pyspark.sql import SparkSession
from blipdataforge.facades import DataPlatform
spark = SparkSession.builder.getOrCreate()
dp = DataPlatform()
data = [
(12114, 'Anne', 21, 1.56, 8, 9, 10, 9, 'Economics', 'SC'),
(13007, 'Adrian', 23, 1.82, 6, 6, 8, 7, 'Economics', 'SC'),
(10045, 'George', 29, 1.77, 10, 9, 10, 7, 'Law', 'SC'),
(12459, 'Adeline', 26, 1.61, 8, 6, 7, 7, 'Law', 'SC'),
(10190, 'Mayla', 22, 1.67, 7, 7, 7, 9, 'Design', 'AR'),
(11552, 'Daniel', 24, 1.75, 9, 9, 10, 9, 'Design', 'AR')
]
columns = [
'StudentID', 'Name', 'Age', 'Height', 'Score1',
'Score2', 'Score3', 'Score4', 'Course', 'Department'
]
students = spark.createDataFrame(data, columns)
dp.write(
students,
catalog="dageneral_sandbox",
database="blipdataforge",
table="students",
write_mode="overwrite"
)
We can see the actual data of this table by using a simple SELECT * FROM SQL statement,
like this:
| StudentID | Name | Age | Height | Score1 | Score2 | Score3 | Score4 | Course | Department |
|---|---|---|---|---|---|---|---|---|---|
| 12114 | Anne | 21 | 1.56 | 8 | 9 | 10 | 9 | Economics | SC |
| 13007 | Adrian | 23 | 1.82 | 6 | 6 | 8 | 7 | Economics | SC |
| 10045 | George | 29 | 1.77 | 10 | 9 | 10 | 7 | Law | SC |
| 12459 | Adeline | 26 | 1.61 | 8 | 6 | 7 | 7 | Law | SC |
| 10190 | Mayla | 22 | 1.67 | 7 | 7 | 7 | 9 | Design | AR |
| 11552 | Daniel | 24 | 1.75 | 9 | 9 | 10 | 9 | Design | AR |
You can see in the output above, that we have in this table data about six different students of some hypotetical university. What if... I don't know, the students Anne and Daniel quitted this university, and, also, filled a lawsuit to the university asking it to delete every personal data that they might have about these students.
We can easily delete the data about these students from this students table,
by using the delete() method. The delete() method of DataPlatform class have
a catalog, database and a table arguments, just like the write() method.
With these arguments you can specify which table you want to delete data from.
In the example below, I am referring to the table dageneral_sandbox.blipdataforge.students that we just created
in the code example above.
Now, the predicate argument is the key component in the delete() method. In
this argument, you specify the predicate (or the delete clause if you prefer to call it this way)
that will be used in the delete operation.
With this predicate, you can specify which observations exactly from the students table
will be deleted. You can see in the example below, that I am using the predicate
Name IN ("Anne", "Daniel"). This is essentially a SQL clause, or a SQL predicate
that you would use in a WHERE SQL statement.
So, the predicate argument in the delete() method accepts as input, a string
which contains predicates (or SQL clauses) that you would use in a WHERE SQL
statement, to identify which exact observations from the table you want to delete.
students_to_remove = ["Anne", "Daniel"]
students_to_remove = [f'"{student}"' for student in students_to_remove]
students_to_remove = ", ".join(students_to_remove)
predicate = f"Name IN ({students_to_remove})"
dp.delete(
catalog="dageneral_sandbox",
database="blipdataforge",
table="students",
predicate=predicate
)
After we run the code above with delete() method, we can take a new look at
the data present in the dageneral_sandbox.blipdataforge.students table.
You can see in the result below, that the students Anne and Daniel were removed
from the table succesfully.
| StudentID | Name | Age | Height | Score1 | Score2 | Score3 | Score4 | Course | Department |
|---|---|---|---|---|---|---|---|---|---|
| 13007 | Adrian | 23 | 1.82 | 6 | 6 | 8 | 7 | Economics | SC |
| 10190 | Mayla | 22 | 1.67 | 7 | 7 | 7 | 9 | Design | AR |
| 12459 | Adeline | 26 | 1.61 | 8 | 6 | 7 | 7 | Law | SC |
| 10045 | George | 29 | 1.77 | 10 | 9 | 10 | 7 | Law | SC |
Differences between a DROP and a DELETE operation
Although they do resemble each other, a drop operation is a completly different operation than a delete operation. A drop operation is meant to completly erase an entire table (or database). You can see a drop operation as a complete destruction. It destroys the entire object (table or database) that you are aiming for.
In contrast, a delete operation deletes only the data that you ask it to delete. In other words, a delete operation is meant to destroy specific ranges/sections of data from a table. It destroys only a small part of the data present in a table.
If you want to, you can delete all data from a specific table using a delete operation. You just need to use a delete clause that covers all data present in the table. But you cannot delete the table itself with a delete operation (this is a job for a drop operation).