Changelog

v1.25.16 (Current)

Exposed new parameter to generate_datalake_sas_link to allow usage of another access key to sign the links.

v1.25.15 (12.11.2025)

Removed databricks-connect==16.4.8 and databricks-dlt==0.3.0 as dependencies to fix INIT_SCRIPT failure.

v1.25.14 (30.10.2025)

Added databricks-connect==16.4.8 as dependency to fix INIT_SCRIPT failure.

v1.25.13 (29.10.2025)

Fixing import - json - in module blipdataforge.takespark.send.blob.

v1.25.12 (28.08.2025)

Fixing import in module blipdataforge.takespark.send.blob.

v1.25.11 (27.08.2025)

Fixing the extraction from the function write_dataframe_as_json() as a list of dicts.

v1.25.10 (27.08.2025)

Adding new fields in _collect_scan_results to improve soda quality checks message.

v1.25.9 (05.08.2025)

Removed delta-spark>0.4 library from pyproject.toml. Library install pyspark.

v1.25.8 (04.08.2025)

Adding control type at function write_dataframe_as_json()

v1.25.7 (04.08.2025)

Added paramiko==3.5.1 as dependency to fix pysftp package.

v1.25.6 (01.08.2025)

Removed libraries from pyproject.toml. Libraries is available in databricks cluster. Libraries list ["boto3==1.24.28", "azure-storage-blob==12.19.0", "azure-storage-file-datalake==12.14.0", "ipython", "mypy-extensions==1.0.0", "mypy==1.3.0", "types-pytz==2023.3.1.1"]

v1.25.5 (31.07.2025)

Removed pyspark library from pyproject.toml. Library is available in databricks cluster.

v1.25.4 (28.07.25)

Fixing takespark function - write_dataframe_as_json().

v1.25.3 (24.07.25)

Changed duration of SAS signed links from 7 to 3 days.

v1.25.2 (12/06/2025)

Removed databricks-connect installation due to compatibility issues that prevent clusters from starting properly. Local clusters must use Databricks Runtime 16.3.

v1.25.1 (22/05/2025)

Fix the order of the parameters table_comment and columns_commens of the write function. It was necessary because some notebooks use the write function without explicitly naming the arguments, so when the new parameters were added in the midle of the existing ones, those notebooks were affected.

v1.25.0 (15/05/2025)

Adds parameters table_comment and columns_commens to the write function to specify table metadata.

v1.24.1 (14/05/2025)

This version makes the BlipDataForge library compatible with DBR 16.4 LTS. This was done by upgrading the dependency azure-eventhub to version 5.15.0.

v1.24.0 (10/04/2025)

Adds the governance rules exception to build the paths fo usbliplayer and eubliplayer.

v1.23.4 (28/03/2025)

This release adds a small fix over the Takespark function write_dataframe_as_parquet().

v1.23.3 (19/03/2025)

This release adds a small fix over the Takespark function sftp_put_file(), which should use now the landzone_volume to access the provided file.

v1.23.2 (19/03/2025)

This release adds a small change to specific log messages that are produced by write_to_opensearch(). The idea is to identify these log messages more easily.

v1.23.1 (18/03/2025)

Adds parameters buffer_size and max_block_ms to the write_eventhub() function.

v1.23.0 (17/03/2025)

Adds the write_elasticsearch() function to DataPlatform facade that enable users to send spark dataframes to Elasticsearch databases.

v1.22.2 (17/03/2025)

A new argument was added to the facade write_to_opensearch(). Now, the user can select the Spark write mode that he/she wants to use when writing the data to the Opensearch index. Checkout the user guide for the facade to know more about this argument.

This release also removes the option opensearch.mapping.exclude from the facade, which means that, the ID column provided to the id_column argument of the facade is aso included as a component of the document schema itself.

v1.22.1 (11/03/2025)

This release adds a fix over the path used by write_to_s3_bucket() Takespark function.

v1.22.0 (10/03/2025)

A new strategy was added to the write_to_opensearch() function. Now, this function can also use a native Spark writer class to write the data to a Opensearch server.

v1.21.9 (27/02/2025)

This release fixes a bug inside the Takespark dbfs_zip_folder() function. This fix is related to how the blob object was being used inside the function.

v1.21.8 (27/02/2025)

This release changes the way that the landzone_volume is created in each domain.

v1.21.7 (26/02/2025)

Fix a bug in the sftp_put_file() and write_to_gcp_bucket() Takespark functions.

v1.21.6 (24/02/2025)

Add a fix over the write_to_s3_bucket() Takespark function. The function was using the file object given as input incorrectly when uploading it to the S3 bucket.

v1.21.5 (18/02/2025)

Add a fix over the write_to_s3_bucket() Takespark function. Now, this function uses a Databricks volume to access the input files. Also fix a small error in send_email_general().

v1.21.4 (18/02/2025)

Adds a fix over the auxiliary Takespark function read_from_blob(), which now uses a Databricks volume to access the files in the landzone container. Also, fix a wrong call to the startswith() method inside the write_to_s3_bucket() Takespark function.

v1.21.3 (17/02/2025)

Add a fix to the write_dataframe_as_excel() Takespark function by using Databricks volumes to write the necessary files.

v1.21.2 (14/02/2025)

This release tries to fix a problem that is happening across multiple functions from Takespark, by changing the way they access the files written.

v1.21.1 (13/02/2025)

Adds better docstring and exceptions habndling for write_mongodb function. The exception messages and docstring now instruct users to use Single User clusters and add mongodb maven dependency to the job.

v1.21.0 (13/02/2025)

Adds new Gaia Connector for MongoDB. Now it is possible to write a PySpark dataframe into a MongoDB through the function write_mongodb() from DataPlatform facade.

v1.20.0 (13/02/2025)

New Gaia Connector available! Now, users can write a Spark DataFrame into a Opensearch database, using the write_to_opensearch() function from DataPlatform() facade.

v1.19.9 (11/02/2025)

Move the output target of write_dataframe_as_csv() back to the landzone container, due to constraints imposed by old pipelines at Azure Data Factory.

v1.19.8 (10/02/2025)

This release is a fix for the previous release. The write_dataframe_as_csv() function was still producing erroneous paths in specific situations.

v1.19.7 (07/02/2025)

This release tries to fix the way that write_dataframe_as_csv() builds the path to the file saved in the landingzone container.

v1.19.6 (06/02/2025)

This release adds a new strategy to copy the file inside the Takespark function sftp_put_file().

v1.19.5 (06/02/2025)

The function write_dataframe_as_csv() was always trying to write the files into the clients domain landzone container. However, this function should try to write the files into the landzone container that is connected to the current domain where the user is. This release fixes that.

v1.19.4 (03/02/2025)

Dont use host keys in the SFTP connection created inside the Takespark function sftp_put_file().

v1.19.3 (03/02/2025)

This release adds a fix over the Takespark function sftp_put_file(). Now, this function creates a Databricks volume under the hood to access the file provided at local_file_path argument. This avoids the need to rewrite the provided file in the local disk of the worker before being sent to the SFTP server.

v1.19.2 (31/01/2025)

Fix blipdataforge.takespark.refining.support.aux_accountcontacts function, which had an error extracting data from the Extras field when it was in JSON format.

v1.19.1 (29/01/2025)

Fix the DataPlatform facade that was crashing due to dlt import even when using a cluster that does not support it.

v1.19.0 (28/01/2025)

Adds support for streaming read, write and processing using Delta Live Tables. Now the DataPlatform implements the get_streaming_engine() function, which returns an interface that allow users to create streaming pipelines. Currently only streaming flows using Delta Live Tables are supported.

v1.18.10 (23/01/2025)

Fix the DataPlatform.write() that was not setting the schema evolution when using the write mode upsert. Now the schema_mode or the merge_schema (deprecated) enables the schema evolution in upsert mode.
Adds an exception rule to build the storage account name for the dataengtechnology domain. Without this fix, the users of this domain cannot write to the data lake.

v1.18.9 (09/01/2025)

This version changes how Firehoses HTTP requests are made. Now it uses requests.Session as a context manager, and sets a HTTP retries in case of DNS lookup failures, socket connections failures or timeouts.

This version also adds more info to Firehose error messages, containing the payload sent to Firehose.

v1.18.8 (07/01/2024)

Adds an error log call to the send_data() and ingest_data() functions in case that Firehose responds with an error or warning HTTP status, to improve observability in Grafana.

v1.18.7 (23/12/2024)

Adds a log call to the share_files() containing the number of files shared and the total size in bytes. This log is intended to be used in Grafana to track connector metrics.

v1.18.6 (18/12/2024)

Adds a log call to the write_eventhub() after the write conclusion containing the dataframe count, topic and namespace. This log is intended to be used in Grafana to track connector metrics.

v1.18.5 (04/12/2024)

This release adds a new parameter to the facade share_files(), to allow the user specify the title of the email that will be sent by the service.

v1.18.4 (29/11/2024)

Exposes the linger.ms configuration for the eventhub connector, and set a new default of 2 ms for this config. For more information, please refer to Kafka linger.ms documentation.

v1.18.3 (19/11/2024)

Raises EventHub batch size to 500KB, wich reduces significantly the total write duration. Also, exposes the batch size and request timeout configuration as parameters of the write_eventhub() function. This will give the user more flexibility to tune the eventhub writes.

v1.18.2 (18/11/2024)

Raises the request timeout of the kafka backend in the write_eventhub() function to 5 minutes. Now, the client will wait longer to receive a response after a send request for each batch sent to the EventHub topic.

v1.18.1 (12/11/2024)

Turns the KafkaConnector into EventHubConnector, narrowing the context of the connector for the sake of a friendly user experience.

v1.18.0 (30/10/2024)

This release of blipdataforge introduces the KafkaConnector class, which can be used to write or read from a Kafka topic. Checkout the User Guide at dataforgedocs.blip.tools for more details.

v1.17.4 (29/10/2024)

blipdataforge automatically builds (at runtime) paths for any data lake that it needs to access for any reason. However, the library was building the wrong paths for the sales and marketing environments at Databricks. This release fixes this issue.

v1.17.3 (22/10/2024)

This release fixes the persist_files() facade. Previously, it was possible to use this facade to copy files to other containers that are outside of the "landingzone" container of your current domain. Now, this facade will raise an exception if you try to copy your files to a volume that is connected to a container that is not the "landingzone" container of your current domain.

v1.17.2 (15/10/2024)

This release add the following fixes:

Fix the lack of the item lifecycleTimeInDays in the Data Contract API responses, by marking it as an optional parameter.
Add logger calls to Data Contract API, so that we can have minimum visilibilty of it's use.

v1.17.1 (14/10/2024)

The wrong function argument was being used on the integration tests for the Data Contract API. This release fix this issue.

v1.17.0 (14/10/2024)

Adds the function share_files, which allows the user to share data lake files through download links sent via email.

v1.16.1 (26/09/2024)

Add user guide for the new persist_files() function to the documentation.

v1.16.0 (26/09/2024)

Added an index argument to the ingest_google_sheet() facade. Now, users can select a specific region/range of the Google Sheet that they want to ingest/read with this index argument. Read the user guide for more details.

v1.15.0 (25/09/2024)

Adds the persist_files function that enables saving local files into Databricks Volumes.

Fix the write function that fails to build the path for the dataplatform's sandbox due to infrastructure naming. Dataplatform DEV storage account ends with 002 instead of 001.

v1.14.3 (12/09/2024)

Add small note in function documentation of ingest_google_sheet(), to remember users to share their Google Sheet with the Data Routing service.

v1.14.2 (09/09/2024)

Fix bug in ingest_data and send_data function that caused a KeyError while trying to retrieve the Firehose API URL from the configuration file.

v1.14.1 (09/09/2024)

A small check was added to ingest_google_sheet() facade, with the object of preventing the user from providing a sandbox catalog in a PRD environment, or, vice-versa.

v1.14.0 (05/09/2024)

Adds function ingest_google_sheet in DataPlatform facade, enabling the users to execute google sheets ingestion through Gaia Data Routing service.

This version also removes the enforcement of allowed domains to use the ingest_google_sheet function.

v1.13.1 (05/09/2024)

Fix typo in TakeSpark get_adf_widgets() function.

v1.13.0 (05/09/2024)

Adds a new way to retrieve configurations from environment. Now, all the configurations in config.BlipDataForgeConfigs class primarily comes from environment variables but, when not set, the configs are retrieved from a configuration file.

Also, some classes were refactored to not use BlipDataForgeConfigs object directly. Now the caller classes and methods passes the parameters explicitly.

v1.12.6 (02/09/2024)

Bug fixes over the TakeSpark function get_adf_widgets(), this function was using table references from the legacy Databricks environment. Therefore, those table references do not exist in the new environment, and needed to be changed. The bullepoints below provide the mapping from previous table references to the new ones:

dslab.take_deals_bots_hs_info -> dageneral_trustedzone.blip.take_deals_bots_hs_info
blipcs.application -> bliplayer.rawsup.portal_applications
genericConsumerZone.bliprawsupdata_tenants -> bliplayer.rawsup.portal_tenants

There was a fourth legacy table reference that was used by get_adf_widgets(), which was growth.hubspot_bots_segment_info. But, this table reference have no "compatible brother table" in the new environment, so, we couldn't match it with a table reference in the new environment.

This growth.hubspot_bots_segment_info table was used when the widget use_client_bot was set to S. Now, the table dslab.take_deals_bots_hs_info is used instead in such case.

v1.12.5 (23/08/2024)

Rollback changes made in the pyspark and delta-spark dependencies.

v1.12.4 (22/08/2024)

Move delta-spark package to tests dependency.

v1.12.3 (22/08/2024)

Move pysark package to tests dependency.

v1.12.2 (21/08/2024)

Fix the docstring of the get_latest_record_date function.
Removes the fixed version 3.4.0 of pyspark dependency to reduce clusters initialization time. This change reduced aprox. 00:03:25, that represents a reduction of 45% of the cluster initialization time.

v1.12.1 (14/08/2024)

Increase the max file size transfer for the write_to_s3_bucket() function from 78GB to 100GB.

v1.12.0 (14/08/2024)

Adds DataRoutingAPIClient class to support integration with Data Routing API.
Adds DataRoutingAPIClient.ingest_google_sheet() function to ingest google sheets.

The google sheets ingestion is in the testing phase, so it has not yet been exposed on the DataPlatform facade and it works only in clients domain.

v1.11.2 (12/08/2024)

Fix the sftp_put_file() function error triggered when the SSH private comes from secret stored in a Databricks secret scope.

v1.11.1 (01/08/2024)

Revert changes made in version v1.11.0 to the function save_tables_alt(). Now, the function have the same behaviour from versions previous to v1.11.0.

v1.11.0 (30/07/2024)

Add overwrite schema mode to the function save_tables_alt() from TakeSpark, with the objective to make the behaviour of save_tables_alt() in the new environment becomes equal to it's behaviour in the old environment.

v1.10.0 (18/07/2024)

Add the optional parameter private_key to blipdataforge.takespark.send.sftp.sftp_put_file function. Now this function accepts a private key to establish connection with the target SFTP server. Also adds a log call to this function to keep track of its usage.

v1.9.6 (17/07/2024)

Fix the blipdataforge.takespark.refining.eventtracks.eventtracks_session function that was breaking due to a variable not initialized.

v1.9.5 (12/07/2024)

Add a small fix over the function dbfs_create_zip_folder() from TakeSpark. This function was not properly compressing the provided data. This problem was fixed by using the ENUM value zipfile.ZIP_DEFLATED.

v1.9.4 (08/07/2024)

Add fix over the logger call that happens inside the version of check_delta_update that is inside TakeSpark (deprecated version).

v1.9.3 (08/07/2024)

Patch version. This version adds a small fix over logger calls made inside functions from TakeSpark:

write_dataframe_as_csv();
email_files_generator();
send_mail();
send_email_general();

v1.9.2 (28/06/2024)

Adds a logger call to the following function from TakeSpark:

check_delta_update() (this is the version of the function that is embedded into TakeSpark, not the one that is in the DataPlatform class);

v1.9.1 (28/06/2024)

Adding logger calls inside 4 functions from TakeSpark, with the objective of increasing visibility and monitoring capabilities over these functions from TakeSpark.

write_dataframe_as_csv();
email_files_generator();
send_email_general();

v1.9.0 (27/06/2024)

Add check_delta_update function to DataPlatform class to enable DSLAB migration.

v1.8.5 (25/06/2024)

Add an small fix over the format of the Soda Quality Checks logs sent to EventHub.

v1.8.4 (19/06/2024)

Introduced a new version of the documentation of the blipdataforge library.

v1.8.3 (03/06/2024)

Fixes:

Fix over the write_to_s3_bucket() function. Now, this function receives as input the keys to the secrets in Azure Key Vault that contains the access key and access key id that are necessary to connect with the AWS S3 bucket.

So, if you want to connect with an AWS S3 bucket of a client, and you register the access key and access key id of this bucket inside the Azure Key Vault with keys "example_key_access_key" and "example_key_access_key_id", then, you use the write_to_s3_bucket() function like this:

from blipdataforge.takespark.send.bucket_write import write_to_s3_bucket
write_to_s3_bucket(
    [file],
    aws_access_key_id="example_key_access_key_id",
    aws_secret_access_key="example_key_access_key",
    bucket_name="example_of_bucket_name",
    write_base_path="/incoming-files/"
)

The write_to_s3_bucket() function will do the job of collecting the actual values of the access key and access key id from the Azure Key Vault for you, and use them to connect with the AWS S3 Bucket. New EventHubs were created for receiving the logs created by BlipDataForge. In this version we add a new connection logic to connect the lib to these new EventHubs.

v1.8.2 (03/06/2024)

New features:

In this version, we added a small change to Soda Quality Check functions. Now, it uses the logger to send the Soda Quality Check logs to EnvetHub, to incorporate it to all DataForge logs in general.

v1.8.1 (09/05/2024)

Fixes:

Fix usage of escape_char parameter when dataframe is splitted in multiple files.

v1.8.0 (25/04/2024)

New features:

Adds escape_char parameter on email_files_generator and write_dataframe_as_csv functions of Taksepak to set the escape character that will be used to escape quotes.

v1.7.42 (25/04/2024)

New features:

The logs sent to EventHub now includes two new items which record the Job and Run IDs in Databricks.

v1.7.3 (27/03/2024)

Refactor takespark write_dataframe_as_csv function to not use pandas lib due to OOM errors and inconsistencies in the output file.

v1.7.2 (25/03/2024)

Fix takespark function write_dataframe_as_parquet to use random uuid in temp directory name due to race condition errors.

v1.7.1 (21/03/2024)

Refactor takespark write_dataframe_as_parquet function to not use pandas lib due to OOM errors.

v1.7.0 (20/03/2024)

Adds new abstraction of Data Routing layer to enable ingesting and sending data through Data Routing.
Adds function ingest_data on DataPlatform facade that enables ingesting data from various sources through the Data Routing.
Refactor send_data function of DataPlatform facade to use the new abstraction of Data Routing.

v1.6.6 (05/03/2024)

Fix send_mult_csv_mail function to convert files argument to a list if a string was passed.

v1.6.5 (26/01/2024)

Refactor lib logger. Now any class or function can use the logger through dependency injection.

v1.6.4 (18/01/2024)

Fix function email_files_generator to use the separator character passed in sep parameter.
Changes default value of sep parameter to ;.

v1.6.3 (16/01/2024)

Fix function tickets_bots_filter to add the catalog prefix clients_trustedzone if the two dots nomenclature is passed in source_table and/or destination_table.

v1.6.2 (10/01/2024)

Adds authentication on DataContractRepository to communicate with Data Contract API.
Refactors data quality automated tests to use authentication token through DATA_CONTRACT_API_TOKEN environment variable.
Switch some print calls with logger calls.

v1.6.1 (04/01/2024)

Fix send sftp broken code by checking if upload folder exists before creating it.
Remove from storage account write functions the prefix folder.

v1.6.0 (03/01/2024)

Adds integration with Data Routing to send data to SFTP.
Adds integration with Data Routing to send data to Elasticsearch.
Adds integration with Data Routing to send events to an Event Hub or Kafka topic.
Adds send_data function on DataPlatform facade to work with Data Routing API.
Adds documentation for data outputs

v1.5.3 (02/01/2024)

Add dbfs_create_zip_folder function that create a zip folder from files already in the datalake
Fix all write_dataframe_as_ functions removing a string refining that was altering the file path send by the user.

v1.5.2 (28/12/2023)

Fix rule to get data from clients_trustedzone
Add smalltalks functions
Add save_tables_alt and initial_info functions
Add list of all functions dataforge absorb from takespark in the docs page

v1.5.1 (18/12/2023)

Fix googlesheets module by importing databricks.sdk
Fix how the blob url (write functions) is built

v1.5.0 (15/12/2023)

Absorb the following functions from takespark:

write to azure blob functions (write_dataframe_as_csv)
send email, write do s3, gcp and sftp
refining functions (messages_base, eventtracks_base etc)
import data from google spreadsheet

v1.4.0 (12/12/2023)

Fix version number to 1.4.0.
Change automated quality checks scan name to include catalog, database and table names.
Fix log message before quality checks execution.

v1.3.7 (30/11/2023)

Adds automated quality checks before write operation based on data contract infomration.
Adds quality checks results persistence on Data Lake.
Adds support for sending quality checks results to Soda Cloud.
Adds run_quality_checks function that allow users to run custom quality checks over spark dataframes and data lake tables.
Adds get_data_contract function on DataPlatform facade to request data contracts from Data Contract API.
Removes commented code

v1.3.6 (21/11/2023)

Removes enforcement to use sandbox catalog when running on development workspaces. (DSLAB)
Exposes path parameter on DataWriter. (DSLAB)

v1.3.5 (20/11/2023)

Adds get_latest_record_date function that allows getting the date of the latest record in a table. This function is an alternate for the old initial_info function of TakeSpark library.

v1.3.4 (20/11/2023)

Adds delete function that allows deleting records from domain catalog tables.

v1.3.3 (09/11/2023)

Update integration tests to run on Databricks workspace dbw-dageneral-dev-brazilsouth.
Adds release dates to changelog documentation.
Fix regex of workpace_id property from DatabricksContext.
Adds versioning for init script used to install latest development version on dedicated development cluster.

v1.3.2 (06/11/2023)

Adds support for running quality checks with Soda library

v1.3.1 (27/10/2023)

Adds communication interface with Data Contract API
Adds support for handling Data Contracts

v1.3.0 (26/09/2023)

Adds support for write mode upsert.

v1.2.0 (12/09/2023)

Adds support for schema mode overwriteSchema on write in delta.
Set parameter merge_schema of write operation as DEPRECATED.

v1.1.1 (05/09/2023)

Adds error log in case of failure during write operation.
Fix bug that breaks execution if a log message contains \n.

v1.1.0 (29/08/2023)

Adds contexts.py module
Removes constants.py module
Refactors unit tests to use contexts instead of constants.
Refactors data_governance module to use static methods on CatalogPolicyEnforcer class.

v1.0.0 (23/08/2023)

Improves library usability by renaming all classes and methods that use Loader or load to Writer and write.
Creates new folders for better organization of docs.

v0.4.0 (22/08/2023)

Adds default business logic for creating data lake paths according to the new domain structure.

v0.3.1 (18/08/2023)

Gets domain and environment information from cluster tags.
Adds domain info to logs.
Fix data_loader tests that send logs to eventhub.

v0.3.0 (16/08/2023)

Adds a facade class DataPlatform that encapsulates most used ETL functions.
Adds capability of log shipping to Azure EventHub.
Adds black==23.7.0 package in test dependencies

v0.2.0 (28/07/2023)

Adds delta write on external location
Adds creation of external table
Removes creation of managed tables
Adds business logic to support delta writes on legacy lake structure
Adds default behavior for delta write if a new catalog is specified
Breaks pipeline execution if any unit test fails

v0.1.0 (25/07/2023)

Adds delta write on data lake (still using managed tables)
Creates database if not exists
Creates table if not exists
Uses scianalytics_sandbox catalog if job is running on development workspace
Adds support for all write modes accepted by spark dataframe writer
Adds support for schema evolution
Adds support for table partitioning
Adds unit tests in CI/CD pipeline
Publish unittests results on CI/CD pipeline