Changelog
v1.25.16 (Current)
Exposed new parameter to generate_datalake_sas_link to allow usage of another access key to sign the links.
v1.25.15 (12.11.2025)
Removed databricks-connect==16.4.8 and databricks-dlt==0.3.0 as dependencies to fix INIT_SCRIPT failure.
v1.25.14 (30.10.2025)
Added databricks-connect==16.4.8 as dependency to fix INIT_SCRIPT failure.
v1.25.13 (29.10.2025)
Fixing import - json - in module blipdataforge.takespark.send.blob.
v1.25.12 (28.08.2025)
Fixing import in module blipdataforge.takespark.send.blob.
v1.25.11 (27.08.2025)
Fixing the extraction from the function write_dataframe_as_json() as a list of dicts.
v1.25.10 (27.08.2025)
Adding new fields in _collect_scan_results to improve soda quality checks message.
v1.25.9 (05.08.2025)
Removed delta-spark>0.4 library from pyproject.toml. Library install pyspark.
v1.25.8 (04.08.2025)
Adding control type at function write_dataframe_as_json()
v1.25.7 (04.08.2025)
Added paramiko==3.5.1 as dependency to fix pysftp package.
v1.25.6 (01.08.2025)
Removed libraries from pyproject.toml. Libraries is available in databricks cluster. Libraries list ["boto3==1.24.28", "azure-storage-blob==12.19.0", "azure-storage-file-datalake==12.14.0", "ipython", "mypy-extensions==1.0.0", "mypy==1.3.0", "types-pytz==2023.3.1.1"]
v1.25.5 (31.07.2025)
Removed pyspark library from pyproject.toml. Library is available in databricks cluster.
v1.25.4 (28.07.25)
Fixing takespark function - write_dataframe_as_json().
v1.25.3 (24.07.25)
Changed duration of SAS signed links from 7 to 3 days.
v1.25.2 (12/06/2025)
Removed databricks-connect installation due to compatibility issues that prevent clusters from starting properly. Local clusters must use Databricks Runtime 16.3.
v1.25.1 (22/05/2025)
Fix the order of the parameters table_comment and columns_commens of the write function. It was necessary because some notebooks use the write function without explicitly naming the arguments, so when the new parameters were added in the midle of the existing ones, those notebooks were affected.
v1.25.0 (15/05/2025)
Adds parameters table_comment and columns_commens to the write function to specify table metadata.
v1.24.1 (14/05/2025)
This version makes the BlipDataForge library compatible with DBR 16.4 LTS. This was done by upgrading the dependency azure-eventhub to version 5.15.0.
v1.24.0 (10/04/2025)
Adds the governance rules exception to build the paths fo usbliplayer and eubliplayer.
v1.23.4 (28/03/2025)
This release adds a small fix over the Takespark function write_dataframe_as_parquet().
v1.23.3 (19/03/2025)
This release adds a small fix over the Takespark function sftp_put_file(), which should use now
the landzone_volume to access the provided file.
v1.23.2 (19/03/2025)
This release adds a small change to specific log messages that are produced by write_to_opensearch().
The idea is to identify these log messages more easily.
v1.23.1 (18/03/2025)
Adds parameters buffer_size and max_block_ms to the write_eventhub() function.
v1.23.0 (17/03/2025)
Adds the write_elasticsearch() function to DataPlatform facade that enable users to send spark dataframes to Elasticsearch databases.
v1.22.2 (17/03/2025)
A new argument was added to the facade write_to_opensearch(). Now, the user can select the Spark write mode
that he/she wants to use when writing the data to the Opensearch index. Checkout the user guide for the
facade to know more about this argument.
This release also removes the option opensearch.mapping.exclude from the facade, which means that, the ID
column provided to the id_column argument of the facade is aso included as a component of the document
schema itself.
v1.22.1 (11/03/2025)
This release adds a fix over the path used by write_to_s3_bucket() Takespark function.
v1.22.0 (10/03/2025)
A new strategy was added to the write_to_opensearch() function. Now, this function can also use a native Spark writer
class to write the data to a Opensearch server.
v1.21.9 (27/02/2025)
This release fixes a bug inside the Takespark dbfs_zip_folder() function. This fix is related to
how the blob object was being used inside the function.
v1.21.8 (27/02/2025)
This release changes the way that the landzone_volume is created in each domain.
v1.21.7 (26/02/2025)
Fix a bug in the sftp_put_file() and write_to_gcp_bucket() Takespark functions.
v1.21.6 (24/02/2025)
Add a fix over the write_to_s3_bucket() Takespark function. The function was using the file object
given as input incorrectly when uploading it to the S3 bucket.
v1.21.5 (18/02/2025)
Add a fix over the write_to_s3_bucket() Takespark function. Now, this function uses a Databricks volume
to access the input files. Also fix a small error in send_email_general().
v1.21.4 (18/02/2025)
Adds a fix over the auxiliary Takespark function read_from_blob(), which now uses a Databricks volume to access the files in
the landzone container. Also, fix a wrong call to the startswith() method inside the write_to_s3_bucket() Takespark function.
v1.21.3 (17/02/2025)
Add a fix to the write_dataframe_as_excel() Takespark function by using Databricks volumes to write the necessary files.
v1.21.2 (14/02/2025)
This release tries to fix a problem that is happening across multiple functions from Takespark, by changing the way they access the files written.
v1.21.1 (13/02/2025)
Adds better docstring and exceptions habndling for write_mongodb function. The exception messages and docstring now instruct users to use Single User clusters and add mongodb maven dependency to the job.
v1.21.0 (13/02/2025)
Adds new Gaia Connector for MongoDB. Now it is possible to write a PySpark dataframe into a MongoDB through the function write_mongodb() from DataPlatform facade.
v1.20.0 (13/02/2025)
New Gaia Connector available! Now, users can write a Spark DataFrame into a Opensearch database,
using the write_to_opensearch() function from DataPlatform() facade.
v1.19.9 (11/02/2025)
Move the output target of write_dataframe_as_csv() back to the landzone container, due to constraints imposed by old pipelines at Azure Data Factory.
v1.19.8 (10/02/2025)
This release is a fix for the previous release. The write_dataframe_as_csv() function was still producing erroneous paths in specific situations.
v1.19.7 (07/02/2025)
This release tries to fix the way that write_dataframe_as_csv() builds the path to the file saved in the landingzone container.
v1.19.6 (06/02/2025)
This release adds a new strategy to copy the file inside the Takespark function sftp_put_file().
v1.19.5 (06/02/2025)
The function write_dataframe_as_csv() was always trying to write the files into the clients domain landzone container.
However, this function should try to write the files into the landzone container that is connected to the current domain
where the user is. This release fixes that.
v1.19.4 (03/02/2025)
Dont use host keys in the SFTP connection created inside the Takespark function sftp_put_file().
v1.19.3 (03/02/2025)
This release adds a fix over the Takespark function sftp_put_file(). Now, this function creates a Databricks volume
under the hood to access the file provided at local_file_path argument. This avoids the need to rewrite the provided
file in the local disk of the worker before being sent to the SFTP server.
v1.19.2 (31/01/2025)
Fix blipdataforge.takespark.refining.support.aux_accountcontacts function, which had an error extracting data from the Extras field when it was in JSON format.
v1.19.1 (29/01/2025)
Fix the DataPlatform facade that was crashing due to dlt import even when using a cluster that does not support it.
v1.19.0 (28/01/2025)
Adds support for streaming read, write and processing using Delta Live Tables. Now the DataPlatform implements the get_streaming_engine() function, which returns an interface that allow users to create streaming pipelines. Currently only streaming flows using Delta Live Tables are supported.
v1.18.10 (23/01/2025)
- Fix the
DataPlatform.write()that was not setting the schema evolution when using the write modeupsert. Now theschema_modeor themerge_schema(deprecated) enables the schema evolution inupsertmode. - Adds an exception rule to build the storage account name for the
dataengtechnologydomain. Without this fix, the users of this domain cannot write to the data lake.
v1.18.9 (09/01/2025)
This version changes how Firehoses HTTP requests are made. Now it uses requests.Session as a context manager, and sets a HTTP retries in case of DNS lookup failures, socket connections failures or timeouts.
This version also adds more info to Firehose error messages, containing the payload sent to Firehose.
v1.18.8 (07/01/2024)
Adds an error log call to the send_data() and ingest_data() functions in case that Firehose responds with an error or warning HTTP status, to improve observability in Grafana.
v1.18.7 (23/12/2024)
Adds a log call to the share_files() containing the number of files shared and the total size in bytes. This log is intended to be used in Grafana to track connector metrics.
v1.18.6 (18/12/2024)
Adds a log call to the write_eventhub() after the write conclusion containing the dataframe count, topic and namespace. This log is intended to be used in Grafana to track connector metrics.
v1.18.5 (04/12/2024)
This release adds a new parameter to the facade share_files(), to allow the user specify the title of
the email that will be sent by the service.
v1.18.4 (29/11/2024)
Exposes the linger.ms configuration for the eventhub connector, and set a new default of 2 ms for this config. For more information, please refer to Kafka linger.ms documentation.
v1.18.3 (19/11/2024)
Raises EventHub batch size to 500KB, wich reduces significantly the total write duration. Also, exposes the batch size and request timeout configuration as parameters of the write_eventhub() function. This will give the user more flexibility to tune the eventhub writes.
v1.18.2 (18/11/2024)
Raises the request timeout of the kafka backend in the write_eventhub() function to 5 minutes. Now, the client will wait longer to receive a response after a send request for each batch sent to the EventHub topic.
v1.18.1 (12/11/2024)
Turns the KafkaConnector into EventHubConnector, narrowing the context of the connector for the sake of a friendly user experience.
v1.18.0 (30/10/2024)
This release of blipdataforge introduces the KafkaConnector class, which can be used
to write or read from a Kafka topic. Checkout the User Guide at dataforgedocs.blip.tools
for more details.
v1.17.4 (29/10/2024)
blipdataforge automatically builds (at runtime) paths for any data lake that it needs to
access for any reason. However, the library was building the wrong paths for the sales
and marketing environments at Databricks. This release fixes this issue.
v1.17.3 (22/10/2024)
This release fixes the persist_files() facade. Previously, it was possible to use
this facade to copy files to other containers that are outside of the "landingzone" container
of your current domain. Now, this facade will raise an exception if you try to copy your
files to a volume that is connected to a container that is not the "landingzone" container
of your current domain.
v1.17.2 (15/10/2024)
This release add the following fixes:
- Fix the lack of the item
lifecycleTimeInDaysin the Data Contract API responses, by marking it as an optional parameter. - Add logger calls to Data Contract API, so that we can have minimum visilibilty of it's use.
v1.17.1 (14/10/2024)
The wrong function argument was being used on the integration tests for the Data Contract API. This release fix this issue.
v1.17.0 (14/10/2024)
Adds the function share_files, which allows the user to share data lake files through download links sent via email.
v1.16.1 (26/09/2024)
Add user guide for the new persist_files() function to the documentation.
v1.16.0 (26/09/2024)
Added an index argument to the ingest_google_sheet() facade. Now, users can
select a specific region/range of the Google Sheet that they want to ingest/read
with this index argument. Read the user guide for more details.
v1.15.0 (25/09/2024)
Adds the persist_files function that enables saving local files into Databricks Volumes.
Fix the write function that fails to build the path for the dataplatform's sandbox due to infrastructure naming. Dataplatform DEV storage account ends with 002 instead of 001.
v1.14.3 (12/09/2024)
Add small note in function documentation of ingest_google_sheet(), to remember users
to share their Google Sheet with the Data Routing service.
v1.14.2 (09/09/2024)
Fix bug in ingest_data and send_data function that caused a KeyError while trying to retrieve the Firehose API URL from the configuration file.
v1.14.1 (09/09/2024)
A small check was added to ingest_google_sheet() facade, with the object of preventing the
user from providing a sandbox catalog in a PRD environment, or, vice-versa.
v1.14.0 (05/09/2024)
Adds function ingest_google_sheet in DataPlatform facade, enabling the users to execute google sheets ingestion through Gaia Data Routing service.
This version also removes the enforcement of allowed domains to use the ingest_google_sheet function.
v1.13.1 (05/09/2024)
Fix typo in TakeSpark get_adf_widgets() function.
v1.13.0 (05/09/2024)
Adds a new way to retrieve configurations from environment. Now, all the configurations in config.BlipDataForgeConfigs class primarily comes from environment variables but, when not set, the configs are retrieved from a configuration file.
Also, some classes were refactored to not use BlipDataForgeConfigs object directly. Now the caller classes and methods passes the parameters explicitly.
v1.12.6 (02/09/2024)
Bug fixes over the TakeSpark function get_adf_widgets(), this function was using table references from the legacy Databricks environment. Therefore, those table references do not exist in the new environment, and needed to be changed. The bullepoints below provide the mapping from previous table references to the new ones:
dslab.take_deals_bots_hs_info->dageneral_trustedzone.blip.take_deals_bots_hs_infoblipcs.application->bliplayer.rawsup.portal_applicationsgenericConsumerZone.bliprawsupdata_tenants->bliplayer.rawsup.portal_tenants
There was a fourth legacy table reference that was used by get_adf_widgets(), which was growth.hubspot_bots_segment_info.
But, this table reference have no "compatible brother table" in the new environment, so, we couldn't match
it with a table reference in the new environment.
This growth.hubspot_bots_segment_info table was used when the widget use_client_bot was set to S.
Now, the table dslab.take_deals_bots_hs_info is used instead in such case.
v1.12.5 (23/08/2024)
- Rollback changes made in the pyspark and delta-spark dependencies.
v1.12.4 (22/08/2024)
- Move delta-spark package to tests dependency.
v1.12.3 (22/08/2024)
- Move pysark package to tests dependency.
v1.12.2 (21/08/2024)
- Fix the docstring of the
get_latest_record_datefunction. - Removes the fixed version
3.4.0of pyspark dependency to reduce clusters initialization time. This change reduced aprox. 00:03:25, that represents a reduction of 45% of the cluster initialization time.
v1.12.1 (14/08/2024)
Increase the max file size transfer for the write_to_s3_bucket() function from 78GB to 100GB.
v1.12.0 (14/08/2024)
- Adds
DataRoutingAPIClientclass to support integration with Data Routing API. - Adds
DataRoutingAPIClient.ingest_google_sheet()function to ingest google sheets.
The google sheets ingestion is in the testing phase, so it has not yet been exposed on the DataPlatform facade and it works only in clients domain.
v1.11.2 (12/08/2024)
Fix the sftp_put_file() function error triggered when the SSH private comes from secret stored in a Databricks secret scope.
v1.11.1 (01/08/2024)
Revert changes made in version v1.11.0 to the function save_tables_alt().
Now, the function have the same behaviour from versions previous to v1.11.0.
v1.11.0 (30/07/2024)
Add overwrite schema mode to the function save_tables_alt() from TakeSpark,
with the objective to make the behaviour of save_tables_alt() in the new
environment becomes equal to it's behaviour in the old environment.
v1.10.0 (18/07/2024)
Add the optional parameter private_key to blipdataforge.takespark.send.sftp.sftp_put_file function. Now this function accepts a private key to establish connection with the target SFTP server. Also adds a log call to this function to keep track of its usage.
v1.9.6 (17/07/2024)
Fix the blipdataforge.takespark.refining.eventtracks.eventtracks_session function that was breaking due to a variable not initialized.
v1.9.5 (12/07/2024)
Add a small fix over the function dbfs_create_zip_folder() from TakeSpark. This function
was not properly compressing the provided data. This problem was fixed by using the
ENUM value zipfile.ZIP_DEFLATED.
v1.9.4 (08/07/2024)
Add fix over the logger call that happens inside the version of check_delta_update that is inside TakeSpark (deprecated version).
v1.9.3 (08/07/2024)
Patch version. This version adds a small fix over logger calls made inside functions from TakeSpark:
write_dataframe_as_csv();email_files_generator();send_mail();send_email_general();
v1.9.2 (28/06/2024)
Adds a logger call to the following function from TakeSpark:
check_delta_update()(this is the version of the function that is embedded into TakeSpark, not the one that is in theDataPlatformclass);
v1.9.1 (28/06/2024)
Adding logger calls inside 4 functions from TakeSpark, with the objective of increasing visibility and monitoring capabilities over these functions from TakeSpark.
write_dataframe_as_csv();email_files_generator();send_email_general();
v1.9.0 (27/06/2024)
Add check_delta_update function to DataPlatform class to enable DSLAB migration.
v1.8.5 (25/06/2024)
Add an small fix over the format of the Soda Quality Checks logs sent to EventHub.
v1.8.4 (19/06/2024)
Introduced a new version of the documentation of the blipdataforge library.
v1.8.3 (03/06/2024)
Fixes:
- Fix over the
write_to_s3_bucket()function. Now, this function receives as input the keys to the secrets in Azure Key Vault that contains the access key and access key id that are necessary to connect with the AWS S3 bucket.
So, if you want to connect with an AWS S3 bucket of a client, and you register the access key and access key id
of this bucket inside the Azure Key Vault with keys "example_key_access_key" and "example_key_access_key_id",
then, you use the write_to_s3_bucket() function like this:
from blipdataforge.takespark.send.bucket_write import write_to_s3_bucket
write_to_s3_bucket(
[file],
aws_access_key_id="example_key_access_key_id",
aws_secret_access_key="example_key_access_key",
bucket_name="example_of_bucket_name",
write_base_path="/incoming-files/"
)
The write_to_s3_bucket() function will do the job of collecting the actual values of
the access key and access key id from the Azure Key Vault for you, and use them to
connect with the AWS S3 Bucket.
New EventHubs were created for receiving the logs created by BlipDataForge. In this version we add a new connection logic to connect the lib to these new EventHubs.
v1.8.2 (03/06/2024)
New features:
- In this version, we added a small change to Soda Quality Check functions. Now, it uses the logger to send the Soda Quality Check logs to EnvetHub, to incorporate it to all DataForge logs in general.
v1.8.1 (09/05/2024)
Fixes:
- Fix usage of
escape_charparameter when dataframe is splitted in multiple files.
v1.8.0 (25/04/2024)
New features:
- Adds
escape_charparameter onemail_files_generatorandwrite_dataframe_as_csvfunctions of Taksepak to set the escape character that will be used to escape quotes.
v1.7.42 (25/04/2024)
New features:
- The logs sent to EventHub now includes two new items which record the Job and Run IDs in Databricks.
v1.7.3 (27/03/2024)
- Refactor takespark
write_dataframe_as_csvfunction to not use pandas lib due to OOM errors and inconsistencies in the output file.
v1.7.2 (25/03/2024)
- Fix takespark function
write_dataframe_as_parquetto use random uuid in temp directory name due to race condition errors.
v1.7.1 (21/03/2024)
- Refactor takespark
write_dataframe_as_parquetfunction to not use pandas lib due to OOM errors.
v1.7.0 (20/03/2024)
- Adds new abstraction of Data Routing layer to enable ingesting and sending data through Data Routing.
- Adds function
ingest_dataonDataPlatformfacade that enables ingesting data from various sources through the Data Routing. - Refactor
send_datafunction ofDataPlatformfacade to use the new abstraction of Data Routing.
v1.6.6 (05/03/2024)
- Fix
send_mult_csv_mailfunction to convertfilesargument to alistif a string was passed.
v1.6.5 (26/01/2024)
- Refactor lib logger. Now any class or function can use the logger through dependency injection.
v1.6.4 (18/01/2024)
- Fix function
email_files_generatorto use the separator character passed insepparameter. - Changes default value of
sepparameter to;.
v1.6.3 (16/01/2024)
- Fix function
tickets_bots_filterto add the catalog prefix clients_trustedzone if the two dots nomenclature is passed in source_table and/or destination_table.
v1.6.2 (10/01/2024)
- Adds authentication on
DataContractRepositoryto communicate with Data Contract API. - Refactors data quality automated tests to use authentication token through
DATA_CONTRACT_API_TOKENenvironment variable. - Switch some
printcalls withloggercalls.
v1.6.1 (04/01/2024)
- Fix send sftp broken code by checking if upload folder exists before creating it.
- Remove from storage account write functions the prefix folder.
v1.6.0 (03/01/2024)
- Adds integration with Data Routing to send data to SFTP.
- Adds integration with Data Routing to send data to Elasticsearch.
- Adds integration with Data Routing to send events to an Event Hub or Kafka topic.
- Adds
send_datafunction onDataPlatformfacade to work with Data Routing API. - Adds documentation for data outputs
v1.5.3 (02/01/2024)
- Add
dbfs_create_zip_folderfunction that create a zip folder from files already in the datalake - Fix all
write_dataframe_as_functions removing a string refining that was altering the file path send by the user.
v1.5.2 (28/12/2023)
- Fix rule to get data from clients_trustedzone
- Add smalltalks functions
- Add save_tables_alt and initial_info functions
- Add list of all functions dataforge absorb from takespark in the docs page
v1.5.1 (18/12/2023)
- Fix googlesheets module by importing databricks.sdk
- Fix how the blob url (write functions) is built
v1.5.0 (15/12/2023)
Absorb the following functions from takespark:
- write to azure blob functions (write_dataframe_as_csv)
- send email, write do s3, gcp and sftp
- refining functions (messages_base, eventtracks_base etc)
- import data from google spreadsheet
v1.4.0 (12/12/2023)
- Fix version number to 1.4.0.
- Change automated quality checks scan name to include catalog, database and table names.
- Fix log message before quality checks execution.
v1.3.7 (30/11/2023)
- Adds automated quality checks before write operation based on data contract infomration.
- Adds quality checks results persistence on Data Lake.
- Adds support for sending quality checks results to Soda Cloud.
- Adds
run_quality_checksfunction that allow users to run custom quality checks over spark dataframes and data lake tables. - Adds
get_data_contractfunction onDataPlatformfacade to request data contracts from Data Contract API. - Removes commented code
v1.3.6 (21/11/2023)
- Removes enforcement to use sandbox catalog when running on development workspaces. (DSLAB)
- Exposes
pathparameter on DataWriter. (DSLAB)
v1.3.5 (20/11/2023)
- Adds
get_latest_record_datefunction that allows getting the date of the latest record in a table. This function is an alternate for the oldinitial_infofunction ofTakeSparklibrary.
v1.3.4 (20/11/2023)
- Adds
deletefunction that allows deleting records from domain catalog tables.
v1.3.3 (09/11/2023)
- Update integration tests to run on Databricks workspace dbw-dageneral-dev-brazilsouth.
- Adds release dates to changelog documentation.
- Fix regex of
workpace_idproperty from DatabricksContext. - Adds versioning for init script used to install latest development version on dedicated development cluster.
v1.3.2 (06/11/2023)
- Adds support for running quality checks with Soda library
v1.3.1 (27/10/2023)
- Adds communication interface with Data Contract API
- Adds support for handling Data Contracts
v1.3.0 (26/09/2023)
- Adds support for write mode
upsert.
v1.2.0 (12/09/2023)
- Adds support for schema mode
overwriteSchemaon write in delta. - Set parameter
merge_schemaof write operation as DEPRECATED.
v1.1.1 (05/09/2023)
- Adds error log in case of failure during write operation.
- Fix bug that breaks execution if a log message contains
\n.
v1.1.0 (29/08/2023)
- Adds contexts.py module
- Removes constants.py module
- Refactors unit tests to use contexts instead of constants.
- Refactors data_governance module to use static methods on
CatalogPolicyEnforcerclass.
v1.0.0 (23/08/2023)
- Improves library usability by renaming all classes and methods that use
LoaderorloadtoWriterandwrite. - Creates new folders for better organization of docs.
v0.4.0 (22/08/2023)
- Adds default business logic for creating data lake paths according to the new domain structure.
v0.3.1 (18/08/2023)
- Gets domain and environment information from cluster tags.
- Adds domain info to logs.
- Fix data_loader tests that send logs to eventhub.
v0.3.0 (16/08/2023)
- Adds a facade class
DataPlatformthat encapsulates most used ETL functions. - Adds capability of log shipping to Azure EventHub.
- Adds black==23.7.0 package in test dependencies
v0.2.0 (28/07/2023)
- Adds delta write on external location
- Adds creation of external table
- Removes creation of managed tables
- Adds business logic to support delta writes on legacy lake structure
- Adds default behavior for delta write if a new catalog is specified
- Breaks pipeline execution if any unit test fails
v0.1.0 (25/07/2023)
- Adds delta write on data lake (still using managed tables)
- Creates database if not exists
- Creates table if not exists
- Uses
scianalytics_sandboxcatalog if job is running on development workspace - Adds support for all write modes accepted by spark dataframe writer
- Adds support for schema evolution
- Adds support for table partitioning
- Adds unit tests in CI/CD pipeline
- Publish unittests results on CI/CD pipeline