Every data team is chasing speed and quicker time to value for their data programs and products. In this post, I’ll explain how Fivetran accelerates building ML applications in the Databricks Lakehouse with AutoML. Fivetran, Databricks and AutoML streamline the process, from automated data movement to efficient model creation.
The Databricks Lakehouse supports multiple data workloads, including BI, data warehousing, AI and data science, while AutoML automates the most challenging aspects of machine learning, especially model creation, the testing of multiple models and model recommendations. On the front end, Fivetran is an automated data movement platform delivering any data source faithfully to Databricks.
I’ll show you how to set up a relational database connector to the Databricks Lakehouse and then move a wine quality dataset over to the Lakehouse. This way I can use Databricks and AutoML to run classification experiments and predict wine quality based on various parameters. In order to run those experiments, I need high-quality, usable data.
That's where Fivetran's automated data platform comes in. Fivetran allows you to centralize data while modernizing your data infrastructure, achieving greater data self-service and building differentiating data solutions in ML applications. Fivetran is fully automated and fully managed, and everything is out of the box and ready to go with zero code and zero maintenance.
Additionally, Fivetran is exceptionally fast to set up and configure in just a few minutes.
I want to give a special shout-out to the University of Cal Irvine for making the wine quality data set available. You can grab the dataset I’m using from UCI here.
Also, another big thank you to Prasad Kona with Databricks, who was generous enough to point me to the data set and recently walked me through AutoML in Databricks in preparation for a hands-on lab that we delivered together. Thank you so much, Prasad!
Foundational to any ML project are the following:
- A concrete business problem. For example, which fruit characteristics are going to yield the highest quality wine?
- Ensuring you have a solid data infrastructure foundation, which is what Fivetran and Databricks provide together.
You can check out this wine quality solution approach with Fivetran and Databricks on YouTube linked below.
So let's get our wine quality data set flowing into Databricks with Fivetran and start experimenting with AutoML.
Adding a new source connector with Fivetran
I have three Databricks destinations on this Fivetran account, so I'll pick the one I want to use. Fivetran supports a classic Databricks SQL warehouse as well as Databricks serverless. I also have my choice of catalogs. I can take advantage of the functionality in the Databricks Unity catalog, but if I’m not ready for UC, then the Hive Metastore catalog works great as well.
I already have multiple data sources flowing into the Databricks lakehouse, including SAP ERP, SQL Server and Oracle operational databases, Salesforce, Google Analytics 4, Kafka S3 and Workday HCM.
Fivetran has over 500 source connectors out of the box including databases, applications, file systems and event systems. That number is growing by the day.
My UCI wine quality data set is sitting in Google Cloud Postgres SQL. That's what I’ll use today as my Fivetran source.
The Fivetran engineering, product and technical documentation teams do a fantastic job of laying out the steps to get data flowing quickly into the Databricks Lakehouse. If you’d like to take a look outside of the Fivetran UI, read this: Google Cloud PostgreSQL Setup Guide.
Additionally, for each source and destination, documentation includes details on version support, configuration support, feature support, any limitations, a sync overview, schema information, type transformations and much more. Here is the PostgreSQL source detail page.
Importantly, in the Fivetran UI, any source setup pages are framed on the right by the setup guide in the right gray navigation. It’s the fast path to ensuring you understand the options to connect quickly to any source with Fivetran.
I can name the destination schema anything that I choose. Also, schemas and tables do not have to be created ahead of time in Databricks. I’m going to use the following:
automate_building_mlapps_databricks_fivetran
From there, Fivetran needs to know how to authenticate to the Postgres SQL database, so that's what I'll provide next. The setup form includes the hostname, the user and password info, and the database I want to access in that Postgres instance. I want to build my dataset from the industry database.
Fivetran handles both the initial sync and incremental change data capture (CDC) automatically. Once the initial historical sync is complete, Fivetran performs incremental updates of any new or modified data from the PostgreSQL source database. I can choose WAL, XMIN or Fivetran Teleport (log-free change detection) to perform incremental updates.
It’s reassuring that Fivetran encrypts all data in motion and uses Transport Layer Security (TLS) for a direct connection to PostgreSQL. TLS allows for both encryption and authentication of the connections that Fivetran makes to the PostgreSQL instance. Additionally, any data that sits ephemerally in the Fivetran service is encrypted with AES256.
Selecting the wine quality dataset to use with Databricks and AutoML
Once Fivetran runs the connection tests to PostgreSQL, it’s time to fetch the source schemas, tables and columns from the PostgreSQL database.
I connected to the industry database, which has 12 schemas available to me. I could sync all schemas, tables and columns for this source, or I can selectively determine my dataset based on my use case, which is what I’m going to do - I just want the following dataset:
Schema: agriculture
Table: wine_quality_red
I want to block the other schemas and tables from moving into Databricks.
If you hover over any columns in the tables, you’ll see an option for hashing at the column level, allowing for additional data privacy and anonymization on any PII data that you don’t want to move from your database source to Databricks. Importantly, Fivetran’s hashing still enables that column to be used in the downstream wine-quality AutoML application workflow.
Managing source changes and schema drift
Fivetran then needs to know how I want to handle incremental changes since the schema may change in the future. I’m going to “Allow all”, but I have many options here. Any and all DML and DDL changes are automatically captured by Fivetran and delivered to Databricks - no coding is required, and I can determine the polling frequency to the PostgreSQL source for the change detection and subsequent movement to Databricks.
Starting the initial sync from PostgreSQL to the Databricks Lakehouse
Once I have those selections saved and tested, I'm ready to start syncing my wine quality data set from Postgres SQL to Databricks. Remember that not only will Fivetran move the wine quality data set I just selected during the initial historical sync, but if there are any changes to that table, CDC for those columns is automatically set up for me. It’s 100% no code and any incremental changes are captured at whatever schedule I specify.
Once Fivetran has moved the data and completed error checking, Fivetran doesn’t store any data in the Fivetran service. A cursor is maintained at the sync point for the next incremental sync to capture changes.
This was a small dataset, just a single table, and you can see that the initial sync was completed very quickly. Also, this connector was set up and synced while on a United Airlines flight on exceptionally low bandwidth: 4Mbps download speed. Also, my Databricks warehouse was in suspend mode until I synced this connector.
For incremental syncs, Fivetran defaults to every six hours, but I can change that to an interval of anything from one minute to 24 hours depending on my use case and the data freshness requirement for the downstream data product, which in this case, will be the Databricks AutoML wine quality application.
My PostgreSQL connector (automate_building_mlapps_databricks_fivetran) is now in the list with all other connectors, and I have access to all of those data sources now in Databricks.
Here’s the new wine quality dataset in Databricks
I jumped out to my Databricks workspace and the catalog explorer, and you can see the Salesforce, SAP and other existing datasets that are already here in Databricks Serverless, along with my new PostgreSQL dataset (highlighted in yellow).
Fivetran provides a faithful 1:1 representation of the wine quality dataset source to Databricks, and the source data is Databricks AutoML-ready. The data is organized, understandable and ready to be worked with and used in a Databricks AutoML experiment.
I selected data preview on the new dataset to give you an idea of what Databricks AutoML will have to work with for the wine quality use case.
I also want to call out the AI-generated table descriptions that Databricks provides out of the box with Unity Catalog, which are handy for helping identify and classify each table. I can edit the description as well if needed plus add tags.
Building a wine quality application with Databricks AutoML
If you haven't used Databricks AutoML before, it automatically applies machine learning to a data set. You provide the data set and identify the prediction target, and AutoML will prepare that data set for model training.
First, I’ll create a personal compute cluster within Databricks, where I'll run my AutoML wine quality experiments. I want to be sure and choose a runtime version suited for ML. I'll keep the node type the same, and that's it.
Once my compute cluster is ready to go, then I’ll jump out to the Databricks Machine Learning Experiments setup page. I will stick with a classification ML problem type, and then I want to use that wine quality data set that was just moved into Databricks with Fivetran a short while ago. I'll let AutoML know what I want to use as the prediction target. I’m looking to assess wine quality.
There are a number of advanced configuration options, including evaluation metrics and a range of training frameworks. The only option that I want to change is the experiment timeout. I will go with 10 minutes for this example, but for best results, you’d want to run the experiment for a longer duration.
Lastly, I don't want to include several columns in that wine quality table as part of the AutoML experiment as they are irrelevant to the model. From there, I'll kick it off and sit back and relax.
AutoML will automatically create, tune and test multiple model variations to find the best model for our wine quality experiment. Behind the scenes, it's handling data prep, feature engineering and hyperparameter tuning. Those tasks are usually really time-consuming for data scientists.
AutoML is also constantly balancing and trying different algorithms and fine-tuning settings to get the best results. Remember, I set the timeout at 10 minutes.
Once the AutoML training is done, it's time to evaluate. AutoML presents a leaderboard of the top-performing models. Those are ranked by the metrics that I chose in the configuration step.
I can dive into detailed model information, training logs and even the code that AutoML generated to understand how those models were built and how they performed, and then ultimately choose the model that best meets my needs based on accuracy and various deployment considerations. These are stacked ranked, and from there, I can either deploy the selected model for real-world predictions on wine quality and the fruit that is being produced, or I can start a new AutoML experiment with different settings to achieve even better results.
It's really up to me. I'll take a peek at the AutoML-generated notebook for the best model. This will highlight the algorithm used by the best model, its hyperparameter settings and performance metrics.
I also like how AutoML within Databricks provides some nice visual charts and explanations so that I can understand feature importance. I'm also going to get details about the best hyperparameter combinations. There are code snippets in there that are reproducible and deployment-ready. I also have full access to the experiment's run history.
Lastly, the generated code is a starting point for further model customization or further experiments that I might want to run separately.
There are multiple data transformation options with Fivetran
If you want to combine datasets, do some pre-processing or do some data enrichment prior to using AutoML, Fivetran provides seamless integration with dbt. Options include quick start data models, dbt Core integration with your dbt project, a range of out-of-the-box dbt packages (in addition to the quickstarts) and also dbt Cloud.
The Fivetran Quickstarts allows you to automatically produce analytics-ready tables using pre-built data models that transform your data with no code and no additional dbt project or any third-party tools required. You can also connect to your dbt core project.
Transformations also include integrated scheduling, and they automatically trigger model runs following the completion of Fivetran connector syncs in the Databricks Lakehouse. You can check out the wide range of connectors and packages that support quick start data models or transformations for dbt Core here.
Don’t forget Databricks Partner Connect with Fivetran
If you don't have a Fivetran account right now, you can set that up easily directly from Databricks and Partner Connect. Databricks creates a pre-configured SQL endpoint, a service user and access credentials, and then Fivetran creates a pre-configured Databricks destination ready to go.
Get started now
It’s that easy to run a quick ML wine quality experiment with Databricks, AutoML and Fivetran. Fivetran ensures that any and all data movement to the Databricks Lakehouse is standardized and predictable across any data source, with each fully automated and fully managed pipeline providing reliability, scalability, predictability, and security.
If you want to give Fivetran a spin for a Databricks AutoML use case or any other Databricks Lakehouse workload, Fivetran makes it easy with a 14-day free trial.
I would love to hear from you about any connectors, data workloads, or use cases you’d like to see profiled next. Take care!
About the author
Kelly Kohlleffel leads the Fivetran Global Partner Sales Engineering organization, working with a broad ecosystem of technology partners and consulting services partners on modern data product and solution approaches. He also hosts the Fivetran Data Drip podcast where some of the brightest minds across the data community talk about their data journey, vision, and challenges. Before Fivetran, he spent time at Hashmap and NTT DATA (data solution and service consulting), Hortonworks (in Hadoop-land), and Oracle. You can connect with Kelly on LinkedIn or follow him on Twitter.