We recently did a deep dive into how we turn our Fivetran connector loaded data into analytics-ready tables in your warehouse with our dbt data models. We are now revealing how we actually create those data models.
And the secret? While the data models take a lot of manual work and collaboration with you all - our customers and peers - some of that work has been scaled using ChatGPT. That’s right, the nascent AI tool that’s revolutionizing business and dominating the national conversation also helps us model data.
[CTA_MODULE]
Here are a few ways we’ve been able to use ChatGPT to expedite our dbt data model creation:
Documentation and finding the descriptions of columns
When developing new - and updating existing - data models we spend a large amount of time documenting the tables and fields from the raw data layer. This is a cumbersome, research-intensive process.
However, with the advent of AI we have been able to streamline this process by generating a significant amount of source data definitions (and even data integrity tests) by interacting with ChatGPT. We are able to feed data source schemas into ChatGPT directly and ask it to generate a dbt flavored yml that documents the defined tables and fields. ChatGPT is able to generate this yml by leveraging the source destinations API docs.
A process that would take us significantly longer in the past is now automated to a few minutes. We reallocate time from researching and design to validating results. This allows us to spend more time working on the data transformations and understanding the raw data as opposed to meticulously combing through the API documentation.
Building python scripts to help automate maintenance efforts
AI is very helpful when you have a basic knowledge of a topic but lack expertise to complete specific tasks without a fair amount of research. AI can assist in that research and complete it in a fraction of the time.
For example, I wanted to use the GitHub API to duplicate a GitHub issue into multiple repositories that all suffered from the same bugs; duplicating the GitHub issues would ensure that the recurring problems were flagged wherever they were present. I generally knew how APIs functioned, but had no experience with GitHub’s. I also had limited experience with creating API scripts.
This presented me with a roadblock. To tackle this use case, I would need to do research and improve my base skills. I would need to read the API documentation, learn about any necessary packages, learn syntax, troubleshoot and iterate my code until I found success.
Instead, I turned to ChatGPT. With AI, I was able to ask simple questions and end up with code I could start testing within a few minutes. I could also ask follow-up questions if I didn’t get expected results, got an error or didn’t understand an output. For example, when I received a “403” api error, ChatGPT was able to tell me what the error meant as well as suggest how to remedy it.
Aid when building complex sql transformations
Many folks may be familiar with the rubber duck debugging philosophy. This is a method where you have a rubber duck on your desk and you explain your code to the duck. While explaining the code you are able to hit points of realization where you need to adjust or update your code to be more accurate and performant.
I have always been a firm believer in this method, but lately I have adopted a new ChatGPT equivalent which I call “robot duck debugging.” With this, I treat my sessions with ChatGPT as I would my very real rubber duck on my desk. I go through my code and explain it in detail and also highlight what I am trying to achieve with the result.
The robot duck then provides insights and oftentimes highlights problematic areas of my code. For example, it once called out that one of my joins would likely result in a cartesian join (which it did) and I should consider a different approach to ensure data validity in my transformation.
The robot duck even once told me that my approach was unfounded and I should consider some new approaches to achieve my desired result. I asked it then what kind of approaches I should take and after some back and forth I arrived at a desirable conclusion. One I had not considered before robot duck debugging.
In examples like the two I mentioned, ChatGPT is extremely valuable to work through complex code and solutions. It is important to note that ChatGPT can sometimes jump the gun and suggest some solutions that are not practical. I have found correcting the robot duck and coaching it to stay on track to be most effective. I have been able to unlock some new approaches simply because I have had the robot duck to accompany me.
But, I still talk to my very real rubber duck from time to time.
Understand business outcomes of specific data models
When building out new data models, our analytics team conducts interviews with our customers to understand the business analytics use case for their source data. Because of everyone’s busy schedules, we are only able to interview a few customers.
Fortunately, ChatGPT is able to help us scale additional business knowledge. With a simple query, we can leverage the scope of the entire history of the internet to understand common pain points and questions others have experienced when leveraging a certain data source.
For example, when building our latest Twilio dbt data model we were struggling to understand exactly what types of answers customers wanted to answer with the data. ChatGPT revealed that customers frequently want to analyze the breakdown of all messages ever sent to uncover trends and cost over time. Similarly, aggregating data by phone number and account is key for any Twilio analysis.
Help with cross database compatibility
We want to ensure that the data models we make are helpful for all organizations - regardless of the data warehouse you deploy. When building models, there are nuanced differences between the warehouses that dbt doesn’t have a function like dbt.type to automatically handle. Warehouses using JSON data have similar ways of processing the data, but the syntax varies between them.
So, when modeling our models, we can ask ChatGPT, “How can I make the below code compatible with Databricks?” While it might not give the exact answer, it points us in the right direction.
To operationalize this, we fed our Fivetran Utils macro into ChatGPT and asked it to create a dispatch compatible with a given warehouse. For example, you can take any Fivetran Utils macro and copy the contents of the code and add it to your ChatGPT session and request it provide a new dispatch for a new adapter.
Using AI to help you get the best analytics-ready tables
Generative AI like ChatGPT is set to become a fantastic productivity aide for anyone who creates content for a living – code, text, images and more. While ChatGPT cannot yet make data models for us from scratch, they certainly help us more efficiently and effectively release and update models for your most needed use cases. These models are free and available to everyone. You can download and deploy them from our library here.
And while AI is helpful, your feedback is invaluable. We would love to hear from you! What data models do you want to see us create? What PRs would improve our existing models? You can submit these ideas in GitHub or start a conversation with us in Fivetran’s Community.
[CTA_MODULE]