By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
Apr 27, 2023
More

Creating data pipeline with Dataform in BigQuery (Part 1)

What is Dataform and how does it work? By Fassil S. Yehuala

Dataform is an IDE (Integrated Development Environment) that allows you to create modern data flows in cloud Data warehouses such as Google BigQuery, Amazon Redshift, or Snowflake. Managing data transformation and integration in contemporary data warehouses has never been more intuitive with Dataform as “it helps data teams build data pipelines that turn raw data into new tables and views that can be used for analytics”.

Dataform is founded by two ex-Googlers, Lewis Hemens and Guillaume-Henri Huon, to modernize the management of data warehouses using Dataform as an OS for the data warehouse. As of December 2020, Dataform has become part of Google Cloud. As part of the Google Cloud Platform, Dataform is now available for free. However, since Dataform at GCP is in the preview stage, some features may not be available.

Dataform is dedicated to the transformation part of the ELT (Extract, Load, Transform) processes. Its main purpose is to “transform data already loaded in your warehouse”. It allows Data teams (analysts, engineers, etc.) to collaborate more effectively using Git for the version control of scripts.

Source: https://dataform.co

The diagram above illustrates the positioning of Dataform in a data transformation chain. Dataform uses best practices from the world of IT development. 

What can Dataform do?

Dataform creates a single directory for all your data management. Indeed, the data definitions are stored in a single directory. Similarly, it synchronizes this information with Git to make it accessible to the entire team. Dataform brings the following best practices of Software Engineering to the world of Data Engineering. 

  • Version control: the Dataform IDE is integrated with Git. This allows users to work from their branches. Thus, it is possible to publish the changes directly or submit them to a review code.
  • Automation of data quality tests: Dataform allows you to define and apply quality tests to all your data. Similarly, automatic alerts make it possible to detect potential errors as early as possible.
  • Secure deployments. Dataform helps you apply sandboxing, development environments, and CI/CD to integrate new changes safely.

Where does Dataform lie in the data pipeline?

Source: https://dataform.co

With Dataform, data engineers and analysts can manage all data transformation processes in the warehouse. Thus, the transformation of raw data into data sets ready to be analyzed in BI tools becomes more readable.

How Dataform works?

To develop in Dataform, an extension of SQL known as SQLX is used. Dataform claims that “SQLX brings additional features to SQL to make development faster, more reliable, and more scalable.” The diagram below shows the relationship between Standard SQL and SQLX.

Source: https://dataform.co

How Does Dataform Compare with Other Similar Tools?

Dataform vs dbt (Data Build Tool)

Like Dataform, dbt is an open-source data transformation tool within data warehouses (Snowflake, Bigquery, Redshift, Postgres). Its use is based on models written in SQL language to define the transformation, which makes it easily accessible to the data analyst. During execution, the models are materialized in the data warehouse in the form of tables or views. Again, similar to Dataform, dbt allows you to associate metadata with models (the generation of documentation and definition of data validation tests).

Because Dataform is now part of GCP and directly integrated with the BigQuery platform, it's a compelling choice for startups in the GCP ecosystem. Dbt, on the other hand, has a huge ecosystem built around it, making it advantageous in terms of compatibility between analysis tools. Dataform and dbt have a SaaS offering, effectively a cloud IDE + CI/CD, ideal for development.

Dataform vs Apache Airflow

Apache Airflow is an open-source workflow scheduling platform widely used in data engineering. It is crucial for the architecture and orchestration of complex data pipelines. Though one of the advantages of Airflow is flexibility, this flexibility begs for extensive coding skills to get your data pipeline running. If you are a start-up you may go for a more straightforward option. With Airflow, data management, dependency, and maintenance activities are sometimes challenging. Dataform takes care of all these issues. 

Conclusion

Many companies, from high-tech startups to significant retailers, use Dataform daily to manage their data effectively. Dataform is a promising product as it positions itself as a tool for the future to accelerate the transformation of Data pipelines in the GCP.

For more information on how to get started with Dataform in BigQuery, please refer to the following documentation.

Further Links

The article is the first part of the series Creating data pipeline with Dataform in BigQuery from datadice. This post introduced Dataform, an integrated development environment for your data team. 

Check out our LinkedIn account, to get insights into our daily working life and get important updates about BigQuery, Data Studio and marketing analytics

We also started with our own YouTube channel. We talk about important DWH, BigQuery, Data Studio and many more topics. Check out the channel here.

If you want to learn more about how to use Google Data Studio and take it to the next level in combination with BigQuery, check our Udemy course here.

If you are looking for help to set up a modern and cost-efficient data warehouse or analytical dashboards, send us an email to hello@datadice.io and we will schedule a call.