Understanding Data Pipelines Orchestration using Dagster

What is Dagster

First of all it’s very important that we understand the concept of data pipelines, which are basically raw data being injected from diferente system sources to a system target after being processed. This can be done for many reasons, since analyses of data until applications integrations. One common type of data pipeline is ETL pipeline, which is the process of extracting, transforming and loading data into a database such as data warehouse.

When architecting a data pipeline there are two levels to be taken into consideration logical and platform. The logical level is how the data is going to be processed and transformed from the source into the target. And the platform level is the design of tooling and implementation that each environment needs. So in this case the logical design could work with different platform designs.

One important concept of data pipelines is data orchestration, which says how the data is going the data flow can be managed and scheduled based on the specific needs. This is where dagster comes into play.

Dagster is an open source data orchestration tool that was created to feel the needs indentified by data engineers and software engineers when orchestrating data.

A nice quote from the creator of “hello world” can represent nicely why dagster was cerated for “Controlling complecity is the essence of computer programming” Brian Kernighan

Why Dagster

it was the first orchestrator that can be used in all of the stages of the data development life cycle

  • local developemnt
  • unit tests
  • integration tests
  • staging environments
  • all the way up to production

it solves the challenge of using too many different infrastructures for local environment and production environment

All the orchestrators have the knoledge of tasks, dagster created the concepts of software assets, meaning that at one level dagster orchestrate computations. Defining operations we want to execute, especifying dependencies between them and schedule them.

An asset can be a data base table, a machine learning model a report, any persistent object that capture some understanding from the real world

With the concept of assets materialization it helps to selectively create or update key data. So if some step of the pipeline fails it’s not needed to compute everything in order to re-run one particular step of the other ones have been already materialized.

It brings order to the code base making it more maintainable. It uses assets observations with metadata making the data observable to be easier to debug the flow of the data going though the pipeline and identify where the errors is happening when there’s corrupted or outdated data at the end of the process.

It also separates the concept of I/O and compute to build locally and ship confidently regardless is the input and output the computation is going to be trustful.

Personal thoughts

I’m a software engineer exploring the world of data engineering because of a real business need. That being said, as a software engineer that doesn’t have much experience with other pipeline orchestration tools, I’m pretty happy how dagster work with the software abstractions and developer experience it provides.

We have been able to separate the business logic from the pipeline logic pretty solid, applying good design patterns and oriented object concepts. Guaranteeing quality with unit, integration and e2e tests. Simulating the inputs and outputs from the production environment in the local environment. Debugging the observability of the data across the full pipeline process.

My experience so far have been with a customer based ETL pipeline, but I’m pretty sure we did a great step choosing dagster as the first step for a bunch of software engineeres that will be making a data engineering culture raise in a company.

As I understand, the data engineering world is evolving as let’s say the frontend world did in 10 years ago. So it’s very important to have tools like Dagster that helps a lot with the software development process. Even because the data plataform is not something like in frontend that only developers take care of and other stakeholders only care about inputs and outputs, in the data field the data platform needs to be handle by multiple teams and multiple stakeholders with different skill sets across the whole data pipeline process.

I know to manage data we might not even use an orchestration pipeline if the context doesn’t require it. And maybe that would be our case. But for sure choosing a opinionated framework for data orchestration with such a good developer experience and nice Software Engineering concepts made out team much more motivated and comprehensive on how to spread the data culture inside of the company. I’m grateful for Elementl for creating this tool and I’m looking forward to explore more about it and see how it can bring more value to businesses needs.

Author

Marco William

Software Engineer

Leave a Reply

Your email address will not be published. Required fields are marked *