11 posts tagged with "etl"

From Pandas to Production: How we built dlt as the right ELT tool for Normies

June 12, 2024 · 12 min read

Open source Data Engineer

info

TL;DR: dlt is a library for Normies: Problem solvers with antipathy for black boxes, gratuitous complexity and external dependencies.

This post tells the story of how we got here.

Try it in colab:

But if you want to load pandas dfs to production databases, with all the best practices built-in, check out this documentation or this colab notebook that shows easy handling of complex api data.

Or check out more resources at the end of the article

I. The background story: Normal people load data too

Hey, I’m Adrian, cofounder of dlt. I’ve been working in the data industry since 2012, doing all kinds of end-to-end things.

In 2017, a hiring team called me a data engineer. As I saw that title brought me a lot of work offers, I kept it and went with it.

But was I doing data engineering? Yes and no. Since my studies were not technical, I always felt some impostor syndrome calling myself a data engineer. I had started as an analyst, did more and more and became an end to end data professional that does everything from building the tech stack, collecting requirements, getting managers to agree on the metrics used 🙄, creating roadmap and hiring a team.

Back in 2022 there was an online conference called Normconf and I ‘felt seen’. As I watched Normconf participants, I could relate more to them than to the data engineer label. No, I am not just writing code and pushing best practices - I am actually just trying to get things done without getting bogged down in bad practice gotchas. And it seemed at this conference that many people felt this way.

normal

Normies: Problem solvers with antipathy for black boxes, gratuitous complexity and external dependencies

At Normconf, "normie" participants often embodied the three fundamental psychological needs identified in Self-Determination Theory: autonomy, competence, and relatedness.

They talked about how they autonomously solved all kinds of problems, related on the pains and gains of their roles, and showed off their competence across the board, in solving problems.

What they did, was what I also did as a data engineer: We start from a business problem, and work back through what needs to be done to understand and solve it.

By very definition, Normie is someone not very specialised at one thing or another, and in our field, even data engineers are jacks of all trades.

What undermines the Normie mission are things that clash with the basic needs, from uncustomisable products, to vendors that add bottlenecks and unreliable dependencies.

Encountering friction between data engineers and Python-first analysts

Before becoming a co-founder of dlt I had 5 interesting years as a startup employee, a half-year nightmare in a corporation with no autonomy or mastery (I got fired for refusing the madness, and it was such a huge relief), followed by 5 fun, rewarding and adventure-filled years of freelancing. Much of my work was “build&hire” which usually meant building a first time data warehouse and hiring a team for it. The setups that I did were bespoke to the businesses that were getting them, including the teams - Meaning, the technical complexity was also tailored to the (lack of) technical culture of the companies I was building for.

In this time, I saw an acute friction between data engineers and Python-first analysts, mostly around the fact that data engineers easily become a bottleneck and data scientists are forced to pick up the slack. And of course, this causes other issues that might further complicate the life of the data engineer, while still not being a good solution for the data consumers.

So at this point I started building boilerplate code for data warehouses and learning how to better cater to the entire team.

II. The initial idea: pandas.df.to_sql() with data engineering best practices

After a few attempts I ended up with the hypothesis that df.to_sql() is the natural abstraction a data person would use - I have a table here, I want a table there, shouldn’t be harder than a function call right?

Right.

Except that particular function call is anything but data engineering complete. A single run will do what it promises. A production pipeline will also have many additional requirements. In the early days, we wrote up an ideal list of features that should be auto-handled (spoiler alert: today dlt does all that and more). Read on for the wish list:

Our dream: a tool that meets production pipelines requirements

Wouldn’t it be nice if we could auto-flatten and unpack nested structures into tables with generated join keys?

Wouldn’t it be nice if data types were properly defined and managed?
Wouldn’t it be nice if we could load the data incrementally, meaning retain some state to know where to start from?
Wouldn’t it be nice if this incremental load was bound to a way to do incremental extraction?
Wouldn’t it be nice if we didn’t run out of memory?
Wouldn’t it be nice if we got alerted/notified when schemas change?
Wouldn’t it be nice if schema changes were self healing?
Wouldn’t it be nice if I could run it all in parallel, or do async calls?
Wouldn’t it be nice if it ran on different databases too, from dev to prod?
Wouldn’t it be nice if it offered requests with built in retries for those nasty unreliable apis (Hey Zendesk, why you fail on call 99998/100000?)
Wouldn’t it be nice if we had some extraction helpers like pagination detection?

Auto typing and unpacking with generated keys: keys

Performance docs

The initial steps

How did we go about it? At first dlt was created as an engine to iron out its functionality. During this time, it was deployed it in several projects, from startups to enterprises, particularly to accelerate data pipeline building in a robust way.

A while later, to prepare this engine for the general public, we created the current interface on top of it. We then tested it in a workshop with many “Normies” of which over 50% were pre-employment learners.

For the workshop we broke down the steps to build an incremental pipeline into 20 steps. In the 6 hour workshop we asked people to react on Slack to each “checkpoint”. We then exported the slack data and loaded it with dlt, exposing the completion rate per checkpoint. Turns out, it was 100%. Everyone who started, managed to build the pipeline. “This is it!” we thought, and spend the next 6 months preparing our docs and adding some plugins for easy deployment.

III. Launching dlt

We finally launched dlt mid 2023 to the general public. Our initial community was mostly data engineers who had been using dlt without docs, managing from reading code. As we hoped a lot of “normies” are using dlt, too!

dlt = code + docs + Slack support

A product is a sum of many parts. For us dlt is not only the dlt library and interface, but also our docs and Slack community and the support and discussions there.

In the early days of dlt we talked to Sebastian Ramirez from FastAPI who told us that he spends 2/3 of his FastAPI time writing documentation.

In this vein, from the beginning docs were very important to us and we quickly adopted our own docs standard.

However, when we originally launched dlt, we found that different user types, especially Normies, expect different things from our docs, and because we asked for feedback, they told us.

So overall, we were not satisfied to stop there.

"Can you make your docs more like my favorite tool's docs?"

To this end we built and embedded our own docs helper in our docs.

The result? The docs helper has been running for a year and we currently see around 300 questions per day. Comparing this to other communities that do AI support on Slack, that’s almost 2 orders of magnitude difference in question volume by community size.

We think this is a good thing, and a result of several factors.

Embedded in docs means at the right place at the right time. Available to anyone, whether they use Slack or not.
Conversations are private and anonymous. This reduces the emotional barrier of asking. We suspect this is great for the many “Normies” / “problem solvers” that work in data.
The questions are different than in our Slack community: Many questions are around “Setup and configuration”, “Troubleshooting” and “General questions” about dlt architecture. In Slack, we see the questions that our docs or assistant could not answer.
The bot is conversational and will remember recent context, enabling it to be particularly helpful. This is different from the “question answering service” that many Slack bots offer, which do not keep context once a question was answered. By retaining context, it’s possible to reach a useful outcome even if it doesn’t come in the first reply.

dlt = “pip install and go” - the fastest way to create a pipeline and source

dlt offers a small number of verified sources, but encourages you to build your own. As we mentioned, creating an ad hoc dlt pipeline and source is dramatically simpler compared to other python libraries. Maintaining a custom dlt source in production takes no time at all because the pipeline won't break unless the source stops existing.

The sources you build and run that are not shared back into the verified sources are what we call “private sources”.

By the end of 2023, our community had already built 1,000 private sources, 2,000 by early March. We are now at the end of q2 2024 and we see 5,000 private sources.

Embracing LLM-free code generation

We recently launched additional tooling that helps our users build sources. If you wish to try our python-first dict-based declarative approach to building sources, check out the relevant post.

Rest api connector
Openapi based pipeline generator that configures the rest api connector.

Alena introduces the generator and troubleshoots the outcome in 4min:

Community videos for rest api source: playlist.

Both tools are LLM-free pipeline generators. I stress LLM free, because in our experience, GPT can do some things to some extent - so if we ask it to complete 10 tasks to produce a pipeline, each having 50-90% accuracy, we can expect very low success rates.

To get around this problem, we built from the OpenAPI standard which contains information that can be turned into a pipeline algorithmically. OpenAPI is an Api spec that’s also used by FastAPI and constantly growing in popularity, with around 50% of apis currently supporting it.

By leveraging the data in the spec, we are able to have a basic pipeline. Our generator also infers some other pieces of information algorithmically to make the pipeline incremental and add some other useful details.

When generation doesn’t work

Of course, generation doesn’t always work but you can take the generated pipeline and make the final adjustments to have a standard REST API config-based pipeline that won’t suffer from code smells.

The benefit of minimalistic sources

The real benefit of this declarative source is not at building time - A declarative interface requires more upfront knowledge. Instead, by having this option, we enable minimalistic pipelines that anyone could maintain, including non coders or human-assisted LLMs. After all, LLMs are particularly proficient at translating configurations back and forth.

Want to influence us? we listen, so you’re welcome to discuss with us in our slack channel #4-discussions

Towards a paid offering

dlt is an open core product, meaning it won’t be gated to push you to the paid version at some point. Instead, much like Kafka and Confluent, we will offer things around dlt to help you leverage it in your context.

If you are interested to help us research what’s needed, you can apply for our design partnership program, that aims to help you deploy dlt, while helping us learn about your challenges.

Call to action.

If you like the idea of dlt, there is one thing that would help us:

Set aside 30min and try it.

See resource below.

We often hear variations of “oh i postponed dlt so long but it only took a few minutes to get going, wish I hadn’t installed [other tool] which took 2 weeks to set up properly and now we need to maintain or replace”, so don't be that guy.

Here are some notebooks and docs to open your appetite:

An API pipeline step by step tutorial to build a production pipeline from an api
A colab demo of schema evolution (2min read)
Docs: RestClient, the imperative class that powers the REST API source, featuring auto pagination https://dlthub.com/docs/general-usage/http/rest-client
Docs: Build a simple pipeline
Docs: Build a complex pipeline
Docs: capabilities overview hub page
Community & Help: Slack join link.

How I Contributed to My First Open Source Data Pipeline

May 23, 2024 · 4 min read

Aman Gupta

Junior Data Engineer

Hello, I'm Aman Gupta. Over the past eight years, I have navigated the structured world of civil engineering, but recently, I have found myself captivated by data engineering. Initially, I knew how to stack bricks and build structural pipelines. But this newfound interest has helped me build data pipelines, and most of all, it was sparked by a workshop hosted by dlt.

info

dlt (data loading tool) is an open-source library that you can add to your Python scripts to load data from various and often messy data sources into well-structured, live datasets.

The dlt workshop took place in November 2022, co-hosted by Adrian Brudaru, my former mentor and co-founder of dlt.

An opportunity arose when another client needed data migration from FreshDesk to BigQuery. I crafted a basic pipeline version, initially designed to support my use case. Upon presenting my basic pipeline to the dlt team, Alena Astrakhatseva, a team member, generously offered to review it and refine it into a community-verified source.

My first iteration was straightforward—loading data in replace mode. While adequate for initial purposes, a verified source demanded features like pagination and incremental loading. To achieve this, I developed an API client tailored for the Freshdesk API, integrating rate limit handling and pagination:

class FreshdeskClient:
    """
    Client for making authenticated requests to the Freshdesk API. It incorporates API requests with
    rate limit and pagination.
    """
    
    def __init__(self, api_key: str, domain: str):
        # Contains stuff like domain, credentials and base URL.
        pass

    def _request_with_rate_limit(self, url: str, **kwargs: Any) -> requests.Response:
        # Handles rate limits in HTTP requests and ensures that the client doesn't exceed the limit set by the server.
        pass

    def paginated_response(
        self,
        endpoint: str,
        per_page: int,
        updated_at: Optional[str] = None,
    ) -> Iterable[TDataItem]:
        # Fetches a paginated response from a specified endpoint.
        pass

To further make the pipeline effective, I developed dlt resources that could handle incremental data loading. This involved creating resources that used dlt's incremental functionality to fetch only new or updated data:

def incremental_resource(
    endpoint: str,
    updated_at: Optional[Any] = dlt.sources.incremental(
        "updated_at", initial_value="2022-01-01T00:00:00Z"
    ),
) -> Generator[Dict[Any, Any], Any, None]:
    """
    Fetches and yields paginated data from a specified API endpoint.
    Each page of data is fetched based on the `updated_at` timestamp
    to ensure incremental loading.
    """

    # Retrieve the last updated timestamp to fetch only new or updated records.
    updated_at = updated_at.last_value

    # Use the FreshdeskClient instance to fetch paginated responses
    yield from freshdesk.paginated_response(
        endpoint=endpoint,
        per_page=per_page,
        updated_at=updated_at,
    )

With the steps defined above, I was able to load the data from Freshdesk to BigQuery and use the pipeline in production. Here’s a summary of the steps I followed:

Created a Freshdesk API token with sufficient privileges.
Created an API client to make requests to the Freshdesk API with rate limit and pagination.
Made incremental requests to this client based on the “updated_at” field in the response.
Ran the pipeline using the Python script.

While my journey from civil engineering to data engineering was initially intimidating, it has proved to be a profound learning experience. Writing a pipeline with dlt mirrors the simplicity of a GET request: you request data, yield it, and it flows from the source to its destination. Now, I help other clients integrate dlt to streamline their data workflows, which has been an invaluable part of my professional growth.

In conclusion, diving into data engineering has expanded my technical skill set and provided a new lens through which I view challenges and solutions. As for me, the lens view mainly was concrete and steel a couple of years back, which has now begun to notice the pipelines of the data world.

Data engineering has proved both challenging, satisfying, and a good career option for me till now. For those interested in the detailed workings of these pipelines, I encourage exploring dlt's GitHub repository or diving into the documentation.

Replacing Saas ETL with Python dlt: A painless experience for Yummy.eu

April 23, 2024 · 3 min read

Adrian Brudaru

Open source Data Engineer

About Yummy.eu

Yummy is a Lean-ops meal-kit company streamlines the entire food preparation process for customers in emerging markets by providing personalized recipes, nutritional guidance, and even shopping services. Their innovative approach ensures a hassle-free, nutritionally optimized meal experience, making daily cooking convenient and enjoyable.

Yummy is a food box business. At the intersection of gastronomy and logistics, this market is very competitive. To make it in this market, Yummy needs to be fast and informed in their operations.

Pipelines are not yet a commodity.

At Yummy, efficiency and timeliness are paramount. Initially, Martin, Yummy’s CTO, chose to purchase data pipelining tools for their operational and analytical needs, aiming to maximize time efficiency. However, the real-world performance of these purchased solutions did not meet expectations, which led to a reassessment of their approach.

What’s important: Velocity, Reliability, Speed, time. Money is secondary.

Martin was initially satisfied with the ease of setup provided by the SaaS services.

The tipping point came when an update to Yummy’s database introduced a new log table, leading to unexpectedly high fees due to the vendor’s default settings that automatically replicated new tables fully on every refresh. This situation highlighted the need for greater control over data management processes and prompted a shift towards more transparent and cost-effective solutions.

Motivated to find a solution that balanced cost with performance, Martin explored using dlt, a tool known for its simplicity in building data pipelines. By combining dlt with asynchronous operations and using Modal for managed execution, the improvements were substantial:

Data processing speed increased tenfold.
Cost reduced by 182 times compared to the traditional SaaS tool.
The new system supports extracting data once and writing to multiple destinations without additional costs.

For a peek into on how Martin implemented this solution, please see Martin's async Postgres source on GitHub..

Taking back control with open source has never been easier

Taking control of your data stack is more accessible than ever with the broad array of open-source tools available. SQL copy pipelines, often seen as a basic utility in data management, do not generally differ significantly between platforms. They perform similar transformations and schema management, making them a commodity available at minimal cost.

SQL to SQL copy pipelines are widespread, yet many service providers charge exorbitant fees for these simple tasks. In contrast, these pipelines can often be set up and run at a fraction of the cost—sometimes just the price of a few coffees.

At dltHub, we advocate for leveraging straightforward, freely available resources to regain control over your data processes and budget effectively.

Setting up a SQL pipeline can take just a few minutes with the right tools. Explore these resources to enhance your data operations:

30+ SQL database sources
Martin’s async PostgreSQL source
Arrow + connectorx for up to 30x faster data transfers

For additional support or to connect with fellow data professionals, join our community.

Easy loading from statistical data metadata exchange to dbs

April 20, 2024 · 4 min read

Adrian Brudaru

Open source Data Engineer

Statistical Data and Metadata eXchange (SDMX) is an international standard used extensively by global organizations, government agencies, and financial institutions to facilitate the efficient exchange, sharing, and processing of statistical data.

Utilizing SDMX enables seamless integration and access to a broad spectrum of statistical datasets covering economics, finance, population demographics, health, and education, among others.

These capabilities make it invaluable for creating robust, data-driven solutions that rely on accurate and comprehensive data sources.

embeddable etl

Why SDMX?

SDMX not only standardizes data formats across disparate systems but also simplifies the access to data provided by institutions such as Eurostat, the ECB (European Central Bank), the IMF (International Monetary Fund), and many national statistics offices.

This standardization allows data engineers and scientists to focus more on analyzing data rather than spending time on data cleaning and preparation.

Installation and Basic Usage

To start integrating SDMX data sources into your Python applications, install the sdmx library using pip:

pip install sdmx1

Here's an example of how to fetch data from multiple SDMX sources, illustrating the diversity of data flows and the ease of access:

from sdmx_source import sdmx_source

source = sdmx_source([
    {"data_source": "ESTAT", "dataflow": "PRC_PPP_IND", "key": {"freq": "A", "na_item": "PLI_EU28", "ppp_cat": "A0101", "geo": ["EE", "FI"]}, "table_name": "food_price_index"},
    {"data_source": "ESTAT", "dataflow": "sts_inpr_m", "key": "M.PROD.B-D+C+D.CA.I15+I10.EE"},
    {"data_source": "ECB", "dataflow": "EXR", "key": {"FREQ": "A", "CURRENCY": "USD"}}
])
print(list(source))

This configuration retrieves data from:

Eurostat (ESTAT) for the Purchasing Power Parity (PPP) and Price Level Indices providing insights into economic factors across different regions.
Eurostat's short-term statistics (sts_inpr_m) on industrial production, which is crucial for economic analysis.
European Central Bank (ECB) for exchange rates, essential for financial and trade-related analyses.

Loading the data with dlt, leveraging best practices

After retrieving data using the sdmx library, the next challenge is effectively integrating this data into databases. The dlt library excels in this area by offering a robust solution for data loading that adheres to best practices in several key ways:

Automated schema management -> dlt infers types and evolves schema as needed. It automatically handles nested structures too. You can customise this behavior, or turn the schema into a data contract.
Declarative configuration -> You can easily switch between write dispositions (append/replace/merge) or destinations.
Scalability -> dlt is designed to handle large volumes of data efficiently, making it suitable for enterprise-level applications and high-volume data streams. This scalability ensures that as your data needs grow, your data processing pipeline can grow with them without requiring significant redesign or resource allocation.

Martin Salo, CTO at Yummy, a food logistics company, uses dlt to efficiently manage complex data flows from SDMX sources. By leveraging dlt, Martin ensures that his data pipelines are not only easy to build, robust and error-resistant but also optimized for performance and scalability.

View Martin Salo's implementation

Martin Salo's implementation of the sdmx_source package effectively simplifies the retrieval of statistical data from diverse SDMX data sources using the Python sdmx library. The design is user-friendly, allowing both simple and complex data queries, and integrates the results directly into pandas DataFrames for immediate analysis.

This implementation enhances data accessibility and prepares it for analytical applications, with built-in logging and error handling to improve reliability.

Conclusion

Integrating sdmx and dlt into your data pipelines significantly enhances data management practices, ensuring operations are robust, scalable, and efficient. These tools provide essential capabilities for data professionals looking to seamlessly integrate complex statistical data into their workflows, enabling more effective data-driven decision-making.

By engaging with the data engineering community and sharing strategies and insights on effective data integration, data engineers can continue to refine their practices and achieve better outcomes in their projects.

Join the conversation and share your insights in our Slack community.

Portable, embeddable ETL - what if pipelines could run anywhere?

April 12, 2024 · 8 min read

Adrian Brudaru

Open source Data Engineer

embeddable etl

The versatility that enables "one way to rule them all"... requires a devtool

A unified approach to ETL processes centers around standardization without compromising flexibility. To achieve this, we need to be enabled to build and run custom code, bu also have helpers to enable us to standardise and simplify our work.

In the data space, we have a few custom code options, some of which portable. But what is needed to achieve universality and portability is more than just a code standard.

So what do we expect from such a tool?

It should be created for our developers
it should be easily pluggable into existing tools and workflows
it should perform across a variety of hardware and environments.

Data teams don't speak Object Oriented Programming (OOP)

Connectors are nice, but when don't exist or break, what do we do? We need to be able to build and maintain those connectors simply, as we work with the rest of our scripts.

The data person has a very mixed spectrum of activities and responsibilities, and programming is often a minor one. Thus, across a data team, while some members can read or even speak OOP, the team will not be able to do so without sacrificing other capabilities.

This means that in order to be able to cater to a data team as a dev team, we need to aknowledge a different abstraction is needed.

Goodbye OOP, hello `@decorators`!

Data teams often navigate complex systems and workflows that prioritize functional clarity over object-oriented programming (OOP) principles. They require tools that simplify process definition, enabling quick, readable, and maintainable data transformation and movement. Decorators serve this purpose well, providing a straightforward way to extend functionality without the overhead of class hierarchies and inheritance.

Decorators in Python allow data teams to annotate functions with metadata and operational characteristics, effectively wrapping additional behavior around core logic. This approach aligns with the procedural mindset commonly found in data workflows, where the emphasis is on the transformation steps and data flow rather than the objects that encapsulate them.

By leveraging decorators, data engineers can focus on defining what each part of the ETL process does—extract, transform, load—without delving into the complexities of OOP. This simplification makes the code more accessible to professionals who may not be OOP experts but are deeply involved in the practicalities of data handling and analysis.

The ability to run embedded is more than just scalability

Most traditional ETL frameworks are architected with the assumption of relatively abundant computational resources. This makes sense given the resource-intensive nature of ETL tasks when dealing with massive datasets.

However, this assumption often overlooks the potential for running these processes on smaller, more constrained infrastructures, such as directly embedded within an orchestrator or on edge devices.

The perspective that ETL processes necessarily require large-scale infrastructure is ripe for challenge. In fact, there is a compelling argument to be made for the efficiency and simplicity of executing ETL tasks, particularly web requests for data integration, on smaller systems. This approach can offer significant cost savings and agility, especially when dealing with less intensive data loads or when seeking to maintain a smaller digital footprint.

Small infrastructure ETL runs can be particularly efficient in situations where real-time data processing is not required, or where data volumes are modest. By utilizing the orchestrator's inherent scheduling and management capabilities, one can execute ETL jobs in a leaner, more cost-effective manner. This can be an excellent fit for organizations that have variable data processing needs, where the infrastructure can scale down to match lower demands, thereby avoiding the costs associated with maintaining larger, underutilized systems.

Running on small workers is easier than spinning up infra

Running ETL processes directly on an orchestrator can simplify architecture by reducing the number of moving parts and dependencies. It allows data teams to quickly integrate new data sources and destinations with minimal overhead. This methodology promotes a more agile and responsive data architecture, enabling businesses to adapt more swiftly to changing data requirements.

It's important to recognize that this lean approach won't be suitable for all scenarios, particularly where data volumes are large or where the complexity of transformations requires the robust computational capabilities of larger systems. Nevertheless, for a significant subset of ETL tasks, particularly those involving straightforward data integrations via web requests, running on smaller infrastructures presents an appealing alternative that is both cost-effective and simplifies the overall data processing landscape.

Dealing with spiky loads is easier on highly parallel infras like serverless functions

Serverless functions are particularly adept at managing spiky data loads due to their highly parallel and elastic nature. These platforms automatically scale up to handle bursts of data requests and scale down immediately after processing, ensuring that resources are utilized only when necessary. This dynamic scaling not only improves resource efficiency but also reduces costs, as billing is based on actual usage rather than reserved capacity.

The stateless design of serverless functions allows them to process multiple, independent tasks concurrently. This capability is crucial for handling simultaneous data streams during peak times, facilitating rapid data processing that aligns with sudden increases in load. Each function operates in isolation, mitigating the risk of one process impacting another, which enhances overall system reliability and performance.

Moreover, serverless architectures eliminate the need for ongoing server management and capacity planning. Data engineers can focus solely on the development of ETL logic without concerning themselves with underlying infrastructure issues. This shift away from operational overhead to pure development accelerates deployment cycles and fosters innovation.

Some examples of embedded portability with dlt

Dagster's embedded ETL now supports `dlt` - enabling devs to do what they love - build.

The "Stop Reinventing Orchestration: Embedded ELT in the Orchestrator" blog post by Pedram from Dagster Labs, introduces the concept of Embedded ELT within an orchestration framework, highlighting the transition in data engineering from bulky, complex systems towards more streamlined, embedded solutions that simplify data ingestion and management. This evolution is seen in the move away from heavy tools like Airbyte or Meltano towards utilizing lightweight, performant libraries which integrate seamlessly into existing orchestration platforms, reducing deployment complexity and operational overhead. This approach leverages the inherent capabilities of orchestration systems to handle concerns typical to data ingestion, such as state management, error handling, and observability, thereby enhancing efficiency and developer experience.

dlt was built for just such a scenario and we are happy to be adopted into it. Besides adding connectors, dlt adds a simple way to build custom pipelines.

Read more about it on Dagster blog post on dlt.

Dagworks' `dlt` + `duckdb` + `ibis` + `Hamilton` demo

The DAGWorks Substack post introduces a highly portable pipeline of all libraries, and leverages a blend of open-source Python libraries: dlt, Ibis, and Hamilton. This integration exemplifies the trend towards modular, decentralized data systems, where each component specializes in a segment of the data handling process—dlt for extraction and loading, Ibis for transformation, and Hamilton for orchestrating complex data flows. These technologies are not just tools but represent a paradigm shift in data engineering, promoting agility, scalability, and cost-efficiency in deploying serverless microservices.

The post not only highlights the technical prowess of combining these libraries to solve practical problems like message retention and thread summarization on Slack but also delves into the meta aspects of such integrations. It reflects on the broader implications of adopting a lightweight stack that can operate within diverse infrastructures, from cloud environments to embedded systems, underscoring the shift towards interoperability and backend agnosticism in data engineering practices. This approach illustrates a shift in the data landscape, moving from monolithic systems to flexible, adaptive solutions that can meet specific organizational needs without heavy dependencies or extensive infrastructure.

Read more about it on Dagworks blog post on dlt.

Closing thoughts

The concepts discussed here—portability, simplicity, and scalability—are central to modern data engineering practices. They reflect a shift towards tools that not only perform well but also integrate seamlessly across different environments, from high-powered servers to minimal infrastructures like edge devices. This shift emphasizes the importance of adaptability in tools used by data teams, catering to a broad spectrum of deployment scenarios without sacrificing performance.

In this landscape, dlt exemplifies the type of tool that embodies these principles. It's not just about being another platform; it's about providing a framework that supports the diverse needs of developers and engineers. dlt's design allows it to be embedded directly within various architectures, enabling teams to implement robust data processes with minimal overhead. This approach reduces complexity and fosters an environment where innovation is not hindered by the constraints of traditional data platforms.

We invite the community to engage with these concepts through dlt, contributing to its evolution and refinement. By participating in this collaborative effort, you can help ensure that the tool remains at the forefront of data engineering technology, providing effective solutions that address the real-world challenges of data management and integration.

Join the conversation and share your insights in our Slack community or contribute directly to the growing list of projects using us. Your expertise can drive the continuous improvement of dlt, shaping it into a tool that not only meets current demands but also anticipates future needs in the data engineering field.

Yes code ELT: dlt make easy things easy, and hard things possible

March 28, 2024 · 8 min read

Adrian Brudaru

Open source Data Engineer

The concept of simplicity and automation in a programming language is not new. Perl scripting language had the motto "Perl makes easy things easy and hard things possible".

The reason for this motto, was the difficulty of working with C, which requires more manual handling of resources and also a compilation step.

Perl scripts could be written and executed rapidly, making it ideal for tasks that needed quick development cycles. This ease of use and ability to handle complex tasks without cumbersome syntax made Perl incredibly popular in its heyday.

Perl was introduced as a scripting language that emphasized getting things done. It was created as a practical extraction and reporting tool, which quickly found its place in system administration, web development, and network programming.

History repeats, Python is a language for humans

human-building

Python took the philosophy of making programming more accessible and human-friendly even further. Guido van Rossum created Python with the goal of removing the drudgery from coding, choosing to prioritize readability and simplicity. This design philosophy makes Python an intuitive language not just for seasoned developers but for beginners as well. Its syntax is clean and expressive, allowing developers to write fewer lines of code for tasks that would require more in Perl or other languages. Python's extensive standard library, along with its powerful data structures, contribute to its ability to handle complex applications with ease.

Python's widespread adoption across various domains, from web development to data science and machine learning, is largely attributed to its accessibility.

Its simple syntax resembles natural language, which lowers the barrier to entry for programming. Compared to Perl, Python offers an even more organized and readable approach to coding, making it an ideal teaching language that prepares new developers for future challenges in software development.

And just like perl, it's used for data extraction and visualisation - but now it's done by normie humans, not sysadmins or devs.

dlt makes easy things fast, and hard things accessible

Following the principles of Perl and Python, dlt aimed to simplify the data engineering process. dlt focuses on making the extraction and loading of data as straightforward as possible.

dlt makes easy things fast

Starting from a simple abstraction like pipeline.run(data, table_name="table"), where data can be any iterable such as a generator or dataframe, dlt enables robust loading. Here is what the above function does, so you don't have to.

It will (optionally) unpack nested lists into separate tables with generated join keys, and flatten nested dictionaries into a main row.
If given a generator, it will consume it via microbatching, buffering to disk or external drives, never running out of memory (customisable).
it will create "extract packages" of extracted data so if the downstream steps fail, it can resume/retry later.
It will normalise the data into a shape that naturally fits the database (customisable).
It will create "load packages" of normalised data so if the downstream steps fail, it can retry later.
It infers and loads with the correct data types, for example from ISO timestamp strings (configurable).
It can accept different types of write dispositions declaratively such as 'append', 'merge' and 'replace'.
It will evolve the schema if we load a second time something with new columns, and it can alert the schema changes.
It will even create type variant columns if data types change (and alert if desired).
Or you can stop the schema from evolving and use the inferred schema or a modified one as a data contract
It will report load packages associated with new columns, enabling passing down column level lineage

That's a lot of development and maintenance pain solved only at its simplest. You could say, the dlt loader doesn't break, as long as it encounters common data types. If an obscure type is in your data, it would need to be added to dlt or converted beforehand.

From robust loading to robust extraction

Building on the simple loading abstraction, dlt is more than a tool for simple things.

The next step in dlt usage is to leverage it for extraction. dlt offers the concepts of 'source' and 'resource', A resource is the equivalent of a single data source, while a source is the group we put resources in to bundle them for usage.

For example, an API extractor from a single API with multiple endpoints, would be built as a source with multiple resources.

Resources enable you to easily configure how the data in that resource is loaded. You can create a resource by decorating a method with the '@resource' decorator, or you can generate them dynamically.

Examples of dynamic resources

If we have an api with multiple endpoints, we can put the endpoints in a list and iterate over it to generate resources
If we have an endpoint that gives us datapoints with different schemas, we could split them by a column in the data.
Similarly, if we have a webhook that listens to multiple types of events, it can dispatch each event type to its own table based on the data.
Or, if we want to shard a data stream into day-shards, we could append a date suffix in the resource name dynamically.

Once we group resources into a source, we can run them together (or, we could still run the resources independently)

Examples of reasons to group resources into sources.

We want to run (load) them together on the same schedule
We want to configure them together or keep their schemas together
They represent a single API and we want to publish them in a coherent, easy to use way.

So what are the efforts you spare when using dlt here?

A source can function similar to a class, but simpler, encouraging code reuse and simplicity.
Resources offer more granular configuration options
Resources can also be transformers, passing data between them in a microbatched way enabling patters like enrichments or list/detail endpoints.
Source schemas can be configured with various options such as pushing down top level columns into nested structures
dlt's requests replacement has built in retries for non-permanent error codes. This safeguards the progress of long extraction jobs that could otherwise break over and over (if retried as a whole) due to network or source api issues.

What else does dlt bring to the table?

Beyond the ease of data extraction and loading, dlt introduces several advanced features that further simplify data engineering tasks:

Asynchronous operations: dlt harnesses the power of asynchronous programming to manage I/O-bound and network operations efficiently. This means faster data processing, better resource utilization, and more responsive applications, especially when dealing with high volumes of data or remote data sources.

Flexible destinations and reverse ETL: dlt isn't just about pulling data in; it's about sending it where it needs to go. Whether it's a SQL database, a data lake, or a cloud-based storage solution or a custom reverse etl destination, dlt provides the flexibility to integrate with various destinations.

Optional T in ETL: With dlt, transformations are not an afterthought but a core feature. You can define transformations as part of your data pipelines, ensuring that the data is not just moved but refined, enriched, and shaped to fit your analytical needs. This capability allows for more sophisticated data modeling and preparation tasks to be streamlined within your ELT processes.

Data quality and observability: dlt places a strong emphasis on data quality and observability. It includes features for schema evolution tracking, data type validation, and error handling, and data contracts, which are critical for maintaining the integrity of your data ecosystem. Observability tools integrated within dlt help monitor the health and performance of your pipelines, providing insights into data flows, bottlenecks, and potential issues before they escalate.

Community and ecosystem: One of the most significant advantages of dlt is its growing community and ecosystem. Similar to Python, dlt benefits from contributions that extend its capabilities, including connectors, plugins, and integrations. This collaborative environment ensures that dlt remains at the forefront of data engineering innovation, adapting to new challenges and opportunities.

In essence, dlt is not just a tool but a comprehensive one stop shop that addresses the end-to-end needs of modern data ingestion. By combining the simplicity of Python with the robustness of enterprise-grade tools, dlt democratizes data engineering, making it accessible to a broader audience. Whether you're a data scientist, analyst, or engineer, dlt empowers you to focus on what matters most: deriving insights and value from your data.

Conclusion

As Perl and Python have made programming more accessible, dlt is set to transform data engineering by making sophisticated data operations accessible to all. This marks a significant shift towards the democratization of technology, enabling more individuals to contribute to and benefit from the digital landscape. dlt isn't just about making easy things fast and hard things accessible; it's about preparing a future where data engineering becomes an integral part of every data professional's toolkit.

Comparison running dbt-core and dlt-dbt runner on Google Cloud Functions

January 15, 2024 · 9 min read

Aman Gupta

Junior Data Engineer

info

TL;DR: This article compares deploying dbt-core standalone and using dlt-dbt runner on Google Cloud Functions. The comparison covers various aspects, along with a step-by-step deployment guide.

dbt or “data build tool” has become a standard for transforming data in analytical environments. Most data pipelines nowadays start with ingestion and finish with running a dbt package.

dlt or “data load tool” is an open-source Python library for easily creating data ingestion pipelines. And of course, after ingesting the data, we want to transform it into an analytical model. For this reason, dlt offers a dbt runner that’s able to just run a dbt model on top of where dlt loaded the data, without setting up any additional things like dbt credentials.

Using dbt in Google Cloud functions

To use dbt in cloud functions, we employed two methods:

dbt-core on GCP cloud functions.
dlt-dbt runner on GCP cloud functions.

Let’s discuss these methods one by one.

1. Deploying dbt-core on Google Cloud functions

Let's dive into running dbt-core up on cloud functions.

You should use this option for scenarios where you have already collected and housed your data in a data warehouse, and you need further transformations or modeling of the data. This is a good option if you have used dbt before and want to leverage the power of dbt-core. If you are new to dbt, please refer to dbt documentation: Link Here.

Let’s start with setting up the following directory structure:

dbt_setup
|-- main.py
|-- requirements.txt
|-- profiles.yml
|-- dbt_project.yml
|-- dbt_transform
    |-- models
    |   |-- model1.sql
    |   |-- model2.sql
    |   |-- sources.yml
    |-- (other dbt related contents, if required)

You can setup the contents in dbt_transform folder by initing a new dbt project, for details refer to documentation.

note

We recommend setting up and testing dbt-core locally before using it in cloud functions.

To run dbt-core on GCP cloud functions:

Once you've tested the dbt-core package locally, update the profiles.yml before migrating the folder to the cloud function as follows:

dbt_gcp: # project name
  target: dev # environment
  outputs:
    dev:
      type: bigquery
      method: oauth
      project: please_set_me_up! # your GCP project name
      dataset: please_set_me_up! # your project dataset name
      threads: 4
      impersonate_service_account: please_set_me_up! # GCP service account

This service account should have bigquery read and write permissions.

Next, modify the main.py as follows:

import os
import subprocess
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)

def run_dbt(request):
    try:
        # Set your dbt profiles directory (assuming it's in /workspace)
        os.environ['DBT_PROFILES_DIR'] = '/workspace/dbt_transform'

        # Log the current working directory and list files
        dbt_project_dir = '/workspace/dbt_transform'
        os.chdir(dbt_project_dir)

        # Log the current working directory and list files
        logging.info(f"Current working directory: {os.getcwd()}")
        logging.info(f"Files in the current directory: {os.listdir('.')}")

        # Run dbt command (e.g., dbt run)

        result = subprocess.run(
            ['dbt', 'run'],
            capture_output=True,
            text=True
            )

        # Return dbt output
        return result.stdout

    except Exception as e:
        logging.error(f"Error running dbt: {str(e)}")
        return f"Error running dbt: {str(e)}"

Next, list runtime-installable modules in requirements.txt:
```
dbt-core
dbt-bigquery
```

Finally, you can deploy the function using gcloud CLI as:

gcloud functions deploy YOUR_FUNCTION_NAME \
--gen2 \
--region=YOUR_REGION \
--runtime=python310 \
--source=YOUR_SOURCE_LOCATION \
--entry-point=YOUR_CODE_ENTRYPOINT \
TRIGGER_FLAGS

You have option to deploy the function via GCP Cloud Functions' GUI.

2. Deploying function using dlt-dbt runner

The second option is running dbt using data load tool(dlt).

I work at dlthub and often create dlt pipelines. These often need dbt for modeling the data, making the dlt-dbt combination highly effective. For using this combination on cloud functions, we used dlt-dbt runner developed at dlthub.

The main reason I use this runner is because I load data with dlt and can re-use dlt’s connection to the warehouse to run my dbt package, saving me the time and code complexity I’d need to set up and run dbt standalone.

To integrate dlt and dbt in cloud functions, use the dlt-dbt runner; here’s how:

Lets start by creating the following directory structure:
```
dbt_setup
|-- main.py
|-- requirements.txt
|-- dbt_project.yml
|-- dbt_transform
    |-- models
    |   |-- model1.sql
    |   |-- model2.sql
    |   |-- sources.yml
    |-- (other dbt related contents, if required)
```
You can set up the dbt by initing a new project, for details refer to documentation.
note
With the dlt-dbt runner configuration, setting up a profiles.yml is unnecessary. DLT seamlessly shares credentials with dbt, and on Google Cloud Functions, it automatically retrieves service account credentials, if none are provided.
Next, configure the dbt_projects.yml and set the model directory, for example:
```
model-paths: ["dbt_transform/models"]
```

Next, configure the main.py as follows:

import dlt
import logging
from flask import jsonify
from dlt.common.runtime.slack import send_slack_message
from dlt.common import json

def run_pipeline(request):
    """
    Set up and execute a data processing pipeline, returning its status
    and model information.

    This function initializes a dlt pipeline with pre-defined settings,
    runs the pipeline with a sample dataset, and then applies dbt
    transformations. It compiles and returns the information about
    each dbt model's execution.

    Args:
        request: The Flask request object. Not used in this function.

    Returns:
        Flask Response: A JSON response with the pipeline's status
        and dbt model information.
    """
    try:
        # Sample data to be processed
        data = [{"name": "Alice Smith", "id": 1, "country": "Germany"},
                {"name": "Carlos Ruiz", "id": 2, "country": "Romania"},
                {"name": "Sunita Gupta", "id": 3, "country": "India"}]

        # Initialize a dlt pipeline with specified settings
        pipeline = dlt.pipeline(
            pipeline_name="user_data_pipeline",
            destination="bigquery",
            dataset_name="dlt_dbt_test"
            )

        # Run the pipeline with the sample data
        pipeline.run(data, table_name="sample_data")

        # Apply dbt transformations and collect model information
        models = transform_data(pipeline)
        model_info = [
            {
                "model_name": m.model_name,
                "time": m.time,
                "status": m.status,
                "message": m.message
            }
            for m in models
        ]

        # Convert the model information to a string
        model_info_str = json.dumps(model_info)

        # Send the model information to Slack
        send_slack_message(
            pipeline.runtime_config.slack_incoming_hook,
            model_info_str
        )

        # Return a success response with model information
        return jsonify({"status": "success", "model_info": model_info})
    except Exception as e:
        # Log and return an error response in case of any exceptions
        logging.error(f"Error in running pipeline: {e}", exc_info=True)

        return jsonify({"status": "error", "error": str(e)}), 500

def transform_data(pipeline):
    """
    Execute dbt models for data transformation within a dlt pipeline.

    This function packages and runs all dbt models associated with the
    pipeline, applying defined transformations to the data.

    Args:
        pipeline (dlt.Pipeline): The pipeline object for which dbt
        transformations are run.

    Returns:
        list: A list of dbt model run information, indicating the
        outcome of each model.

    Raises:
        Exception: If there is an error in running the dbt models.
    """
    try:
        # Initialize dbt with the given pipeline and virtual environment
        dbt = dlt.dbt.package(
            pipeline,
            "/workspace/dbt_transform",
            venv=dlt.dbt.get_venv(pipeline)
        )
        logging.info("Running dbt models...")
        # Run all dbt models and return their run information
        return dbt.run_all()
    except Exception as e:
        # Log and re-raise any errors encountered during dbt model
        # execution
        logging.error(f"Error in running dbt models: {e}", exc_info=True)
        raise

# Main execution block
if __name__ == "__main__":
    # Execute the pipeline function.
    run_pipeline(None)

The send_slack_message function is utilized for sending messages to Slack, triggered by both success and error events. For setup instructions, please refer to the official documentation here.
RUNTIME__SLACK_INCOMING_HOOK was set up as environment variable in the above code.
Next, list runtime-installable modules in requirements.txt:
```
dbt-core
dbt-bigquery
```

Finally, you can deploy the function using gcloud CLI as:

gcloud functions deploy YOUR_FUNCTION_NAME \
--gen2 \
--region=YOUR_REGION \
--runtime=python310 \
--source=YOUR_SOURCE_LOCATION \
--entry-point=YOUR_CODE_ENTRYPOINT \
TRIGGER_FLAGS

The merit of this method is that it can be used to load and transform data simultaneously. Using dlt for data loading and dbt for modeling makes using dlt-dbt a killer combination for data engineers and scientists, and my preferred choice. This method is especially effective for batched data and event-driven pipelines with small to medium workloads. For larger data loads nearing timeout limits, consider separating dlt and dbt into different cloud functions.

For more info on using dlt-dbt runner , please refer to the official documentation by clicking here.

Deployment considerations: How does cloud functions compare to Git Actions?

At dlthub we already natively support deploying to GitHub Actions, enabling you to have a serverless setup with a 1-command deployment.

GitHub actions is an orchestrator that most would not find suitable for a data warehouse setup - but it certainly could do the job for a minimalistic setup. GitHub actions provide 2000 free minutes per month, so if our pipelines run for 66 minutes per day, we fit in the free tier. If our pipelines took another 1h per day, we would need to pay ~15 USD/month for the smallest machine (2 vCPUs) but you can see how that would be expensive if we wanted to run it continuously or had multiple pipelines always-on in parallel.

Cloud functions are serverless lightweight computing solutions that can handle small computational workloads and are cost-effective. dbt doesn't require the high computing power of the machine because it uses the computing power of the data warehouse to perform the transformations. This makes running dbt-core on cloud functions a good choice. The free tier would suffice for about 1.5h per day of running a 1 vCPU and 2 GB RAM machine, and if we wanted an additional 1h per day for this hardware it would cost us around 3-5 USD/month.

DLT-DBT-RUNNER_IMAGE

When deploying dbt-core on cloud functions, there are certain constraints to keep in mind. For instance, there is a 9-minute time-out limit for all 1st Gen functions. For 2nd Gen functions, there is a 9-minute limit for event-driven functions and a 60-minute limit for HTTP functions. Since dbt works on the processing power of the data warehouse it's operating on, 60 minutes is sufficient for most cases with small to medium workloads. However, it is important to remember the 9-minute cap when using event-driven functions.

Conclusion

When creating lightweight pipelines, using the two tools together on one cloud function makes a lot of sense, simplifying the setup process and the handover between loading and transformation.

However, for more resource-intensive pipelines, we might want to improve resource utilisation by separating the dlt loading from the dbt running because while dbt’s run speed is determined by the database, dlt can utilize the cloud function’s hardware resources.

When it comes to setting up just a dbt package to run on cloud functions, I guess it comes to personal preference: I prefer dlt as it simplifies credential management. It automatically shares credentials with dbt, making setup easier. Streamlining the process further, dlt on Google Cloud functions, efficiently retrieves service account credentials, when none are provided. I also used dlt’s Slack error reporting function that sends success and error notifications from your runs directly to your Slack channel, helping me manage and monitor my runs.

Exploring data replication of SAP HANA to Snowflake using dlt

November 29, 2023 · 5 min read

Rahul Joshi

Developer Relations at dltHub

info

TL;DR: While most companies continue to build their businesses on top of SAP, when it comes to analytics, they prefer to take advantage of the price and elastic compute of modern cloud infrastructure. As a consequence, we get several dlt users asking for a simple and low-cost way to migrate from SAP to cloud data warehouses like Snowflake. In this blog, I show how you can build a custom SAP connector with dlt and use it to load SAP HANA tables into Snowflake.

Blog image

In case you haven’t figured it out already, we at dltHub love creating blogs and demos. It’s fun, creative, and gives us a chance to play around with many new tools. We are able to do this mostly because, like any other modern tooling, dlt just fits in the modern ecosystem. Not only does dlt have existing integrations (to, for example, GCP, AWS, dbt, airflow etc.) that can simply be “plugged in”, but it is also very simple to customize it to integrate with almost any other modern tool (such as Metabase, Holistics, Dagster, Prefect etc.).

But what about enterprise systems like SAP? They are, after all, the most ubiquitous tooling out there: according to SAP data, 99 out of 100 largest companies are SAP customers. A huge part of the reason for this is that their ERP system is still the gold standard in terms of effectivity and reliability. However, when it comes to OLAP workloads like analytics, machine learning, predictive modelling etc., many companies prefer the convenience and cost savings of modern cloud solutions like GCP, AWS, Azure, etc..

So, wouldn’t it be nice to be able to integrate SAP into the modern ecosystem?

Unfortunately, this is not that simple. SAP does not integrate easily with non-SAP systems, and migrating data out from SAP is complicated and/or costly. This often means that ERP data stays separate from analytics data.

Creating a dlt integration

Our users have been asking for SAP HANA data, hence I decided to create a custom dlt integration to SAP’s in-memory data warehouse: SAP HANA. Given its SQL backend and Python API, I figured dlt should also have no problem connecting to it.

I then use this pipeline to load SAP HANA tables into Snowflake, since Snowflake is cloud agnostic and can be run in different environments (such AWS, GCP, Azure, or any combination of the three). This is how I did it:

Step 1: I created an instance in SAP HANA cloud.

(I used this helpful tutorial to navigate SAP HANA.)

SAP instance

Step 2: I inserted some sample data.
SAP insert data

Step 3: With tables created in SAP HANA, I was now ready to create a dlt pipeline to extract it into Snowflake:

Since SAP HANA has a SQL backend, I decided to extract the data using dlt’s SQL source

I first created a dlt pipeline
dlt init sql_database snowflake
I then passed the connection string for my HANA instance inside the loading function in sql_database_pipeline.py. (Optional: I also specified the tables that I wanted to load in sql_database().with_resources("v_city", "v_hotel", "room") )
Before running the pipeline I installed all necessary requirements using
pip install -r requirements.txt
The dependencies inside requirements.txt are for the general SQL source. To extract data specifically from HANA, I also installed the packages hdbcli and sqlalchemy-hana.

Step 4: I finally ran the pipeline using python sql_database_pipeline.py. This loaded the tables into Snowflake.

Data in Snowflake

Takeaway

The dlt SAP HANA connector constructed in this demo works like any other dlt connector, and is able to successfully load data from SAP HANA into data warehouses like Snowflake.

Furthermore, the demo only used a toy example, but the SQL source is a production-ready source with incremental loading, merges, data contracts etc., which means that this pipeline could also be configured for production use-cases.

Finally, the dlt-SAP integration has bigger consequences: it allows you to add other tools like dbt, airflow etc. easily into an SAP workflow, since all of these tools integrate well with dlt.

Next steps

This was a just first step into exploring what’s possible. Creating a custom dlt connector worked pretty well for SAP HANA, and there are several possible next steps, such as converting this to a verified source, or building other SAP connectors.

Creating a verified source for SAP HANA: This should be pretty straight-forward since it would require a small modification of the existing SQL source.
Creating a dlt connector for SAP S/4 HANA: S/4 HANA is SAP’s ERP software that runs on the HANA database. The use case would be to load ERP tables from S/4 HANA into other data warehouses like Snowflake. Depending on the requirements, there are two ways to go about it:
1. Low volume data: This would again be straight-forward. SAP offers REST API end points to access ERP tables, and dlt is designed to be able to load data from any such end point.
2. High volume data: dlt can also be configured for the use case of migrating large volumes of data with fast incremental or merge syncs. But this would require some additional steps, such as configuring the pipeline to access HANA backend directly from Python hdbcli.

Deploy Google Cloud Functions as webhooks to capture event-based data from GitHub, Slack, or Hubspot

November 22, 2023 · 8 min read

Aman Gupta

Junior Data Engineer

💡 This article explores methods for monitoring transactional events, allowing immediate action and data capture that might be lost otherwise. We focus on Github, Slack, and Hubspot, demonstrating techniques applicable to low-volume transactional events (under 500k/month) within the free tier. For clickstream tracking or higher volumes, we recommend more scalable solutions.

There’s more than one way to sync data. Pulling data after it has been collected from APIs is a classic way, but some types of data are better transmitted as an event at the time of happening. Our approach is event-triggered and can include actions like:

Application	Action
Slack	Sending messages in Slack
Github	Commit, comment, or PR actions
Hubspot	Object creation or meeting specific criteria

These actions initiate a webhook that sends a POST request to trigger a DLT pipeline for event ingestion. The data is then loaded into BigQuery.

pictorial_demonstration

This setup enables real-time alerts or event storage for later use. For example, let’s say you want to alert every time something happens - you’d want to be able to capture an event being sent to you and act on it. Or, in some cases, you store it for later use. This guide covers a use case for deploying and setting up webhooks.

Why do we use webhooks?

Whenever we want to receive an event from an external source, we need a “recipient address” to which they can send the data. To solve this problem, an effortless way is to use a URL as the address and accept a payload as data.

Why cloud functions?

The key reasons for using cloud functions include:

To have a URL up and accept the data payload, we would need some service or API always to be up and ready to listen for the data.
Creating our application for this would be cumbersome and expensive. It makes sense to use some serverless service for low volumes of events.
On AWS, you would use API gateway + lambda to handle incoming events, but for GCP users, the option is more straightforward: Google Cloud functions come with an HTTP trigger, which enables you to create a URL and accept a payload.
The pricing for cloud functions is unbeatable for low volumes: For ingesting an event with a minor function, assuming processing time to be a few seconds, we could invoke a few hundred thousand calls every month for free. For more pricing details, see the GCP pricing page for cloud functions.

Let's dive into the deployment of webhooks and app setup, focusing next on triggers from GitHub, Slack, and HubSpot for use cases discussed above.

1. GitHub Webhook

This GitHub webhook is triggered upon specified events such as pull requests (PRs), commits, or comments. It relays relevant data to BigQuery. Set up the GitHub webhook by creating the cloud function URL and configuring it in the GitHub repository settings.

1.1 Initialize GitHub webhook deployment

To set up the webhook, start by creating a cloud function. Follow these brief steps, and for an in-depth guide, please refer to the detailed documentation.

Log into GCP and activate the Cloud Functions API.
Click 'Create Function' in Cloud Functions, and select your region and environment setup.
Choose HTTP as the trigger, enable 'Allow unauthenticated invocations', save, and click 'Next'.

Set the environment to Python 3.10 and prepare to insert code into main.py:

import dlt
import time
from google.cloud import bigquery
from dlt.common import json

def github_webhook(request):
    # Extract relevant data from the request payload
    data = request.get_json()

    Event = [data]

    pipeline = dlt.pipeline(
        pipeline_name='platform_to_bigquery',
        destination='bigquery',
        dataset_name='github_data',
    )

    pipeline.run(Event, table_name='webhook') #table_name can be customized
    return 'Event received and processed successfully.'

Name the function entry point "github_webhook" and list required modules in requirements.txt.
```
# requirements.txt
dlt[bigquery]
```
Post-deployment, a webhook URL is generated, typically following a specific format.
```
https://{region]-{project-id}.cloudfunctions.net/{cloud-function-name}
```

Once the cloud function is configured, it provides a URL for GitHub webhooks to send POST requests, funneling data directly into BigQuery.

1.2 Configure the repository webhook in GitHub

Set up a GitHub repository webhook to trigger the cloud function on specified events by following these steps:

Log into GitHub and go to your repository.
Click "Settings" > "Webhooks" > "Add webhook."
Enter the cloud function URL in "Payload URL."
Choose "Content-Type" and select events to trigger the webhook, or select "Just send me everything."
Click "Add webhook."

With these steps complete, any chosen events in the repository will push data to BigQuery, ready for analysis.

2. Slack Webhook

This Slack webhook fires when a user sends a message in a channel where the Slack app is installed. To set it up, set up a cloud function as below and obtain the URL, then configure the message events in Slack App settings.

2.1 Initialize Slack webhook deployment

Set up the webhook by creating a cloud function, using the same steps as for the GitHub webhook.

Here’s what main.py looks like:

import dlt
from flask import jsonify

def slack_webhook(request):
    # Handles webhook POST requests
    if request.method == 'POST':
        data = request.get_json()

        # Responds to Slack's verification challenge
        if 'challenge' in data:
            return jsonify({'challenge': data['challenge']})

        # Processes a message event
        if 'event' in data and 'channel' in data['event']:
            message_data = process_webhook_event(data['event'])

            # Configures and initiates a DLT pipeline
            pipeline = dlt.pipeline(
                pipeline_name='platform_to_bigquery',
                destination='bigquery',
                dataset_name='slack_data',
            )

            # Runs the pipeline with the processed event data
            pipeline.run([message_data], table_name='webhook')
            return 'Event processed.'
        else:
            return 'Event type not supported', 400
    else:
        return 'Only POST requests are accepted', 405

def process_webhook_event(event_data):
    # Formats the event data for the DLT pipeline
    message_data = {
        'channel': event_data.get('channel'),
        'user': event_data.get('user'),
        'text': event_data.get('text'),
        'ts': event_data.get('ts'),
        # Potentially add more fields according to event_data structure
    }
    return message_data

Name the entry point "slack_webhook" and include the necessary modules in requirements.txt, the same as the GitHub webhook setup.
Once the cloud function is configured, you get a URL for Slack events to send POST requests, funneling data directly into BigQuery.

2.2 Set up and configure a Slack app

Create and install a Slack app in your workspace to link channel messages from Slack to BigQuery as follows:

Go to "Manage apps" in workspace settings; click "Build" and "Create New App".
Choose "from scratch", name the app, select the workspace, and create the app.
Under "Features", select "Event Subscription", enable it, and input the Cloud Function URL.
Add message.channels under "Subscribe to bot events".
Save and integrate the app to the desired channel.

With these steps complete, any message sent on the channel will push data to BigQuery, ready for analysis.

3. Hubspot webhook

A Hubspot webhook can be configured within an automation workflow, applicable to contacts, companies, deals, tickets, quotes, conversations, feedback submissions, goals and invoices. It triggers upon specific conditions or data filters. To establish it, create a cloud function, retrieve its URL, and input this in Hubspot's automation workflow settings for message events.

3.1 Initialize Hubspot webhook deployment

Set up the webhook by creating a cloud function, using the same steps as for the GitHub webhook.

Here’s what main.pylooks like:

import dlt
from flask import jsonify

def hubspot_webhook(request):
    # Endpoint for handling webhook POST requests from Hubspot
    if request.method == 'POST':
        # Get JSON data from the POST request
        data = request.get_json()

        # Initialize and configure the DLT pipeline
        pipeline = dlt.pipeline(
            pipeline_name="hubspot",
            destination='bigquery',               # Destination service for the data
            dataset_name='hubspot_webhooks_dataset',  # BigQuery dataset name
        )

        # Execute the pipeline with the incoming data
        pipeline.run([data], table_name='hubspot_contact_events')

        # Return a success response
        return jsonify(message='HubSpot event processed.'), 200
    else:
        # Return an error response for non-POST requests
        return jsonify(error='Only POST requests are accepted'), 405

Name the entry point "your_webhook" and include the necessary modules in requirements.txt, the same as the GitHub webhook setup.
Once the cloud function is configured, you get a URL for Slack events to send POST requests, funneling data directly into BigQuery.

3.2 Configure a Hubspot automation workflow

To activate a Hubspot workflow with your webhook:

Go to Hubspot: "Automation" > "Workflows" > "Create workflow".
Start from scratch; choose "Company-based" for this example.
Set "Object created" as the trigger.
Add the "Send a webhook" action, use the "POST" method, and input your webhook URL.
Select the company properties to include, test, and save.

This triggers the webhook upon new company creation, sending data to Bigquery via DLT.

In conclusion

Setting up a webhook is straightforward.

Using dlt with schema evolution, we can accept the events without worrying about their schema. However, for events with custom schemas or vulnerable to bad data quality or abuse, consider using dlt’s data contracts.

Solving data ingestion for Python coders

November 8, 2023 · 9 min read

Adrian Brudaru

Open source data engineer

In a recent article, Anna Geller, product manager at Kestra, highlighted why data ingestion will never be solved. In her article, she described the many obstacles around data ingestion, and detailed how various companies and open-source tools approached this problem.

I’m Adrian, data builder. Before starting dlthub, I was building data warehouses and teams for startups and corporations. Since I was such a power-builder, I have been looking for many years into how this space could be solved.

The conviction on which we started dlt is that, to solve the data ingestion problem, we need to identify the motivated problem solver and turbo charge them with the right tooling.

The current state of data ingestion: dependent on vendors or engineers.

When building a data pipeline, we can start from scratch, or we can look for existing solutions.

How can we build an ingestion pipeline?

SaaS tools: We could use ready-made pipelines or use building blocks to configure a new API call.
SDKs: We could ask a software developer to build a Singer or Airbyte source. Or we could learn object-oriented programming and the SDKs and become the software developer - but the latter is an unreasonable pathway for most.
Custom pipelines: We could ask a data engineer to build custom pipelines. Unfortunately, everyone is building from scratch, so we usually end up reinventing the flat tire. Pipelines often break and have a high maintenance effort, bottlenecking the amount that can be built and maintained per data engineer.

Besides the persona-tool fit, in the current tooling, there is a major trade-off between complexity. For example, SaaS tools or SaaS SDKs offer “building blocks” and leave little room for customizations. On the other hand, custom pipelines enable one to do anything they could want but come with a high burden of code, complexity, and maintenance. And classic SDKs are simply too difficult for the majority of data people.

So how can we solve ingestion?

Ask first, who should solve ingestion. Afterwards, we can look into the right tools.

The builder persona should be invested in solving the problem, not into preserving it.

UI first? We already established that people dependent on a UI with building blocks are non-builders - they use what exists. They are part of the demand, not part of the solution.

SDK first? Further, having a community of software engineers for which the only reason to maintain pipelines is financial incentives also doesn’t work. For example, Singer has a large community of agencies that will help - for a price. But the open-source sources are not maintained, PRs are not accepted, etc. It’s just another indirect vendor community for whom the problem is desired.

The reasonable approach is to offer something to a person who wants to use the data but also has some capability to do something about it, and willingness to make an effort. So the problem has to be solved in code, and it logically follows that if we want the data person to use this without friction, it has to be Python.

So the existing tools are a dead end: What do custom pipeline builders do?

Unfortunately, the industry has very little standardization, but we can note some patterns.

df.to_sql() was a great first step

For the Python-first users, pandas df.to_sql() automated loading dataframes to SQL without having to worry about database-specific commands or APIs.

Unfortunately, this way of loading is limited and not very robust. There is no support for merge/upsert loading or for advanced configuration like performance hints. The automatic typing might sometimes also lead to issues over time with incremental loading.

Additionally, putting the data into a dataframe means loading it into memory, leading to limitations. So a data engineer considering how to create a boilerplate loading solution would not end up relying on this method because it would offer too little while taking away fine-grain control.

So while this method works well for quick and dirty work, it doesn’t work so well in production. And for a data engineer, this method adds little while taking away a lot. The good news: we can all use it; The bad news: it’s not engineering-ready.

Inserting JSON directly is a common antipattern. However, many developers use it because it solves a real problem.

Inserting JSON “as is” is a common antipattern in data loading. We do it because it’s a quick fix for compatibility issues between untyped semi-structured data and strongly typed databases. This enables us to just feed raw data to the analyst who can sort through it and clean it and curate it, which in turn enables the data team to not get bottlenecked at the data engineer.

So, inserting JSON is not all bad. It solves some real problems, but it has some unpleasant side effects:

Without an explicit schema, you do not know if there are schema changes in the data.
Without an explicit schema, you don’t know if your JSON extract path is unique. Many applications output inconsistent types, for example, a dictionary for a single record or a list of dicts for multiple records, causing JSON path inconsistencies.
Without an explicit schema, data discovery and exploration are harder, requiring more effort.
Reading a JSON record in a database usually scans the entire record, multiplying cost or degrading performance significantly.
Without types, you might incorrectly guess and suffer from frequent maintenance or incorrect parsing.
Dashboarding tools usually cannot handle nested data - but they often have options to model tabular data.

Boilerplate code vs one-offs

Companies who have the capacity will generally create some kind of common, boilerplate methods that enable their team to re-use the same glue code. This has major advantages but also disadvantages: building something like this in-house is hard, and the result is often a major cause of frustration for the users. What we usually see implemented is a solution to a problem, but is usually immature to be a nice technology and far from being a good product that people can use.

One-offs have their advantage: they are easy to create and can generally take a shortened path to loading data. However, as soon as you have more of them, you will want to have a single point of maintenance as above.

The solution: A pipeline-building dev tool for the Python layman

Let’s let Drake recap for us:

what would drake do

So what does our desired solution look like?

Usable by any Python user in any Python environment, like df.to_sql()
Automate difficult things: Normalize JSON into relational tables automatically. Alert schema changes or contract violations. Add robustness, scaling.
Keep code low: Declarative hints are better than imperative spaghetti.
Enable fine-grained control: Builders should be enabled to control finer aspects such as performance, cost, compliance.
Community: Builders should be enabled to share content that they create

We formulated our product principles and went from there.

And how far did we get?

dlt is usable by any Python user and has a very shallow learning curve.
dlt runs where Python runs: Cloud functions, notebooks, etc.
Automate difficult things: Dlt’s schema automations and extraction helpers do 80% of the pipeline work.
Keep code low: by automating a large chunk and offering declarative configuration, dlt keeps code as short as it can be.
Fine-grained control: Engineers with advanced requirements can easily fulfill them by using building blocks or custom code.
Community: We have a sharing mechanism (add a source to dlt’s sources) but it’s too complex for the target audience. There is a trade-off between the quality of code and strictness of requirements which we will continue exploring. We are also considering how LLMs can be used to assist with code quality and pipeline generation in the future.

What about automating the builder further?

LLMs are changing the world. They are particularly well-suited at language tasks. Here, a library shines over any other tool - simple code like you would write with dlt can automatically be written by GPT.

The same cannot be said for SDK code or UI tools: because they use abstractions like classes or configurations, they deviate much further from natural language, significantly increasing the complexity of using LLMs to generate for them.

LLMs aside, technology is advancing faster than our ability to build better interfaces - and a UI builder has been for years an obsolete choice. With the advent of self-documenting APIs following OpenAPI standard, there is no more need for a human to use a UI to compose building blocks - the entire code can be generated even without LLM assistance (demo of how we do it). An LLM could then possibly improve it from there. And if the APIs do not follow the standard, the building blocks of a UI builder are even less useful, while an LLM could read the docs and brute-force solutions.

So, will data ingestion ever be a fully solved problem? Yes, by you and us together.

In summary, data ingestion is a complex challenge that has seen various attempts at solutions, from SDKs to custom pipelines. The landscape is marked by trade-offs, with existing tools often lacking the perfect balance between simplicity and flexibility.

dlt, as a pipeline-building dev tool designed for Python users, aims to bridge this gap by offering an approachable, yet powerful solution. It enables users to automate complex tasks, keep their code concise, and maintain fine-grained control over their data pipelines. The community aspect is also a crucial part of the dlt vision, allowing builders to share their content and insights.

The journey toward solving data ingestion challenges is not just possible; it's promising, and it's one that data professionals together with dlt are uniquely equipped to undertake.

Resources:

Join the ⭐Slack Community⭐ for discussion and help!
Dive into our Getting Started.
Star us on GitHub!

I. The background story: Normal people load data too​

Normies: Problem solvers with antipathy for black boxes, gratuitous complexity and external dependencies​

Encountering friction between data engineers and Python-first analysts​

II. The initial idea: pandas.df.to_sql() with data engineering best practices​

Our dream: a tool that meets production pipelines requirements​

The initial steps​

III. Launching dlt​

dlt = code + docs + Slack support​

"Can you make your docs more like my favorite tool's docs?"​

dlt = “pip install and go” - the fastest way to create a pipeline and source​

Embracing LLM-free code generation​

When generation doesn’t work​

The benefit of minimalistic sources​

Towards a paid offering​

Call to action.​

Pipelines are not yet a commodity.​

What’s important: Velocity, Reliability, Speed, time. Money is secondary.​

10x faster, 182x cheaper with dlt + async + modal​

Taking back control with open source has never been easier​

Why SDMX?​

Installation and Basic Usage​

Loading the data with dlt, leveraging best practices​

Conclusion​

The versatility that enables "one way to rule them all"... requires a devtool​

Data teams don't speak Object Oriented Programming (OOP)​

Goodbye OOP, hello @decorators!​

The ability to run embedded is more than just scalability​

Running on small workers is easier than spinning up infra​

Dealing with spiky loads is easier on highly parallel infras like serverless functions​

Some examples of embedded portability with dlt​

Dagster's embedded ETL now supports dlt - enabling devs to do what they love - build.​

Dagworks' dlt + duckdb + ibis + Hamilton demo​

Closing thoughts​

History repeats, Python is a language for humans​

dlt makes easy things fast, and hard things accessible

dlt makes easy things fast​

From robust loading to robust extraction​

What else does dlt bring to the table?​

Conclusion​

Using dbt in Google Cloud functions​

1. Deploying dbt-core on Google Cloud functions​

2. Deploying function using dlt-dbt runner​

Deployment considerations: How does cloud functions compare to Git Actions?​

Conclusion​

Creating a dlt integration​

Takeaway​

Next steps​

Why do we use webhooks?​

Why cloud functions?​

1. GitHub Webhook​

1.1 Initialize GitHub webhook deployment​

1.2 Configure the repository webhook in GitHub​

2. Slack Webhook​

2.1 Initialize Slack webhook deployment​

2.2 Set up and configure a Slack app​

3. Hubspot webhook​

3.1 Initialize Hubspot webhook deployment​

3.2 Configure a Hubspot automation workflow​

In conclusion​

The current state of data ingestion: dependent on vendors or engineers.

How can we build an ingestion pipeline?​

So how can we solve ingestion?

The builder persona should be invested in solving the problem, not into preserving it.​

So the existing tools are a dead end: What do custom pipeline builders do?​

df.to_sql() was a great first step​

Inserting JSON directly is a common antipattern. However, many developers use it because it solves a real problem.​

Boilerplate code vs one-offs​

The solution: A pipeline-building dev tool for the Python layman

So, will data ingestion ever be a fully solved problem? Yes, by you and us together.

Resources:​

DHelp

Ask a question

I. The background story: Normal people load data too

Normies: Problem solvers with antipathy for black boxes, gratuitous complexity and external dependencies

Encountering friction between data engineers and Python-first analysts

II. The initial idea: pandas.df.to_sql() with data engineering best practices

Our dream: a tool that meets production pipelines requirements

The initial steps

III. Launching dlt

dlt = code + docs + Slack support

"Can you make your docs more like my favorite tool's docs?"

dlt = “pip install and go” - the fastest way to create a pipeline and source

Embracing LLM-free code generation

When generation doesn’t work

The benefit of minimalistic sources

Towards a paid offering

Call to action.

Pipelines are not yet a commodity.

What’s important: Velocity, Reliability, Speed, time. Money is secondary.

10x faster, 182x cheaper with dlt + async + modal

Taking back control with open source has never been easier

Why SDMX?

Installation and Basic Usage

Loading the data with dlt, leveraging best practices

Conclusion

The versatility that enables "one way to rule them all"... requires a devtool

Data teams don't speak Object Oriented Programming (OOP)

Goodbye OOP, hello `@decorators`!

The ability to run embedded is more than just scalability

Running on small workers is easier than spinning up infra

Dealing with spiky loads is easier on highly parallel infras like serverless functions

Some examples of embedded portability with dlt

Dagster's embedded ETL now supports `dlt` - enabling devs to do what they love - build.

Dagworks' `dlt` + `duckdb` + `ibis` + `Hamilton` demo

Closing thoughts

History repeats, Python is a language for humans

dlt makes easy things fast

From robust loading to robust extraction

What else does dlt bring to the table?

Conclusion

Using dbt in Google Cloud functions

1. Deploying dbt-core on Google Cloud functions

2. Deploying function using dlt-dbt runner

Deployment considerations: How does cloud functions compare to Git Actions?

Conclusion

Creating a dlt integration

Takeaway

Next steps

Why do we use webhooks?

Why cloud functions?

1. GitHub Webhook

1.1 Initialize GitHub webhook deployment

1.2 Configure the repository webhook in GitHub

2. Slack Webhook

2.1 Initialize Slack webhook deployment

2.2 Set up and configure a Slack app

3. Hubspot webhook

3.1 Initialize Hubspot webhook deployment

3.2 Configure a Hubspot automation workflow

In conclusion

How can we build an ingestion pipeline?

The builder persona should be invested in solving the problem, not into preserving it.

So the existing tools are a dead end: What do custom pipeline builders do?

df.to_sql() was a great first step

Inserting JSON directly is a common antipattern. However, many developers use it because it solves a real problem.

Boilerplate code vs one-offs

Resources: