Why Airflow and Snowflake?
Apache Airflow and Snowflake have become the de facto standard stack for modern data engineering, and for good reason. Airflow provides the orchestration layer — scheduling, dependency management, retries, alerting, and monitoring for data workflows. Snowflake provides the analytical database — scalable, performant, and managed cloud storage and compute for analytical queries. Together, they form the backbone of a data pipeline that can grow from a handful of tables to petabyte-scale analytical workloads.
This guide walks through building a production-grade ETL pipeline step by step. We'll cover the architecture decisions, the code patterns, the operational considerations, and the mistakes that trip up most teams on their first pipeline. By the end, you'll have a mental model for how data should flow from source systems through transformation to analytical consumption.
Architecture Overview
A typical Airflow-Snowflake pipeline follows this flow:
- Extract: Pull data from source systems (APIs, databases, files, SaaS tools) into a staging area. This could be S3 buckets, a staging schema in Snowflake, or local files on the Airflow worker.
- Load: Load the raw extracted data into Snowflake's raw or staging layer. Snowflake's
COPY INTOcommand efficiently loads data from cloud storage (S3, Azure Blob, GCS) into tables. - Transform: Use SQL transformations (often managed by dbt) to clean, deduplicate, join, and reshape the raw data into analytical models. This is where business logic lives — calculating metrics, applying business rules, and creating the dimensional models that analysts query.
- Serve: The transformed data is available in Snowflake for BI tools, analytical queries, ML model training, and operational applications.
Airflow orchestrates all of these steps — managing the sequence, handling failures, and providing visibility into pipeline health. Each step is implemented as an Airflow task, and the dependencies between tasks form a DAG (Directed Acyclic Graph) that defines the pipeline's execution order.
Setting Up Your Environment
For a production Airflow deployment, we strongly recommend using a managed service: Astronomer (the most popular managed Airflow), Amazon MWAA (AWS-managed), or Google Cloud Composer (GCP-managed). Self-hosting Airflow is operationally demanding — it requires managing a web server, a scheduler, workers, a metadata database, and message broker (Redis or RabbitMQ). Managed services handle all of this for you.
For Snowflake, create a dedicated warehouse for ETL workloads, separate from your analytical warehouses. This prevents pipeline processing from competing with analyst queries for compute resources. Start with an X-Small warehouse — Snowflake's auto-scaling and auto-suspend features mean you only pay for compute when the pipeline is actually running.
Set up a three-schema architecture in Snowflake: RAW for unprocessed source data, STAGING for intermediate transformations, and ANALYTICS for the final models consumed by BI tools and analysts. This separation makes it clear what stage of processing data is in and allows you to apply different access controls to each layer.
Building the DAG
An Airflow DAG defines your pipeline's structure — what tasks run, in what order, and what happens when they fail. Here's how to think about structuring your first DAG:
Task granularity: Each task should do one thing. Don't create a single task that extracts, loads, and transforms all data. Create separate tasks for each source system's extraction, each table's loading, and each transformation step. This granularity means that when a task fails, only that task needs to be retried — not the entire pipeline. It also makes the pipeline easier to debug and monitor.
Dependencies: Use Airflow's dependency operators to define the execution order. Transformation tasks should depend on loading tasks. Loading tasks should depend on extraction tasks. Cross-source transformations (joins between data from different sources) should depend on all relevant source loads completing successfully.
Idempotency: Every task should be idempotent — running it twice with the same input should produce the same result. This is critical for reliability because Airflow will retry failed tasks, and you need to know that a retry won't create duplicate records. For load operations, use Snowflake's MERGE statement or a truncate-and-reload pattern. For file operations, use unique file naming that prevents reprocessing.
Error handling: Configure task retries (typically 2-3 retries with exponential backoff), set up alerting for failed tasks (Slack, email, or PagerDuty), and implement circuit breakers for external APIs that may be temporarily unavailable. A good pipeline fails gracefully — it retries transient errors, alerts on persistent failures, and never corrupts the target data.
Extraction Patterns
Different source systems require different extraction approaches:
Database sources (PostgreSQL, MySQL, MongoDB): Use incremental extraction based on a timestamp column (e.g., updated_at). On each run, extract only records modified since the last successful run. This is dramatically faster than full table extraction and reduces load on the source database. Use Airflow's PostgresHook or MySqlHook to connect and extract data. Store the high-water mark (the timestamp of the last extracted record) in Airflow Variables or an extraction metadata table in Snowflake.
API sources (Salesforce, HubSpot, Stripe): Use the API's pagination and filtering capabilities to extract data incrementally. Respect rate limits — add backoff logic to your extraction code. Consider using managed ETL tools like Fivetran or Airbyte for complex APIs, and use Airflow to orchestrate and monitor these tools rather than writing raw API integration code.
File sources (SFTP, S3, email attachments): Use Airflow's file sensors to detect new files, then load them into Snowflake. Implement file tracking to prevent reprocessing — record which files have been processed in a metadata table and skip them on subsequent runs.
Loading into Snowflake
Snowflake's COPY INTO command is the most efficient way to load data. The pattern is: extract data to cloud storage (S3/GCS/Azure Blob), then use COPY INTO to load from storage into Snowflake tables. This approach leverages Snowflake's massively parallel loading architecture, which can ingest files at rates of hundreds of MB per second.
For smaller datasets or real-time needs, you can also use Snowflake's Snowpipe for continuous, auto-triggered loading from cloud storage. Snowpipe monitors an S3 bucket (or equivalent) and automatically loads new files within minutes of their arrival.
Always load into raw tables that mirror the source schema exactly — don't transform during loading. This preserves the source data in its original form, which is invaluable for debugging transformation issues, replaying failed loads, and auditing data lineage.
Transformation with dbt
While you can write transformation SQL directly in Airflow tasks, we strongly recommend using dbt (data build tool) for the transformation layer. dbt provides version-controlled SQL models, built-in testing, documentation generation, and a dependency graph that parallels Airflow's DAG structure but is purpose-built for SQL transformations.
The integration pattern is straightforward: Airflow triggers dbt runs as tasks within the DAG, after the loading tasks complete. The DbtCloudRunJobOperator (for dbt Cloud) or BashOperator running dbt run (for dbt Core) are the typical approaches. dbt handles the transformation logic; Airflow handles the scheduling and orchestration.
Structure your dbt project with three layers: staging models that clean and standardize raw data (one model per source table), intermediate models that implement business logic and joins, and marts models that create the final analytical tables consumed by BI tools. This layered architecture makes the transformation logic modular, testable, and easy to debug.
Monitoring and Alerting
A pipeline without monitoring is a pipeline waiting to fail silently. Implement monitoring at three levels:
Pipeline-level: Track whether the DAG ran successfully, how long it took, and whether any tasks were retried. Airflow's built-in UI provides this, and you can enhance it with Slack or email notifications on failure. Set up SLA alerts for DAGs that must complete by a specific time (e.g., "daily refresh must complete by 7 AM").
Data-level: Use dbt tests or Great Expectations to validate the transformed data after each run. Check for nulls in required columns, unexpected duplicates, row count anomalies (a sudden 50% drop in records suggests an extraction problem), and range violations. These tests should run as tasks in your Airflow DAG, after the transformation tasks.
Infrastructure-level: Monitor Snowflake warehouse utilization, query performance, and credit consumption. Set up alerts for long-running queries, warehouse queue depth, and budget thresholds. Snowflake's Resource Monitors can automatically suspend warehouses that exceed credit budgets.
The best data pipeline is the one nobody has to think about — because it runs reliably, recovers from failures automatically, and alerts you before problems affect downstream users.
Common Mistakes to Avoid
- Not handling schema evolution. Source systems change their schemas — new columns appear, columns are renamed, data types change. Your pipeline should handle these changes gracefully, not crash. Use Snowflake's schema detection features and build alerting for schema drift.
- Full refreshes for everything. Loading entire tables on every run is simple but wasteful. For large tables, incremental loading reduces processing time from hours to minutes and significantly reduces Snowflake compute costs.
- Hardcoding credentials. Never put database passwords, API keys, or Snowflake credentials in your DAG code. Use Airflow Connections (stored encrypted in Airflow's metadata database) or a secrets manager (AWS Secrets Manager, HashiCorp Vault).
- Ignoring data freshness requirements. Not every table needs to be refreshed at the same frequency. Dimension tables (products, employees, regions) might need daily updates. Fact tables (transactions, events) might need hourly or real-time updates. Match your pipeline schedule to business requirements, not a one-size-fits-all cron expression.
Need Help With This?
Neural Vector Insights helps organizations turn these concepts into production reality. Let us talk about your project.
Start a Conversation