This project bundles a resource-aware analytics stack—Airflow, MinIO, a custom ETL worker (Polars + DuckDB + Pandas), and Beszel monitoring—into a single-node environment. The start_pipeline.sh script provisions everything with Docker Compose and gives you helper commands to run, observe, and optimize the ETL workloads.
- A machine/VM with ≥16 GB RAM and ≥200 GB free disk, as the ETL jobs cache data locally.
- Docker and Docker Compose (v2) installed and running.
bash,curl, andchmodavailable (default on macOS/Linux).
start_pipeline.sh– entrypoint script that sets up and controls the stack.docker-compose.airflow.yml– official Airflow Docker stack (downloaded automatically if missing).docker-compose.override.yml– adds MinIO, Beszel, and the custom ETL worker.scripts/etl_pipeline.py– the core ETL job (extract from MinIO, transform with Polars/DuckDB, load back to MinIO).scripts/generate_sample_data.py– utility to seed MinIO with CSV data.dags/– Airflow DAGs, includingetl_pipelineto orchestrate the Python ETL script.
chmod +x start_pipeline.shThe start command calls the start_pipeline function inside the script. Each sub-step below includes a quick validation tip.
-
Prepare project folders
- Creates
dags,logs,plugins,data,scripts, andconfigif they do not exist. - Writes
.envwith yourAIRFLOW_UIDso Airflow containers run as your user. - Ensures
scripts/monitor.pyexists and is executable. - Validate:
ls dags logs plugins data scripts configshould show populated directories;.envshould containAIRFLOW_UID=....
- Creates
-
Fetch the Airflow Compose bundle (idempotent)
- Downloads
docker-compose.airflow.ymlfrom the Apache Airflow docs site when the file is absent. - Validate:
ls docker-compose.airflow.ymlor view the first lines withhead.
- Downloads
-
Pull container images
- Runs
docker-compose -f docker-compose.airflow.yml -f docker-compose.override.yml --profile flower --profile extras pullto grab all Airflow/MinIO/Beszel/ETL images. - Validate: Docker reports the image digests; reruns skip already cached layers.
- Runs
-
Initialize the Airflow metadatabase
- Starts the transient
airflow-initcontainer to set up the PostgreSQL metadata DB and create the admin user. - Validate:
docker compose ... ps airflow-initshould finish withExit 0; logs showAirflow is ready.
- Starts the transient
-
Start all services in the background
- Brings up Airflow (API server, scheduler, workers, triggerer, flower), Postgres, Redis, MinIO (plus bucket bootstrapper), Beszel, and the
etl-workercontainer. - Validate:
./start_pipeline.sh status(or the fulldocker-compose ... ps) lists every container asUp.
- Brings up Airflow (API server, scheduler, workers, triggerer, flower), Postgres, Redis, MinIO (plus bucket bootstrapper), Beszel, and the
-
Check core service health
- Polls the HTTP health endpoints for Airflow (http://localhost:8080), MinIO (http://localhost:9000/minio/health/live), and Beszel (http://localhost:8090).
- Validate: You can also run the same
curlcommands manually or open the URLs in a browser.
-
Display quick reference output
- Prints service URLs, default credentials, and helpful ETL commands once everything is ready.
- Validate: Look for
Pipeline ready ✅in your terminal output.
./start_pipeline.sh start # provision + start everything
./start_pipeline.sh status # show container state and URLs
./start_pipeline.sh logs <svc> # tail logs (e.g., airflow-scheduler, etl-worker)
./start_pipeline.sh etl [chunk] # run ETL manually with optional row chunk size
./start_pipeline.sh shell <svc> # get an interactive shell inside a container
./start_pipeline.sh stop # stop containers but keep volumes/data
./start_pipeline.sh cleanup # stop and remove containers, networks, volumes- Airflow Web UI: http://localhost:8080 (default credentials:
admin/admin) - Airflow Flower (Celery monitoring): http://localhost:5555
- MinIO Console: http://localhost:9001 (default credentials:
minioadmin/minioadmin123) - Beszel Monitor: http://localhost:8090
Tip: Once the Airflow UI is up, unpause the
etl_pipelineDAG and trigger it manually to schedule ETL runs.
The ETL pipeline extracts CSV files from the raw-data MinIO bucket, enriches them with Polars/DuckDB, and writes Parquet + summary JSON back to processed-data.
./start_pipeline.sh shell etl-worker
python /app/scripts/generate_sample_data.py --records 5000
exitThis stores a sales_data_5000.csv file in the raw-data bucket.
./start_pipeline.sh etl # default chunk size from ETL_CHUNK_SIZE
./start_pipeline.sh etl 25000 # override chunk size for large filesSuccess emits ETL Pipeline completed successfully in the logs and pushes processed Parquet + JSON summary files to processed-data.
- Open Airflow UI → DAGs.
- Locate
etl_pipeline, flip the toggle to On. - Trigger
Run(⚡ icon) to execute the DAG:run_etl_pipeline→cleanup_temp_data→system_health_check. - Inspect task logs directly in Airflow.
- Beszel captures CPU, RAM, and I/O metrics per container; watch it during heavy ETL runs.
- Resource guardrails:
- RAM > 70% → reduce
--chunk-sizeor break the DAG into more granular tasks. - CPU > 80% → parallelize transforms or stagger DAG schedules.
- Disk I/O > 75% → optimize DuckDB SQL, limit temp outputs, or leverage Parquet pushdowns.
- RAM > 70% → reduce
- Scale by uploading larger datasets to MinIO (
--large 500produces ~500 MB). Rerun the ETL and iterate on chunk size or DAG timing to maximize throughput without exhausting the single node.
- Health check warnings: If the script warns a service is slow to start, re-run
./start_pipeline.sh statusand inspect logs (./start_pipeline.sh logs airflow-scheduler). - MinIO authentication errors: Ensure buckets exist (
createbucketscontainer runs on start) and credentials match the override file. - ETL failures: Use
./start_pipeline.sh logs etl-workerfor stack traces. Files remain in/tmpuntil thecleanup_temp_datatask runs. - Port conflicts: Stop any services already using 8080, 9000/9001, 8090, or 5555 before starting the stack.
- Customize
scripts/etl_pipeline.pywith domain-specific transforms. - Add more Airflow DAGs and stagger schedules to balance resource usage.
- Plug in alerting (email, Slack) based on Beszel or Airflow events.