End-to-end customer analytics pipeline that ingests Snowflake data into Dataiku DSS, computes RFM scores, CLV estimates, and churn risk, writes results back to Snowflake, and is mirrored on Databricks with validated parity.
Snowflake (DEV.DATAIKU_DEMO)
βββ CUSTOMERS (1,000 rows)
βββ TRANSACTIONS (8,000 rows)
β
βΌ Dataiku DSS (DEMO project)
βββββββββββββββββββββββββββββββββββββββββββββ
β [Shaker] filter STATUS = 'completed' β
β β transactions_completed β
β β
β [Join] LEFT JOIN on CUSTOMER_ID β
β β customer_transactions_joined β
β β
β [Python] RFM + CLV + Churn analytics β
β β CUSTOMER_ANALYTICS_OUTPUT β
βββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
Snowflake DEV.DATAIKU_DEMO.CUSTOMER_ANALYTICS_OUTPUT
Databricks dev.dataiku_demo.customer_analytics_output β migrated, parity verified
Parity was validated using Datafold β a data reliability platform that runs cross-database diffs at scale using bisection hashing.
- Datadiff run: https://app.datafold.com/datadiffs/13857162
- Algorithm: bisection hash on
CUSTOMER_ID - Result: 0 differences across all 1,000 rows
The validate_parity.py script uses the same open-source
data-diff library that powers Datafold cloud.
