|
| 1 | +--- |
| 2 | +name: optimize-storage-costs |
| 3 | +description: Optimize BigQuery storage costs by identifying and removing dead-end and unused tables. USE FOR analyzing storage waste, reviewing tables with no consumption, cleaning up unused datasets, or implementing storage cost reduction strategies. |
| 4 | +--- |
| 5 | + |
| 6 | +# Optimize Storage Costs (Dead-end and Unused Tables) |
| 7 | + |
| 8 | +## Purpose |
| 9 | + |
| 10 | +Identify and remove BigQuery tables that contribute to storage costs but have no active consumption, based on Masthead Data lineage analysis. |
| 11 | + |
| 12 | +## Table Categories |
| 13 | + |
| 14 | +| Type | Definition | Indicators | |
| 15 | +|------|------------|------------| |
| 16 | +| **Dead-end** | Regularly updated, no downstream consumption | Updated but never read in 30+ days | |
| 17 | +| **Unused** | No upstream or downstream activity | No reads/writes in 30+ days | |
| 18 | + |
| 19 | +## When to Use |
| 20 | + |
| 21 | +- Reducing storage costs when budget is constrained |
| 22 | +- Cleaning up abandoned tables and pipelines |
| 23 | +- Implementing regular storage hygiene |
| 24 | +- Investigating sudden storage cost increases |
| 25 | + |
| 26 | +## Prerequisites |
| 27 | + |
| 28 | +- Masthead Data agent v0.2.7+ installed (for accurate lineage) |
| 29 | +- Access to Masthead insights dataset: `masthead-prod.{DATASET_NAME}.insights` |
| 30 | +- BigQuery permissions to query insights and drop tables |
| 31 | + |
| 32 | +## Implementation Steps |
| 33 | + |
| 34 | +### Step 1: Query Storage Waste |
| 35 | + |
| 36 | +```bash |
| 37 | +bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \ |
| 38 | +"SELECT |
| 39 | + subtype, |
| 40 | + project_id, |
| 41 | + target_resource, |
| 42 | + SAFE.STRING(operations[0].resource_type) AS resource_type, |
| 43 | + SAFE.INT64(overview.num_bytes) / POW(1024, 4) AS total_tib, |
| 44 | + SAFE.FLOAT64(overview.cost_30d) AS cost_usd_30d, |
| 45 | + SAFE.FLOAT64(overview.savings_30d) AS savings_usd_30d |
| 46 | +FROM \`masthead-prod.{DATASET_NAME}.insights\` |
| 47 | +WHERE category = 'Cost' |
| 48 | + AND subtype IN ('Dead end table', 'Unused table') |
| 49 | + AND overview.num_bytes IS NOT NULL |
| 50 | + AND SAFE.FLOAT64(overview.savings_30d) > 10 |
| 51 | +ORDER BY total_tib DESC" > storage_waste.csv |
| 52 | +``` |
| 53 | + |
| 54 | +**Alternative: Use Masthead UI** |
| 55 | +- Navigate to [Dictionary page](https://app.mastheadata.com/dictionary?tab=Tables&deadEnd=true) |
| 56 | +- Filter by `Dead-end` or `Unused` labels |
| 57 | +- Export table list for review |
| 58 | + |
| 59 | +### Step 2: Review and Decide |
| 60 | + |
| 61 | +Review `storage_waste.csv` and add a `status` column with values: |
| 62 | +- `keep` - Table is needed |
| 63 | +- `to drop` - Safe to remove |
| 64 | +- `investigate` - Needs further analysis |
| 65 | + |
| 66 | +**Review criteria:** |
| 67 | +- Is this a backup or archive table? (consider alternative storage) |
| 68 | +- Is there a downstream dependency not captured in lineage? |
| 69 | +- Is this table part of an active experiment or migration? |
| 70 | + |
| 71 | +### Step 3: Drop Approved Tables |
| 72 | + |
| 73 | +```bash |
| 74 | +# Generate DROP statements |
| 75 | +awk -F',' '$NF=="to drop" { |
| 76 | + print "bq rm -f -t " $4 |
| 77 | +}' storage_waste.csv > drop_tables.sh |
| 78 | + |
| 79 | +# Review generated commands |
| 80 | +cat drop_tables.sh |
| 81 | + |
| 82 | +# Execute (after review!) |
| 83 | +bash drop_tables.sh |
| 84 | +``` |
| 85 | + |
| 86 | +**Safe mode (dry-run first):** |
| 87 | +```bash |
| 88 | +# Add --dry-run flag to each command |
| 89 | +sed 's/bq rm/bq rm --dry-run/' drop_tables.sh > drop_tables_dryrun.sh |
| 90 | +bash drop_tables_dryrun.sh |
| 91 | +``` |
| 92 | + |
| 93 | +### Step 4: Verify Savings |
| 94 | + |
| 95 | +After 24-48 hours, check storage reduction in Masthead: |
| 96 | +- [Storage Cost Insights](https://app.mastheadata.com/costs?tab=Storage+costs) |
| 97 | +- Compare before/after storage size and costs |
| 98 | + |
| 99 | +## Alternative: Notebook-based Workflow |
| 100 | + |
| 101 | +For interactive review with Google Sheets integration: |
| 102 | + |
| 103 | +1. Use notebook at: `github.com/masthead-data/templates/blob/main/notebooks/save_on_unused_storage.ipynb` |
| 104 | +2. Export results to Google Sheets for team review |
| 105 | +3. Pull back reviewed data and execute drops |
| 106 | + |
| 107 | +## Decision Framework |
| 108 | + |
| 109 | +| Monthly Savings | Action | |
| 110 | +|-----------------|--------| |
| 111 | +| < $10 | Consider keeping (low ROI) | |
| 112 | +| $10-$100 | Review and drop if unused | |
| 113 | +| $100-$1000 | Priority review, likely drop | |
| 114 | +| > $1000 | Immediate investigation required | |
| 115 | + |
| 116 | +## Key Notes |
| 117 | + |
| 118 | +- **Dead-end tables** may indicate pipeline issues - investigate before dropping |
| 119 | +- Tables can be restored from time travel (7 days) or fail-safe (7 days after time travel) |
| 120 | +- Consider archiving to Cloud Storage if compliance requires retention |
| 121 | +- Coordinate with data teams before dropping shared datasets |
| 122 | +- Wait 14 days after storage billing model changes before dropping tables |
| 123 | + |
| 124 | +## Related Optimizations |
| 125 | + |
| 126 | +- **Storage billing model**: Switch between Logical/Physical pricing (see docs) |
| 127 | +- **Table expiration**: Set automatic expiration for temporary tables |
| 128 | +- **Partitioning**: Use partitioned tables with expiration policies |
| 129 | + |
| 130 | +## Documentation |
| 131 | + |
| 132 | +- [Masthead Storage Costs](https://docs.mastheadata.com/cost-insights/storage-costs) |
| 133 | +- [BigQuery Storage Pricing](https://cloud.google.com/bigquery/pricing#storage) |
0 commit comments