Skip to content

Commit dc909f2

Browse files
committed
Enhance SKILL documentation for optimizing BigQuery storage costs
1 parent 96f3d66 commit dc909f2

File tree

1 file changed

+24
-13
lines changed
  • .agents/skills/optimize-storage-costs

1 file changed

+24
-13
lines changed

.agents/skills/optimize-storage-costs/SKILL.md

Lines changed: 24 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,15 @@ Identify and remove BigQuery tables that contribute to storage costs but have no
1111

1212
## Table Categories
1313

14-
| Type | Definition | Indicators |
15-
|------|------------|------------|
16-
| **Dead-end** | Regularly updated, no downstream consumption | Updated but never read in 30+ days |
17-
| **Unused** | No upstream or downstream activity | No reads/writes in 30+ days |
14+
Masthead Data uses lineage analysis to identify tables, but relies on visible pipeline references. Modification timestamps are critical:
15+
16+
| Type | Definition | Indicators | Watch for |
17+
|------|------------|------------|---|
18+
| **Dead-end** | Regularly updated, no downstream consumption | Updated but never read in 30+ days | External writers outside lineage graph (manual jobs, independent pipelines) |
19+
| **Unused** | No upstream or downstream activity | No reads/writes in 30+ days | Recent `lastModifiedTime` despite "Unused" flag suggests external writer—**do not drop without verification** |
20+
21+
### Key Signal
22+
If a table is flagged `Unused` **and** has a recent modification timestamp, something outside Masthead's visibility is writing to it. This always warrants investigation before dropping.
1823

1924
## When to Use
2025

@@ -26,7 +31,7 @@ Identify and remove BigQuery tables that contribute to storage costs but have no
2631
## Prerequisites
2732

2833
- Masthead Data agent v0.2.7+ installed (for accurate lineage)
29-
- Access to Masthead insights dataset: `masthead-prod.{DATASET_NAME}.insights`
34+
- Access to Masthead insights dataset: `masthead-prod.httparchive.insights`
3035
- BigQuery permissions to query insights and drop tables
3136

3237
## Implementation Steps
@@ -43,14 +48,17 @@ bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
4348
SAFE.INT64(overview.num_bytes) / POW(1024, 4) AS total_tib,
4449
SAFE.FLOAT64(overview.cost_30d) AS cost_usd_30d,
4550
SAFE.FLOAT64(overview.savings_30d) AS savings_usd_30d
46-
FROM \`masthead-prod.{DATASET_NAME}.insights\`
51+
FROM \`masthead-prod.httparchive.insights\`
4752
WHERE category = 'Cost'
4853
AND subtype IN ('Dead end table', 'Unused table')
4954
AND overview.num_bytes IS NOT NULL
5055
AND SAFE.FLOAT64(overview.savings_30d) > 10
51-
ORDER BY total_tib DESC" > storage_waste.csv
56+
AND target_resource NOT LIKE '%analytics_%' -- Filter out low-impact GA intraday tables
57+
ORDER BY savings_usd_30d DESC" > storage_waste.csv
5258
```
5359

60+
**Note:** Sorting by `savings_usd_30d` instead of `total_tib` prioritizes high-impact targets for review.
61+
5462
**Alternative: Use Masthead UI**
5563
- Navigate to [Dictionary page](https://app.mastheadata.com/dictionary?tab=Tables&deadEnd=true)
5664
- Filter by `Dead-end` or `Unused` labels
@@ -67,6 +75,8 @@ Review `storage_waste.csv` and add a `status` column with values:
6775
- Is this a backup or archive table? (consider alternative storage)
6876
- Is there a downstream dependency not captured in lineage?
6977
- Is this table part of an active experiment or migration?
78+
- **For repo-managed projects:** Search the codebase (e.g., `grep` for table name in model definitions, scripts) to confirm ownership. Table naming can be misleading (e.g., `cwv_tech_*` may seem like current outputs but could be legacy).
79+
- **Check for disabled producers:** If a Dataform `publish()` has `disabled: true` but the underlying BigQuery table still exists and has recent modifications, either the table is abandoned or an external process took over—both warrant investigation.
7080

7181
### Step 3: Drop Approved Tables
7282

@@ -106,16 +116,17 @@ For interactive review with Google Sheets integration:
106116

107117
## Decision Framework
108118

109-
| Monthly Savings | Action |
110-
|-----------------|--------|
111-
| < $10 | Consider keeping (low ROI) |
112-
| $10-$100 | Review and drop if unused |
113-
| $100-$1000 | Priority review, likely drop |
114-
| > $1000 | Immediate investigation required |
119+
| Monthly Savings | Action | Recency Check |
120+
|-----------------|--------|---------------|
121+
| < $10 | Consider keeping (low ROI) | Skip if `lastModifiedTime` > 12 months old (hygiene only) |
122+
| $10-$100 | Review and drop if unused | Check modification date; recent writes require owner verification |
123+
| $100-$1000 | Priority review, likely drop | Mandatory verification if modified in last 30 days |
124+
| > $1000 | Immediate investigation required | Always verify external writer before any action |
115125

116126
## Key Notes
117127

118128
- **Dead-end tables** may indicate pipeline issues - investigate before dropping
129+
- **Unused tables with recent modifications** are the highest-priority investigate cases. The gap between Masthead's "no lineage" and actual writes means an external dependency exists.
119130
- Tables can be restored from time travel (7 days) or fail-safe (7 days after time travel)
120131
- Consider archiving to Cloud Storage if compliance requires retention
121132
- Coordinate with data teams before dropping shared datasets

0 commit comments

Comments
 (0)