Skip to content

Commit 96f3d66

Browse files
committed
Add SKILL documentation for optimizing BigQuery storage and comute costs
1 parent 6c020fe commit 96f3d66

File tree

2 files changed

+221
-0
lines changed

2 files changed

+221
-0
lines changed
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
name: optimize-model-compute
3+
description: Optimize BigQuery compute costs by assigning Dataform actions to slot reservations. USE FOR managing which models use reserved slots vs on-demand pricing, updating reservation assignments, or analyzing cost vs priority tradeoffs for data pipelines.
4+
---
5+
6+
# Optimize Model Compute (BigQuery Reservations)
7+
8+
## Purpose
9+
10+
Automatically assign Dataform actions to BigQuery slot reservations based on priority and cost optimization strategy. Routes high-priority workloads to reserved slots while using on-demand pricing for low-priority tasks.
11+
12+
## When to Use
13+
14+
- Assigning new models/actions to appropriate compute tiers (reserved vs on-demand)
15+
- Rebalancing reservation assignments based on priority changes
16+
- Optimizing costs by moving low-priority workloads to on-demand
17+
- Ensuring critical pipelines get guaranteed compute resources
18+
19+
## Configuration File
20+
21+
Reservations are configured in `definitions/_reservations.js`:
22+
23+
```javascript
24+
const { autoAssignActions } = require('@masthead-data/dataform-package')
25+
26+
const RESERVATION_CONFIG = [
27+
{
28+
tag: 'reservation', // Human-readable identifier
29+
reservation: 'projects/.../reservations/...', // BigQuery reservation path
30+
actions: [ // Models assigned to this tier
31+
'httparchive.crawl.pages',
32+
'httparchive.f1.pages_latest'
33+
]
34+
},
35+
{
36+
tag: 'on_demand',
37+
reservation: 'none', // On-demand pricing
38+
actions: [
39+
'httparchive.sample_data.pages_10k'
40+
]
41+
}
42+
]
43+
44+
autoAssignActions(RESERVATION_CONFIG)
45+
```
46+
47+
## Implementation Steps
48+
49+
### Step 1: Source Configuration
50+
51+
**TODO**: _User will provide details on how to determine which models should use reserved vs on-demand compute_
52+
53+
### Step 2: Update Configuration
54+
55+
1. Open `definitions/_reservations.js`
56+
2. Add or move actions between reservation tiers:
57+
- **Reserved slots** (`reservation: 'projects/...'`): Critical, high-priority, SLA-sensitive workloads
58+
- **On-demand** (`reservation: 'none'`): Low-priority, ad-hoc, or experimental workloads
59+
60+
### Step 3: Verify Changes
61+
62+
```bash
63+
# Check syntax
64+
dataform compile
65+
66+
# Validate no duplicate assignments
67+
grep -r "\.actions" definitions/_reservations.js
68+
```
69+
70+
## Decision Criteria
71+
72+
| Factor | Reserved Slots | On-Demand |
73+
|--------|----------------|-----------|
74+
| **Priority** | High, SLA-bound | Low, flexible |
75+
| **Frequency** | Regular, scheduled | Ad-hoc, occasional |
76+
| **Cost Pattern** | Predictable usage | Variable, sporadic |
77+
| **Impact** | Critical pipelines | Experimental, samples |
78+
79+
## Key Notes
80+
81+
- Each action should appear in only ONE reservation config
82+
- File starts with `_` to ensure it runs first in Dataform queue
83+
- Changes take effect on next Dataform workflow run
84+
- Package automatically handles global assignment (no per-file edits needed)
85+
86+
## Package Reference
87+
88+
Using `@masthead-data/dataform-package` (see [package.json](../../../package.json))
Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
---
2+
name: optimize-storage-costs
3+
description: Optimize BigQuery storage costs by identifying and removing dead-end and unused tables. USE FOR analyzing storage waste, reviewing tables with no consumption, cleaning up unused datasets, or implementing storage cost reduction strategies.
4+
---
5+
6+
# Optimize Storage Costs (Dead-end and Unused Tables)
7+
8+
## Purpose
9+
10+
Identify and remove BigQuery tables that contribute to storage costs but have no active consumption, based on Masthead Data lineage analysis.
11+
12+
## Table Categories
13+
14+
| Type | Definition | Indicators |
15+
|------|------------|------------|
16+
| **Dead-end** | Regularly updated, no downstream consumption | Updated but never read in 30+ days |
17+
| **Unused** | No upstream or downstream activity | No reads/writes in 30+ days |
18+
19+
## When to Use
20+
21+
- Reducing storage costs when budget is constrained
22+
- Cleaning up abandoned tables and pipelines
23+
- Implementing regular storage hygiene
24+
- Investigating sudden storage cost increases
25+
26+
## Prerequisites
27+
28+
- Masthead Data agent v0.2.7+ installed (for accurate lineage)
29+
- Access to Masthead insights dataset: `masthead-prod.{DATASET_NAME}.insights`
30+
- BigQuery permissions to query insights and drop tables
31+
32+
## Implementation Steps
33+
34+
### Step 1: Query Storage Waste
35+
36+
```bash
37+
bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
38+
"SELECT
39+
subtype,
40+
project_id,
41+
target_resource,
42+
SAFE.STRING(operations[0].resource_type) AS resource_type,
43+
SAFE.INT64(overview.num_bytes) / POW(1024, 4) AS total_tib,
44+
SAFE.FLOAT64(overview.cost_30d) AS cost_usd_30d,
45+
SAFE.FLOAT64(overview.savings_30d) AS savings_usd_30d
46+
FROM \`masthead-prod.{DATASET_NAME}.insights\`
47+
WHERE category = 'Cost'
48+
AND subtype IN ('Dead end table', 'Unused table')
49+
AND overview.num_bytes IS NOT NULL
50+
AND SAFE.FLOAT64(overview.savings_30d) > 10
51+
ORDER BY total_tib DESC" > storage_waste.csv
52+
```
53+
54+
**Alternative: Use Masthead UI**
55+
- Navigate to [Dictionary page](https://app.mastheadata.com/dictionary?tab=Tables&deadEnd=true)
56+
- Filter by `Dead-end` or `Unused` labels
57+
- Export table list for review
58+
59+
### Step 2: Review and Decide
60+
61+
Review `storage_waste.csv` and add a `status` column with values:
62+
- `keep` - Table is needed
63+
- `to drop` - Safe to remove
64+
- `investigate` - Needs further analysis
65+
66+
**Review criteria:**
67+
- Is this a backup or archive table? (consider alternative storage)
68+
- Is there a downstream dependency not captured in lineage?
69+
- Is this table part of an active experiment or migration?
70+
71+
### Step 3: Drop Approved Tables
72+
73+
```bash
74+
# Generate DROP statements
75+
awk -F',' '$NF=="to drop" {
76+
print "bq rm -f -t " $4
77+
}' storage_waste.csv > drop_tables.sh
78+
79+
# Review generated commands
80+
cat drop_tables.sh
81+
82+
# Execute (after review!)
83+
bash drop_tables.sh
84+
```
85+
86+
**Safe mode (dry-run first):**
87+
```bash
88+
# Add --dry-run flag to each command
89+
sed 's/bq rm/bq rm --dry-run/' drop_tables.sh > drop_tables_dryrun.sh
90+
bash drop_tables_dryrun.sh
91+
```
92+
93+
### Step 4: Verify Savings
94+
95+
After 24-48 hours, check storage reduction in Masthead:
96+
- [Storage Cost Insights](https://app.mastheadata.com/costs?tab=Storage+costs)
97+
- Compare before/after storage size and costs
98+
99+
## Alternative: Notebook-based Workflow
100+
101+
For interactive review with Google Sheets integration:
102+
103+
1. Use notebook at: `github.com/masthead-data/templates/blob/main/notebooks/save_on_unused_storage.ipynb`
104+
2. Export results to Google Sheets for team review
105+
3. Pull back reviewed data and execute drops
106+
107+
## Decision Framework
108+
109+
| Monthly Savings | Action |
110+
|-----------------|--------|
111+
| < $10 | Consider keeping (low ROI) |
112+
| $10-$100 | Review and drop if unused |
113+
| $100-$1000 | Priority review, likely drop |
114+
| > $1000 | Immediate investigation required |
115+
116+
## Key Notes
117+
118+
- **Dead-end tables** may indicate pipeline issues - investigate before dropping
119+
- Tables can be restored from time travel (7 days) or fail-safe (7 days after time travel)
120+
- Consider archiving to Cloud Storage if compliance requires retention
121+
- Coordinate with data teams before dropping shared datasets
122+
- Wait 14 days after storage billing model changes before dropping tables
123+
124+
## Related Optimizations
125+
126+
- **Storage billing model**: Switch between Logical/Physical pricing (see docs)
127+
- **Table expiration**: Set automatic expiration for temporary tables
128+
- **Partitioning**: Use partitioned tables with expiration policies
129+
130+
## Documentation
131+
132+
- [Masthead Storage Costs](https://docs.mastheadata.com/cost-insights/storage-costs)
133+
- [BigQuery Storage Pricing](https://cloud.google.com/bigquery/pricing#storage)

0 commit comments

Comments
 (0)