Skip to content

Commit 5d4ae25

Browse files
committed
Updated Claude md
1 parent ad80993 commit 5d4ae25

1 file changed

Lines changed: 99 additions & 20 deletions

File tree

CLAUDE.md

Lines changed: 99 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -22,48 +22,115 @@ cached-data/
2222
## How It Works
2323

2424
1. `fetch_all_categories.py` reads 3 per-category tokens (`GH_TOKEN_TRENDING`, `GH_TOKEN_NEW_RELEASES`, `GH_TOKEN_MOST_POPULAR`), falling back to `GITHUB_TOKEN`
25-
2. Each category gets its own `GitHubClient` with a dedicated token (5,000 req/hr each = 15,000 total)
26-
3. For each category × platform (12 combos), checks if cached JSON is fresh (<23h)
27-
4. If stale, queries GitHub Search API with platform-specific topics/languages/keywords
28-
5. Filters repos that have **real releases with platform installers** (e.g. `.apk` for android, `.exe`/`.msi` for windows)
29-
6. Verifies ALL candidates — no artificial caps. Stops gracefully when rate limit drops below `RATE_LIMIT_FLOOR` (50)
30-
7. Saves results to `cached-data/{category}/{platform}.json`
31-
8. GitHub Actions commits and pushes changes
25+
2. Each category gets its own `GitHubClient` with a dedicated token
26+
3. If tokens are shared (same underlying user), the budget is split evenly across categories
27+
4. For each category × platform (12 combos), checks if cached JSON is fresh (<23h)
28+
5. If stale, queries GitHub Search API with platform-specific topics/languages/keywords
29+
6. Filters repos that have **real releases with platform installers** via two methods:
30+
- **Extension matching**: Direct installer files (`.apk`, `.exe`, `.dmg`, `.deb`, etc.)
31+
- **Keyword matching**: Generic archives (`.zip`, `.tar.gz`) with platform keywords in the filename (e.g. `myapp-macos-arm64.zip`, `myapp-win-x64.tar.gz`)
32+
7. Repos with NSFW/inappropriate topics or descriptions are excluded via `BLOCKED_TOPICS`
33+
8. Verifies ALL candidates — no artificial caps. Stops gracefully when per-platform budget is exhausted or rate limit drops below `RATE_LIMIT_FLOOR` (50)
34+
9. Never saves 0-repo results; never overwrites good cached data with poor results
35+
10. Waits 65s between platforms for search API rate limit (30 req/min) to reset
36+
11. Saves results to `cached-data/{category}/{platform}.json`
37+
12. GitHub Actions commits and pushes changes
3238

3339
### Token Strategy
34-
- 3 GitHub Classic PATs (scope: `public_repo`), one per category
35-
- Stored as GitHub Actions secrets: `GH_TOKEN_TRENDING`, `GH_TOKEN_NEW_RELEASES`, `GH_TOKEN_MOST_POPULAR`
40+
- 3 GitHub Classic PATs (scope: `public_repo`), each from a **separate GitHub account**
41+
- GitHub rate limits are per-user (not per-token), so 3 accounts = 3 independent 5,000 req/hr pools = 15,000 total
42+
- Stored as GitHub Actions repository secrets: `GH_TOKEN_TRENDING`, `GH_TOKEN_NEW_RELEASES`, `GH_TOKEN_MOST_POPULAR`
3643
- Backward compatible: falls back to single `GITHUB_TOKEN` if per-category tokens aren't set
44+
- If shared tokens detected, budget is automatically split evenly across categories
45+
46+
### Rate Limit Management
47+
48+
**Two independent rate limits at play:**
49+
50+
| Limit | Pool | Scope |
51+
|---|---|---|
52+
| Core API | 5,000/hr per user | Release checks, rate_limit endpoint |
53+
| Search API | 30/min per user | `search/repositories` queries |
54+
55+
**Core API budget system:**
56+
- `main()` detects shared tokens and caps each category to its fair share
57+
- `process_category()` divides the category's budget evenly across 4 platforms
58+
- Budget recalculates after each platform — unused budget carries forward
59+
- `verify_installers()` stops when per-platform budget is exhausted (not just global floor)
60+
61+
**Search API throttling:**
62+
- 65-second pause between platforms within a category to let the 30 req/min limit reset
63+
- Only pauses if the previous platform actually ran searches (cached platforms skip it)
64+
- `_update_rate_info()` ignores search API headers to prevent core rate tracking pollution
65+
66+
**Safety caps:**
67+
- `_wait_for_rate_limit()` never sleeps more than 60s (prevents workflow timeout)
68+
- Minimum budget of 100 requests per platform regardless of remaining
69+
- Workflow timeout: 45 minutes
3770

3871
### Categories
39-
- **trending**: High star velocity + recent activity. Sorted by trending score (platform score + velocity x 10)
72+
- **trending**: High star velocity + recent activity. Sorted by trending score (platform score + velocity × 10)
4073
- **new-releases**: Repos with stable releases in last 14 days. Sorted by release date
41-
- **most-popular**: Repos with 5000+ stars. Sorted by star count
74+
- **most-popular**: Repos with 5,000+ stars. Sorted by star count
4275

4376
### Platform Detection
44-
Each platform has defined: topics, installer file extensions, scoring keywords (high/medium/low), primary/secondary languages, and frameworks. See `PLATFORMS` dict in fetch script.
77+
78+
Each platform has: topics, installer file extensions, scoring keywords (high/medium/low), primary/secondary languages, and frameworks. See `PLATFORMS` dict.
79+
80+
**Installer detection** uses two layers:
81+
1. **Extension matching** — dedicated installer files:
82+
- Android: `.apk`, `.aab`
83+
- Windows: `.msi`, `.exe`, `.msix`
84+
- macOS: `.dmg`, `.pkg`, `.app.zip`
85+
- Linux: `.appimage`, `.deb`, `.rpm`
86+
2. **Keyword matching** — generic archives (`.zip`, `.tar.gz`, `.tar.xz`, `.tar.bz2`, `.7z`) with platform keywords in the filename:
87+
- Android: `android`
88+
- Windows: `win64`, `win32`, `windows`, `-win-`, etc.
89+
- macOS: `macos`, `darwin`, `osx`, `-mac-`, etc.
90+
- Linux: `linux`, `-linux-`, etc.
91+
92+
### Content Filtering
93+
94+
`BLOCKED_TOPICS` set (~40 terms) excludes repos with NSFW/inappropriate content. Checked against both repo topics (set intersection) and description (substring match) during candidate collection, before any API calls are wasted on verification.
95+
96+
### Cache Protection
97+
98+
- Cache files are valid for 23 hours (`CACHE_VALIDITY_HOURS`)
99+
- Stale caches with fewer than the minimum threshold repos are refetched (30 for trending/most-popular, 10 for new-releases)
100+
- **Never saves 0 repos** — if a fetch returns 0, existing cache is preserved
101+
- **Never overwrites good data with poor results** — if fetch returns fewer than threshold but cache has more, cache is kept
102+
- `FORCE_REFRESH` env var bypasses cache loading entirely
103+
104+
### Fork Inclusion
105+
106+
All search queries include `fork:true` to discover forked repositories with platform installers.
45107

46108
## Key Constants
47109

48110
| Constant | Value | Notes |
49111
|---|---|---|
50-
| `RATE_LIMIT_FLOOR` | 50 | Stop verifying when rate limit drops below this |
112+
| `RATE_LIMIT_FLOOR` | 50 | Global minimum — stop verifying below this |
51113
| `CACHE_VALIDITY_HOURS` | 23 | Cache TTL |
52-
| `MAX_CONCURRENT_REQUESTS` | 25 | HTTP concurrency limit |
53-
| `MAX_SEARCH_CONCURRENT` | 5 | Search API concurrency (stricter) |
114+
| `MAX_CONCURRENT_REQUESTS` | 25 | HTTP concurrency (core API) |
115+
| `MAX_SEARCH_CONCURRENT` | 5 | Search API concurrency |
116+
| `RELEASE_CHECK_BATCH` | 40 | Repos verified per batch |
117+
| `REQUEST_TIMEOUT` | 20 | Per-request timeout (seconds) |
54118
| `MAX_RETRIES` | 3 | Per-request retry limit |
55-
| `MIN_STARS` (most-popular) | 5000 | Minimum stars for most-popular |
119+
| `MIN_STARS` (most-popular) | 5000 | Minimum stars |
56120
| `MAX_RELEASE_AGE_DAYS` (new-releases) | 14 | Max release age |
57121

58122
## Commands
59123

60124
```bash
61-
# Run with 3 dedicated tokens (recommended)
125+
# Run with 3 dedicated tokens (recommended — each from a different GitHub account)
62126
GH_TOKEN_TRENDING=ghp_a GH_TOKEN_NEW_RELEASES=ghp_b GH_TOKEN_MOST_POPULAR=ghp_c python scripts/fetch_all_categories.py
63127

64128
# Run with single fallback token
65129
GITHUB_TOKEN=ghp_xxx python scripts/fetch_all_categories.py
66130

131+
# Force refresh (ignore all caches)
132+
FORCE_REFRESH=true GITHUB_TOKEN=ghp_xxx python scripts/fetch_all_categories.py
133+
67134
# Validate release dates
68135
GITHUB_TOKEN=ghp_xxx python scripts/validate_releases.py [platform]
69136

@@ -84,11 +151,23 @@ Each `{platform}.json`:
84151
}
85152
```
86153

154+
## Workflow Details
155+
156+
**Jobs:**
157+
1. `check-rate-limit` — Checks all 3 tokens, reports dedicated vs fallback, gates on >1000 remaining
158+
2. `fetch-and-update` — Runs the script, validates JSON, commits and pushes (retries push up to 3 times with rebase)
159+
3. `notify-on-failure` — Auto-creates a GitHub issue labeled `automation, category-fetch, bug`
160+
161+
**Workflow inputs:**
162+
- `force_refresh` (boolean) — Skip all caches when triggered manually
163+
164+
**Timeout:** 45 minutes
165+
87166
## Development Notes
88167

89168
- Python 3.11, no type-checking or linting configured
90169
- No tests beyond `validate_releases.py`
91170
- Each category creates its own `GitHubClient` — release cache is per-client, shared across platforms within the same category
92-
- Platforms are processed sequentially within each category to avoid rate-limit thrashing
93-
- The workflow retries `git push` up to 3 times with rebase on conflict
94-
- On failure, the workflow auto-creates a GitHub issue labeled `automation, category-fetch, bug`
171+
- Platforms are processed sequentially within each category (with 65s search rate limit pause between)
172+
- Typical runtime: 15-25 minutes with 3 dedicated tokens
173+
- The `_check_assets()` helper inside `get_latest_stable_release()` detects installers for ALL platforms in one pass, so cross-platform repos benefit all platforms from a single release check

0 commit comments

Comments
 (0)