Commit 8c5bc2f
committed
Use cloud OCR for per-page content in cloud mode
When PAGEINDEX_API_KEY is set, index_long_document now fetches
per-page markdown via col.get_page_content() instead of running
local pymupdf. Cloud OCR produces cleaner output (preserves
tables, math, and section headers) than raw pymupdf text
extraction. Falls back to local pymupdf if the cloud call raises
or returns an empty result.1 parent 771452d commit 8c5bc2f
1 file changed
Lines changed: 17 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
77 | 77 | | |
78 | 78 | | |
79 | 79 | | |
80 | | - | |
| 80 | + | |
81 | 81 | | |
82 | 82 | | |
83 | 83 | | |
84 | 84 | | |
85 | 85 | | |
86 | | - | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
87 | 102 | | |
88 | 103 | | |
89 | 104 | | |
| |||
0 commit comments