Skip to content

Commit 8060553

Browse files
sbscullymtmail
andauthored
Opencage CLI adapted from batch.py example script (#54)
* Initial release of 'opencage' CLI tool * prepare version 3.0.0 release --------- Co-authored-by: marc tobias <mtmail@gmx.net>
1 parent db6829c commit 8060553

21 files changed

Lines changed: 709 additions & 25 deletions

.github/workflows/build.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@ jobs:
1818
- "3.10"
1919
- "3.9"
2020
- "3.8"
21-
- "3.7"
2221
os:
2322
- ubuntu-latest
2423
steps:

Changes.txt

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,11 @@
1-
unreleased
1+
v3.0.0 Wed Sep 4 2024
2+
Requires python 3.7 and asyncio package
3+
Inititial release of the 'opencage' CLI tool
4+
RateLimitExceededError no longer prints reset date
25
Batch example: warn if no API key present earlier
36
Batch example: some errors were not printed, e.g. invalid API key
47
Batch example: Check latest version of opencage package is used
8+
Add python 3.12, no longer test against python 3.7
59

610
v2.3.1 Wed Nov 15 2023
711
New error 'SSLError' which is more explicit in case of SSL certificate chain issues

README.md

Lines changed: 34 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,14 @@ A Python module to access the [OpenCage Geocoding API](https://opencagedata.com/
1515

1616
You can find a [comprehensive tutorial for using this module on the OpenCage site](https://opencagedata.com/tutorials/geocode-in-python).
1717

18-
There are also two brief video tutorials on YouTube, one [covering forward geocoding](https://www.youtube.com/watch?v=9bXu8-LPr5c), one [covering reverse geocoding](https://www.youtube.com/watch?v=u-kkE4yA-z0).
18+
There are two brief video tutorials on YouTube, one [covering forward geocoding](https://www.youtube.com/watch?v=9bXu8-LPr5c), one [covering reverse geocoding](https://www.youtube.com/watch?v=u-kkE4yA-z0).
19+
20+
The module installs an `opencage` CLI tool for geocoding files. Check `opencage --help` or the [CLI tutorial](https://opencagedata.com/tutorials/geocode-in-cli).
21+
1922

2023
## Usage
2124

22-
Supports Python 3.7 or newer. Use the older opencage 1.x releases if you need Python 2.7 support.
25+
Supports Python 3.8 or newer. Starting opencage version 3.0 depends on asyncio package.
2326

2427
Install the module:
2528

@@ -87,7 +90,7 @@ with OpenCageGeocode(key) as geocoder:
8790

8891
You can run requests in parallel with the `geocode_async` and `reverse_geocode_async`
8992
method which have the same parameters and response as their synronous counterparts.
90-
You will need at least Python 3.7 and the `asyncio` and `aiohttp` packages installed.
93+
You will need at least Python 3.8 and the `asyncio` and `aiohttp` packages installed.
9194

9295
```python
9396
async with OpenCageGeocode(key) as geocoder:
@@ -109,7 +112,34 @@ geocoder = OpenCageGeocode('your-api-key', 'http')
109112

110113
### Command-line batch geocoding
111114

112-
See `examples/batch.py` for an example to geocode a CSV file.
115+
Use `opencage forward` or `opencage reverse`
116+
117+
```
118+
opencage forward --help
119+
120+
-h, --help show this help message and exit
121+
--api-key API_KEY Your OpenCage API key
122+
--input INPUT Input file name
123+
--output OUTPUT Output file name
124+
--headers If the first row should be treated as a header row
125+
--input-columns INPUT_COLUMNS
126+
Comma-separated list of integers (default '1')
127+
--add-columns ADD_COLUMNS
128+
Comma-separated list of output columns
129+
--workers WORKERS Number of parallel geocoding requests (default 1)
130+
--timeout TIMEOUT Timeout in seconds (default 10)
131+
--retries RETRIES Number of retries (default 5)
132+
--api-domain API_DOMAIN
133+
API domain (default api.opencagedata.com)
134+
--extra-params EXTRA_PARAMS
135+
Extra parameters for each request (e.g. language=fr,no_dedupe=1)
136+
--limit LIMIT Stop after this number of lines in the input
137+
--dry-run Read the input file but no geocoding
138+
--no-progress Display no progress bar
139+
--quiet No progress bar and no messages
140+
--overwrite Delete the output file first if it exists
141+
--verbose Display debug information for each request
142+
```
113143

114144
<img src="batch-progress.gif"/>
115145

examples/batch.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
# Background tutorial on async programming with Python
44
# https://realpython.com/async-io-python/
55

6-
# Requires Python 3.7 or newer. Tested with 3.8/3.9/3.10/3.11.
6+
# Requires Python 3.8 or newer. Tested with 3.8/3.9/3.10/3.11.
77

88
# Installation:
99
# pip3 install --upgrade opencage asyncio aiohttp backoff tqdm
@@ -213,7 +213,7 @@ async def run_worker(worker_name, queue):
213213

214214
async def main():
215215
global PROGRESS_BAR
216-
assert sys.version_info >= (3, 7), "Script requires Python 3.7 or newer"
216+
assert sys.version_info >= (3, 8), "Script requires Python 3.8 or newer"
217217

218218
## 1. Read CSV into a Queue
219219
## Each work_item is an address and id. The id will be part of the output,

opencage/batch.py

Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
import sys
2+
import ssl
3+
import asyncio
4+
import traceback
5+
import threading
6+
import backoff
7+
import certifi
8+
import random
9+
10+
from tqdm import tqdm
11+
from urllib.parse import urlencode
12+
from contextlib import suppress
13+
from opencage.geocoder import OpenCageGeocode, OpenCageGeocodeError
14+
15+
class OpenCageBatchGeocoder():
16+
def __init__(self, options):
17+
self.options = options
18+
self.sslcontext = ssl.create_default_context(cafile=certifi.where())
19+
self.write_counter = 1
20+
21+
def __call__(self, *args, **kwargs):
22+
asyncio.run(self.geocode(*args, **kwargs))
23+
24+
async def geocode(self, input, output):
25+
if not self.options.dry_run:
26+
test = await self.test_request()
27+
if test['error']:
28+
self.log(test['error'])
29+
return
30+
31+
if self.options.headers:
32+
header_columns = next(input, None)
33+
if header_columns is None:
34+
return
35+
36+
queue = asyncio.Queue(maxsize=self.options.limit)
37+
38+
await self.read_input(input, queue)
39+
40+
if self.options.dry_run:
41+
return
42+
43+
if self.options.headers:
44+
output.writerow(header_columns + self.options.add_columns)
45+
46+
progress_bar = not (self.options.no_progress or self.options.quiet) and \
47+
tqdm(total=queue.qsize(), position=0, desc="Addresses geocoded", dynamic_ncols=True)
48+
49+
tasks = []
50+
for _ in range(self.options.workers):
51+
task = asyncio.create_task(self.worker(output, queue, progress_bar))
52+
tasks.append(task)
53+
54+
# This starts the workers and waits until all are finished
55+
await queue.join()
56+
57+
# All tasks done
58+
for task in tasks:
59+
task.cancel()
60+
61+
if progress_bar:
62+
progress_bar.close()
63+
64+
async def test_request(self):
65+
try:
66+
async with OpenCageGeocode(self.options.api_key, domain=self.options.api_domain, sslcontext=self.sslcontext) as geocoder:
67+
result = await geocoder.geocode_async('Kendall Sq, Cambridge, MA', raw_response=True)
68+
69+
free = False
70+
with suppress(KeyError):
71+
free = result['rate']['limit'] == 2500
72+
73+
return { 'error': None, 'free': free }
74+
except Exception as exc:
75+
return { 'error': exc }
76+
77+
async def read_input(self, input, queue):
78+
for index, row in enumerate(input):
79+
line_number = index + 1
80+
81+
if len(row) == 0:
82+
raise Exception(f"Empty line in input file at line number {line_number}, aborting")
83+
84+
item = await self.read_one_line(row, line_number)
85+
await queue.put(item)
86+
87+
if queue.full():
88+
break
89+
90+
async def read_one_line(self, row, row_id):
91+
if self.options.command == 'reverse':
92+
input_columns = [1, 2]
93+
elif self.options.input_columns:
94+
input_columns = self.options.input_columns
95+
else:
96+
input_columns = None
97+
98+
if input_columns:
99+
address = []
100+
try:
101+
for column in input_columns:
102+
# input_columns option uses 1-based indexing
103+
address.append(row[column - 1])
104+
except IndexError:
105+
self.log(f"Missing input column {column} in {row}")
106+
else:
107+
address = row
108+
109+
if self.options.command == 'reverse' and len(address) != 2:
110+
self.log(f"Expected two comma-separated values for reverse geocoding, got {address}")
111+
112+
return { 'row_id': row_id, 'address': ','.join(address), 'original_columns': row }
113+
114+
async def worker(self, output, queue, progress):
115+
while True:
116+
item = await queue.get()
117+
118+
try:
119+
await self.geocode_one_address(output, item['row_id'], item['address'], item['original_columns'])
120+
121+
if progress:
122+
progress.update(1)
123+
except Exception as exc:
124+
traceback.print_exception(exc, file=sys.stderr)
125+
finally:
126+
queue.task_done()
127+
128+
async def geocode_one_address(self, output, row_id, address, original_columns):
129+
def on_backoff(details):
130+
if not self.options.quiet:
131+
sys.stderr.write("Backing off {wait:0.1f} seconds afters {tries} tries "
132+
"calling function {target} with args {args} and kwargs "
133+
"{kwargs}\n".format(**details))
134+
135+
@backoff.on_exception(backoff.expo,
136+
asyncio.TimeoutError,
137+
max_time=self.options.timeout,
138+
max_tries=self.options.retries,
139+
on_backoff=on_backoff)
140+
async def _geocode_one_address():
141+
async with OpenCageGeocode(self.options.api_key, domain=self.options.api_domain, sslcontext=self.sslcontext) as geocoder:
142+
geocoding_results = None
143+
params = { 'no_annotations': 1, **self.options.extra_params }
144+
145+
try:
146+
if self.options.command == 'reverse':
147+
lon, lat = address.split(',')
148+
geocoding_results = await geocoder.reverse_geocode_async(lon, lat, **params)
149+
else:
150+
geocoding_results = await geocoder.geocode_async(address, **params)
151+
except OpenCageGeocodeError as exc:
152+
self.log(str(exc))
153+
except Exception as exc:
154+
traceback.print_exception(exc, file=sys.stderr)
155+
156+
try:
157+
if geocoding_results is not None and len(geocoding_results):
158+
geocoding_result = geocoding_results[0]
159+
else:
160+
geocoding_result = None
161+
162+
if self.options.verbose:
163+
self.log({
164+
'row_id': row_id,
165+
'thread_id': threading.get_native_id(),
166+
'request': geocoder.url + '?' + urlencode(geocoder._parse_request(address, params)),
167+
'response': geocoding_result
168+
})
169+
170+
await self.write_one_geocoding_result(output, row_id, address, geocoding_result, original_columns)
171+
except Exception as exc:
172+
traceback.print_exception(exc, file=sys.stderr)
173+
174+
await _geocode_one_address()
175+
176+
async def write_one_geocoding_result(self, output, row_id, address, geocoding_result, original_columns = []):
177+
row = original_columns
178+
179+
for column in self.options.add_columns:
180+
if geocoding_result is None:
181+
row.append('')
182+
elif column in geocoding_result:
183+
row.append(geocoding_result[column])
184+
elif column in geocoding_result['components']:
185+
row.append(geocoding_result['components'][column])
186+
elif column in geocoding_result['geometry']:
187+
row.append(geocoding_result['geometry'][column])
188+
else:
189+
row.append('')
190+
191+
# Enforce that row are written ordered. That means we might wait for other threads
192+
# to finish a task and make the overall process slower. Alternative would be to
193+
# use a second queue, or keep some results in memory.
194+
while row_id > self.write_counter:
195+
if self.options.verbose:
196+
self.log(f"Want to write row {row_id}, but write_counter is at {self.write_counter}")
197+
await asyncio.sleep(random.uniform(0.01, 0.1))
198+
199+
if self.options.verbose:
200+
self.log(f"Writing row {row_id}")
201+
output.writerow(row)
202+
self.write_counter = self.write_counter + 1
203+
204+
def log(self, message):
205+
if not self.options.quiet:
206+
sys.stderr.write(f"{message}\n")
207+

0 commit comments

Comments
 (0)