A FastAPI microservice wrapping NVIDIA's LocateAnything-3B spatial grounding model.
LocateAnything-3B is a 3B-parameter vision-language model that returns bounding boxes for objects, UI elements, and text given a natural language prompt. This repo exposes it over HTTP with two endpoints: an OpenAI-compatible /v1/chat/completions and a telemetry-rich /api/inference for dataset generation pipelines.
Features:
/v1/chat/completions— drop-in compatibility with OpenAI vision tool integrations/api/inference— returns tokens/sec, boxes/sec, and decoding mode fallback stats alongside results- Native 4K Image Support — Uses FlashAttention-2 in the Vision Encoder for efficient memory scaling on large images.
- Native Video & Multi-Image Support — Process .mp4 videos directly or send an array of sequential image frames.
- Dynamic image resizing (
short_size) is fully customizable (default cap removed). - Exposes all three LocateAnything decoding modes:
hybrid,fast(MTP),slow(AR)
- NVIDIA GPU (~12GB VRAM minimum)
- Up-to-date NVIDIA Host Drivers (Driver version 580+ required for CUDA 13 support)
- Docker + Docker Compose with NVIDIA Container Toolkit (Linux) or WSL2 GPU passthrough (Windows)
- Hugging Face account with a Read-access token
git clone https://github.com/dliebner/LocateAnythingAPI.git
cd LocateAnythingAPICreate a .env file:
HF_TOKEN=hf_your_token_here
API_PORT=8000 # Defaults to 8000Build and run:
docker-compose up --buildNote on First Build: The first time you build the container, it will compile FlashAttention-2 from source for your specific GPU architecture. This takes 5-10 minutes, but the Docker layer is permanently cached.
First boot also downloads the ~6GB model weights from Hugging Face. Subsequent starts are fast.
(Note: The examples below use port 8000. If you changed API_PORT in your .env file, update the URLs accordingly.)
OpenAI-compatible. Accepts standard parameters (temperature, max_tokens, top_p) plus custom fields: task, model_mode, short_size, and top_k.
import requests, base64
with open("screenshot.png", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
payload = {
"model": "nvidia/LocateAnything-3B",
"temperature": 0.1,
"top_p": 0.9,
"top_k": 50,
"max_tokens": 4096,
"task": "detect", # detect | ground_multi | ground_gui | ocr | point
"model_mode": "hybrid", # hybrid | fast | slow
"short_size": 1024,
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "gem, clover, ring, bat"},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}}
]
}]
}
response = requests.post("http://localhost:8000/v1/chat/completions", json=payload)
print(response.json()["choices"][0]["message"]["content"])
# <ref>bat</ref><box><100><200><150><250></box>...Video & Multi-Frame Usage:
To send video, change the content type to video_url and use an .mp4 data URI. To send multiple images (as a sequence), just append multiple image_url objects to the content array.
{
"role": "user",
"content": [
{ "type": "text", "text": "Find the cursor" },
{ "type": "video_url", "video_url": { "url": "data:video/mp4;base64,AAAA..." } }
]
}Returns the raw output string plus generation stats. Useful for labeling pipelines where you want to track throughput or detect AR fallbacks.
import requests, base64
with open("screenshot.png", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
payload = {
"image_b64": img_b64, # OR a list: [frame1_b64, frame2_b64]
# "video_b64": "AAAA...", # Alternatively, send an mp4 base64 string
"prompt": "gem, clover, ring, bat",
"task": "detect",
"mode": "hybrid",
"short_size": 1024,
"temperature": 0.1,
"top_p": 0.9,
"top_k": 50,
"max_tokens": 4096
}
response = requests.post("http://localhost:8000/api/inference", json=payload)
data = response.json()
print(data["raw_text"]) # bounding box string
print(data["stats"]) # {"tps": "84.3", "bps": "13.7", "switch_to_ar": "1", ...}This repo (Apache 2.0): The API wrapper code is licensed under Apache 2.0.
Model weights (NVIDIA Non-Commercial): The weights are downloaded at runtime from Hugging Face and are governed by the NVIDIA Model License.