LocateAnything API

A FastAPI microservice wrapping NVIDIA's LocateAnything-3B spatial grounding model.

Overview

LocateAnything-3B is a 3B-parameter vision-language model that returns bounding boxes for objects, UI elements, and text given a natural language prompt. This repo exposes it over HTTP with two endpoints: an OpenAI-compatible /v1/chat/completions and a telemetry-rich /api/inference for dataset generation pipelines.

Features:

/v1/chat/completions — drop-in compatibility with OpenAI vision tool integrations
/api/inference — returns tokens/sec, boxes/sec, and decoding mode fallback stats alongside results
Native 4K Image Support — Uses FlashAttention-2 in the Vision Encoder for efficient memory scaling on large images.
Native Video & Multi-Image Support — Process .mp4 videos directly or send an array of sequential image frames.
Dynamic image resizing (short_size) is fully customizable (default cap removed).
Exposes all three LocateAnything decoding modes: hybrid, fast (MTP), slow (AR)

Requirements

NVIDIA GPU (~12GB VRAM minimum)
Up-to-date NVIDIA Host Drivers (Driver version 580+ required for CUDA 13 support)
Docker + Docker Compose with NVIDIA Container Toolkit (Linux) or WSL2 GPU passthrough (Windows)
Hugging Face account with a Read-access token

Setup

git clone https://github.com/dliebner/LocateAnythingAPI.git
cd LocateAnythingAPI

Create a .env file:

HF_TOKEN=hf_your_token_here
API_PORT=8000  # Defaults to 8000

Build and run:

docker-compose up --build

Note on First Build: The first time you build the container, it will compile FlashAttention-2 from source for your specific GPU architecture. This takes 5-10 minutes, but the Docker layer is permanently cached.

First boot also downloads the ~6GB model weights from Hugging Face. Subsequent starts are fast.

API

(Note: The examples below use port 8000. If you changed API_PORT in your .env file, update the URLs accordingly.)

`POST /v1/chat/completions`

OpenAI-compatible. Accepts standard parameters (temperature, max_tokens, top_p) plus custom fields: task, model_mode, short_size, and top_k.

import requests, base64

with open("screenshot.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

payload = {
    "model": "nvidia/LocateAnything-3B",
    "temperature": 0.1,
    "top_p": 0.9,
    "top_k": 50,
    "max_tokens": 4096,
    "task": "detect",        # detect | ground_multi | ground_gui | ocr | point
    "model_mode": "hybrid",  # hybrid | fast | slow
    "short_size": 1024,
    "messages": [{
        "role": "user",
        "content": [
            {"type": "text", "text": "gem, clover, ring, bat"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}}
        ]
    }]
}

response = requests.post("http://localhost:8000/v1/chat/completions", json=payload)
print(response.json()["choices"][0]["message"]["content"])
# <ref>bat</ref><box><100><200><150><250></box>...

Video & Multi-Frame Usage: To send video, change the content type to video_url and use an .mp4 data URI. To send multiple images (as a sequence), just append multiple image_url objects to the content array.

{
  "role": "user",
  "content": [
      { "type": "text", "text": "Find the cursor" },
      { "type": "video_url", "video_url": { "url": "data:video/mp4;base64,AAAA..." } }
  ]
}

`POST /api/inference`

Returns the raw output string plus generation stats. Useful for labeling pipelines where you want to track throughput or detect AR fallbacks.

import requests, base64

with open("screenshot.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

payload = {
    "image_b64": img_b64, # OR a list: [frame1_b64, frame2_b64]
    # "video_b64": "AAAA...", # Alternatively, send an mp4 base64 string
    "prompt": "gem, clover, ring, bat",
    "task": "detect",
    "mode": "hybrid",
    "short_size": 1024,
    "temperature": 0.1,
    "top_p": 0.9,
    "top_k": 50,
    "max_tokens": 4096
}

response = requests.post("http://localhost:8000/api/inference", json=payload)
data = response.json()

print(data["raw_text"])  # bounding box string
print(data["stats"])     # {"tps": "84.3", "bps": "13.7", "switch_to_ar": "1", ...}

License

This repo (Apache 2.0): The API wrapper code is licensed under Apache 2.0.

Model weights (NVIDIA Non-Commercial): The weights are downloaded at runtime from Hugging Face and are governed by the NVIDIA Model License.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LocateAnything API

Overview

Requirements

Setup

API

`POST /v1/chat/completions`

`POST /api/inference`

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LocateAnything API

Overview

Requirements

Setup

API

POST /v1/chat/completions

POST /api/inference

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /v1/chat/completions`

`POST /api/inference`

Packages