Skip to content

dliebner/LocateAnythingAPI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LocateAnything API

A FastAPI microservice wrapping NVIDIA's LocateAnything-3B spatial grounding model.

Overview

LocateAnything-3B is a 3B-parameter vision-language model that returns bounding boxes for objects, UI elements, and text given a natural language prompt. This repo exposes it over HTTP with two endpoints: an OpenAI-compatible /v1/chat/completions and a telemetry-rich /api/inference for dataset generation pipelines.

Features:

  • /v1/chat/completions — drop-in compatibility with OpenAI vision tool integrations
  • /api/inference — returns tokens/sec, boxes/sec, and decoding mode fallback stats alongside results
  • Native 4K Image Support — Uses FlashAttention-2 in the Vision Encoder for efficient memory scaling on large images.
  • Native Video & Multi-Image Support — Process .mp4 videos directly or send an array of sequential image frames.
  • Dynamic image resizing (short_size) is fully customizable (default cap removed).
  • Exposes all three LocateAnything decoding modes: hybrid, fast (MTP), slow (AR)

Requirements

  • NVIDIA GPU (~12GB VRAM minimum)
  • Up-to-date NVIDIA Host Drivers (Driver version 580+ required for CUDA 13 support)
  • Docker + Docker Compose with NVIDIA Container Toolkit (Linux) or WSL2 GPU passthrough (Windows)
  • Hugging Face account with a Read-access token

Setup

git clone https://github.com/dliebner/LocateAnythingAPI.git
cd LocateAnythingAPI

Create a .env file:

HF_TOKEN=hf_your_token_here
API_PORT=8000  # Defaults to 8000

Build and run:

docker-compose up --build

Note on First Build: The first time you build the container, it will compile FlashAttention-2 from source for your specific GPU architecture. This takes 5-10 minutes, but the Docker layer is permanently cached.

First boot also downloads the ~6GB model weights from Hugging Face. Subsequent starts are fast.


API

(Note: The examples below use port 8000. If you changed API_PORT in your .env file, update the URLs accordingly.)

POST /v1/chat/completions

OpenAI-compatible. Accepts standard parameters (temperature, max_tokens, top_p) plus custom fields: task, model_mode, short_size, and top_k.

import requests, base64

with open("screenshot.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

payload = {
    "model": "nvidia/LocateAnything-3B",
    "temperature": 0.1,
    "top_p": 0.9,
    "top_k": 50,
    "max_tokens": 4096,
    "task": "detect",        # detect | ground_multi | ground_gui | ocr | point
    "model_mode": "hybrid",  # hybrid | fast | slow
    "short_size": 1024,
    "messages": [{
        "role": "user",
        "content": [
            {"type": "text", "text": "gem, clover, ring, bat"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}}
        ]
    }]
}

response = requests.post("http://localhost:8000/v1/chat/completions", json=payload)
print(response.json()["choices"][0]["message"]["content"])
# <ref>bat</ref><box><100><200><150><250></box>...

Video & Multi-Frame Usage: To send video, change the content type to video_url and use an .mp4 data URI. To send multiple images (as a sequence), just append multiple image_url objects to the content array.

{
  "role": "user",
  "content": [
      { "type": "text", "text": "Find the cursor" },
      { "type": "video_url", "video_url": { "url": "data:video/mp4;base64,AAAA..." } }
  ]
}

POST /api/inference

Returns the raw output string plus generation stats. Useful for labeling pipelines where you want to track throughput or detect AR fallbacks.

import requests, base64

with open("screenshot.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

payload = {
    "image_b64": img_b64, # OR a list: [frame1_b64, frame2_b64]
    # "video_b64": "AAAA...", # Alternatively, send an mp4 base64 string
    "prompt": "gem, clover, ring, bat",
    "task": "detect",
    "mode": "hybrid",
    "short_size": 1024,
    "temperature": 0.1,
    "top_p": 0.9,
    "top_k": 50,
    "max_tokens": 4096
}

response = requests.post("http://localhost:8000/api/inference", json=payload)
data = response.json()

print(data["raw_text"])  # bounding box string
print(data["stats"])     # {"tps": "84.3", "bps": "13.7", "switch_to_ar": "1", ...}

License

This repo (Apache 2.0): The API wrapper code is licensed under Apache 2.0.

Model weights (NVIDIA Non-Commercial): The weights are downloaded at runtime from Hugging Face and are governed by the NVIDIA Model License.

About

API and docker wrapper for local LocateAnything with FlashAttention-2 support

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors