Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 1 addition & 6 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,4 @@ RUN uv sync
COPY --chmod=755 . .

# Container start script
CMD ["uv", "run", "gunicorn", "main:app", \
"--timeout", "300", \
"--graceful-timeout", "30", \
"-k", "uvicorn.workers.UvicornWorker", \
"-w", "1", \
"-b", "0.0.0.0:5000"]
CMD ["uv", "run", "gunicorn", "main:app", "-k", "uvicorn.workers.UvicornWorker", "-w", "2", "--bind", "0.0.0.0:5000", "--timeout", "30", "--graceful-timeout", "15", "--max-requests", "1000", "--max-requests-jitter", "100", "--keep-alive", "5", "--access-logfile", "-", "--error-logfile", "-"]
16 changes: 11 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,30 @@
# Wikidata Textifier

**Wikidata Textifier** is an API that transforms Wikidata entities into compact outputs for LLM and GenAI use cases.
It resolves missing labels for properties and claim values using the Wikidata Action API and caches labels to reduce repeated lookups.
**Wikidata Textifier** is an API that transforms entities into compact outputs for LLM and GenAI use cases.
It resolves missing labels for properties and claim values using the Wikidata/Wikibase Action API and caches labels to reduce repeated lookups.

Live API: [wd-textify.wmcloud.org](https://wd-textify.wmcloud.org/) \
API Docs: [wd-textify.wmcloud.org/docs](https://wd-textify.wmcloud.org/docs)

## Features

- Textify Wikidata entities as `json`, `text`, or `triplet`.
- Textify entities as `json`, `text`, or `triplet`.
- Resolve labels for linked entities and properties.
- Cache labels in MariaDB for faster repeated requests.
- Support multilingual output with fallback language support.
- Avoid SPARQL and use Wikidata Action API / EntityData endpoints.
- Avoid SPARQL and use Wikibase Action API / EntityData endpoints.

## Output Formats

- `json`: Structured representation with claims (and optionally qualifiers/references).
- `text`: Readable summary including label, description, aliases, and attributes.
- `triplet`: Triplet-style lines with labels and IDs for graph-style traversal.

## Future Plan

- Replace Action API with GraphQL once Wikibase GraphQL is available for Wikibases:
[Wikidata: Wikibase GraphQL](https://www.wikidata.org/wiki/Wikidata:Wikibase_GraphQL)

## API

### `GET /`
Expand All @@ -28,7 +33,7 @@ API Docs: [wd-textify.wmcloud.org/docs](https://wd-textify.wmcloud.org/docs)

| Name | Type | Required | Description |
|---|---|---|---|
| `id` | string | Yes | Comma-separated Wikidata IDs (for example: `Q42` or `Q42,Q2`). |
| `id` | string | Yes | Comma-separated entity IDs (for example: `Q42` or `Q42,Q2`). |
| `pid` | string | No | Comma-separated property IDs to filter claims (for example: `P31,P279`). |
| `lang` | string | No | Preferred language code (default: `en`). |
| `fallback_lang` | string | No | Fallback language code (default: `en`). |
Expand All @@ -37,6 +42,7 @@ API Docs: [wd-textify.wmcloud.org/docs](https://wd-textify.wmcloud.org/docs)
| `all_ranks` | bool | No | Include all statement ranks instead of preferred/normal filtering (default: `false`). |
| `qualifiers` | bool | No | Include qualifiers in claim values (default: `true`). |
| `references` | bool | No | Include references in claim values (default: `false`). |
| `action_api_url` | string | No | Action API URL (default: `https://www.wikidata.org/w/api.php`). |

#### Example requests

Expand Down
2 changes: 0 additions & 2 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,6 @@ services:
volumes:
- ./data/mysql:/var/lib/mysql
- ./docker-entrypoint-initdb:/docker-entrypoint-initdb.d
ports:
- "3306:3306"
healthcheck:
test: ["CMD-SHELL", "mariadb-admin ping -h 127.0.0.1 -u root -p$${MARIADB_ROOT_PASSWORD} --silent"]
interval: 5s
Expand Down
122 changes: 54 additions & 68 deletions main.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""FastAPI application that exposes Wikidata textification endpoints."""
"""FastAPI application that exposes Wikidata/Wikibase textification endpoints."""

import os
import time
Expand All @@ -9,7 +9,7 @@
from fastapi.middleware.cors import CORSMiddleware

from src import utils
from src.Normalizer import JSONNormalizer, TTLNormalizer
from src.Normalizer import JSONNormalizer
from src.WikidataLabel import LazyLabelFactory, WikidataLabel

# Start Fastapi app
Expand Down Expand Up @@ -45,7 +45,7 @@ async def startup():
"/",
responses={
200: {
"description": "Returns a list of relevant Wikidata property PIDs with similarity scores",
"description": "Returns textified entities keyed by requested IDs",
"content": {
"application/json": {
"example": [
Expand All @@ -57,8 +57,21 @@ async def startup():
},
},
422: {
"description": "Missing or invalid query parameter",
"content": {"application/json": {"example": {"detail": "Invalid format specified"}}},
"description": "Validation error for missing or invalid query parameters",
"content": {
"application/json": {
"example": {
"detail": [
{
"type": "missing",
"loc": ["query", "id"],
"msg": "Field required",
"input": None,
}
]
}
}
},
},
},
)
Expand All @@ -74,15 +87,16 @@ async def get_textified_wd(
all_ranks: bool = False,
qualifiers: bool = True,
fallback_lang: str = "en",
action_api_url: str = "https://www.wikidata.org/w/api.php",
):
"""Retrieve Wikidata entities as structured JSON, natural text, or triplet lines.
"""Retrieve entities as structured JSON, natural text, or triplet lines.

This endpoint fetches one or more entities, resolves missing labels, and normalizes
claims into a compact representation suitable for downstream LLM use.

**Args:**

- **id** (str): Comma-separated Wikidata IDs to fetch (for example: `"Q42"` or `"Q42,Q2"`).
- **id** (str): Comma-separated entity IDs to fetch (for example: `"Q42"` or `"Q42,Q2"`).
- **pid** (str, optional): Comma-separated property IDs used to filter returned claims (for example: `"P31,P279"`).
- **lang** (str): Preferred language code for labels and formatted values.
- **format** (str): Output format. One of `"json"`, `"text"`, or `"triplet"`.
Expand All @@ -91,6 +105,8 @@ async def get_textified_wd(
- **all_ranks** (bool): If `true`, include preferred, normal, and deprecated statement ranks.
- **qualifiers** (bool): If `true`, include qualifiers for claim values.
- **fallback_lang** (str): Fallback language used when `lang` is unavailable.
- **action_api_url** (str): Action API URL
(default: `https://www.wikidata.org/w/api.php`).

**Returns:**

Expand All @@ -107,74 +123,44 @@ async def get_textified_wd(
filter_pids = [p.strip() for p in pid.split(",")]

qids = [q.strip() for q in id.split(",")]
label_factory = LazyLabelFactory(lang=lang, fallback_lang=fallback_lang)
label_factory = LazyLabelFactory(lang=lang, fallback_lang=fallback_lang, wb_url=action_api_url)

# JSON is used with Action API for bulk retrieval
entities = {}
if len(qids) == 1:
# When one QID is requested, TTL is used
try:
entity_data = utils.get_wikidata_ttl_by_id(qids[0], lang=lang)
except requests.HTTPError:
entity_data = None

if not entity_data:
response = "ID not found"
raise HTTPException(status_code=404, detail=response)

entity_data = TTLNormalizer(
entity_id=qids[0],
ttl_text=entity_data,
try:
entity_data = utils.get_wikidata_json_by_ids(qids, action_api_url=action_api_url)
except requests.HTTPError:
entity_data = None
if not entity_data:
response = "IDs not found"
raise HTTPException(status_code=404, detail=response)

entity_data = {
qid: JSONNormalizer(
entity_id=qid,
entity_json=entity_data[qid],
lang=lang,
fallback_lang=fallback_lang,
label_factory=label_factory,
debug=False,
)

entities = {
qids[0]: entity_data.normalize(
external_ids=external_ids,
all_ranks=all_ranks,
references=references,
filter_pids=filter_pids,
qualifiers=qualifiers,
)
}
else:
# JSON is used with Action API for bulk retrieval
try:
entity_data = utils.get_wikidata_json_by_ids(qids)
except requests.HTTPError:
entity_data = None
if not entity_data:
response = "IDs not found"
raise HTTPException(status_code=404, detail=response)

entity_data = {
qid: JSONNormalizer(
entity_id=qid,
entity_json=entity_data[qid],
lang=lang,
fallback_lang=fallback_lang,
label_factory=label_factory,
debug=False,
)
if entity_data.get(qid)
else None
for qid in qids
}

entities = {
qid: entity.normalize(
external_ids=external_ids,
all_ranks=all_ranks,
references=references,
filter_pids=filter_pids,
qualifiers=qualifiers,
)
if entity
else None
for qid, entity in entity_data.items()
}
if entity_data.get(qid)
else None
for qid in qids
}

entities = {
qid: entity.normalize(
external_ids=external_ids,
all_ranks=all_ranks,
references=references,
filter_pids=filter_pids,
qualifiers=qualifiers,
)
if entity
else None
for qid, entity in entity_data.items()
}

return_data = {}
for qid, entity in entities.items():
Expand Down
2 changes: 1 addition & 1 deletion src/Normalizer/JSONNormalizer.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Normalize Wikidata Action API JSON into internal textifier objects."""
"""Normalize Wikidata/Wikibase Action API JSON into internal textifier objects."""

from __future__ import annotations

Expand Down
27 changes: 7 additions & 20 deletions src/Textifier/WikidataTextifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,10 +52,7 @@ def __bool__(self) -> bool:

def to_json(self) -> Optional[str]:
"""Serialize to a JSON object."""
return {
"text": self.text,
"lang": self.lang
}
return {"text": self.text, "lang": self.lang}


@dataclass(slots=True)
Expand All @@ -73,10 +70,7 @@ def __str__(self) -> str:
def __bool__(self) -> bool:
"""Return whether both latitude and longitude are present."""
# coordinates are meaningful if we have both lat/lon
return (
self.latitude is not None
and self.longitude is not None
)
return self.latitude is not None and self.longitude is not None

def to_json(self) -> Dict[str, Any]:
"""Serialize coordinates to a JSON object."""
Expand Down Expand Up @@ -164,11 +158,7 @@ class WikidataEntity:

def __bool__(self) -> bool:
"""Return whether this entity has a usable id and label."""
return (
bool(self.id)
and self.label is not None
and str(self.label) != ""
)
return bool(self.id) and self.label is not None and str(self.label) != ""

def to_text(self, lang="en") -> str:
"""Render the entity into a readable text."""
Expand All @@ -184,7 +174,7 @@ def to_text(self, lang="en") -> str:
string += f" {lang_var[', '].join(map(str, self.aliases))}"

attributes = [c.to_text(lang) for c in self.claims]
attributes= [a for a in attributes if a] # filter out empty attributes
attributes = [a for a in attributes if a] # filter out empty attributes

if len(attributes) > 0:
attributes = "\n- ".join(attributes)
Expand Down Expand Up @@ -236,10 +226,7 @@ class WikidataClaim:

def __bool__(self) -> bool:
"""Return whether this claim contains a value."""
return (
bool(self.property)
and any(bool(v) for v in self.values)
)
return bool(self.property) and any(bool(v) for v in self.values)

def to_text(self, lang="en") -> str:
"""Render the claim into a readable text."""
Expand Down Expand Up @@ -302,8 +289,8 @@ class WikidataClaimValue:
value: Optional[
Union[
WikidataEntity, WikidataQuantity, WikidataTime, WikidataCoordinates, WikidataText, WikidataMonolingualText
]
] = None
]
] = None
qualifiers: List[WikidataClaim] = field(default_factory=list)
references: List[List[WikidataClaim]] = field(default_factory=list)
rank: Optional[str] = None # preferred|normal|deprecated
Expand Down
Loading
Loading