docs: improve the find data page to include information about queries, cache tables, and MCP by dbirman · Pull Request #69 · AllenNeuralDynamics/aind-software-docs

dbirman · 2026-03-20T16:07:07Z

No description provided.

saskiad · 2026-03-20T20:46:19Z

 # Find data

-The [data portal](https://data.allenneuraldynamics.org/assets) is a tool for finding and exploring data assets. Currently, you can search all assets that have V2 metadata and easily click links to go to the Code Ocean data asset, metadata, and QC report.
+Each raw asset uploaded from a platform at AIND produces a group of derived assets, one per modality. You can find these assets easily by performing a query on our metadata database using your project name and other fields unique to your project. **All analyses at AIND should begin with a query that returns a group of data assets, filtered by passing quality control**.


This organization is true for phys/behavior. Not for other modalities.
For some spim, I think it's just one derived asset which is fine because it's one modality
But for other spim, I think there are many different dervied assets that have more to do with clustering results in time.

We maybe can just get rid of the first sentence and start with "you can find data assets by performing a query"

saskiad · 2026-03-20T20:50:03Z

+
+## Query DocDB
+
+DocDB queries are dictionaries (key-value pairs) that return a set of data assets. Analysis pipelines are required to use a query as the first step in gathering data for analysis **and** to filter assets according to passing quality control criteria. We recommend using the MCP server to gain familiarity with the patterns used for creating queries. 


I actually think we need more information here.
It's a MongoDB query that uses a particular language/organization. These can be run in Python.
Helen probably can point us to some resources to direct people to, but I do think we want the last line of using the MCP to develop the queries is important.

oh, how is this meant to be different from the aind-data-access-api? I think I'm conflating the two - where/how would one do DocDB queries separate from the aind-data-access-api?

The query I anticipate for analysis workflows is using the aind-data-access-api, is that not true?

I cleaned it up, hopefully it makes more sense now

saskiad · 2026-03-20T20:50:24Z

+
+DocDB queries are dictionaries (key-value pairs) that return a set of data assets. Analysis pipelines are required to use a query as the first step in gathering data for analysis **and** to filter assets according to passing quality control criteria. We recommend using the MCP server to gain familiarity with the patterns used for creating queries. 
+
+### AI (MCP Server)


I'd title this as "MCP Server (AI)"

saskiad · 2026-03-20T20:51:40Z

+
+### Fast queries through the cache
+
+Metadata queries to the database can be very slow. The [`zombie-squirrel`](https://github.com/AllenNeuralDynamics/zombie-squirrel/) package exposes a cache of some fields in the V2 metadata making them available with much lower latency. The metadata cache is updated at midnight, do not use it if you need immediate access to assets.


I'd put the last sentence as a paranthetical note.

Is it possible to list the fields that it caches so people know what they can use this for?

The tables are listed in the readme, I linked there. I'll also update the readme so it has more information about what fields are cached.

saskiad · 2026-03-25T17:58:27Z

+    qc_df = qc(subject_id=subject_id)
+    if qc_df.empty or "status" not in qc_df.columns:
+        continue
+    for _, row in subject_assets.iterrows():


I feel like there's an easier way to just ask if all metrics are status==Pass?

I'm fine with this example, but it feels complicated in a way that might overwhelm people. But not a deal breaker for me.

I think we need to wrap this in a helper function then. I'll revisit this once I have this implemented in biodata-query

saskiad

two small comments - one is a typo, the other a small suggestion that you are free to ignore.

tmchartrand · 2026-04-16T19:08:51Z

 # Find data

-The [data portal](https://data.allenneuraldynamics.org/assets) is a tool for finding and exploring data assets. Currently, you can search all assets that have V2 metadata and easily click links to go to the Code Ocean data asset, metadata, and QC report.
+Raw assets uploaded from platforms at AIND are run through automated pipelines that produce derived assets. You can find these assets by performing a query on our metadata database using your project name and other fields unique to your experiment. **All analyses at AIND should begin with a query that returns a group of data assets, filtered by passing quality control**.


This and the statement below ("Analysis workflows are required to use a query...") are quite a bit stricter than I've heard this expressed before. I think if this is actually a new requirement we should be clearer about it and introduce it explicitly somehow rather than hidden in these how-to docs. Even if I'm missing something and this isn't actually new, I still feel like requirements or "shoulds" are clearer and more effective if separated into their own pages and linked - they seem like a different category of docs in the diataxis sense.

I've been saying that this needs to be a requirement since we started the upgrade to V2 and pushing people to do QC on assets. I'm not sure how many other requirements there are though and whether an entire page for them is necessary.

dbirman added 3 commits March 19, 2026 16:26

docs: added content to the find_data page

8767dd4

docs: fix qc filter example with zs

e605b25

fix: link out to adap docs

e852b51

dbirman linked an issue Mar 20, 2026 that may be closed by this pull request

find data should include information on using MCP #49

Open

dbirman requested a review from saskiad March 20, 2026 19:54

saskiad reviewed Mar 20, 2026

View reviewed changes

Comment thread docs/source/explore_analyze/find_data.md

saskiad reviewed Mar 20, 2026

View reviewed changes

Comment thread docs/source/explore_analyze/find_data.md

saskiad requested changes Mar 20, 2026

View reviewed changes

docs: changes from review

221f134

dbirman requested a review from saskiad March 24, 2026 03:15

saskiad reviewed Mar 25, 2026

View reviewed changes

Comment thread docs/source/explore_analyze/find_data.md Outdated

saskiad approved these changes Mar 25, 2026

View reviewed changes

docs: fix typo

4a8bd80

tmchartrand reviewed Apr 16, 2026

View reviewed changes


		## Query DocDB

		DocDB queries are dictionaries (key-value pairs) that return a set of data assets. Analysis pipelines are required to use a query as the first step in gathering data for analysis and to filter assets according to passing quality control criteria. We recommend using the MCP server to gain familiarity with the patterns used for creating queries.


		DocDB queries are dictionaries (key-value pairs) that return a set of data assets. Analysis pipelines are required to use a query as the first step in gathering data for analysis and to filter assets according to passing quality control criteria. We recommend using the MCP server to gain familiarity with the patterns used for creating queries.

		### AI (MCP Server)


		### Fast queries through the cache

		Metadata queries to the database can be very slow. The [`zombie-squirrel`](https://github.com/AllenNeuralDynamics/zombie-squirrel/) package exposes a cache of some fields in the V2 metadata making them available with much lower latency. The metadata cache is updated at midnight, do not use it if you need immediate access to assets.

Conversation

dbirman commented Mar 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

saskiad left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants