docs: improve the find data page to include information about queries, cache tables, and MCP#69
Conversation
| # Find data | ||
|
|
||
| The [data portal](https://data.allenneuraldynamics.org/assets) is a tool for finding and exploring data assets. Currently, you can search all assets that have V2 metadata and easily click links to go to the Code Ocean data asset, metadata, and QC report. | ||
| Each raw asset uploaded from a platform at AIND produces a group of derived assets, one per modality. You can find these assets easily by performing a query on our metadata database using your project name and other fields unique to your project. **All analyses at AIND should begin with a query that returns a group of data assets, filtered by passing quality control**. |
There was a problem hiding this comment.
This organization is true for phys/behavior. Not for other modalities.
For some spim, I think it's just one derived asset which is fine because it's one modality
But for other spim, I think there are many different dervied assets that have more to do with clustering results in time.
We maybe can just get rid of the first sentence and start with "you can find data assets by performing a query"
|
|
||
| ## Query DocDB | ||
|
|
||
| DocDB queries are dictionaries (key-value pairs) that return a set of data assets. Analysis pipelines are required to use a query as the first step in gathering data for analysis **and** to filter assets according to passing quality control criteria. We recommend using the MCP server to gain familiarity with the patterns used for creating queries. |
There was a problem hiding this comment.
I actually think we need more information here.
It's a MongoDB query that uses a particular language/organization. These can be run in Python.
Helen probably can point us to some resources to direct people to, but I do think we want the last line of using the MCP to develop the queries is important.
There was a problem hiding this comment.
oh, how is this meant to be different from the aind-data-access-api? I think I'm conflating the two - where/how would one do DocDB queries separate from the aind-data-access-api?
The query I anticipate for analysis workflows is using the aind-data-access-api, is that not true?
There was a problem hiding this comment.
I cleaned it up, hopefully it makes more sense now
|
|
||
| DocDB queries are dictionaries (key-value pairs) that return a set of data assets. Analysis pipelines are required to use a query as the first step in gathering data for analysis **and** to filter assets according to passing quality control criteria. We recommend using the MCP server to gain familiarity with the patterns used for creating queries. | ||
|
|
||
| ### AI (MCP Server) |
There was a problem hiding this comment.
I'd title this as "MCP Server (AI)"
|
|
||
| ### Fast queries through the cache | ||
|
|
||
| Metadata queries to the database can be very slow. The [`zombie-squirrel`](https://github.com/AllenNeuralDynamics/zombie-squirrel/) package exposes a cache of some fields in the V2 metadata making them available with much lower latency. The metadata cache is updated at midnight, do not use it if you need immediate access to assets. |
There was a problem hiding this comment.
I'd put the last sentence as a paranthetical note.
There was a problem hiding this comment.
Is it possible to list the fields that it caches so people know what they can use this for?
There was a problem hiding this comment.
The tables are listed in the readme, I linked there. I'll also update the readme so it has more information about what fields are cached.
| qc_df = qc(subject_id=subject_id) | ||
| if qc_df.empty or "status" not in qc_df.columns: | ||
| continue | ||
| for _, row in subject_assets.iterrows(): |
There was a problem hiding this comment.
I feel like there's an easier way to just ask if all metrics are status==Pass?
I'm fine with this example, but it feels complicated in a way that might overwhelm people. But not a deal breaker for me.
There was a problem hiding this comment.
I think we need to wrap this in a helper function then. I'll revisit this once I have this implemented in biodata-query
saskiad
left a comment
There was a problem hiding this comment.
two small comments - one is a typo, the other a small suggestion that you are free to ignore.
| # Find data | ||
|
|
||
| The [data portal](https://data.allenneuraldynamics.org/assets) is a tool for finding and exploring data assets. Currently, you can search all assets that have V2 metadata and easily click links to go to the Code Ocean data asset, metadata, and QC report. | ||
| Raw assets uploaded from platforms at AIND are run through automated pipelines that produce derived assets. You can find these assets by performing a query on our metadata database using your project name and other fields unique to your experiment. **All analyses at AIND should begin with a query that returns a group of data assets, filtered by passing quality control**. |
There was a problem hiding this comment.
This and the statement below ("Analysis workflows are required to use a query...") are quite a bit stricter than I've heard this expressed before. I think if this is actually a new requirement we should be clearer about it and introduce it explicitly somehow rather than hidden in these how-to docs. Even if I'm missing something and this isn't actually new, I still feel like requirements or "shoulds" are clearer and more effective if separated into their own pages and linked - they seem like a different category of docs in the diataxis sense.
There was a problem hiding this comment.
I've been saying that this needs to be a requirement since we started the upgrade to V2 and pushing people to do QC on assets. I'm not sure how many other requirements there are though and whether an entire page for them is necessary.
No description provided.