Vertex AI Agent Engine: reasoning engine with secret-backed env vars (secret_env) deploys but has zero running instances

#### Environment details
  - OS type and version: managed Agent Engine runtime (deployed from macOS)
  - Python version: 3.11 (runtime `python_spec.version=3.11`; reproduced on the 3.11 deploy path)
  - `google-cloud-aiplatform` version: 1.156.0
  - Region: us-central1
#### Summary
Deploying a reasoning engine (Agent Engine) whose `deployment_spec` contains **secret-backed environment variables** — `secret_env` (Terraform) / `env_vars` with a `{"secret": SECRET_ID, "version": ...}` value (SDK) — creates the engine **successfully** (`done: true`, no error) but it **never starts any running instances**. `stream_query` then fails with:
```
FAILED_PRECONDITION: ... does not have running instances. It's likely that it does
not have a valid 'spec.package_spec' configuration.
```
An **identical** deployment with the same value passed as a **plain** env var (no secret reference) starts instances and serves normally. The only difference between "works" and "0 instances" is the presence of `secretEnv` in the `deploymentSpec`.
#### Steps to reproduce
  1. Create a Secret Manager secret in the same project (one enabled version).
  2. Grant `roles/secretmanager.secretAccessor` on it to the runtime service account (and, per the docs, to the Vertex AI Service Agent `service-PROJECT_NUMBER@gcp-sa-aiplatform.iam.gserviceaccount.com` and the Reasoning Engine Service Agent `service-PROJECT_NUMBER@gcp-sa-aiplatform-re.iam.gserviceaccount.com`).
  3. Deploy an agent with a secret-backed env var (code below).
  4. Wait for the create LRO to finish — it succeeds (`done: true`, no `error`).
  5. Call `stream_query` → `FAILED_PRECONDITION ... does not have running instances`.
  6. Deploy the same agent with the value as a **plain** env var instead → instances start and it serves.
#### Code example
```python
import vertexai
from vertexai import agent_engines
vertexai.init(project="PROJECT", location="us-central1",
              staging_bucket="gs://STAGING_BUCKET")
# `app` is any AdkApp / custom ReasoningEngine. The trigger is purely the
# secret-backed env var, not the agent code.
# CONTROL — starts running instances, serves fine:
agent_engines.create(app, requirements=[...], display_name="control")
# REPRO — created successfully, but 0 running instances:
agent_engines.create(
    app, requirements=[...], display_name="repro-secret",
    env_vars={"MY_SECRET": {"secret": "my-secret", "version": "latest"}},
)
```
#### Stack trace (stream_query against the secret_env engine)
```
google.api_core.exceptions.FailedPrecondition: 400 The requested resource
[projects/.../locations/us-central1/reasoningEngines/...] does not have running
instances. It's likely that it does not have a valid 'spec.package_spec'
configuration. Please update the resource with a valid 'spec.package_spec' and
then try again.
```
#### Diagnosis already performed (rules out the usual causes)
With Secret Manager **DATA_READ audit logging** enabled and a controlled A/B/C bisect of otherwise-identical engines:
- The secret **value is read successfully** at instance startup by the **runtime service account** (`AccessSecretVersion`, `code=OK`, zero denials).
- The application **starts normally** — runtime logs show `Started server process`, `Application startup complete`, `Uvicorn running on http://0.0.0.0:8080`. The only anomaly is that the platform never marks the instance "running", and **no request/health-probe logs follow startup**.
- **Independent of IAM grants:** reproduced with `secretAccessor` granted to the runtime SA, the Reasoning Engine Service Agent (`gcp-sa-aiplatform-re`), **and** the Vertex AI Service Agent (`gcp-sa-aiplatform`) — all three, fully propagated → still 0 instances. Audit logs show **only the runtime SA** ever reads the secret; the platform service agents never attempt a read.
- **Independent of secret version format:** `"latest"` and a pinned numeric version both fail identically.
- **Independent of Python version** (3.11) and **deploy tool** (reproduced via both the `vertexai` SDK and Terraform's `google_vertex_ai_reasoning_engine` `secret_env`).
- The **control** (same engine, value as a plain env var) → running instances.
**Expected:** a reasoning engine with secret-backed env vars should start instances, since the secret is readable and the app boots.
**Actual:** zero running instances whenever `secretEnv` is present in the `deployment_spec`, regardless of IAM/version/Python/tooling.

---
_Related: #5647 — same surface symptom (secret-backed env vars in Agent Engine), but that report was resolved via the custom-service-account / `secretAccessor` IAM fix. This issue is distinct: the IAM fix is explicitly ruled out above (secret is read OK by the runtime SA; granting all relevant service agents still yields 0 instances)._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vertex AI Agent Engine: reasoning engine with secret-backed env vars (secret_env) deploys but has zero running instances #6863

Environment details

Summary

Steps to reproduce

Code example

Stack trace (stream_query against the secret_env engine)

Diagnosis already performed (rules out the usual causes)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Vertex AI Agent Engine: reasoning engine with secret-backed env vars (secret_env) deploys but has zero running instances #6863

Description

Environment details

Summary

Steps to reproduce

Code example

Stack trace (stream_query against the secret_env engine)

Diagnosis already performed (rules out the usual causes)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions