Skip to content

Vertex AI Agent Engine: reasoning engine with secret-backed env vars (secret_env) deploys but has zero running instances #6863

@itaykz

Description

@itaykz

Environment details

  • OS type and version: managed Agent Engine runtime (deployed from macOS)
  • Python version: 3.11 (runtime python_spec.version=3.11; reproduced on the 3.11 deploy path)
  • google-cloud-aiplatform version: 1.156.0
  • Region: us-central1

Summary

Deploying a reasoning engine (Agent Engine) whose deployment_spec contains secret-backed environment variablessecret_env (Terraform) / env_vars with a {"secret": SECRET_ID, "version": ...} value (SDK) — creates the engine successfully (done: true, no error) but it never starts any running instances. stream_query then fails with:

FAILED_PRECONDITION: ... does not have running instances. It's likely that it does
not have a valid 'spec.package_spec' configuration.

An identical deployment with the same value passed as a plain env var (no secret reference) starts instances and serves normally. The only difference between "works" and "0 instances" is the presence of secretEnv in the deploymentSpec.

Steps to reproduce

  1. Create a Secret Manager secret in the same project (one enabled version).
  2. Grant roles/secretmanager.secretAccessor on it to the runtime service account (and, per the docs, to the Vertex AI Service Agent service-PROJECT_NUMBER@gcp-sa-aiplatform.iam.gserviceaccount.com and the Reasoning Engine Service Agent service-PROJECT_NUMBER@gcp-sa-aiplatform-re.iam.gserviceaccount.com).
  3. Deploy an agent with a secret-backed env var (code below).
  4. Wait for the create LRO to finish — it succeeds (done: true, no error).
  5. Call stream_queryFAILED_PRECONDITION ... does not have running instances.
  6. Deploy the same agent with the value as a plain env var instead → instances start and it serves.

Code example

import vertexai
from vertexai import agent_engines
vertexai.init(project="PROJECT", location="us-central1",
              staging_bucket="gs://STAGING_BUCKET")
# `app` is any AdkApp / custom ReasoningEngine. The trigger is purely the
# secret-backed env var, not the agent code.
# CONTROL — starts running instances, serves fine:
agent_engines.create(app, requirements=[...], display_name="control")
# REPRO — created successfully, but 0 running instances:
agent_engines.create(
    app, requirements=[...], display_name="repro-secret",
    env_vars={"MY_SECRET": {"secret": "my-secret", "version": "latest"}},
)

Stack trace (stream_query against the secret_env engine)

google.api_core.exceptions.FailedPrecondition: 400 The requested resource
[projects/.../locations/us-central1/reasoningEngines/...] does not have running
instances. It's likely that it does not have a valid 'spec.package_spec'
configuration. Please update the resource with a valid 'spec.package_spec' and
then try again.

Diagnosis already performed (rules out the usual causes)

With Secret Manager DATA_READ audit logging enabled and a controlled A/B/C bisect of otherwise-identical engines:

  • The secret value is read successfully at instance startup by the runtime service account (AccessSecretVersion, code=OK, zero denials).
  • The application starts normally — runtime logs show Started server process, Application startup complete, Uvicorn running on http://0.0.0.0:8080. The only anomaly is that the platform never marks the instance "running", and no request/health-probe logs follow startup.
  • Independent of IAM grants: reproduced with secretAccessor granted to the runtime SA, the Reasoning Engine Service Agent (gcp-sa-aiplatform-re), and the Vertex AI Service Agent (gcp-sa-aiplatform) — all three, fully propagated → still 0 instances. Audit logs show only the runtime SA ever reads the secret; the platform service agents never attempt a read.
  • Independent of secret version format: "latest" and a pinned numeric version both fail identically.
  • Independent of Python version (3.11) and deploy tool (reproduced via both the vertexai SDK and Terraform's google_vertex_ai_reasoning_engine secret_env).
  • The control (same engine, value as a plain env var) → running instances.
    Expected: a reasoning engine with secret-backed env vars should start instances, since the secret is readable and the app boots.
    Actual: zero running instances whenever secretEnv is present in the deployment_spec, regardless of IAM/version/Python/tooling.

Related: #5647 — same surface symptom (secret-backed env vars in Agent Engine), but that report was resolved via the custom-service-account / secretAccessor IAM fix. This issue is distinct: the IAM fix is explicitly ruled out above (secret is read OK by the runtime SA; granting all relevant service agents still yields 0 instances).

Metadata

Metadata

Assignees

No one assigned

    Labels

    api: vertex-aiIssues related to the googleapis/python-aiplatform API.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions