Skip to content

feat: add caching on visibility layer#2

Open
shahsurajk wants to merge 3 commits into
mainfrom
ft/visibility-caching
Open

feat: add caching on visibility layer#2
shahsurajk wants to merge 3 commits into
mainfrom
ft/visibility-caching

Conversation

@shahsurajk

Copy link
Copy Markdown

What changed?

Describe what has changed in this PR.

Why?

Tell your future self why have you made these changes.

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

Any change is risky. Identify all risks you are aware of. If none, remove this section.

Implements a simple in-memory caching layer for the visibility store to
address performance issues with expensive PostgreSQL queries on the
executions_visibility table.

Problem:
- CountWorkflowExecutions with GROUP BY (e.g., status counts) causes
  full table scans and parallel workers, firing every ~2 minutes
- Scheduler count queries (TemporalNamespaceDivision='TemporalScheduler')
  are frequent and expensive
- ListWorkflowExecutions with ORDER BY coalesce(close_time,...) does
  full table scans on 721K-page, ~3M row table

Solution:
- Simple cachingVisibilityManager wrapping VisibilityManager interface
- Uses map[string]*cacheEntry protected by sync.RWMutex (no external deps)
- SHA256 cache keys from (method, namespaceID, query, pageSize, pageToken)
- Hardcoded TTLs: ListWorkflowExecutions=15s, CountWorkflowExecutions=20s,
  scheduler queries=30s
- Namespace-wide cache invalidation on write operations
- Lazy expiry (no background goroutines)
- Prometheus metrics: temporal_visibility_cache_hits_total and
  temporal_visibility_cache_misses_total with method label

Configuration:
- Disabled by default (opt-in)
- Enable via: visibilityCache.enabled=true
- Configurable TTL override: visibilityCache.cacheTTLSeconds

Files Added:
- caching_visibility_manager.go: Main caching wrapper
- caching_visibility_manager_test.go: Comprehensive test suite (9 tests)
- visibility_cache.go: Config structures
- visibility-cache-config-example.yaml: Documentation

Files Modified:
- factory.go: Wire caching manager into visibility manager creation
- visibility_manager_test.go: Add cacheConfig parameter to tests
- config.go: Add VisibilityCache field to SQL config

Also includes legacy store-level cache implementation (dormant):
- cached_visibility_store.go: Store-level wrapper
- cache/inmemory.go, redis.go, factory.go, noop.go: Cache backends
- Can be enabled later for Redis support or multi-tier caching

Expected Impact:
- 50-80% reduction in PostgreSQL visibility query load
- Faster dashboard/UI rendering
- Cache hit rate >60% after warm-up
- Minimal memory overhead (~10-50MB per pod)

Thread-safe, no goroutine leaks, no global state, no external dependencies.
Adds CI/CD pipeline following organization's standard workflow pattern
for building Temporal server with visibility caching and pushing to AWS ECR.

Workflow Features:
- Builds for test (ap-south-1) and production (us-east-1) environments
- Test builds triggered on branch pushes
- Production builds triggered on tag pushes
- Automatic image tagging with version/branch/sha/date
- Multi-environment AWS credential management

Files Added:
- .github/workflows/build-and-push.yml: GitHub Actions workflow
- ci/Dockerfile: Minimal Alpine-based runtime image
- ci/README.md: CI/CD documentation

Image Details:
- Base: Alpine 3.19
- Size: ~50-100MB
- User: Non-root (temporal:1000)
- Registry: AWS ECR container-images/temporal-server

Deployment:
- Test: Push to branch → test-<branch>-<sha>-<date>
- Production: Push tag → <version> (e.g., v1.25.0-cache)
@shahsurajk

Copy link
Copy Markdown
Author

@ReviewBot review

@shahsurajk shahsurajk changed the title Ft/visibility caching feat: add caching on visibility layer Jun 9, 2026

@vulcanho vulcanho Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📋 Summary

This PR adds a visibility caching layer to reduce PostgreSQL load, plus a GitHub Actions CI pipeline and Dockerfile for ECR deployment. The core caching logic in caching_visibility_manager.go is wired up and functional, but has two significant correctness bugs: (1) invalidateNamespace purges the entire cache regardless of namespace, causing cross-namespace stampedes in multi-tenant deployments; and (2) the cacheTTLSeconds config field is silently ignored because defaultCacheTTL is stored but never used—hardcoded constants govern all TTLs. Additionally, the cache/ sub-package (factory, redis, inmemory, noop) and store/cached_visibility_store.go are entirely dead code that isn't imported or wired anywhere; cache/factory.go also references struct fields that don't exist on *config.VisibilityCacheConfig (compile error on go build ./...), and cache/redis.go imports github.com/redis/go-redis/v9 which is absent from go.mod. The CI workflow has a flavour vs flavor typo that silently drops the latest=true tag configuration, and the Docker healthcheck validates nothing meaningful.

💬 General Feedback

  • code_quality: The common/persistence/visibility/cache/ package (factory.go, inmemory.go, noop.go, redis.go) and common/persistence/visibility/store/cached_visibility_store.go (with its NewCachedVisibilityStore constructor) are entirely dead code — no file in the module imports go.temporal.io/server/common/persistence/visibility/cache, and NewCachedVisibilityStore is never called outside its own file. This appears to be an earlier or alternative caching design that was superseded by caching_visibility_manager.go. Beyond the compile errors already flagged, shipping dead code increases maintenance burden and will confuse future contributors. Decide: either wire up the store-level Redis caching path (fixing the struct mismatch and adding the go-redis dependency), or delete all files in cache/ and store/cached_visibility_store.go to keep the codebase clean.

🔀 Architecture

flowchart TD
    A[Caller / gRPC handler] --> B[cachingVisibilityManager]
    B -->|Read: cache hit| B2[return cached result]
    B -->|Read: cache miss| C[VisibilityManagerImpl]
    C --> D[(PostgreSQL visibility store)]
    C -->|response| B
    B -->|putInCache — hardcoded TTL constants| E[in-memory map\nmap[string]*cacheEntry\nunbounded, lazy expiry]
    B -->|Write ops: start/close/upsert/delete| C
    C -->|success| B3[invalidateNamespace]
    B3 -->|deletes ALL keys, not namespace-scoped| E

    subgraph Dead Code — not wired
        F[cache/ package\nNewVisibilityCache]
        G[store/cached_visibility_store.go\nNewCachedVisibilityStore]
    end

    style F fill:#f99,stroke:#c00
    style G fill:#f99,stroke:#c00
    style B3 fill:#ff9,stroke:#c80
Loading

Comment on lines +20 to +56
switch cfg.Type {
case config.VisibilityCacheTypeInMemory:
maxSize := cfg.InMemory.MaxSize
if maxSize <= 0 {
maxSize = 10000
}
ttl := cfg.InMemory.TTL
if ttl <= 0 {
ttl = 0
}
return NewInMemoryVisibilityCache(maxSize, ttl), nil

case config.VisibilityCacheTypeRedis:
maxRetries := cfg.Redis.MaxRetries
if maxRetries <= 0 {
maxRetries = 3
}
poolSize := cfg.Redis.PoolSize
if poolSize <= 0 {
poolSize = 10
}
ttl := cfg.Redis.TTL
if ttl <= 0 {
ttl = 0
}
return NewRedisVisibilityCache(
cfg.Redis.Endpoints,
cfg.Redis.Password,
cfg.Redis.DB,
ttl,
maxRetries,
poolSize,
logger,
)

default:
return nil, fmt.Errorf("unknown visibility cache type: %s", cfg.Type)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache/factory.go references fields absent from VisibilityCacheConfig, causing compile failure

NewVisibilityCache takes *config.VisibilityCacheConfig but then accesses cfg.Type, cfg.InMemory, and cfg.Redis, none of which exist on that struct (they live on VisibilityCacheConfigLegacy). go build ./... will fail to compile this package. Either change the parameter type to *config.VisibilityCacheConfigLegacy and wire that type through config.go and factory.go, or delete this file and the entire cache/ sub-package if it is superseded by caching_visibility_manager.go.

Prompt for agents
In `common/persistence/visibility/cache/factory.go`, `NewVisibilityCache` takes `cfg *config.VisibilityCacheConfig` but references `cfg.Type`, `cfg.InMemory`, and `cfg.Redis` which don't exist on that struct—only on `VisibilityCacheConfigLegacy`. Fix options: (a) change the parameter to `*config.VisibilityCacheConfigLegacy` and update all callers, or (b) delete the entire `cache/` sub-package if it is dead code superseded by `caching_visibility_manager.go`. Verify no other files import `go.temporal.io/server/common/persistence/visibility/cache` before deleting.

"encoding/json"
"time"

"github.com/redis/go-redis/v9"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github.com/redis/go-redis/v9 is imported but missing from go.mod

cache/redis.go imports github.com/redis/go-redis/v9, but this dependency is not declared in go.mod or go.sum. go build ./... will fail with a module resolution error. Either run go get github.com/redis/go-redis/v9 to add it, or—if the entire cache/ sub-package is dead code—delete the package.

Prompt for agents
In `common/persistence/visibility/cache/redis.go` line 8, `github.com/redis/go-redis/v9` is imported but not present in `go.mod`. Either: (a) run `go get github.com/redis/go-redis/v9` and `go mod tidy` in the repo root, or (b) delete the entire `common/persistence/visibility/cache/` directory if it is dead code (it is currently not imported anywhere in the codebase—confirmed by grepping for `visibility/cache"`). Choose option (b) if Redis caching is not part of this PR's scope.

Comment on lines +257 to +273
func (c *cachingVisibilityManager) invalidateNamespace(namespaceID namespace.ID) {
c.mu.Lock()
defer c.mu.Unlock()

nsIDStr := namespaceID.String()
keysToDelete := make([]string, 0)

for key := range c.cache {
keysToDelete = append(keysToDelete, key)
}

for _, key := range keysToDelete {
delete(c.cache, key)
}

c.logger.Debug("Invalidated visibility cache for namespace", tag.WorkflowNamespaceID(nsIDStr), tag.Counter(len(keysToDelete)))
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invalidateNamespace flushes the entire cache, not just the target namespace

The function deletes every entry in c.cache regardless of which namespace the write occurred in. The namespaceID argument is only used in the log message. Because the cache keys are SHA-256 hashes of (method:namespaceID:query:...), it is not possible to filter by namespace without storing a reverse index.

Fix: maintain a secondary namespaceKeys map[string][]string that maps namespace ID strings to their cache keys. In putInCache, append the new key to namespaceKeys[namespaceID]; in invalidateNamespace, range over namespaceKeys[nsIDStr], delete each from c.cache, and then clear the slice. This keeps invalidation scoped to the namespace that changed.

Prompt for agents
In `common/persistence/visibility/caching_visibility_manager.go`, `invalidateNamespace` (lines 257–273) deletes ALL entries from `c.cache` regardless of namespace. Fix: add a field `namespaceKeys map[string][]string` to `cachingVisibilityManager`; in `putInCache`, extract the namespace ID from the context or pass it as a parameter and append the new cache key to `namespaceKeys[nsID]`; in `invalidateNamespace`, delete only the keys listed under `namespaceKeys[namespaceID.String()]` and clear the slice. Remember to hold `c.mu.Lock()` around both `c.cache` and `namespaceKeys` accesses. The namespace ID is already embedded in the pre-hash string in `buildCacheKey` so you can extract it from there, or pass it as an extra parameter.

Comment on lines +39 to +64
defaultCacheTTL time.Duration
}

func NewCachingVisibilityManager(
delegate manager.VisibilityManager,
cachingEnabled bool,
cacheTTLSeconds int,
logger log.Logger,
metricsHandler metrics.Handler,
) manager.VisibilityManager {
if !cachingEnabled {
return delegate
}

defaultTTL := 20 * time.Second
if cacheTTLSeconds > 0 {
defaultTTL = time.Duration(cacheTTLSeconds) * time.Second
}

cvm := &cachingVisibilityManager{
delegate: delegate,
cache: make(map[string]*cacheEntry),
logger: logger,
metricsHandler: metricsHandler,
cachingEnabled: cachingEnabled,
defaultCacheTTL: defaultTTL,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cacheTTLSeconds config is silently ignored; defaultCacheTTL is set but never used

The cacheTTLSeconds value is parsed from config and stored as c.defaultCacheTTL, but every putInCache call passes a hardcoded package-level constant instead. Users who set cacheTTLSeconds: 5 expecting shorter staleness will get 15–30 second stale results. Fix: replace the hardcoded constants in ListWorkflowExecutions and CountWorkflowExecutions with c.defaultCacheTTL—or define per-method TTL ratios relative to defaultCacheTTL if the differing TTLs for list vs count vs scheduler are intentional.

Suggested change
defaultCacheTTL time.Duration
}
func NewCachingVisibilityManager(
delegate manager.VisibilityManager,
cachingEnabled bool,
cacheTTLSeconds int,
logger log.Logger,
metricsHandler metrics.Handler,
) manager.VisibilityManager {
if !cachingEnabled {
return delegate
}
defaultTTL := 20 * time.Second
if cacheTTLSeconds > 0 {
defaultTTL = time.Duration(cacheTTLSeconds) * time.Second
}
cvm := &cachingVisibilityManager{
delegate: delegate,
cache: make(map[string]*cacheEntry),
logger: logger,
metricsHandler: metricsHandler,
cachingEnabled: cachingEnabled,
defaultCacheTTL: defaultTTL,
func (c *cachingVisibilityManager) ListWorkflowExecutions(
ctx context.Context,
request *manager.ListWorkflowExecutionsRequestV2,
) (*manager.ListWorkflowExecutionsResponse, error) {
cacheKey := c.buildCacheKey("ListWorkflowExecutions", request.NamespaceID.String(), request.Query, request.PageSize, request.NextPageToken)
if cached := c.getFromCache(cacheKey, "ListWorkflowExecutions"); cached != nil {
if resp, ok := cached.(*manager.ListWorkflowExecutionsResponse); ok {
return resp, nil
}
}
resp, err := c.delegate.ListWorkflowExecutions(ctx, request)
if err == nil {
c.putInCache(cacheKey, resp, c.defaultCacheTTL)
}
return resp, err
}
Prompt for agents
In `common/persistence/visibility/caching_visibility_manager.go`, `defaultCacheTTL` is stored (lines 39, 64) but never read. Fix: in `ListWorkflowExecutions` (line 157) replace `listWorkflowExecutionsTTL` with `c.defaultCacheTTL`; in `CountWorkflowExecutions` (lines 167-169) replace `countWorkflowExecutionsTTL` with `c.defaultCacheTTL` and `schedulerCountTTL` with a computed value like `c.defaultCacheTTL * 3/2` (or keep a separate multiplier). Then remove the now-unused package-level TTL constants. Update `TestCachingVisibilityManager_CacheExpiry` to sleep for `defaultTTL + 1s` instead of the hardcoded 16s.

Comment on lines +58 to +65
cvm := &cachingVisibilityManager{
delegate: delegate,
cache: make(map[string]*cacheEntry),
logger: logger,
metricsHandler: metricsHandler,
cachingEnabled: cachingEnabled,
defaultCacheTTL: defaultTTL,
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded in-memory cache map — expired entries are never evicted

Stale entries are detected lazily on read but left in the map, so the map only shrinks on a write-triggered full flush (which itself causes a thundering herd). Add a max-size cap and background eviction, or use a proper LRU cache implementation. At minimum, delete the stale entry when detected during getFromCache (requires upgrading to a write lock or using a delete-on-expiry pass). The existing go.temporal.io/server/common/cache package used by inmemory.go already provides a bounded LRU with TTL—consider using it here instead of the hand-rolled map.

Prompt for agents
In `common/persistence/visibility/caching_visibility_manager.go`, `c.cache` is an unbounded `map[string]*cacheEntry` (line 60). Expired entries are detected in `getFromCache` (line 234) but never deleted—the key stays in the map forever. Fix: when `time.Now().After(entry.expiresAt)` is true in `getFromCache`, delete the key from the map (requires either upgrading to a write lock for this check, or a periodic cleanup goroutine). Alternatively, replace the hand-rolled map with the existing `go.temporal.io/server/common/cache.Cache` (bounded LRU + TTL) already used by `cache/inmemory.go`—this would also fix the unbounded growth.

Comment on lines +68 to +69
flavour: |
latest=true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flavour: is a misspelling of flavor: — the latest=true configuration is silently ignored

GitHub Actions silently ignores unknown with: keys. The correct input name for docker/metadata-action is flavor (American English), not flavour. As written, the latest=true line does nothing. Fix: rename flavour: to flavor:.

Suggested change
flavour: |
latest=true
- name: Docker meta
id: docker_meta
uses: docker/metadata-action@v3
with:
images: ${{ steps.login-ecr.outputs.registry }}/container-images/temporal-server
flavor: |
latest=true
tags: |
type=semver,pattern={{version}}
type=ref,event=branch,prefix=${{ matrix.env }}-,suffix=-{{sha}}-{{date 'YYYYMMDD'}}
type=raw,value=latest,enable=${{ github.ref == format('refs/heads/{0}', 'main') }}
Prompt for agents
In `.github/workflows/build-and-push.yml` line 68, change `flavour:` to `flavor:`. The `docker/metadata-action` input is named `flavor` (American English); the current British spelling is silently ignored, so `latest=true` has no effect.

Comment thread ci/Dockerfile
Comment on lines +46 to +47
HEALTHCHECK --interval=30s --timeout=5s --start-period=30s --retries=3 \
CMD ["/usr/local/bin/temporal-server", "--version"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker HEALTHCHECK uses --version — always passes, never validates service health

The HEALTHCHECK command temporal-server --version is a version-print that exits 0 immediately regardless of service state. It cannot detect that the server is not listening. Replace it with a real connectivity check, for example using tctl or a gRPC health probe. If you don't want to add tctl to the image, a simple TCP check via nc or the grpc-health-probe binary is standard for gRPC services:

HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
    CMD grpc-health-probe -addr=:7233 || exit 1

Alternatively, install tctl in the image and use tctl --address 127.0.0.1:7233 cluster health.

Prompt for agents
In `ci/Dockerfile` lines 46-47, the HEALTHCHECK command is `temporal-server --version`, which always exits 0 immediately and tells the orchestrator nothing about real service health. Replace it with a check that verifies the gRPC frontend (port 7233) is accepting connections. Option 1: add `grpc-health-probe` to the image (`RUN apk add grpc-health-probe` or download from GitHub releases) and use `CMD grpc-health-probe -addr=:7233`. Option 2: install `temporalcli` and call `temporal cluster health`. Update `start-period` to 60s to give Temporal time to initialize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant