All notable changes to the RustFS Kubernetes Operator will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Expanded root
README.mdwith overview, quick start, development commands, CI vsmake pre-commit, and documentation index. - Aligned
CLAUDE.mdandROADMAP.mdwith current code: Tenant status conditions and StatefulSet updates on the successful reconcile path are documented as implemented; remaining work (status on early errors, integration tests, rollout extras) is listed explicitly. - Clarified the documentation map:
CONTRIBUTING.md(quality gates and CI alignment),docs/DEVELOPMENT.md(environment setup),docs/DEVELOPMENT-NOTES.md(historical notes, not normative). - Updated
examples/README.md: Tenant Services document S3 9000 and RustFS Console 9001; distinguished the Operator HTTP Console (default 9090,cargo run -- console) from the Tenant{tenant}-consoleService. - Standardized
README.md,scripts/README.md, and shell scripts underscripts/to English for consistency with architecture and development docs. - Translated Rust doc and line comments in
src/console/to English (no behavior change).
-
Console RBAC:
ClusterRolefor the console (Helmdeploy/rustfs-operator/templates/console-clusterrole.yamlanddeploy/k8s-dev/console-rbac.yaml) now includesget/list/watchonevents.k8s.ioevents, required for Tenant Events aggregation (in addition to""events). -
Operator RBAC:
ClusterRolefor the operator (deploy/rustfs-operator/templates/clusterrole.yamlanddeploy/k8s-dev/operator-rbac.yaml) now includesevents.k8s.ioevents(get/list/watch/create/patch). Dev scripts (e.g.scripts/deploy/deploy-rustfs-4node.sh) often usekubectl create token rustfs-operatorfor Console login; that identity must be able to list events.k8s.io Events for Tenant Events SSE. -
Operator RBAC:
ClusterRolefor the operator ServiceAccount now includesget/list/watchonpersistentvolumeclaims(Helmdeploy/rustfs-operator/templates/clusterrole.yamlanddeploy/k8s-dev/operator-rbac.yaml). Tenant event scope discovery lists PVCs labeled for the tenant; without this rule, the API returnedForbiddenwhen the request identity wasrustfs-system:rustfs-operator. -
console-web/make pre-commit:npm run lintnow runseslint .(bareeslintonly printed CLI help). Addedformat/format:checkscripts;Makefileconsole-fmtandconsole-fmt-checkcall them so Prettier resolves fromnode_modulesafternpm installinconsole-web/. -
Tenant
PoolCRD validation (CEL): Match the operator console API — requireservers × volumesPerServer >= 4for every pool, and>= 6total volumes whenservers == 3(fixes the previous 3-server rule using< 4in CEL). Regenerateddeploy/rustfs-operator/crds/tenant-crd.yamlandtenant.yaml. Addedvalidate_pool_total_volumesas the shared Rust implementation used bysrc/console/handlers/pools.rs. -
Tenant name length:
validate_dns1035_labelnow capsmetadata.nameat 55 characters so derived names like{name}-consoleremain valid Kubernetes DNS labels (≤ 63). -
Encryption validation on reconcile:
validate_kms_secretnow runs wheneverspec.encryption.enabledis true (previously skipped whenkmsSecretwas unset).
-
Console Tenant Events (breaking): Removed
GET /api/v1/namespaces/{namespace}/tenants/{tenant}/events. Events are delivered via SSEGET .../tenants/{tenant}/events/stream(text/event-stream). Payloads use named events:snapshot(JSONEventListResponse) andstream_error(JSON{ "message" }on watch/snapshot failures). Listing usesevents.k8s.io/v1with per-resource field selectorsregarding.kind+regarding.name(bounded concurrency) instead of listing all namespace events. The Events tab usesEventSource(withCredentials) and listens forsnapshot/stream_error; transporterrortoasts are deduplicated untilonopen. Aggregates events for the Tenant CR, Pods, StatefulSets, and PVCs per PRD scope; legacycore/v1Events not mirrored toevents.k8s.iomay be absent. -
Tenant
spec.encryption.vault: RemovedtlsSkipVerifyandcustomCertificates(they were never wired torustfs-kms). Vault TLS should rely on system-trusted CAs or TLS upstream. The project is still pre-production; if you have old YAML with these keys, remove them before apply. -
Tenant
spec.encryption(breaking): CRD and Console API now match RustFS server startup (rustfs/src/init.rs/config/cli.rs) only.vaultretainsendpoint;localretainskeyDirectory; optionaldefaultKeyIdmaps toRUSTFS_KMS_DEFAULT_KEY_ID. RemovedpingSeconds, Vaultengine/namespace/prefix/authType/appRole, andlocal.masterKeyId. Injected pod env vars are only those the RustFS binary reads (no unusedRUSTFS_KMS_VAULT_*tuning). Regenerateddeploy/rustfs-operator/crds/tenant-crd.yamlandtenant.yaml. -
Local KMS (
context.rs): Validate absolutekeyDirectoryand require a single server replica across pools (multi-replica tenants need Vault or shared storage). -
Deploy scripts (
scripts/deploy/deploy-rustfs.sh,deploy-rustfs-4node.sh): Docker builds use layer cache by default (docker_build_cached); setRUSTFS_DOCKER_NO_CACHE=truefor a full rebuild. Documented inscripts/README.md. -
4-node deploy: Help text moved to an early heredoc (avoids trailing
case/parse issues); see script header. -
4-node cleanup (
cleanup-rustfs-4node.sh): Host storage dirs under/tmp/rustfs-storage-*may requiresudo rm -rfafter Kind (root-owned bind mounts). -
Dockerfile (operator and
console-web/Dockerfile): Build caching and reproducibility tweaks (cargo-chef pin, pnpm in frontend image as applicable).
- Cursor Agent skill
.cursor/skills/rustfs-operator-contribute/SKILL.mdformake pre-commit, commit, push to forkmy, and opening PRs torustfs/operatorwith the project template.
Implemented intelligent StatefulSet update detection and validation to improve reconciliation efficiency and safety:
-
Diff Detection: Added
statefulset_needs_update()method to detect actual changes- Compares existing vs desired StatefulSet specs semantically
- Avoids unnecessary API calls when no changes are needed
- Checks: replicas, image, env vars, resources, scheduling, pod management policy, etc.
-
Immutable Field Validation: Added
validate_statefulset_update()method- Prevents modifications to immutable StatefulSet fields (selector, volumeClaimTemplates, serviceName)
- Provides clear error messages for invalid updates (e.g., changing volumesPerServer)
- Protects against API rejections during reconciliation
-
Enhanced Reconciliation Logic: Refactored StatefulSet reconciliation loop
- Checks if StatefulSet exists before attempting update
- Validates update safety before applying changes
- Only applies updates when actual changes are detected
- Records Kubernetes events for update lifecycle (Created, UpdateStarted, UpdateValidationFailed)
-
Error Handling: Extended error policy
- Added 60-second requeue for immutable field modification errors (user-fixable)
- Consistent error handling across credential and validation failures
-
New Error Types: Added to
types::error::ErrorInternalError- For unexpected internal conditionsImmutableFieldModified- For attempted modifications to immutable fieldsSerdeJson- For JSON serialization errors during comparisons
-
Comprehensive Test Coverage: Added 9 new unit tests (35 tests total)
- Tests for diff detection (no changes, image, replicas, env vars, resources)
- Tests for validation (selector, serviceName, volumesPerServer changes rejected)
- Test for safe updates (image changes allowed)
Benefits:
- Reduces unnecessary API calls and reconciliation overhead
- Prevents reconciliation failures from invalid updates
- Provides better error messages for users
- Foundation for rollout monitoring (Phase 2)
-
Renamed:
get_tenant_credentials()→validate_credential_secret()- Function now only validates Secret structure (exists, has required keys)
- No longer extracts or returns credential values
- Removed environment variable fallback logic
- Returns
Result<(), Error>instead ofBTreeMap<String, String> - Added: Minimum length validation (8 characters for both accesskey and secretkey)
-
Purpose: Eliminate duplication between validation and runtime credential injection
- Validation: Performed by
validate_credential_secret()in reconciliation loop - Runtime: Handled by Kubernetes via
secretKeyRefin StatefulSet environment variables
- Validation: Performed by
-
Benefits:
- Clearer separation of concerns
- Credentials never loaded into operator memory (more secure)
- Simpler code with single responsibility
- Consistent behavior between validation and runtime
- Better security with minimum length requirements
- Field Renamed:
spec.configuration→spec.credsSecret- Rationale: The name
configurationwas too generic and didn't clearly indicate its purpose (referencing a Secret containing RustFS credentials) - New Name:
credsSecretfollows Kubernetes naming conventions (similar toimagePullSecrets) and clearly indicates it references a Secret with credentials - Migration Required: Update your Tenant manifests to use
credsSecretinstead ofconfiguration
- Rationale: The name
Before (v0.1.0):
spec:
configuration:
name: rustfs-credentialsAfter (v0.2.0):
spec:
credsSecret:
name: rustfs-credentials- Impact: All Tenant resources using
spec.configurationmust be updated - Migration: Simple find-and-replace:
configuration:→credsSecret: - Note: This is acceptable at v0.1.0 (pre-release) stage before production adoption
-
Secure Credentials via Kubernetes Secrets: New
spec.credsSecretfield for referencing credentials Secret- Recommended for production: Store RustFS admin credentials in Kubernetes Secrets
- Secret Structure: Must contain
accesskeyandsecretkeykeys - Automatic Injection: Credentials automatically injected as
RUSTFS_ACCESS_KEYandRUSTFS_SECRET_KEYenvironment variables - Validation: Optional validation when Secret is configured
- Secret must exist in the same namespace
- Must have both
accesskeyandsecretkeykeys - Both keys must be valid UTF-8 strings
- Both keys must be at least 8 characters long
- Priority: Secret credentials take precedence over environment variables
- Backward Compatible: Environment variable-based credentials still supported
-
Smart Error Retry Logic:
- Credential validation errors (user-fixable): 60-second retry interval (reduces log spam)
- Transient API errors: 5-second retry (fast recovery)
- Other validation errors: 15-second retry
- Auto-recovery when Secret is fixed
-
New Example:
examples/secret-credentials-tenant.yaml- Complete working example with Secret + Tenant
- Production security best practices
- Troubleshooting guide
- Error retry behavior documentation
-
Documentation Updates:
- Updated CLAUDE.md with credential management section
- Updated ROADMAP.md (marked feature as completed ✅)
- Enhanced examples/README.md with security guidance
-
Per-Pool Kubernetes Scheduling: Added comprehensive scheduling configuration to Pool struct
nodeSelector- Target specific nodes by labelsaffinity- Complex node/pod affinity rulestolerations- Schedule on tainted nodes (e.g., spot instances)topologySpreadConstraints- Distribute pods across failure domainsresources- CPU/memory requests and limits per poolpriorityClassName- Override tenant-level priority per pool
-
SchedulingConfig Struct: Grouped scheduling fields for better code organization
- Uses
#[serde(flatten)]to maintain flat YAML structure - Follows industry-standard pattern (MongoDB, PostgreSQL operators)
- 100% backward compatible
- Uses
-
New Examples:
cluster-expansion-tenant.yaml- Demonstrates capacity expansion and pool migrationhardware-pools-tenant.yaml- Shows heterogeneous disk sizes (same storage class)geographic-pools-tenant.yaml- Multi-region deployment for compliance and DRspot-instance-tenant.yaml- Cost optimization using spot instances
-
Documentation:
docs/multi-pool-use-cases.md- Comprehensive multi-pool use case guidedocs/architecture-decisions.md- Critical architecture understanding- Updated
examples/README.mdwith architecture warnings
-
Tests: Added 5 new tests for scheduling field propagation (20 → 25 tests)
- Operator now automatically sets required RustFS environment variables:
RUSTFS_VOLUMES- Multi-node volume configuration (already existed)RUSTFS_ADDRESS- Server binding address (0.0.0.0:9000)RUSTFS_CONSOLE_ADDRESS- Console binding address (0.0.0.0:9001)RUSTFS_CONSOLE_ENABLE- Enable console UI (true)
-
Console Port: Changed from 9090 to 9001 (correct RustFS default)
- Fixed in
services.rsandworkloads.rs - Verified against RustFS source code constants
- Fixed in
-
IO Service Port: Changed from 90 to 9000 (S3 API standard)
- Fixed in
services.rs - Now matches S3-compatible service expectations
- Fixed in
-
Volume Mount Paths: Changed from
/data/{N}to/data/rustfs{N}- Matches RustFS official Helm chart convention
- Aligns with RustFS docker-compose examples
- Verified against RustFS MNMD deployment guide
-
RUSTFS_VOLUMES Format: Updated path from
/data/{0...N}to/data/rustfs{0...N}- Consistent with RustFS ecosystem standards
- Uses 3-dot ellipsis notation for RustFS expansion
-
Storage Class Mixing: Corrected examples that incorrectly mixed storage classes
- Updated
hardware-pools-tenant.yamlto use same storage class with different sizes - Fixed
spot-instance-tenant.yamlto use uniform storage class - Added warnings to
geographic-pools-tenant.yamlabout unified cluster behavior
- Updated
-
Architectural Clarifications:
- All pools form ONE unified RustFS erasure-coded cluster
- Data is striped uniformly across ALL volumes regardless of storage class
- Mixing NVMe/SSD/HDD results in HDD-level performance for entire cluster
- RustFS has no intelligent storage class-based data placement
- Fixed
multi-pool-tenant.yamlsyntax error (missingpersistence:nesting) - Moved examples from
deploy/rustfs-operator/examples/toexamples/at project root - Created comprehensive
examples/README.mdwith usage guide
- simple-tenant.yaml: Added documentation for all scheduling fields
- production-ha-tenant.yaml: Added topology spread constraints and resource requirements
- minimal-dev-tenant.yaml: Corrected port references and added verification commands
- custom-rbac-tenant.yaml: Clarified RBAC patterns
- tiered-storage-tenant.yaml (2025-11-05): Removed example with fabricated RustFS features
- Contained non-existent environment variables
- Made false claims about automatic storage tiering
- Replaced with architecturally sound examples
Key architectural facts now documented:
- Unified Cluster Architecture: All pools in a Tenant form ONE erasure-coded cluster
- Uniform Data Distribution: Erasure coding stripes data across ALL volumes equally
- No Storage Class Awareness: RustFS does not prefer fast disks over slow disks
- Performance Limitation: Cluster performs at speed of SLOWEST storage class
- External Tiering: RustFS tiering uses lifecycle policies to external cloud storage (S3, Azure, GCS)
Documented valid uses:
- ✅ Cluster capacity expansion and hardware migration
- ✅ Geographic distribution for compliance and disaster recovery
- ✅ Spot vs on-demand instance optimization (compute cost savings)
- ✅ Same storage class with different disk sizes
- ✅ Resource differentiation (CPU/memory) per pool
- ✅ Topology-aware distribution across failure domains
Invalid uses clarified:
- ❌ Storage class mixing for performance tiering (NVMe for hot, HDD for cold)
- ❌ Automatic intelligent data placement based on access patterns
- Basic Tenant CRD with pool support
- RBAC resource creation (Role, ServiceAccount, RoleBinding)
- Service creation (IO, Console, Headless)
- StatefulSet creation per pool
- Volume claim template generation
- RUSTFS_VOLUMES automatic configuration
- Incorrect console port (9090 instead of 9001)
- Incorrect IO service port (90 instead of 9000)
- Missing required RustFS environment variables
- Non-standard volume mount paths
- Limited multi-pool scheduling capabilities
- Misleading examples with fabricated features
All changes verified against:
- RustFS source code (
~/git/rustfs) - RustFS Helm chart (
helm/rustfs/) - RustFS docker-compose examples
- RustFS MNMD deployment guide
- RustFS configuration constants
- Test Count: 25 tests
- Status: All passing ✅
- Build: Successful ✅
- Backward Compatibility: 100% maintained ✅
Branch: feature/pool-scheduling-enhancements
Status: Ready for merge