Improvement Backlog

This page turns the correctness and scalability recommendations into implementation-ready work items. Use it as the execution plan after reading Data Correctness Gotchas.

Phase 1: Correctness-critical tickets

P0/P1 backlog

Ticket	Problem	Implementation	Acceptance criteria	Effort
CE-001: State transition guard	Model output can force invalid `intent/state` jumps.	Add transition validator before final persistence; block or quarantine disallowed transitions.	Invalid transitions never persist; blocked transitions produce explicit error stage and fallback response.	M
CE-002: Conversation optimistic locking	Concurrent same-conversation writes produce last-write-wins corruption.	Add `version` column and optimistic lock handling on conversation updates.	Conflicting parallel updates produce deterministic conflict behavior (retry or fail with known code).	M
CE-003: LLM context lifecycle safety	ThreadLocal LLM context can leak across pooled threads if not cleared.	Wrap every `LlmInvocationContext.set(...)` in `try/finally` + `clear()`.	No stale context observed under stress test with mixed conversation IDs.	S
CE-004: Prompt variable allowlist	All input params are currently exposable to prompt rendering.	Introduce allowlist + redaction for sensitive/unexpected keys.	Only approved prompt keys appear in rendered prompt payloads.	M
CE-005: Stale context eviction rules	Partial schema merges can keep incompatible old fields.	Add per-intent/state field-retention policy and evict on transitions.	Transition tests show old incompatible fields are removed deterministically.	M

Phase 2: Operability and convenience tickets

Developer and operator UX

Ticket	Problem	Implementation	Acceptance criteria	Effort
CE-006: Config lint and dry-run	Broken rules/prompts are discovered too late.	Add validator command/endpoint for response mapping coverage, unresolved vars, rule loops, MCP safety checks.	Invalid config sets fail lint in CI and are blocked from promotion.	M
CE-007: Deterministic replay tool	Wrong-output incidents are hard to reproduce.	Replay conversation turns against frozen config snapshot and compare expected vs actual transitions.	At least one production incident can be replayed locally with identical state progression.	M-L
CE-008: Scenario test harness	Manual QA misses edge-path regressions.	Add fixture-driven conversation tests (turn sequence + expected intent/state/output assertions).	Regression suite catches known sticky-intent, rule-collision, and reset-flow bugs.	M
CE-009: Transition map generator	State machine behavior is opaque to integrators.	Generate graph from rules/responses/schema transitions with dead-end warnings.	Docs include generated transition map and dead-end detection report.	S-M

Phase 3: Scalability tickets

Throughput and horizontal scale

Ticket	Problem	Implementation	Acceptance criteria	Effort
CE-010: Hot-path query refactor	`findAll().stream()` in request path degrades with config size.	Replace with indexed query methods for response/template/schema selection.	P95 latency remains stable when control-plane rows scale 10x.	M
CE-011: Config cache with version invalidation	Repeated config reads increase latency variability.	Add cache per intent/state with invalidation on config mutation.	Cache hit ratio > 90% in steady state without stale-config incidents.	M-L
CE-012: Per-conversation execution serialization	Concurrent turns create races as scale increases.	Route requests by conversation key to single active worker/partition.	No race-induced state drift in concurrency stress tests.	L
CE-013: Canonical turn store	History reconstructed from audit can be incomplete/noisy.	Persist normalized user/assistant turns and switch history provider to it.	History quality checks pass even when audit levels change.	M-L
CE-014: Bounded enrichment budgets	Optional enrichments can inflate synchronous latency.	Apply strict timeout budget for container/MCP enrichments and degrade gracefully.	SLO maintained under downstream slowdown with deterministic fallback behavior.	M

Recommended rollout order

CE-003 and CE-001 first (cheap/high impact correctness guards).
CE-002 before any high-concurrency scale work.
CE-004 and CE-005 before prompt/template expansion.
CE-006 + CE-008 to stop regressions while refactoring.
CE-010 and CE-011 to stabilize throughput.
CE-012 for horizontal scale and race elimination.
CE-013 and CE-014 for long-term quality and latency control.

Done criteria for the program

Exit gates

Gate	Target
Correctness	No illegal transition persistence in test suite + canary runtime.
Concurrency	No race-induced state drift under parallel same-conversation load tests.
Security	Prompt exposure allowlist and MCP safety policy enforcement enabled by default.
Scalability	Stable p95/p99 under 10x config growth and peak expected QPS.
Operability	Config lint, replay, and scenario tests integrated into release workflow.

How to use this backlog

Treat each ticket as a tracked ADR-backed change. For every ticket: define owner, rollout guardrails, migration/rollback plan, and evidence artifact (test report or benchmark).

Phase 1: Correctness-critical tickets​

P0/P1 backlog

Phase 2: Operability and convenience tickets​

Developer and operator UX

Phase 3: Scalability tickets​