/health returns 200" = readiness probeagent-orchestrator:8050 to healthy pods = Service/certs = SecretsDockerfile, runs the same everywhere — laptop, lab, productionreplicas: 3 with HPAhttp://agent-orchestrator:8050nextwave namespacemosaic namespace{{ .Values.x }} placeholders{{ .Values.agentOrchestrator.replicaCount }}cluster-admin access. Rebuilt regularly — safe to break things.create_incident on the Glide REST API| Service | Tech | Port | K8s Kind | Namespace | Role (Restaurant Analogy) |
|---|---|---|---|---|---|
| LBF Client | React/TS | 8060 | Deployment | nextwave | Chat UI — captures user input, renders streaming response |
| Conversation Server | Java/Spring | 8040 | Deployment | nextwave | API gateway — validates auth, proxies SSE stream |
| Agent Orchestrator | Python/FastAPI | 8050 | Deployment+HPA | nextwave | AI brain — runs 18-stage pipeline, makes LLM calls |
| MCP Server | Python/FastAPI | 8030 | Deployment | nextwave | Tool execution — creates incidents, searches KB via Glide |
| Central Cache | Java/WebFlux | 8090 | Deployment | nextwave | Cache layer — tool defs, state, citations via Valkey |
| Mosaic (GAIC) | Java 21 | 18443 | Deployment | mosaic | LLM proxy — PII masking, audit to Hermes, multi-provider |
| Valkey | Redis-compat | 6379 | Deployment (per shard) | both | In-memory state — sessions, execution plans, cache backend |
| Envoy Proxy | Sidecar | — | Sidecar container | nextwave | mTLS enforcement — intercepts all service-to-service traffic |
kubectl apply or helm install allowed on SnowK8s clustersagent-orchestrator/va_agentic/native_agent/ on the release/prod branchpytest (hard gate — fails = blocked), then builds a Docker image tagged agent-orchestrator:a1b2c3d4 and pushes to registry-snapshots.devsnc.com./gradlew build (Java) or uv sync && uv build (Python)./gradlew test / pytestregistry.devsnc.com/team/service:a1b2c3d4registry-snapshotsregistry-releasesFROM registry.devsnc.com/eclipse-temurin:21-jdk
RUN apt-get update && apt-get install -y \
vim-tiny nano jq less procps \
&& rm -rf /var/lib/apt/lists/*
ARG JAR_FILE=./build/libs/*SNAPSHOT.jar
WORKDIR /app
COPY ${JAR_FILE} app.jar
COPY config/ config/
RUN chown -R ubuntu:ubuntu /app
USER ubuntu # non-root mandatory
ENTRYPOINT ["sh", "-c", \
"exec java $DEBUG_OPTS \
-jar /app/app.jar"]
code.devsnc.com) to GitLab (gitlab.servicenow.net) and copies container images across registriesrelease/prod tagatomic: true rolls back the entire release — the previous version keeps runningworkload/charts/nextwave/
Chart.yaml # name: nextwave
values.yaml # defaults for ALL services
templates/
conversation-server/ # Deploy + Svc + ConfigMap
agent-orchestrator/ # Deploy + HPA + Svc
central-cache/
mcp-server-og/
envoy-proxy/ # mTLS sidecar
valkey/ # Per-shard Deployments
webClient/ # LBF Client
monitoring/ # Grafana dashboards
network-policies/ # CiliumNetworkPolicy
# Template (write once):
replicas: {{ .Values.ao.replicaCount }}
image: "{{ .Values.ao.image.tag }}"
cpu: {{ .Values.ao.resources.limits.cpu }}
# values.yaml (change per env):
ao:
replicaCount: 3
image:
tag: a1b2c3d4
resources:
limits: { cpu: 6, memory: 2Gi }
# Rendered output (what K8s sees):
replicas: 3
image: "a1b2c3d4"
cpu: 6
| Priority | Source | Location | Example |
|---|---|---|---|
| 4 (highest) | CMDB Config | Workload Instance config tab | replicaCount, secrets, DNS |
| 3 | Git Config | config/prod/config.yaml | imageRegistry = Bifrost prod |
| 2 | Env Overlays | overlays/nextwave/prod/values.yaml | AO: cpu 6, Valkey: 24Gi |
| 1 (lowest) | Chart Defaults | charts/nextwave/values.yaml | cpu 100m, 1 replica |
snc-prod uses [snc, snc-prod, bifrost-images]atomic: true = full rollback on failurehelm install or kubectl apply directly. All deploys go through Heimdall.| Service | CPU Req | CPU Lim | Mem Req | Mem Lim | Replicas |
|---|---|---|---|---|---|
| Agent Orchestrator | 2 | 6 | 1Gi | 2Gi | 3 (HPA) |
| Conversation Server | 100m | 1 | 512Mi | 1Gi | 2 |
| Mosaic (default) | 500m | 1 | 2Gi | 4Gi | 3 |
| Mosaic (prod) | 500m | 4 | 2Gi | 8Gi | 3 |
| Valkey (Mosaic prod) | 1 | 4 | 6Gi | 8Gi | 6 shards |
startupProbe: # boot time
httpGet: /health (HTTPS)
failureThreshold: 30 # 30×10s = 5min
periodSeconds: 10
livenessProbe: # is it alive?
httpGet: / (HTTPS)
failureThreshold: 3 # 3 fails → restart
readinessProbe: # ready for traffic?
httpGet: /health (HTTPS)
failureThreshold: 3 # 3 fails → remove from LB
securityContext:
runAsNonRoot: true # mandatory
runAsUser: 1000
runAsGroup: 1000
terminationGracePeriodSeconds: 120
lifecycle:
preStop: # zero dropped requests
exec:
command:
- prestop-graceful-shutdown.sh
# drain LB (30s) + wait for
# in-flight requests (60s)
| Rule | From | To | Port |
|---|---|---|---|
| ADC Ingress | External (ADCv2) | Service pods | HTTPS |
| Service → Valkey | App pods | Valkey shards | 6379 |
| Service → S3 | App pods | MinIO / ext S3 | 443 |
| Prometheus scrape | radium namespace | App pods | HTTP |
| DNS | All pods | kube-dns | 53 |
/prometheus on every Mosaic pod every 15 seconds, collecting http_server_requests_total and valkey_cache_hit_totalMosaicNoPodsReady (critical) if all Mosaic pods go down, or MosaicValkeyInstancesDown if a cache shard diescluster_name + namespace + pod_nameapiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: mosaic-service
spec:
selector:
matchLabels:
app.kubernetes.io/component: service
podMetricsEndpoints:
- path: /prometheus
port: http
interval: 15s # scrape every 15 seconds
radium namespace discovers PodMonitors and scrapes endpoints automatically| Metric | Type |
|---|---|
http_server_requests_total | Counter (by method, uri, status) |
http_server_requests_duration_seconds | Histogram (latency percentiles) |
| Metric | Type |
|---|---|
valkey_cache_hit_total | Counter (hit/miss) |
valkey_cache_read_duration_seconds | Histogram |
valkey_pool_active_connections | Gauge (per shard) |
valkey_health_node_status | Gauge (1=healthy) |
# commons-py/observability/default_config.yaml
otlp:
endpoint: "observation:9060"
protocol: "grpc"
batch_processor:
max_export_batch_size: 20
max_queue_size: 2048
| Service | Library | Format |
|---|---|---|
| Agent Orchestrator | structlog | JSON with context |
| MCP Server | structlog | JSON with context |
| Conversation Server | SLF4J + MDC | Logback |
| Central Cache | SLF4J + MDC | Logback |
| Mosaic | SLF4J + MDC | Logback |
[Task Execution] | Pipeline stage logs |
[HTTP Request] | Outbound calls |
[HTTP Stream] | SSE streaming |
[ERROR_HANDLING] | Exception handling |
index=cloudapps cluster_name=c003.bwi
namespace=nextwave pod_name=agent-orchestrator*index=cloudapps namespace=nextwave
"ERROR" OR "Exception"index=cloudapps namespace=nextwave
"[Task Execution]" conversation_id="abc123"index=cloudapps cluster_name=skuld004.ycg
"OOMKilled" OR "CrashLoopBackOff"[GAIC_N]Hermes KRaft cluster:
1 controller node
3 broker nodes (ports 9093, 9094, 9095)
Mosaic config:
glideexport.hermes.supports-token=false
| Alert | Condition | Severity |
|---|---|---|
MosaicPodInCrashLoopBackOff | >1 restart in 15m | critical |
MosaicPodRestartingFrequently | >2 restarts in 15m | warning |
MosaicMultiplePodsRestarting | 2+ pods restarting | critical |
MosaicNoPodsReady | Complete outage | critical |
MosaicSelfTestFailure | Self-test fails | critical |
| Alert | Condition | Severity |
|---|---|---|
MosaicValkeyMultiplePodsRestarting | Multiple shards restarting | critical |
MosaicValkeyNoPodsReady | Complete cache outage | critical |
MosaicValkeyInstancesDown | Enabled shards not running | critical |
MosaicHeimdallFailedToApply | Workload CR not applying | critical |
monitoring:
prometheusRule:
valkeyMemoryAlert:
warningThresholdPercent: 75 # warn at 75%
criticalThresholdPercent: 90 # critical at 90%
for: 5m # sustain 5 min before firing
/certsdocker compose up -d --build
# UI: localhost:8060
# CS: localhost:8040
# AO: localhost:8050
main/developrelease/prod. Director approval.| Branch | Environment | Clusters |
|---|---|---|
release/prod | Production | All Skuld DCs |
release/preview | Preview | Select Skuld DCs |
main/develop | Lab / Skuld | Carthagelab, Skuld004 |
| Feature branches | Lab only | Carthagelab |
bssh sk8sops01.ycg0→start_k8toolbox→k login c003.ycg→k9s| k9s Key | Action |
|---|---|
:ns | Browse namespaces |
:po | List pods |
l | View logs |
s | Shell into pod |
d | Describe resource |
# Pods
kubectl -n nextwave get pods
kubectl -n nextwave logs -f ao-xxx
kubectl -n nextwave logs --previous ao-xxx
kubectl -n nextwave describe pod ao-xxx
kubectl -n nextwave exec -it ao-xxx -- /bin/bash
# Deployments
kubectl -n nextwave get deploy
kubectl -n nextwave rollout restart deploy/ao
kubectl -n nextwave rollout status deploy/ao
kubectl -n nextwave top pods
kubectl -n nextwave get events --sort-by='.lastTimestamp'
kubectl describe pod — check Eventskubectl logs --previous — last crash logReason: OOMKilledatomic: true = rolled back, previous version still runningkubectl get pods | grep valkeyvalkey_health_node_status in Grafanavalkey-cli -h valkey-0 pingkubectl get certificateReady: Truekubectl get secret| Script | Purpose |
|---|---|
1-clone-repositories.sh | Clone/update all service repos (SSH or HTTPS) |
2-build-and-deploy.sh | Build all with Docker Compose. --forceenv, --purge, --parallel |
3-validate-deployment.sh | Health check all services |
4-mosaic-gaic-deployment.sh | Local Mosaic Docker (setup/start/stop) |
5-update-repos.sh | Safe update: stash → pull → restore |