01Kubernetes Fundamentals
Kubernetes Basics
- Before diving into SnowK8s and Offglide, let's understand the building blocks
- No prior K8s knowledge assumed
What is Kubernetes?
- "A portable, extensible, open-source platform for managing containerized workloads and services."
- Originated from Google's Borg Project, open-sourced in 2014
- Name means helmsman in Greek — K8s = K + 8 letters + s
Evolution: How We Got Here
Traditionalbare servers
Virtual Machinesdedicated OS per VM
Containersshared OS, lightweight
Kubernetesorchestrates containers
How This Maps to Offglide
- When you deploy Agent Orchestrator to SnowK8s, you declare what you need and K8s handles everything else
Offglide Team Declares:
- "Keep 3 Agent Orchestrator pods running" = replica count
- "Each AO pod needs 2 CPU, 1Gi memory" = resource requests/limits
- "AO is ready when
/healthreturns 200" = readiness probe - "Drain connections before killing AO pods" = preStop lifecycle hook
- "Scale AO to 6 pods when CPU > 70%" = HPA autoscaling
Kubernetes Handles:
- Places AO pods on nodes with enough CPU = scheduler
- Remembers "3 AO replicas desired" even after restarts = etcd
- Notices a worker node died, moves AO-2 elsewhere = self-healing
- Routes
agent-orchestrator:8050to healthy pods = Service - Injects Mosaic TLS certs into AO pods at
/certs= Secrets
Core Building Blocks
Container
- A zip file of your app + everything it needs (runtime, libraries, config)
- Built via a
Dockerfile, runs the same everywhere — laptop, lab, production
In Offglide
- AO container packages Python 3.13 + FastAPI + all dependencies
- CS container packages Java 21 + Spring Boot
- Each runs identically on your laptop, Carthagelab, or Skuld production
Pod
- The smallest thing K8s can run — wraps 1+ containers with a shared IP
- Crashes are auto-restarted; pods are ephemeral — they come and go
In Offglide
- Each AO pod runs two containers: the AO app + an Envoy Proxy sidecar for mTLS
- They share the same network — Envoy intercepts all traffic before it reaches AO
Deployment
- Tells K8s: "I want 3 copies of this pod always running"
- If one dies, K8s spins up a replacement
- Handles rolling updates for zero-downtime deploys
In Offglide
- AO has
replicas: 3with HPA - If an AO pod OOMs during a heavy LLM call, K8s spins up a new one instantly
- During deploys, pods are replaced one-by-one — users never see downtime
Service (ClusterIP)
- Pods get random IPs that change on restart
- A Service gives your pods a stable DNS name so other services can always find them
In Offglide
- CS calls AO using
http://agent-orchestrator:8050 - Never needs to know which pod IP or which of the 3 replicas handles the request
- K8s load-balances automatically
ConfigMap & Secret
- Config injected at runtime — not baked into the image
- ConfigMaps for regular settings, Secrets for passwords and TLS certs
- Same image works in all environments
In Offglide
- Same Mosaic Docker image runs in lab and prod
- Lab gets lab TLS certs + lab S3 password from Secrets
- Prod gets prod certs + prod S3 password — zero code changes
Namespace
- Like folders on your computer — keeps teams' resources separated
- You can only see resources in your own namespace
In Offglide
- Offglide services live in the
nextwavenamespace - Mosaic lives in the
mosaicnamespace - They can't accidentally interfere — Cilium network policies enforce this boundary
Node
- A single machine (physical or VM) in the cluster
- Worker nodes run your pods, each has fixed CPU + memory
- K8s schedules pods onto nodes based on resource requests
Cluster
- A set of machines working together
- Control plane (brain): API server, scheduler, etcd
- Worker nodes (muscle): run pods
- In SnowK8s: 3 control nodes + many workers
Helm Chart
- The package manager for K8s (like apt/brew/pip)
- Templates your YAML with
{{ .Values.x }}placeholders - One chart + different values per environment = done
In Offglide
- NextWave Helm chart has
{{ .Values.agentOrchestrator.replicaCount }} - In local values.yaml it's 1, in prod overlay it's 3
- Same chart — Heimdall fills in the right number per cluster
Why Kubernetes?
Self-Healing
- Pod crashes → K8s restarts it
- Node dies → pods rescheduled to healthy nodes
Service Discovery
- Stable DNS names for every service
- Traffic balanced automatically during spikes
Zero-Downtime Deploys
- Rolling updates replace pods one at a time
- Users never see downtime
Autoscaling
- HPA scales pods based on CPU/memory
- More load = more pods automatically
Secrets Management
- Passwords and certs stored securely
- Injected at runtime — no hardcoding
Multi-Cloud
- Works anywhere: bare metal, AWS, Azure, GCP
- Same manifests everywhere
02SnowK8s Platform
What is SnowK8s?
- SnowK8s is ServiceNow's internal Kubernetes platform for deploying shared services — services that run outside customer Glide instances
- 30+ data centers, 3+ regions, multi-cluster architecture
Why Offglide Needs SnowK8s
- Glide is the customer's ServiceNow instance — it stores incidents, users, KB articles
- AI features (LLM calls, streaming chat, tool orchestration) need to iterate weekly, not wait for quarterly Glide upgrades
- SnowK8s lets the Offglide team deploy Agent Orchestrator updates to production in hours, while Glide stays unchanged
- The customer gets the latest AI automatically — Glide discovers Offglide via DISHv2/MIMIR and calls it over mTLS
Why SnowK8s Exists
- Faster rollouts — ship AI features weekly, not quarterly with Glide releases
- Independent scaling — AI workloads scale separately from Glide instances
- Standardization — all teams deploy the same way (Helm + Heimdall)
- M&A integration — acquisitions running K8s can onboard to SnowK8s
Key Characteristics
- 3-node control plane per cluster (VMware VMs) for HA quorum
- Bare metal workers (not VMs) for performance
- Cilium CNI for networking + network policies (eBPF-based)
- Inception builds clusters (Ansible + Kubespray)
- ADCv2 VIP as entry point for traffic routing
- Multi-tenant — all clusters shared across teams
The SnowK8s Stack
Heimdall
- Deployment orchestrator. Merges config from CMDB + Git + overlays, renders Helm charts, applies to clusters.
Bifrost
- Replicates Git repos (GHE → GitLab) and container images across registries.
Kaa
- Certificate management (cert-manager + EJBCA). Auto-provisions and renews mTLS certificates.
Radium
- Monitoring stack: Prometheus + Grafana + AlertManager. Deployed per cluster.
Anchore
- Container vulnerability scanning. No critical/high CVEs allowed in production.
DCPS
- DataCenter Password Store. External-secrets operator pulls secrets into K8s.
Cilium
- CNI plugin for pod networking + network policy enforcement using eBPF.
Hermes
- Kafka-based event streaming. Audit logs, analytics, async messaging.
Cluster Families
Ragnarok (Sandbox)
- Purpose: Experimentation & learning.
cluster-adminaccess. Rebuilt regularly — safe to break things. - Clusters: ragnarok002.ifa, ragnarok002.ifb (4-6 worker nodes each)
Carthagelab (HA Lab)
- Purpose: Pre-production testing. Mirrors prod config. Restricted access (no cluster-admin).
- Clusters: carthagelab004/007 .dva/.dvb (5-6 workers each)
Skuld (Production)
- Purpose: Live customer traffic. Multi-region. Director-level approval required. Full observability.
- DC pairs: bwi/phx, aus/ord, gig/gru, ycg/ytz
Regional Deployment Model
- Each region has at least two clusters behind an ADCv2 VIP (load balancer)
- Workloads deploy to paired datacenters for HA
- If one DC fails, the ADC routes all traffic to the other
Workload Onboarding (ARK Process)
- Before deploying to SnowK8s, teams go through ARK (Architecture Review for Kubernetes):
- Submit resource requirements (CPU, memory, storage per pod)
- Define scaling strategy (replicas, HPA thresholds)
- Specify network policies and security model
- Plan failover and HA strategy
- Get namespace and resource quotas allocated
i
Resource quotas are enforced at the namespace level. Kyverno policies ensure all pods have resource requests/limits defined.
03Offglide on Kubernetes
How Offglide Comes Together
- Offglide is a microservices platform where each service is a container deployed as a K8s Deployment
- Here's how they map to Kubernetes resources
How the Services Fit Together
- Think of a user asking "Create an incident for email server down." The LBF Client captures the message
- The Conversation Server validates the session and passes it on
- The Agent Orchestrator runs Planner1 to classify the intent as SINGLE_AGENT_TOOL_MATCH, then calls Mosaic to talk to GPT-4 (with PII masked)
- Mosaic logs the call to Hermes (Kafka → S3 for audit)
- AO then calls the MCP Server to execute
create_incidenton the Glide REST API - The incident number (INC0045123) comes back, AO runs the Output Refiner, and streams the polished response token-by-token via SSE back through CS to the user
- State is saved to Valkey so the next turn remembers context
- Every pod in this chain runs on SnowK8s with mTLS enforced by Envoy sidecars
Service-to-K8s Resource Map
| Service | Tech | Port | K8s Kind | Namespace | Role (Restaurant Analogy) |
|---|---|---|---|---|---|
| LBF Client | React/TS | 8060 | Deployment | nextwave | Chat UI — captures user input, renders streaming response |
| Conversation Server | Java/Spring | 8040 | Deployment | nextwave | API gateway — validates auth, proxies SSE stream |
| Agent Orchestrator | Python/FastAPI | 8050 | Deployment+HPA | nextwave | AI brain — runs 18-stage pipeline, makes LLM calls |
| MCP Server | Python/FastAPI | 8030 | Deployment | nextwave | Tool execution — creates incidents, searches KB via Glide |
| Central Cache | Java/WebFlux | 8090 | Deployment | nextwave | Cache layer — tool defs, state, citations via Valkey |
| Mosaic (GAIC) | Java 21 | 18443 | Deployment | mosaic | LLM proxy — PII masking, audit to Hermes, multi-provider |
| Valkey | Redis-compat | 6379 | Deployment (per shard) | both | In-memory state — sessions, execution plans, cache backend |
| Envoy Proxy | Sidecar | — | Sidecar container | nextwave | mTLS enforcement — intercepts all service-to-service traffic |
The Cluster View (what k9s shows)
Service Communication Flow
How a Message Flows (step by step)
User types in chat UI (LBF Client)
- React sends POST to Conversation Server
- SSE connection opened for streaming
CS validates & forwards to AO
- Session auth, execution context created, request forwarded with SSE proxy
AO runs Planner1 (intent classification)
- LLM call via Mosaic
- Classifies: QUESTION_ANSWERING, TOOL_MATCH, SMALL_TALK, etc.
Pipeline routes based on classification
- TOOL_MATCH → MCP Server
- KB_SEARCH → Planner2
- DIRECT → straight to output
Mosaic proxies LLM calls (PII masked)
- PII tokens replaced
- Audit log sent to Hermes → S3
- Response PII restored
Output Refiner polishes response
- Final LLM pass for quality, formatting, citations
- Streams via SSE CHUNKs
Response arrives token-by-token
- SSE CHUNKs → TEXT → ANNOTATION_ADDED (citations) → SYSTEM:"done"
- State saved to Valkey
04CI/CD Pipeline
Code to Running Pods
- Every step is automated
- No manual
kubectl applyorhelm installallowed on SnowK8s clusters
What Happens When You Push an AO Change
- You push a fix to
agent-orchestrator/va_agentic/native_agent/on therelease/prodbranch - Jenkins runs
pytest(hard gate — fails = blocked), then builds a Docker image taggedagent-orchestrator:a1b2c3d4and pushes toregistry-snapshots.devsnc.com - Bifrost mirrors the code to GitLab and replicates the image to the prod registry
- Anchore scans the image for CVEs
- Heimdall (running inside each Skuld cluster) detects the new image tag in the CMDB, merges prod overlays (6 CPU, 2Gi memory, 3 replicas), renders the Helm chart, diffs against the running state, and applies
- K8s does a rolling update — AO-1 drains, new AO-1 starts, then AO-2, then AO-3
- Users never see downtime
Code PushGHE
Jenkins CIbuild+test
Bifrostreplicate
AnchoreCVE scan
Heimdallconfig merge
Helm Deployto K8s
Jenkins CI Stages
Prepare
- Workspace cleanup, SCM checkout, install Java 21 / Python 3.13 + uv
Build
./gradlew build(Java) oruv sync && uv build(Python)
Unit Tests — HARD GATE
- Pipeline stops if any test fails
./gradlew test/pytest
Integration Tests — HARD GATE
- Service-level integration tests
- Must pass to continue
SonarQube + Coverage
- Code quality analysis and test coverage reporting
Build Docker Image
- Tagged with git commit hash:
registry.devsnc.com/team/service:a1b2c3d4
Anchore Security Scan
- No critical/high CVEs allowed before promotion
Push to Registry
- Snapshot →
registry-snapshots - Promoted →
registry-releases
Dockerfile Example
FROM registry.devsnc.com/eclipse-temurin:21-jdk
RUN apt-get update && apt-get install -y \
vim-tiny nano jq less procps \
&& rm -rf /var/lib/apt/lists/*
ARG JAR_FILE=./build/libs/*SNAPSHOT.jar
WORKDIR /app
COPY ${JAR_FILE} app.jar
COPY config/ config/
RUN chown -R ubuntu:ubuntu /app
USER ubuntu # non-root mandatory
ENTRYPOINT ["sh", "-c", \
"exec java $DEBUG_OPTS \
-jar /app/app.jar"]
→
Image tag = git commit hash. You always know exactly what code is deployed.
Bifrost Replication
What Bifrost Does
- Replicates from GHE (
code.devsnc.com) to GitLab (gitlab.servicenow.net) and copies container images across registries - Required because Heimdall reads from GitLab
GHE repo
→
GitLab mirror
registry-snapshots
→
Bifrost prod
05Helm & Heimdall
Helm Charts & Heimdall
- Helm is the package manager
- Heimdall orchestrates where and how it deploys
- You never run Helm directly
How NextWave Uses This
- The NextWave Helm chart has templates for all 8 services (AO, CS, MCP, Cache, Valkey, Envoy, LBF, monitoring)
- The values.yaml defaults are safe for local dev (1 replica, 100m CPU)
- The prod overlay bumps AO to 3 replicas with 6 CPU
- Heimdall runs inside skuld004.ycg, reads the CMDB Workload Instance for "nextwave-prod", clones the GitLab repo at the
release/prodtag - Merges chart defaults → snc overlay → prod overlay → CMDB overrides, renders all the YAML, and applies only what changed
- If the render fails,
atomic: truerolls back the entire release — the previous version keeps running
NextWave Helm Chart
workload/charts/nextwave/
Chart.yaml # name: nextwave
values.yaml # defaults for ALL services
templates/
conversation-server/ # Deploy + Svc + ConfigMap
agent-orchestrator/ # Deploy + HPA + Svc
central-cache/
mcp-server-og/
envoy-proxy/ # mTLS sidecar
valkey/ # Per-shard Deployments
webClient/ # LBF Client
monitoring/ # Grafana dashboards
network-policies/ # CiliumNetworkPolicy
How Helm Templates Work
# Template (write once):
replicas: {{ .Values.ao.replicaCount }}
image: "{{ .Values.ao.image.tag }}"
cpu: {{ .Values.ao.resources.limits.cpu }}
# values.yaml (change per env):
ao:
replicaCount: 3
image:
tag: a1b2c3d4
resources:
limits: { cpu: 6, memory: 2Gi }
# Rendered output (what K8s sees):
replicas: 3
image: "a1b2c3d4"
cpu: 6
Config Layering (Heimdall merges these, last wins)
| Priority | Source | Location | Example |
|---|---|---|---|
| 4 (highest) | CMDB Config | Workload Instance config tab | replicaCount, secrets, DNS |
| 3 | Git Config | config/prod/config.yaml | imageRegistry = Bifrost prod |
| 2 | Env Overlays | overlays/nextwave/prod/values.yaml | AO: cpu 6, Valkey: 24Gi |
| 1 (lowest) | Chart Defaults | charts/nextwave/values.yaml | cpu 100m, 1 replica |
Heimdall Controller (runs in every cluster)
Sync workload instances from CMDB
- Polls for changes to the Workload Instance records
Clone Git repo (ref from CMDB)
- Uses the branch/tag from the Workload Instance record
Read heimdall.yaml → resolve environment
- Determines overlays to apply:
snc-produses[snc, snc-prod, bifrost-images]
Merge config layers
- chart defaults → overlays → git config → CMDB
- Last wins
Render Helm + diff against cluster
- Only applies if changes detected
atomic: true= full rollback on failure
CMDB Workload Hierarchy
Workloadnextwave
Familyprod / preview / lab
Instanceskuld004.ycg
!
1 workload = 1 namespace. Sharing namespaces is NOT supported. Never run
helm install or kubectl apply directly. All deploys go through Heimdall.06K8s Deployment Details
K8s Resource Details
- Real specs from the actual Helm charts
- Resources, probes, security, volumes
Resource Allocation per Service
| Service | CPU Req | CPU Lim | Mem Req | Mem Lim | Replicas |
|---|---|---|---|---|---|
| Agent Orchestrator | 2 | 6 | 1Gi | 2Gi | 3 (HPA) |
| Conversation Server | 100m | 1 | 512Mi | 1Gi | 2 |
| Mosaic (default) | 500m | 1 | 2Gi | 4Gi | 3 |
| Mosaic (prod) | 500m | 4 | 2Gi | 8Gi | 3 |
| Valkey (Mosaic prod) | 1 | 4 | 6Gi | 8Gi | 6 shards |
i
requests = "I need at least this" (K8s uses for scheduling). limits = "never use more" (exceed memory = OOMKilled; exceed CPU = throttled).
Health Probes (real deployment YAML)
Mosaic Service
startupProbe: # boot time
httpGet: /health (HTTPS)
failureThreshold: 30 # 30×10s = 5min
periodSeconds: 10
livenessProbe: # is it alive?
httpGet: / (HTTPS)
failureThreshold: 3 # 3 fails → restart
readinessProbe: # ready for traffic?
httpGet: /health (HTTPS)
failureThreshold: 3 # 3 fails → remove from LB
Pod Security & Lifecycle
securityContext:
runAsNonRoot: true # mandatory
runAsUser: 1000
runAsGroup: 1000
terminationGracePeriodSeconds: 120
lifecycle:
preStop: # zero dropped requests
exec:
command:
- prestop-graceful-shutdown.sh
# drain LB (30s) + wait for
# in-flight requests (60s)
Network Policies (Cilium)
- Default: deny-all + DNS. Then explicit CiliumNetworkPolicy rules allow specific traffic:
| Rule | From | To | Port |
|---|---|---|---|
| ADC Ingress | External (ADCv2) | Service pods | HTTPS |
| Service → Valkey | App pods | Valkey shards | 6379 |
| Service → S3 | App pods | MinIO / ext S3 | 443 |
| Prometheus scrape | radium namespace | App pods | HTTP |
| DNS | All pods | kube-dns | 53 |
07Monitoring & Metrics
Monitoring & Grafana
- Live — The monitoring stack (Radium) runs per-cluster
- Prometheus scrapes metrics from every pod every 15s
- Grafana visualizes
- AlertManager fires alerts
How Offglide Is Monitored
- Prometheus scrapes
/prometheuson every Mosaic pod every 15 seconds, collectinghttp_server_requests_totalandvalkey_cache_hit_total - Grafana shows dashboards like "Mosaic HTTP Metrics" (request rate, P95 latency by endpoint, error rate) and "Valkey Metrics" (hit/miss ratio, connection pool per shard)
- AlertManager fires
MosaicNoPodsReady(critical) if all Mosaic pods go down, orMosaicValkeyInstancesDownif a cache shard dies - The alert routes through the TNG receiver to Eyrie/ServiceNow event management
- Meanwhile, Splunk holds every log line from every pod, searchable by
cluster_name+namespace+pod_name
How Prometheus Scrapes Offglide
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: mosaic-service
spec:
selector:
matchLabels:
app.kubernetes.io/component: service
podMetricsEndpoints:
- path: /prometheus
port: http
interval: 15s # scrape every 15 seconds
- Prometheus operator in
radiumnamespace discovers PodMonitors and scrapes endpoints automatically
Metrics Collected
HTTP Metrics
| Metric | Type |
|---|---|
http_server_requests_total | Counter (by method, uri, status) |
http_server_requests_duration_seconds | Histogram (latency percentiles) |
Valkey Cache Metrics
| Metric | Type |
|---|---|
valkey_cache_hit_total | Counter (hit/miss) |
valkey_cache_read_duration_seconds | Histogram |
valkey_pool_active_connections | Gauge (per shard) |
valkey_health_node_status | Gauge (1=healthy) |
Grafana Dashboards
Compute Resources
- Per-pod CPU, memory, network I/O across namespace
K8s default dashboard
Namespace Consumption
- Total resource usage. Compare across teams.
K8s default dashboard
Worker Nodes
- Node-level CPU, memory, disk. Capacity planning.
K8s default dashboard
Pod Deep Dive
- Single pod: CPU, memory, network, filesystem
K8s default dashboard
Mosaic HTTP
- Request rate, error rate, P95, endpoints, status codes
mosaic_http_metrics.json
Mosaic Valkey
- Cache ops, hit/miss, R/W latency, pool health
mosaic_valkey_metrics.json
OpenTelemetry (Distributed Tracing)
# commons-py/observability/default_config.yaml
otlp:
endpoint: "observation:9060"
protocol: "grpc"
batch_processor:
max_export_batch_size: 20
max_queue_size: 2048
- Traces flow: Service → OTLP (gRPC) → Observation service (:9060) → TraceLens (Next.js trace analyzer)
- W3C TraceContext propagated across all services
08Logging & Splunk
Centralized Logging
- All pod stdout/stderr is collected by splunk-connect operator and forwarded to Splunk with cluster/pod/service metadata
- No extra config needed from workloads
How Logs Flow
Log Formats by Service
| Service | Library | Format |
|---|---|---|
| Agent Orchestrator | structlog | JSON with context |
| MCP Server | structlog | JSON with context |
| Conversation Server | SLF4J + MDC | Logback |
| Central Cache | SLF4J + MDC | Logback |
| Mosaic | SLF4J + MDC | Logback |
AO Log Prefixes
[Task Execution] | Pipeline stage logs |
[HTTP Request] | Outbound calls |
[HTTP Stream] | SSE streaming |
[ERROR_HANDLING] | Exception handling |
Splunk Queries
- AO logs in a cluster:
index=cloudapps cluster_name=c003.bwi
namespace=nextwave pod_name=agent-orchestrator*- Errors only:
index=cloudapps namespace=nextwave
"ERROR" OR "Exception"- Pipeline stage for a conversation:
index=cloudapps namespace=nextwave
"[Task Execution]" conversation_id="abc123"- Pod crashes:
index=cloudapps cluster_name=skuld004.ycg
"OOMKilled" OR "CrashLoopBackOff"09Hermes (Kafka)
Hermes & Event Streaming
- Hermes is ServiceNow's Kafka-based event streaming platform
- The backbone for async communication — audit logs, analytics, and cross-service messaging
How Hermes Fits Into the Offglide Flow
- When Agent Orchestrator calls Mosaic to make an LLM request, Mosaic does three things:
- (1) masks PII like email addresses with tokens like
[GAIC_N] - (2) forwards the sanitized prompt to GPT-4/Claude
- (3) produces an encrypted audit log containing the full request, response, token count, latency, and caller identity
- That audit log is published to a Hermes Kafka topic
- A downstream consumer picks it up and writes it to S3 for long-term compliance storage
- This is all async — it doesn't slow down the LLM response
- Hermes also routes alert events from AlertManager through the TNG receiver to the Eyrie incident management system
What Hermes Handles in Offglide
- LLM Audit Logs — Mosaic sends every LLM call (encrypted) to S3 via Hermes
- Conversation Analytics — usage data, response quality metrics
- Event-Driven Communication — async events between services
- Alert Routing — AlertManager → TNG receiver → Hermes → Eyrie
→
Hermes is a shared SnowK8s service. Offglide services connect via the service mesh as producers/consumers.
Mosaic → Hermes Audit Flow
Local Development
- For local dev, Hermes runs as KRaft-mode Kafka (no ZooKeeper):
Hermes KRaft cluster:
1 controller node
3 broker nodes (ports 9093, 9094, 9095)
Mosaic config:
glideexport.hermes.supports-token=false
10Alerting
Prometheus Alerts
- Active rules — Real PrometheusRule definitions from the Mosaic Helm chart
Prometheusevaluates rules
AlertManagerfires alert
TNG Receiverroutes
Eyrie / SNevent mgmt
Service Alerts
| Alert | Condition | Severity |
|---|---|---|
MosaicPodInCrashLoopBackOff | >1 restart in 15m | critical |
MosaicPodRestartingFrequently | >2 restarts in 15m | warning |
MosaicMultiplePodsRestarting | 2+ pods restarting | critical |
MosaicNoPodsReady | Complete outage | critical |
MosaicSelfTestFailure | Self-test fails | critical |
Valkey Alerts
| Alert | Condition | Severity |
|---|---|---|
MosaicValkeyMultiplePodsRestarting | Multiple shards restarting | critical |
MosaicValkeyNoPodsReady | Complete cache outage | critical |
MosaicValkeyInstancesDown | Enabled shards not running | critical |
MosaicHeimdallFailedToApply | Workload CR not applying | critical |
Configurable Thresholds
monitoring:
prometheusRule:
valkeyMemoryAlert:
warningThresholdPercent: 75 # warn at 75%
criticalThresholdPercent: 90 # critical at 90%
for: 5m # sustain 5 min before firing
11Security
Security & mTLS
- All inter-service communication is mutual TLS
- Certificates managed by Kaa (cert-manager + EJBCA)
Kaa Certificate Lifecycle
- cert-manager watches Certificate resources
- CSR sent to EJBCA CA
- EJBCA issues signed certificate
- Certificate stored as K8s Secret
- Pod mounts Secret at
/certs - Auto-renewal before expiry
Security Layers
- mTLS — Envoy sidecar on every pod
- Network Policies — Cilium deny-all + explicit allows
- RBAC — LDAP-backed access control
- DCPS — External secrets from DataCenter Password Store
- Non-root — All containers run as UID 1000
- Image Scanning — Anchore + Twistlock
- PII Masking — Mosaic strips PII before LLM calls
12Environments
Environment Hierarchy
- From Docker Compose on your laptop to global production clusters
Local
- Docker Compose or Kind
- Mock Kaa certs. Low resources.
docker compose up -d --build
# UI: localhost:8060
# CS: localhost:8040
# AO: localhost:8050
Lab / Skuld
- Carthagelab/Skuld clusters
- Real Kaa certs. DCPS secrets
- Branch:
main/develop
Production
- Skuld clusters. Multi-region DCs
- Full mTLS, Anchore required
- Branch:
release/prod. Director approval.
Branch Mapping
| Branch | Environment | Clusters |
|---|---|---|
release/prod | Production | All Skuld DCs |
release/preview | Preview | Select Skuld DCs |
main/develop | Lab / Skuld | Carthagelab, Skuld004 |
| Feature branches | Lab only | Carthagelab |
13Operations
Operations & Troubleshooting
- Cluster access, essential commands, troubleshooting guides
Cluster Access
bssh sk8sops01.ycg0→start_k8toolbox→k login c003.ycg→k9s| k9s Key | Action |
|---|---|
:ns | Browse namespaces |
:po | List pods |
l | View logs |
s | Shell into pod |
d | Describe resource |
Common kubectl
# Pods
kubectl -n nextwave get pods
kubectl -n nextwave logs -f ao-xxx
kubectl -n nextwave logs --previous ao-xxx
kubectl -n nextwave describe pod ao-xxx
kubectl -n nextwave exec -it ao-xxx -- /bin/bash
# Deployments
kubectl -n nextwave get deploy
kubectl -n nextwave rollout restart deploy/ao
kubectl -n nextwave rollout status deploy/ao
kubectl -n nextwave top pods
kubectl -n nextwave get events --sort-by='.lastTimestamp'
Troubleshooting
▶
Pod in CrashLoopBackOff
common- Cause: App crashes on startup (bad env var, missing secret, OOM)
kubectl describe pod— check Eventskubectl logs --previous— last crash log- Look for OOMKilled, missing env vars, TLS cert errors
- Check Splunk for the pod's last output
▶
OOMKilled
common- Cause: Exceeded memory limit. K8s kills immediately.
- Describe pod → look for
Reason: OOMKilled - Check Grafana memory dashboard for usage trend
- Increase memory limit in Helm overlay (or CMDB for prod)
- For AO: check if LLM logging is enabled (high memory)
▶
Heimdall deploy failed
critical- Cause: Helm render error, config merge issue, or K8s rejected manifest
- Check MosaicHeimdallFailedToApply alert
- Look at Heimdall controller logs
- Verify git ref in CMDB is valid
- Check YAML syntax in overlay values
atomic: true= rolled back, previous version still running
▶
Valkey connection issues
warning- Cause: TLS cert mismatch, shard down, or network policy blocking
- Check Valkey pods:
kubectl get pods | grep valkey - Check
valkey_health_node_statusin Grafana - Verify CiliumNetworkPolicy allows service → valkey:6379
- Shell into app and test:
valkey-cli -h valkey-0 ping
▶
TLS / mTLS certificate errors
warning- Cause: cert-manager failed to issue/renew, or wrong CA
- Check cert-manager:
kubectl get certificate - Describe Certificate → look for
Ready: True - Verify Secret exists:
kubectl get secret - For local: use Mock Kaa Toolkit certs
Setup Scripts (offglide-services-setup/)
| Script | Purpose |
|---|---|
1-clone-repositories.sh | Clone/update all service repos (SSH or HTTPS) |
2-build-and-deploy.sh | Build all with Docker Compose. --forceenv, --purge, --parallel |
3-validate-deployment.sh | Health check all services |
4-mosaic-gaic-deployment.sh | Local Mosaic Docker (setup/start/stop) |
5-update-repos.sh | Safe update: stash → pull → restore |
14Glossary
Glossary
SnowK8s
- ServiceNow's internal K8s platform for shared services across 30+ data centers globally
Heimdall
- Deployment orchestrator. Merges config layers, renders Helm, applies to clusters. Never run Helm directly.
Bifrost
- Replicates code (GHE → GitLab) and container images across registries
Kaa
- Certificate management (cert-manager + EJBCA). Auto-provisions mTLS certs for all services.
Hermes
- Kafka-based event streaming. Audit logs, analytics, async messaging across Offglide.
Radium
- Per-cluster monitoring stack: Prometheus + Grafana + AlertManager
Valkey
- Redis-compatible in-memory store. Session state, cache, execution plans.
Mosaic (GAIC)
- LLM proxy with PII masking, multi-provider routing, audit logging via Hermes to S3
Agent Orchestrator
- Python/FastAPI brain. Runs 18-stage AI pipeline, makes LLM calls, manages multi-turn state.
Conversation Server
- Java/Spring Boot API gateway. Auth, session mgmt, SSE proxying.
MCP Server
- Tool gateway using Model Context Protocol. Executes tools on Glide REST APIs.
LBF Client
- React/TS chat UI. Streaming tokens, forms, citations. Web component distribution.
Cilium
- CNI plugin for networking + network policy enforcement using eBPF. VxLAN overlay mode.
DCPS
- DataCenter Password Store. External-secrets operator pulls into K8s Secrets.
Anchore
- Container vulnerability scanner. No critical/high CVEs before production.
ARK
- Architecture Review for Kubernetes. Required before onboarding to SnowK8s.
Skuld
- Production cluster family. Live customer traffic. Multi-region. Strict RBAC.
Ragnarok
- Sandbox cluster family. cluster-admin access. Safe experimentation.
DISHv2 / MIMIR
- Service discovery. Glide uses DISHv2 + MIMIR to resolve Offglide endpoints.
SSE
- Server-Sent Events. AO streams response tokens to client in real-time.
Execution Plan
- Multi-turn state tracker. READY → IN_PROGRESS → COMPLETED. Persisted in Valkey.
Inception
- Cluster bootstrap using Ansible + Kubespray. Creates K8s clusters from bare metal.