01Kubernetes Fundamentals

Kubernetes Basics

Before diving into SnowK8s and Offglide, let's understand the building blocks
No prior K8s knowledge assumed

What is Kubernetes?

"A portable, extensible, open-source platform for managing containerized workloads and services."
Originated from Google's Borg Project, open-sourced in 2014
Name means helmsman in Greek — K8s = K + 8 letters + s

Evolution: How We Got Here

Traditionalbare servers

Virtual Machinesdedicated OS per VM

Containersshared OS, lightweight

Kubernetesorchestrates containers

How This Maps to Offglide

When you deploy Agent Orchestrator to SnowK8s, you declare what you need and K8s handles everything else

Offglide Team Declares:

"Keep 3 Agent Orchestrator pods running" = replica count
"Each AO pod needs 2 CPU, 1Gi memory" = resource requests/limits
"AO is ready when /health returns 200" = readiness probe
"Drain connections before killing AO pods" = preStop lifecycle hook
"Scale AO to 6 pods when CPU > 70%" = HPA autoscaling

Kubernetes Handles:

Places AO pods on nodes with enough CPU = scheduler
Remembers "3 AO replicas desired" even after restarts = etcd
Notices a worker node died, moves AO-2 elsewhere = self-healing
Routes agent-orchestrator:8050 to healthy pods = Service
Injects Mosaic TLS certs into AO pods at /certs = Secrets

Core Building Blocks

Container

A zip file of your app + everything it needs (runtime, libraries, config)
Built via a Dockerfile, runs the same everywhere — laptop, lab, production

In Offglide

AO container packages Python 3.13 + FastAPI + all dependencies
CS container packages Java 21 + Spring Boot
Each runs identically on your laptop, Carthagelab, or Skuld production

Pod

The smallest thing K8s can run — wraps 1+ containers with a shared IP
Crashes are auto-restarted; pods are ephemeral — they come and go

In Offglide

Each AO pod runs two containers: the AO app + an Envoy Proxy sidecar for mTLS
They share the same network — Envoy intercepts all traffic before it reaches AO

Deployment

Tells K8s: "I want 3 copies of this pod always running"
If one dies, K8s spins up a replacement
Handles rolling updates for zero-downtime deploys

In Offglide

AO has replicas: 3 with HPA
If an AO pod OOMs during a heavy LLM call, K8s spins up a new one instantly
During deploys, pods are replaced one-by-one — users never see downtime

Service (ClusterIP)

Pods get random IPs that change on restart
A Service gives your pods a stable DNS name so other services can always find them

In Offglide

CS calls AO using http://agent-orchestrator:8050
Never needs to know which pod IP or which of the 3 replicas handles the request
K8s load-balances automatically

ConfigMap & Secret

Config injected at runtime — not baked into the image
ConfigMaps for regular settings, Secrets for passwords and TLS certs
Same image works in all environments

In Offglide

Same Mosaic Docker image runs in lab and prod
Lab gets lab TLS certs + lab S3 password from Secrets
Prod gets prod certs + prod S3 password — zero code changes

Namespace

Like folders on your computer — keeps teams' resources separated
You can only see resources in your own namespace

In Offglide

Offglide services live in the nextwave namespace
Mosaic lives in the mosaic namespace
They can't accidentally interfere — Cilium network policies enforce this boundary

Node

A single machine (physical or VM) in the cluster
Worker nodes run your pods, each has fixed CPU + memory
K8s schedules pods onto nodes based on resource requests

Cluster

A set of machines working together
Control plane (brain): API server, scheduler, etcd
Worker nodes (muscle): run pods
In SnowK8s: 3 control nodes + many workers

Helm Chart

The package manager for K8s (like apt/brew/pip)
Templates your YAML with {{ .Values.x }} placeholders
One chart + different values per environment = done

In Offglide

NextWave Helm chart has {{ .Values.agentOrchestrator.replicaCount }}
In local values.yaml it's 1, in prod overlay it's 3
Same chart — Heimdall fills in the right number per cluster

Why Kubernetes?

Self-Healing

Pod crashes → K8s restarts it
Node dies → pods rescheduled to healthy nodes

Service Discovery

Stable DNS names for every service
Traffic balanced automatically during spikes

Zero-Downtime Deploys

Rolling updates replace pods one at a time
Users never see downtime

Autoscaling

HPA scales pods based on CPU/memory
More load = more pods automatically

Secrets Management

Passwords and certs stored securely
Injected at runtime — no hardcoding

Multi-Cloud

Works anywhere: bare metal, AWS, Azure, GCP
Same manifests everywhere

02SnowK8s Platform

What is SnowK8s?

SnowK8s is ServiceNow's internal Kubernetes platform for deploying shared services — services that run outside customer Glide instances
30+ data centers, 3+ regions, multi-cluster architecture

Why Offglide Needs SnowK8s

Glide is the customer's ServiceNow instance — it stores incidents, users, KB articles
AI features (LLM calls, streaming chat, tool orchestration) need to iterate weekly, not wait for quarterly Glide upgrades
SnowK8s lets the Offglide team deploy Agent Orchestrator updates to production in hours, while Glide stays unchanged
The customer gets the latest AI automatically — Glide discovers Offglide via DISHv2/MIMIR and calls it over mTLS

Why SnowK8s Exists

Faster rollouts — ship AI features weekly, not quarterly with Glide releases
Independent scaling — AI workloads scale separately from Glide instances
Standardization — all teams deploy the same way (Helm + Heimdall)
M&A integration — acquisitions running K8s can onboard to SnowK8s

Key Characteristics

3-node control plane per cluster (VMware VMs) for HA quorum
Bare metal workers (not VMs) for performance
Cilium CNI for networking + network policies (eBPF-based)
Inception builds clusters (Ansible + Kubespray)
ADCv2 VIP as entry point for traffic routing
Multi-tenant — all clusters shared across teams

The SnowK8s Stack

Heimdall

Deployment orchestrator. Merges config from CMDB + Git + overlays, renders Helm charts, applies to clusters.

Bifrost

Replicates Git repos (GHE → GitLab) and container images across registries.

Kaa

Certificate management (cert-manager + EJBCA). Auto-provisions and renews mTLS certificates.

Radium

Monitoring stack: Prometheus + Grafana + AlertManager. Deployed per cluster.

Anchore

Container vulnerability scanning. No critical/high CVEs allowed in production.

DCPS

DataCenter Password Store. External-secrets operator pulls secrets into K8s.

Cilium

CNI plugin for pod networking + network policy enforcement using eBPF.

Hermes

Kafka-based event streaming. Audit logs, analytics, async messaging.

Cluster Families

Ragnarok (Sandbox)

Purpose: Experimentation & learning. cluster-admin access. Rebuilt regularly — safe to break things.
Clusters: ragnarok002.ifa, ragnarok002.ifb (4-6 worker nodes each)

Carthagelab (HA Lab)

Purpose: Pre-production testing. Mirrors prod config. Restricted access (no cluster-admin).
Clusters: carthagelab004/007 .dva/.dvb (5-6 workers each)

Skuld (Production)

Purpose: Live customer traffic. Multi-region. Director-level approval required. Full observability.
DC pairs: bwi/phx, aus/ord, gig/gru, ycg/ytz

Regional Deployment Model

Each region has at least two clusters behind an ADCv2 VIP (load balancer)
Workloads deploy to paired datacenters for HA
If one DC fails, the ADC routes all traffic to the other

Workload Onboarding (ARK Process)

Before deploying to SnowK8s, teams go through ARK (Architecture Review for Kubernetes):

Submit resource requirements (CPU, memory, storage per pod)
Define scaling strategy (replicas, HPA thresholds)
Specify network policies and security model
Plan failover and HA strategy
Get namespace and resource quotas allocated

Resource quotas are enforced at the namespace level. Kyverno policies ensure all pods have resource requests/limits defined.

03Offglide on Kubernetes

How Offglide Comes Together

Offglide is a microservices platform where each service is a container deployed as a K8s Deployment
Here's how they map to Kubernetes resources

How the Services Fit Together

Think of a user asking "Create an incident for email server down." The LBF Client captures the message
The Conversation Server validates the session and passes it on
The Agent Orchestrator runs Planner1 to classify the intent as SINGLE_AGENT_TOOL_MATCH, then calls Mosaic to talk to GPT-4 (with PII masked)
Mosaic logs the call to Hermes (Kafka → S3 for audit)
AO then calls the MCP Server to execute create_incident on the Glide REST API
The incident number (INC0045123) comes back, AO runs the Output Refiner, and streams the polished response token-by-token via SSE back through CS to the user
State is saved to Valkey so the next turn remembers context
Every pod in this chain runs on SnowK8s with mTLS enforced by Envoy sidecars

Service-to-K8s Resource Map

Service	Tech	Port	K8s Kind	Namespace	Role (Restaurant Analogy)
LBF Client	React/TS	8060	Deployment	nextwave	Chat UI — captures user input, renders streaming response
Conversation Server	Java/Spring	8040	Deployment	nextwave	API gateway — validates auth, proxies SSE stream
Agent Orchestrator	Python/FastAPI	8050	Deployment+HPA	nextwave	AI brain — runs 18-stage pipeline, makes LLM calls
MCP Server	Python/FastAPI	8030	Deployment	nextwave	Tool execution — creates incidents, searches KB via Glide
Central Cache	Java/WebFlux	8090	Deployment	nextwave	Cache layer — tool defs, state, citations via Valkey
Mosaic (GAIC)	Java 21	18443	Deployment	mosaic	LLM proxy — PII masking, audit to Hermes, multi-provider
Valkey	Redis-compat	6379	Deployment (per shard)	both	In-memory state — sessions, execution plans, cache backend
Envoy Proxy	Sidecar	—	Sidecar container	nextwave	mTLS enforcement — intercepts all service-to-service traffic

The Cluster View (what k9s shows)

Service Communication Flow

How a Message Flows (step by step)

User types in chat UI (LBF Client)

React sends POST to Conversation Server
SSE connection opened for streaming

CS validates & forwards to AO

Session auth, execution context created, request forwarded with SSE proxy

AO runs Planner1 (intent classification)

LLM call via Mosaic
Classifies: QUESTION_ANSWERING, TOOL_MATCH, SMALL_TALK, etc.

Pipeline routes based on classification

TOOL_MATCH → MCP Server
KB_SEARCH → Planner2
DIRECT → straight to output

Mosaic proxies LLM calls (PII masked)

PII tokens replaced
Audit log sent to Hermes → S3
Response PII restored

Output Refiner polishes response

Final LLM pass for quality, formatting, citations
Streams via SSE CHUNKs

Response arrives token-by-token

SSE CHUNKs → TEXT → ANNOTATION_ADDED (citations) → SYSTEM:"done"
State saved to Valkey

04CI/CD Pipeline

Code to Running Pods

Every step is automated
No manual kubectl apply or helm install allowed on SnowK8s clusters

What Happens When You Push an AO Change

You push a fix to agent-orchestrator/va_agentic/native_agent/ on the release/prod branch
Jenkins runs pytest (hard gate — fails = blocked), then builds a Docker image tagged agent-orchestrator:a1b2c3d4 and pushes to registry-snapshots.devsnc.com
Bifrost mirrors the code to GitLab and replicates the image to the prod registry
Anchore scans the image for CVEs
Heimdall (running inside each Skuld cluster) detects the new image tag in the CMDB, merges prod overlays (6 CPU, 2Gi memory, 3 replicas), renders the Helm chart, diffs against the running state, and applies
K8s does a rolling update — AO-1 drains, new AO-1 starts, then AO-2, then AO-3
Users never see downtime

Code PushGHE

Jenkins CIbuild+test

Bifrostreplicate

AnchoreCVE scan

Heimdallconfig merge

Helm Deployto K8s

Jenkins CI Stages

Prepare

Workspace cleanup, SCM checkout, install Java 21 / Python 3.13 + uv

Build

./gradlew build (Java) or uv sync && uv build (Python)

Unit Tests — HARD GATE

Pipeline stops if any test fails
./gradlew test / pytest

Integration Tests — HARD GATE

Service-level integration tests
Must pass to continue

SonarQube + Coverage

Code quality analysis and test coverage reporting

Build Docker Image

Tagged with git commit hash: registry.devsnc.com/team/service:a1b2c3d4

Anchore Security Scan

No critical/high CVEs allowed before promotion

Push to Registry

Snapshot → registry-snapshots
Promoted → registry-releases

Dockerfile Example

FROM registry.devsnc.com/eclipse-temurin:21-jdk

RUN apt-get update && apt-get install -y \
    vim-tiny nano jq less procps \
    && rm -rf /var/lib/apt/lists/*

ARG JAR_FILE=./build/libs/*SNAPSHOT.jar
WORKDIR /app
COPY ${JAR_FILE} app.jar
COPY config/ config/

RUN chown -R ubuntu:ubuntu /app
USER ubuntu   # non-root mandatory

ENTRYPOINT ["sh", "-c", \
  "exec java $DEBUG_OPTS \
   -jar /app/app.jar"]

→

Image tag = git commit hash. You always know exactly what code is deployed.

Bifrost Replication

What Bifrost Does

Replicates from GHE (code.devsnc.com) to GitLab (gitlab.servicenow.net) and copies container images across registries
Required because Heimdall reads from GitLab

GHE repo

→

GitLab mirror

registry-snapshots

→

Bifrost prod

05Helm & Heimdall

Helm Charts & Heimdall

Helm is the package manager
Heimdall orchestrates where and how it deploys
You never run Helm directly

How NextWave Uses This

The NextWave Helm chart has templates for all 8 services (AO, CS, MCP, Cache, Valkey, Envoy, LBF, monitoring)
The values.yaml defaults are safe for local dev (1 replica, 100m CPU)
The prod overlay bumps AO to 3 replicas with 6 CPU
Heimdall runs inside skuld004.ycg, reads the CMDB Workload Instance for "nextwave-prod", clones the GitLab repo at the release/prod tag
Merges chart defaults → snc overlay → prod overlay → CMDB overrides, renders all the YAML, and applies only what changed
If the render fails, atomic: true rolls back the entire release — the previous version keeps running

NextWave Helm Chart

workload/charts/nextwave/
  Chart.yaml             # name: nextwave
  values.yaml            # defaults for ALL services
  templates/
    conversation-server/ # Deploy + Svc + ConfigMap
    agent-orchestrator/  # Deploy + HPA + Svc
    central-cache/
    mcp-server-og/
    envoy-proxy/         # mTLS sidecar
    valkey/              # Per-shard Deployments
    webClient/           # LBF Client
    monitoring/          # Grafana dashboards
    network-policies/    # CiliumNetworkPolicy

How Helm Templates Work

# Template (write once):
replicas: {{ .Values.ao.replicaCount }}
image: "{{ .Values.ao.image.tag }}"
cpu: {{ .Values.ao.resources.limits.cpu }}

# values.yaml (change per env):
ao:
  replicaCount: 3
  image:
    tag: a1b2c3d4
  resources:
    limits: { cpu: 6, memory: 2Gi }

# Rendered output (what K8s sees):
replicas: 3
image: "a1b2c3d4"
cpu: 6

Config Layering (Heimdall merges these, last wins)

Priority	Source	Location	Example
4 (highest)	CMDB Config	Workload Instance config tab	replicaCount, secrets, DNS
3	Git Config	`config/prod/config.yaml`	imageRegistry = Bifrost prod
2	Env Overlays	`overlays/nextwave/prod/values.yaml`	AO: cpu 6, Valkey: 24Gi
1 (lowest)	Chart Defaults	`charts/nextwave/values.yaml`	cpu 100m, 1 replica

Heimdall Controller (runs in every cluster)

Sync workload instances from CMDB

Polls for changes to the Workload Instance records

Clone Git repo (ref from CMDB)

Uses the branch/tag from the Workload Instance record

Read heimdall.yaml → resolve environment

Determines overlays to apply: snc-prod uses [snc, snc-prod, bifrost-images]

Merge config layers

chart defaults → overlays → git config → CMDB
Last wins

Render Helm + diff against cluster

Only applies if changes detected
atomic: true = full rollback on failure

CMDB Workload Hierarchy

Workloadnextwave

Familyprod / preview / lab

Instanceskuld004.ycg

1 workload = 1 namespace. Sharing namespaces is NOT supported. Never run helm install or kubectl apply directly. All deploys go through Heimdall.

06K8s Deployment Details

K8s Resource Details

Real specs from the actual Helm charts
Resources, probes, security, volumes

Resource Allocation per Service

Service	CPU Req	CPU Lim	Mem Req	Mem Lim	Replicas
Agent Orchestrator	2	6	1Gi	2Gi	3 (HPA)
Conversation Server	100m	1	512Mi	1Gi	2
Mosaic (default)	500m	1	2Gi	4Gi	3
Mosaic (prod)	500m	4	2Gi	8Gi	3
Valkey (Mosaic prod)	1	4	6Gi	8Gi	6 shards

requests = "I need at least this" (K8s uses for scheduling). limits = "never use more" (exceed memory = OOMKilled; exceed CPU = throttled).

Health Probes (real deployment YAML)

Mosaic Service

startupProbe:          # boot time
  httpGet: /health (HTTPS)
  failureThreshold: 30  # 30×10s = 5min
  periodSeconds: 10

livenessProbe:         # is it alive?
  httpGet: / (HTTPS)
  failureThreshold: 3   # 3 fails → restart

readinessProbe:        # ready for traffic?
  httpGet: /health (HTTPS)
  failureThreshold: 3   # 3 fails → remove from LB

Pod Security & Lifecycle

securityContext:
  runAsNonRoot: true    # mandatory
  runAsUser: 1000
  runAsGroup: 1000

terminationGracePeriodSeconds: 120
lifecycle:
  preStop:              # zero dropped requests
    exec:
      command:
        - prestop-graceful-shutdown.sh
        # drain LB (30s) + wait for
        # in-flight requests (60s)

Network Policies (Cilium)

Default: deny-all + DNS. Then explicit CiliumNetworkPolicy rules allow specific traffic:

Rule	From	To	Port
ADC Ingress	External (ADCv2)	Service pods	HTTPS
Service → Valkey	App pods	Valkey shards	6379
Service → S3	App pods	MinIO / ext S3	443
Prometheus scrape	radium namespace	App pods	HTTP
DNS	All pods	kube-dns	53

07Monitoring & Metrics

Monitoring & Grafana

Live — The monitoring stack (Radium) runs per-cluster
Prometheus scrapes metrics from every pod every 15s
Grafana visualizes
AlertManager fires alerts

How Offglide Is Monitored

Prometheus scrapes /prometheus on every Mosaic pod every 15 seconds, collecting http_server_requests_total and valkey_cache_hit_total
Grafana shows dashboards like "Mosaic HTTP Metrics" (request rate, P95 latency by endpoint, error rate) and "Valkey Metrics" (hit/miss ratio, connection pool per shard)
AlertManager fires MosaicNoPodsReady (critical) if all Mosaic pods go down, or MosaicValkeyInstancesDown if a cache shard dies
The alert routes through the TNG receiver to Eyrie/ServiceNow event management
Meanwhile, Splunk holds every log line from every pod, searchable by cluster_name + namespace + pod_name

How Prometheus Scrapes Offglide

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: mosaic-service
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: service
  podMetricsEndpoints:
    - path: /prometheus
      port: http
      interval: 15s       # scrape every 15 seconds

Prometheus operator in radium namespace discovers PodMonitors and scrapes endpoints automatically

Metrics Collected

HTTP Metrics

Metric	Type
`http_server_requests_total`	Counter (by method, uri, status)
`http_server_requests_duration_seconds`	Histogram (latency percentiles)

Valkey Cache Metrics

Metric	Type
`valkey_cache_hit_total`	Counter (hit/miss)
`valkey_cache_read_duration_seconds`	Histogram
`valkey_pool_active_connections`	Gauge (per shard)
`valkey_health_node_status`	Gauge (1=healthy)

Grafana Dashboards

Compute Resources

Per-pod CPU, memory, network I/O across namespace

K8s default dashboard

Namespace Consumption

Total resource usage. Compare across teams.

K8s default dashboard

Worker Nodes

Node-level CPU, memory, disk. Capacity planning.

K8s default dashboard

Pod Deep Dive

Single pod: CPU, memory, network, filesystem

K8s default dashboard

Mosaic HTTP

Request rate, error rate, P95, endpoints, status codes

mosaic_http_metrics.json

Mosaic Valkey

Cache ops, hit/miss, R/W latency, pool health

mosaic_valkey_metrics.json

OpenTelemetry (Distributed Tracing)

# commons-py/observability/default_config.yaml
otlp:
  endpoint: "observation:9060"
  protocol: "grpc"
batch_processor:
  max_export_batch_size: 20
  max_queue_size: 2048

Traces flow: Service → OTLP (gRPC) → Observation service (:9060) → TraceLens (Next.js trace analyzer)
W3C TraceContext propagated across all services

08Logging & Splunk

Centralized Logging

All pod stdout/stderr is collected by splunk-connect operator and forwarded to Splunk with cluster/pod/service metadata
No extra config needed from workloads

How Logs Flow

Log Formats by Service

Service	Library	Format
Agent Orchestrator	`structlog`	JSON with context
MCP Server	`structlog`	JSON with context
Conversation Server	`SLF4J + MDC`	Logback
Central Cache	`SLF4J + MDC`	Logback
Mosaic	`SLF4J + MDC`	Logback

AO Log Prefixes

`[Task Execution]`	Pipeline stage logs
`[HTTP Request]`	Outbound calls
`[HTTP Stream]`	SSE streaming
`[ERROR_HANDLING]`	Exception handling

Splunk Queries

AO logs in a cluster:

index=cloudapps cluster_name=c003.bwi
  namespace=nextwave pod_name=agent-orchestrator*

Errors only:

index=cloudapps namespace=nextwave
  "ERROR" OR "Exception"

Pipeline stage for a conversation:

index=cloudapps namespace=nextwave
  "[Task Execution]" conversation_id="abc123"

Pod crashes:

index=cloudapps cluster_name=skuld004.ycg
  "OOMKilled" OR "CrashLoopBackOff"

09Hermes (Kafka)

Hermes & Event Streaming

Hermes is ServiceNow's Kafka-based event streaming platform
The backbone for async communication — audit logs, analytics, and cross-service messaging

How Hermes Fits Into the Offglide Flow

When Agent Orchestrator calls Mosaic to make an LLM request, Mosaic does three things:
(1) masks PII like email addresses with tokens like [GAIC_N]
(2) forwards the sanitized prompt to GPT-4/Claude
(3) produces an encrypted audit log containing the full request, response, token count, latency, and caller identity
That audit log is published to a Hermes Kafka topic
A downstream consumer picks it up and writes it to S3 for long-term compliance storage
This is all async — it doesn't slow down the LLM response
Hermes also routes alert events from AlertManager through the TNG receiver to the Eyrie incident management system

What Hermes Handles in Offglide

LLM Audit Logs — Mosaic sends every LLM call (encrypted) to S3 via Hermes
Conversation Analytics — usage data, response quality metrics
Event-Driven Communication — async events between services
Alert Routing — AlertManager → TNG receiver → Hermes → Eyrie

→

Hermes is a shared SnowK8s service. Offglide services connect via the service mesh as producers/consumers.

Mosaic → Hermes Audit Flow

Local Development

For local dev, Hermes runs as KRaft-mode Kafka (no ZooKeeper):

Hermes KRaft cluster:
  1 controller node
  3 broker nodes (ports 9093, 9094, 9095)

Mosaic config:
  glideexport.hermes.supports-token=false

10Alerting

Prometheus Alerts

Active rules — Real PrometheusRule definitions from the Mosaic Helm chart

Prometheusevaluates rules

AlertManagerfires alert

TNG Receiverroutes

Eyrie / SNevent mgmt

Service Alerts

Alert	Condition	Severity
`MosaicPodInCrashLoopBackOff`	>1 restart in 15m	critical
`MosaicPodRestartingFrequently`	>2 restarts in 15m	warning
`MosaicMultiplePodsRestarting`	2+ pods restarting	critical
`MosaicNoPodsReady`	Complete outage	critical
`MosaicSelfTestFailure`	Self-test fails	critical

Valkey Alerts

Alert	Condition	Severity
`MosaicValkeyMultiplePodsRestarting`	Multiple shards restarting	critical
`MosaicValkeyNoPodsReady`	Complete cache outage	critical
`MosaicValkeyInstancesDown`	Enabled shards not running	critical
`MosaicHeimdallFailedToApply`	Workload CR not applying	critical

Configurable Thresholds

monitoring:
  prometheusRule:
    valkeyMemoryAlert:
      warningThresholdPercent: 75    # warn at 75%
      criticalThresholdPercent: 90   # critical at 90%
      for: 5m                        # sustain 5 min before firing

11Security

Security & mTLS

All inter-service communication is mutual TLS
Certificates managed by Kaa (cert-manager + EJBCA)

Kaa Certificate Lifecycle

cert-manager watches Certificate resources
CSR sent to EJBCA CA
EJBCA issues signed certificate
Certificate stored as K8s Secret
Pod mounts Secret at /certs
Auto-renewal before expiry

Security Layers

mTLS — Envoy sidecar on every pod
Network Policies — Cilium deny-all + explicit allows
RBAC — LDAP-backed access control
DCPS — External secrets from DataCenter Password Store
Non-root — All containers run as UID 1000
Image Scanning — Anchore + Twistlock
PII Masking — Mosaic strips PII before LLM calls

12Environments

Environment Hierarchy

From Docker Compose on your laptop to global production clusters

Local

Docker Compose or Kind
Mock Kaa certs. Low resources.

docker compose up -d --build
# UI: localhost:8060
# CS: localhost:8040
# AO: localhost:8050

Lab / Skuld

Carthagelab/Skuld clusters
Real Kaa certs. DCPS secrets
Branch: main/develop

Production

Skuld clusters. Multi-region DCs
Full mTLS, Anchore required
Branch: release/prod. Director approval.

Branch Mapping

Branch	Environment	Clusters
`release/prod`	Production	All Skuld DCs
`release/preview`	Preview	Select Skuld DCs
`main`/`develop`	Lab / Skuld	Carthagelab, Skuld004
Feature branches	Lab only	Carthagelab

13Operations

Operations & Troubleshooting

Cluster access, essential commands, troubleshooting guides

Cluster Access

bssh sk8sops01.ycg0→start_k8toolbox→k login c003.ycg→k9s

k9s Key	Action
`:ns`	Browse namespaces
`:po`	List pods
`l`	View logs
`s`	Shell into pod
`d`	Describe resource

Common kubectl

# Pods
kubectl -n nextwave get pods
kubectl -n nextwave logs -f ao-xxx
kubectl -n nextwave logs --previous ao-xxx
kubectl -n nextwave describe pod ao-xxx
kubectl -n nextwave exec -it ao-xxx -- /bin/bash

# Deployments
kubectl -n nextwave get deploy
kubectl -n nextwave rollout restart deploy/ao
kubectl -n nextwave rollout status deploy/ao
kubectl -n nextwave top pods
kubectl -n nextwave get events --sort-by='.lastTimestamp'

Troubleshooting

▶

Pod in CrashLoopBackOff

common

Cause: App crashes on startup (bad env var, missing secret, OOM)

kubectl describe pod — check Events
kubectl logs --previous — last crash log
Look for OOMKilled, missing env vars, TLS cert errors
Check Splunk for the pod's last output

▶

OOMKilled

common

Cause: Exceeded memory limit. K8s kills immediately.

Describe pod → look for Reason: OOMKilled
Check Grafana memory dashboard for usage trend
Increase memory limit in Helm overlay (or CMDB for prod)
For AO: check if LLM logging is enabled (high memory)

▶

Heimdall deploy failed

critical

Cause: Helm render error, config merge issue, or K8s rejected manifest

Check MosaicHeimdallFailedToApply alert
Look at Heimdall controller logs
Verify git ref in CMDB is valid
Check YAML syntax in overlay values
atomic: true = rolled back, previous version still running

▶

Valkey connection issues

warning

Cause: TLS cert mismatch, shard down, or network policy blocking

Check Valkey pods: kubectl get pods | grep valkey
Check valkey_health_node_status in Grafana
Verify CiliumNetworkPolicy allows service → valkey:6379
Shell into app and test: valkey-cli -h valkey-0 ping

▶

TLS / mTLS certificate errors

warning

Cause: cert-manager failed to issue/renew, or wrong CA

Check cert-manager: kubectl get certificate
Describe Certificate → look for Ready: True
Verify Secret exists: kubectl get secret
For local: use Mock Kaa Toolkit certs

Setup Scripts (offglide-services-setup/)

Script	Purpose
`1-clone-repositories.sh`	Clone/update all service repos (SSH or HTTPS)
`2-build-and-deploy.sh`	Build all with Docker Compose. `--forceenv`, `--purge`, `--parallel`
`3-validate-deployment.sh`	Health check all services
`4-mosaic-gaic-deployment.sh`	Local Mosaic Docker (setup/start/stop)
`5-update-repos.sh`	Safe update: stash → pull → restore

14Glossary

Glossary

SnowK8s

ServiceNow's internal K8s platform for shared services across 30+ data centers globally

Heimdall

Deployment orchestrator. Merges config layers, renders Helm, applies to clusters. Never run Helm directly.

Bifrost

Replicates code (GHE → GitLab) and container images across registries

Kaa

Certificate management (cert-manager + EJBCA). Auto-provisions mTLS certs for all services.

Hermes

Kafka-based event streaming. Audit logs, analytics, async messaging across Offglide.

Radium

Per-cluster monitoring stack: Prometheus + Grafana + AlertManager

Valkey

Redis-compatible in-memory store. Session state, cache, execution plans.

Mosaic (GAIC)

LLM proxy with PII masking, multi-provider routing, audit logging via Hermes to S3

Agent Orchestrator

Python/FastAPI brain. Runs 18-stage AI pipeline, makes LLM calls, manages multi-turn state.

Conversation Server

Java/Spring Boot API gateway. Auth, session mgmt, SSE proxying.

MCP Server

Tool gateway using Model Context Protocol. Executes tools on Glide REST APIs.

LBF Client

React/TS chat UI. Streaming tokens, forms, citations. Web component distribution.

Cilium

CNI plugin for networking + network policy enforcement using eBPF. VxLAN overlay mode.

DCPS

DataCenter Password Store. External-secrets operator pulls into K8s Secrets.

Anchore

Container vulnerability scanner. No critical/high CVEs before production.

ARK

Architecture Review for Kubernetes. Required before onboarding to SnowK8s.

Skuld

Production cluster family. Live customer traffic. Multi-region. Strict RBAC.

Ragnarok

Sandbox cluster family. cluster-admin access. Safe experimentation.

DISHv2 / MIMIR

Service discovery. Glide uses DISHv2 + MIMIR to resolve Offglide endpoints.

SSE

Server-Sent Events. AO streams response tokens to client in real-time.

Execution Plan

Multi-turn state tracker. READY → IN_PROGRESS → COMPLETED. Persisted in Valkey.

Inception

Cluster bootstrap using Ansible + Kubespray. Creates K8s clusters from bare metal.