01Kubernetes Fundamentals

Kubernetes Basics

What is Kubernetes?

Evolution: How We Got Here

Traditionalbare servers
Virtual Machinesdedicated OS per VM
Containersshared OS, lightweight
Kubernetesorchestrates containers

How This Maps to Offglide

Offglide Team Declares:

  • "Keep 3 Agent Orchestrator pods running" = replica count
  • "Each AO pod needs 2 CPU, 1Gi memory" = resource requests/limits
  • "AO is ready when /health returns 200" = readiness probe
  • "Drain connections before killing AO pods" = preStop lifecycle hook
  • "Scale AO to 6 pods when CPU > 70%" = HPA autoscaling

Kubernetes Handles:

  • Places AO pods on nodes with enough CPU = scheduler
  • Remembers "3 AO replicas desired" even after restarts = etcd
  • Notices a worker node died, moves AO-2 elsewhere = self-healing
  • Routes agent-orchestrator:8050 to healthy pods = Service
  • Injects Mosaic TLS certs into AO pods at /certs = Secrets

Core Building Blocks

Container

  • A zip file of your app + everything it needs (runtime, libraries, config)
  • Built via a Dockerfile, runs the same everywhere — laptop, lab, production

In Offglide

  • AO container packages Python 3.13 + FastAPI + all dependencies
  • CS container packages Java 21 + Spring Boot
  • Each runs identically on your laptop, Carthagelab, or Skuld production

Pod

  • The smallest thing K8s can run — wraps 1+ containers with a shared IP
  • Crashes are auto-restarted; pods are ephemeral — they come and go

In Offglide

  • Each AO pod runs two containers: the AO app + an Envoy Proxy sidecar for mTLS
  • They share the same network — Envoy intercepts all traffic before it reaches AO

Deployment

  • Tells K8s: "I want 3 copies of this pod always running"
  • If one dies, K8s spins up a replacement
  • Handles rolling updates for zero-downtime deploys

In Offglide

  • AO has replicas: 3 with HPA
  • If an AO pod OOMs during a heavy LLM call, K8s spins up a new one instantly
  • During deploys, pods are replaced one-by-one — users never see downtime

Service (ClusterIP)

  • Pods get random IPs that change on restart
  • A Service gives your pods a stable DNS name so other services can always find them

In Offglide

  • CS calls AO using http://agent-orchestrator:8050
  • Never needs to know which pod IP or which of the 3 replicas handles the request
  • K8s load-balances automatically

ConfigMap & Secret

  • Config injected at runtime — not baked into the image
  • ConfigMaps for regular settings, Secrets for passwords and TLS certs
  • Same image works in all environments

In Offglide

  • Same Mosaic Docker image runs in lab and prod
  • Lab gets lab TLS certs + lab S3 password from Secrets
  • Prod gets prod certs + prod S3 password — zero code changes

Namespace

  • Like folders on your computer — keeps teams' resources separated
  • You can only see resources in your own namespace

In Offglide

  • Offglide services live in the nextwave namespace
  • Mosaic lives in the mosaic namespace
  • They can't accidentally interfere — Cilium network policies enforce this boundary

Node

  • A single machine (physical or VM) in the cluster
  • Worker nodes run your pods, each has fixed CPU + memory
  • K8s schedules pods onto nodes based on resource requests

Cluster

  • A set of machines working together
  • Control plane (brain): API server, scheduler, etcd
  • Worker nodes (muscle): run pods
  • In SnowK8s: 3 control nodes + many workers

Helm Chart

  • The package manager for K8s (like apt/brew/pip)
  • Templates your YAML with {{ .Values.x }} placeholders
  • One chart + different values per environment = done

In Offglide

  • NextWave Helm chart has {{ .Values.agentOrchestrator.replicaCount }}
  • In local values.yaml it's 1, in prod overlay it's 3
  • Same chart — Heimdall fills in the right number per cluster

Why Kubernetes?

Self-Healing

  • Pod crashes → K8s restarts it
  • Node dies → pods rescheduled to healthy nodes

Service Discovery

  • Stable DNS names for every service
  • Traffic balanced automatically during spikes

Zero-Downtime Deploys

  • Rolling updates replace pods one at a time
  • Users never see downtime

Autoscaling

  • HPA scales pods based on CPU/memory
  • More load = more pods automatically

Secrets Management

  • Passwords and certs stored securely
  • Injected at runtime — no hardcoding

Multi-Cloud

  • Works anywhere: bare metal, AWS, Azure, GCP
  • Same manifests everywhere
02SnowK8s Platform

What is SnowK8s?

Why Offglide Needs SnowK8s

Why SnowK8s Exists

  • Faster rollouts — ship AI features weekly, not quarterly with Glide releases
  • Independent scaling — AI workloads scale separately from Glide instances
  • Standardization — all teams deploy the same way (Helm + Heimdall)
  • M&A integration — acquisitions running K8s can onboard to SnowK8s

Key Characteristics

  • 3-node control plane per cluster (VMware VMs) for HA quorum
  • Bare metal workers (not VMs) for performance
  • Cilium CNI for networking + network policies (eBPF-based)
  • Inception builds clusters (Ansible + Kubespray)
  • ADCv2 VIP as entry point for traffic routing
  • Multi-tenant — all clusters shared across teams

The SnowK8s Stack

Heimdall

  • Deployment orchestrator. Merges config from CMDB + Git + overlays, renders Helm charts, applies to clusters.

Bifrost

  • Replicates Git repos (GHE → GitLab) and container images across registries.

Kaa

  • Certificate management (cert-manager + EJBCA). Auto-provisions and renews mTLS certificates.

Radium

  • Monitoring stack: Prometheus + Grafana + AlertManager. Deployed per cluster.

Anchore

  • Container vulnerability scanning. No critical/high CVEs allowed in production.

DCPS

  • DataCenter Password Store. External-secrets operator pulls secrets into K8s.

Cilium

  • CNI plugin for pod networking + network policy enforcement using eBPF.

Hermes

  • Kafka-based event streaming. Audit logs, analytics, async messaging.

Cluster Families

Ragnarok (Sandbox)

  • Purpose: Experimentation & learning. cluster-admin access. Rebuilt regularly — safe to break things.
  • Clusters: ragnarok002.ifa, ragnarok002.ifb (4-6 worker nodes each)

Carthagelab (HA Lab)

  • Purpose: Pre-production testing. Mirrors prod config. Restricted access (no cluster-admin).
  • Clusters: carthagelab004/007 .dva/.dvb (5-6 workers each)

Skuld (Production)

  • Purpose: Live customer traffic. Multi-region. Director-level approval required. Full observability.
  • DC pairs: bwi/phx, aus/ord, gig/gru, ycg/ytz

Regional Deployment Model

Global SnowK8s Platform AMER Americas bwi / phx DC pair aus / ord DC pair ycg / ytz DC pair EMEA Europe fra / ams DC pair APAC Asia-Pacific tyo / syd DC pair FED Government Regulated Dedicated clusters 🔒 ADCv2 VIP load-balances each DC pair Active DC (ADCv2 health check) Each DC pair provides HA failover

Workload Onboarding (ARK Process)

  1. Submit resource requirements (CPU, memory, storage per pod)
  2. Define scaling strategy (replicas, HPA thresholds)
  3. Specify network policies and security model
  4. Plan failover and HA strategy
  5. Get namespace and resource quotas allocated
i
Resource quotas are enforced at the namespace level. Kyverno policies ensure all pods have resource requests/limits defined.
03Offglide on Kubernetes

How Offglide Comes Together

How the Services Fit Together

Service-to-K8s Resource Map

ServiceTechPortK8s KindNamespaceRole (Restaurant Analogy)
LBF ClientReact/TS8060DeploymentnextwaveChat UI — captures user input, renders streaming response
Conversation ServerJava/Spring8040DeploymentnextwaveAPI gateway — validates auth, proxies SSE stream
Agent OrchestratorPython/FastAPI8050Deployment+HPAnextwaveAI brain — runs 18-stage pipeline, makes LLM calls
MCP ServerPython/FastAPI8030DeploymentnextwaveTool execution — creates incidents, searches KB via Glide
Central CacheJava/WebFlux8090DeploymentnextwaveCache layer — tool defs, state, citations via Valkey
Mosaic (GAIC)Java 2118443DeploymentmosaicLLM proxy — PII masking, audit to Hermes, multi-provider
ValkeyRedis-compat6379Deployment (per shard)bothIn-memory state — sessions, execution plans, cache backend
Envoy ProxySidecarSidecar containernextwavemTLS enforcement — intercepts all service-to-service traffic

The Cluster View (what k9s shows)

SnowK8s Cluster (c003.ycg) namespace: nextwave Deployment: agent-orchestrator (HPA: 1-6 replicas) AO pod 1 +envoy sidecar Python 3.13 AO pod 2 +envoy sidecar FastAPI AO pod 3 +envoy sidecar 2-6 CPU, 1-2Gi Deployment+HPA Deployment: conversation-server (2 replicas) CS pod 1 Java 17, Spring Boot CS pod 2 100m-1 CPU, 512Mi-1Gi MCP :8030 Cache :8090 LBF :8060 Valkey valkey-0 valkey-ro shards with PVC :6379 namespace: mosaic mosaic-1 mosaic-2 mosaic-3 500m-4 CPU, 2-8Gi Valkey: 6 shards maxmemory 4GB each Shared: Hermes (Kafka) Prometheus (radium ns) Heimdall (config controller) Envoy (mTLS)

Service Communication Flow

User Browser LBF Client :8060 Conv Server :8040 Agent Orch :8050 Mosaic :18443 LLM Provider GPT / Claude POST /turn validate pipeline PII mask prompt tokens unmask refine SSE proxy SSE CHUNK MCP Server :8030 tool execution Cache :8090 Valkey :6379 Hermes Kafka → S3 Request SSE Response Audit log fork

How a Message Flows (step by step)

User types in chat UI (LBF Client)

  • React sends POST to Conversation Server
  • SSE connection opened for streaming

CS validates & forwards to AO

  • Session auth, execution context created, request forwarded with SSE proxy

AO runs Planner1 (intent classification)

  • LLM call via Mosaic
  • Classifies: QUESTION_ANSWERING, TOOL_MATCH, SMALL_TALK, etc.

Pipeline routes based on classification

  • TOOL_MATCH → MCP Server
  • KB_SEARCH → Planner2
  • DIRECT → straight to output

Mosaic proxies LLM calls (PII masked)

  • PII tokens replaced
  • Audit log sent to Hermes → S3
  • Response PII restored

Output Refiner polishes response

  • Final LLM pass for quality, formatting, citations
  • Streams via SSE CHUNKs

Response arrives token-by-token

  • SSE CHUNKs → TEXT → ANNOTATION_ADDED (citations) → SYSTEM:"done"
  • State saved to Valkey
04CI/CD Pipeline

Code to Running Pods

What Happens When You Push an AO Change

Code PushGHE
Jenkins CIbuild+test
Bifrostreplicate
AnchoreCVE scan
Heimdallconfig merge
Helm Deployto K8s
Code Push GHE Jenkins CI test + build Bifrost replicate Anchore CVE scan Heimdall config merge K8s Deploy rolling update pod1 pod2 pod3 Pipeline progress DEPLOYED

Jenkins CI Stages

Prepare

  • Workspace cleanup, SCM checkout, install Java 21 / Python 3.13 + uv

Build

  • ./gradlew build (Java) or uv sync && uv build (Python)

Unit Tests — HARD GATE

  • Pipeline stops if any test fails
  • ./gradlew test / pytest

Integration Tests — HARD GATE

  • Service-level integration tests
  • Must pass to continue

SonarQube + Coverage

  • Code quality analysis and test coverage reporting

Build Docker Image

  • Tagged with git commit hash: registry.devsnc.com/team/service:a1b2c3d4

Anchore Security Scan

  • No critical/high CVEs allowed before promotion

Push to Registry

  • Snapshot → registry-snapshots
  • Promoted → registry-releases

Dockerfile Example

FROM registry.devsnc.com/eclipse-temurin:21-jdk

RUN apt-get update && apt-get install -y \
    vim-tiny nano jq less procps \
    && rm -rf /var/lib/apt/lists/*

ARG JAR_FILE=./build/libs/*SNAPSHOT.jar
WORKDIR /app
COPY ${JAR_FILE} app.jar
COPY config/ config/

RUN chown -R ubuntu:ubuntu /app
USER ubuntu   # non-root mandatory

ENTRYPOINT ["sh", "-c", \
  "exec java $DEBUG_OPTS \
   -jar /app/app.jar"]
Image tag = git commit hash. You always know exactly what code is deployed.

Bifrost Replication

What Bifrost Does

  • Replicates from GHE (code.devsnc.com) to GitLab (gitlab.servicenow.net) and copies container images across registries
  • Required because Heimdall reads from GitLab
GHE repo
GitLab mirror
registry-snapshots
Bifrost prod
05Helm & Heimdall

Helm Charts & Heimdall

How NextWave Uses This

NextWave Helm Chart

workload/charts/nextwave/
  Chart.yaml             # name: nextwave
  values.yaml            # defaults for ALL services
  templates/
    conversation-server/ # Deploy + Svc + ConfigMap
    agent-orchestrator/  # Deploy + HPA + Svc
    central-cache/
    mcp-server-og/
    envoy-proxy/         # mTLS sidecar
    valkey/              # Per-shard Deployments
    webClient/           # LBF Client
    monitoring/          # Grafana dashboards
    network-policies/    # CiliumNetworkPolicy

How Helm Templates Work

# Template (write once):
replicas: {{ .Values.ao.replicaCount }}
image: "{{ .Values.ao.image.tag }}"
cpu: {{ .Values.ao.resources.limits.cpu }}

# values.yaml (change per env):
ao:
  replicaCount: 3
  image:
    tag: a1b2c3d4
  resources:
    limits: { cpu: 6, memory: 2Gi }

# Rendered output (what K8s sees):
replicas: 3
image: "a1b2c3d4"
cpu: 6

Config Layering (Heimdall merges these, last wins)

PrioritySourceLocationExample
4 (highest)CMDB ConfigWorkload Instance config tabreplicaCount, secrets, DNS
3Git Configconfig/prod/config.yamlimageRegistry = Bifrost prod
2Env Overlaysoverlays/nextwave/prod/values.yamlAO: cpu 6, Valkey: 24Gi
1 (lowest)Chart Defaultscharts/nextwave/values.yamlcpu 100m, 1 replica
1 Chart Defaults cpu: 100m, replicas: 1 2 Env Overlay cpu: 6 ◀ overrides 3 Git Config imageRegistry: bifrost-prod 4 CMDB Override replicas: 3 ◀ overrides PRIORITY Merged: cpu=6, replicas=3

Heimdall Controller (runs in every cluster)

Sync workload instances from CMDB

  • Polls for changes to the Workload Instance records

Clone Git repo (ref from CMDB)

  • Uses the branch/tag from the Workload Instance record

Read heimdall.yaml → resolve environment

  • Determines overlays to apply: snc-prod uses [snc, snc-prod, bifrost-images]

Merge config layers

  • chart defaults → overlays → git config → CMDB
  • Last wins

Render Helm + diff against cluster

  • Only applies if changes detected
  • atomic: true = full rollback on failure

CMDB Workload Hierarchy

Workloadnextwave
Familyprod / preview / lab
Instanceskuld004.ycg
!
1 workload = 1 namespace. Sharing namespaces is NOT supported. Never run helm install or kubectl apply directly. All deploys go through Heimdall.
06K8s Deployment Details

K8s Resource Details

Resource Allocation per Service

ServiceCPU ReqCPU LimMem ReqMem LimReplicas
Agent Orchestrator261Gi2Gi3 (HPA)
Conversation Server100m1512Mi1Gi2
Mosaic (default)500m12Gi4Gi3
Mosaic (prod)500m42Gi8Gi3
Valkey (Mosaic prod)146Gi8Gi6 shards
i
requests = "I need at least this" (K8s uses for scheduling). limits = "never use more" (exceed memory = OOMKilled; exceed CPU = throttled).

Health Probes (real deployment YAML)

Mosaic Service

startupProbe:          # boot time
  httpGet: /health (HTTPS)
  failureThreshold: 30  # 30×10s = 5min
  periodSeconds: 10

livenessProbe:         # is it alive?
  httpGet: / (HTTPS)
  failureThreshold: 3   # 3 fails → restart

readinessProbe:        # ready for traffic?
  httpGet: /health (HTTPS)
  failureThreshold: 3   # 3 fails → remove from LB

Pod Security & Lifecycle

securityContext:
  runAsNonRoot: true    # mandatory
  runAsUser: 1000
  runAsGroup: 1000

terminationGracePeriodSeconds: 120
lifecycle:
  preStop:              # zero dropped requests
    exec:
      command:
        - prestop-graceful-shutdown.sh
        # drain LB (30s) + wait for
        # in-flight requests (60s)

Network Policies (Cilium)

RuleFromToPort
ADC IngressExternal (ADCv2)Service podsHTTPS
Service → ValkeyApp podsValkey shards6379
Service → S3App podsMinIO / ext S3443
Prometheus scraperadium namespaceApp podsHTTP
DNSAll podskube-dns53
07Monitoring & Metrics

Monitoring & Grafana

How Offglide Is Monitored

Prometheus radium ns 15s scrape AO-1 AO-2 Mosaic metrics Grafana threshold AlertManager fires alert TNG / Eyrie event mgmt Scrape (every 15s) Dashboard metrics Alert on threshold breach

How Prometheus Scrapes Offglide

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: mosaic-service
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: service
  podMetricsEndpoints:
    - path: /prometheus
      port: http
      interval: 15s       # scrape every 15 seconds

Metrics Collected

HTTP Metrics

MetricType
http_server_requests_totalCounter (by method, uri, status)
http_server_requests_duration_secondsHistogram (latency percentiles)

Valkey Cache Metrics

MetricType
valkey_cache_hit_totalCounter (hit/miss)
valkey_cache_read_duration_secondsHistogram
valkey_pool_active_connectionsGauge (per shard)
valkey_health_node_statusGauge (1=healthy)

Grafana Dashboards

Compute Resources

  • Per-pod CPU, memory, network I/O across namespace
K8s default dashboard

Namespace Consumption

  • Total resource usage. Compare across teams.
K8s default dashboard

Worker Nodes

  • Node-level CPU, memory, disk. Capacity planning.
K8s default dashboard

Pod Deep Dive

  • Single pod: CPU, memory, network, filesystem
K8s default dashboard

Mosaic HTTP

  • Request rate, error rate, P95, endpoints, status codes
mosaic_http_metrics.json

Mosaic Valkey

  • Cache ops, hit/miss, R/W latency, pool health
mosaic_valkey_metrics.json

OpenTelemetry (Distributed Tracing)

# commons-py/observability/default_config.yaml
otlp:
  endpoint: "observation:9060"
  protocol: "grpc"
batch_processor:
  max_export_batch_size: 20
  max_queue_size: 2048
08Logging & Splunk

Centralized Logging

How Logs Flow

App Pod {"level":"info","msg":".."} {"level":"debug","ts":..} stdout Kubelet captures logs collect splunk-connect adds metadata cluster pod ns HEC Splunk index=cloudapps ns=.. 🔍 structlog JSON /var/log/pods/ + cluster_name, pod_name searchable in <30s

Log Formats by Service

ServiceLibraryFormat
Agent OrchestratorstructlogJSON with context
MCP ServerstructlogJSON with context
Conversation ServerSLF4J + MDCLogback
Central CacheSLF4J + MDCLogback
MosaicSLF4J + MDCLogback

AO Log Prefixes

[Task Execution]Pipeline stage logs
[HTTP Request]Outbound calls
[HTTP Stream]SSE streaming
[ERROR_HANDLING]Exception handling

Splunk Queries

  • AO logs in a cluster:
index=cloudapps cluster_name=c003.bwi
  namespace=nextwave pod_name=agent-orchestrator*
  • Errors only:
index=cloudapps namespace=nextwave
  "ERROR" OR "Exception"
  • Pipeline stage for a conversation:
index=cloudapps namespace=nextwave
  "[Task Execution]" conversation_id="abc123"
  • Pod crashes:
index=cloudapps cluster_name=skuld004.ycg
  "OOMKilled" OR "CrashLoopBackOff"
09Hermes (Kafka)

Hermes & Event Streaming

How Hermes Fits Into the Offglide Flow

What Hermes Handles in Offglide

  • LLM Audit Logs — Mosaic sends every LLM call (encrypted) to S3 via Hermes
  • Conversation Analytics — usage data, response quality metrics
  • Event-Driven Communication — async events between services
  • Alert Routing — AlertManager → TNG receiver → Hermes → Eyrie
Hermes is a shared SnowK8s service. Offglide services connect via the service mesh as producers/consumers.

Mosaic → Hermes Audit Flow

Agent Orch :8050 LLM call Mosaic PII mask sanitized LLM Provider GPT / Claude audit fork Hermes (Kafka) encrypted topic S3 compliance 🔒 LLM request Response Encrypted audit log

Local Development

Hermes KRaft cluster:
  1 controller node
  3 broker nodes (ports 9093, 9094, 9095)

Mosaic config:
  glideexport.hermes.supports-token=false
10Alerting

Prometheus Alerts

Prometheusevaluates rules
AlertManagerfires alert
TNG Receiverroutes
Eyrie / SNevent mgmt

Service Alerts

AlertConditionSeverity
MosaicPodInCrashLoopBackOff>1 restart in 15mcritical
MosaicPodRestartingFrequently>2 restarts in 15mwarning
MosaicMultiplePodsRestarting2+ pods restartingcritical
MosaicNoPodsReadyComplete outagecritical
MosaicSelfTestFailureSelf-test failscritical

Valkey Alerts

AlertConditionSeverity
MosaicValkeyMultiplePodsRestartingMultiple shards restartingcritical
MosaicValkeyNoPodsReadyComplete cache outagecritical
MosaicValkeyInstancesDownEnabled shards not runningcritical
MosaicHeimdallFailedToApplyWorkload CR not applyingcritical

Configurable Thresholds

monitoring:
  prometheusRule:
    valkeyMemoryAlert:
      warningThresholdPercent: 75    # warn at 75%
      criticalThresholdPercent: 90   # critical at 90%
      for: 5m                        # sustain 5 min before firing
11Security

Security & mTLS

Kaa Certificate Lifecycle

  1. cert-manager watches Certificate resources
  2. CSR sent to EJBCA CA
  3. EJBCA issues signed certificate
  4. Certificate stored as K8s Secret
  5. Pod mounts Secret at /certs
  6. Auto-renewal before expiry

Security Layers

  • mTLS — Envoy sidecar on every pod
  • Network Policies — Cilium deny-all + explicit allows
  • RBAC — LDAP-backed access control
  • DCPS — External secrets from DataCenter Password Store
  • Non-root — All containers run as UID 1000
  • Image Scanning — Anchore + Twistlock
  • PII Masking — Mosaic strips PII before LLM calls
12Environments

Environment Hierarchy

Local

  • Docker Compose or Kind
  • Mock Kaa certs. Low resources.
docker compose up -d --build
# UI: localhost:8060
# CS: localhost:8040
# AO: localhost:8050

Lab / Skuld

  • Carthagelab/Skuld clusters
  • Real Kaa certs. DCPS secrets
  • Branch: main/develop

Production

  • Skuld clusters. Multi-region DCs
  • Full mTLS, Anchore required
  • Branch: release/prod. Director approval.

Branch Mapping

BranchEnvironmentClusters
release/prodProductionAll Skuld DCs
release/previewPreviewSelect Skuld DCs
main/developLab / SkuldCarthagelab, Skuld004
Feature branchesLab onlyCarthagelab
13Operations

Day-2 Operations

Cluster Access

bssh sk8sops01.ycg0start_k8toolboxk login c003.ycgk9s
k9s KeyAction
:nsBrowse namespaces
:poList pods
lView logs
sShell into pod
dDescribe resource

Common kubectl

# Pods
kubectl -n nextwave get pods
kubectl -n nextwave logs -f ao-xxx
kubectl -n nextwave logs --previous ao-xxx
kubectl -n nextwave describe pod ao-xxx
kubectl -n nextwave exec -it ao-xxx -- /bin/bash
# Deployments
kubectl -n nextwave get deploy
kubectl -n nextwave rollout restart deploy/ao
kubectl -n nextwave rollout status deploy/ao
kubectl -n nextwave top pods
kubectl -n nextwave get events --sort-by='.lastTimestamp'

Troubleshooting

Pod in CrashLoopBackOff

common
  • Cause: App crashes on startup (bad env var, missing secret, OOM)
  1. kubectl describe pod — check Events
  2. kubectl logs --previous — last crash log
  3. Look for OOMKilled, missing env vars, TLS cert errors
  4. Check Splunk for the pod's last output

OOMKilled

common
  • Cause: Exceeded memory limit. K8s kills immediately.
  1. Describe pod → look for Reason: OOMKilled
  2. Check Grafana memory dashboard for usage trend
  3. Increase memory limit in Helm overlay (or CMDB for prod)
  4. For AO: check if LLM logging is enabled (high memory)

Heimdall deploy failed

critical
  • Cause: Helm render error, config merge issue, or K8s rejected manifest
  1. Check MosaicHeimdallFailedToApply alert
  2. Look at Heimdall controller logs
  3. Verify git ref in CMDB is valid
  4. Check YAML syntax in overlay values
  5. atomic: true = rolled back, previous version still running

Valkey connection issues

warning
  • Cause: TLS cert mismatch, shard down, or network policy blocking
  1. Check Valkey pods: kubectl get pods | grep valkey
  2. Check valkey_health_node_status in Grafana
  3. Verify CiliumNetworkPolicy allows service → valkey:6379
  4. Shell into app and test: valkey-cli -h valkey-0 ping

TLS / mTLS certificate errors

warning
  • Cause: cert-manager failed to issue/renew, or wrong CA
  1. Check cert-manager: kubectl get certificate
  2. Describe Certificate → look for Ready: True
  3. Verify Secret exists: kubectl get secret
  4. For local: use Mock Kaa Toolkit certs

Setup Scripts (offglide-services-setup/)

ScriptPurpose
1-clone-repositories.shClone/update all service repos (SSH or HTTPS)
2-build-and-deploy.shBuild all with Docker Compose. --forceenv, --purge, --parallel
3-validate-deployment.shHealth check all services
4-mosaic-gaic-deployment.shLocal Mosaic Docker (setup/start/stop)
5-update-repos.shSafe update: stash → pull → restore
14Glossary

Glossary

SnowK8s

  • ServiceNow's internal K8s platform for shared services across 30+ data centers globally

Heimdall

  • Deployment orchestrator. Merges config layers, renders Helm, applies to clusters. Never run Helm directly.

Bifrost

  • Replicates code (GHE → GitLab) and container images across registries

Kaa

  • Certificate management (cert-manager + EJBCA). Auto-provisions mTLS certs for all services.

Hermes

  • Kafka-based event streaming. Audit logs, analytics, async messaging across Offglide.

Radium

  • Per-cluster monitoring stack: Prometheus + Grafana + AlertManager

Valkey

  • Redis-compatible in-memory store. Session state, cache, execution plans.

Mosaic (GAIC)

  • LLM proxy with PII masking, multi-provider routing, audit logging via Hermes to S3

Agent Orchestrator

  • Python/FastAPI brain. Runs 18-stage AI pipeline, makes LLM calls, manages multi-turn state.

Conversation Server

  • Java/Spring Boot API gateway. Auth, session mgmt, SSE proxying.

MCP Server

  • Tool gateway using Model Context Protocol. Executes tools on Glide REST APIs.

LBF Client

  • React/TS chat UI. Streaming tokens, forms, citations. Web component distribution.

Cilium

  • CNI plugin for networking + network policy enforcement using eBPF. VxLAN overlay mode.

DCPS

  • DataCenter Password Store. External-secrets operator pulls into K8s Secrets.

Anchore

  • Container vulnerability scanner. No critical/high CVEs before production.

ARK

  • Architecture Review for Kubernetes. Required before onboarding to SnowK8s.

Skuld

  • Production cluster family. Live customer traffic. Multi-region. Strict RBAC.

Ragnarok

  • Sandbox cluster family. cluster-admin access. Safe experimentation.

DISHv2 / MIMIR

  • Service discovery. Glide uses DISHv2 + MIMIR to resolve Offglide endpoints.

SSE

  • Server-Sent Events. AO streams response tokens to client in real-time.

Execution Plan

  • Multi-turn state tracker. READY → IN_PROGRESS → COMPLETED. Persisted in Valkey.

Inception

  • Cluster bootstrap using Ansible + Kubespray. Creates K8s clusters from bare metal.