StartseiteBlogNicht kategorisiertScaling OpenShift Monitoring with TypeScript, CDK8s, and ArgoCD

Scaling OpenShift Monitoring with TypeScript, CDK8s, and ArgoCD

Diese Seite ist auch verfügbar in: English (Englisch)

The Problem: Fragmented Monitoring

Monitoring in Kubernetes and OpenShift is universally recognized as critical. However, in practice, it often ends up fragmented, inconsistent, and hard to evolve. The larger and older your platform becomes, the worse the problem gets.

In my case, I was working with an OpenShift landscape that had grown over several years:

  • Multiple test, staging, and production clusters
  • Numerous teams and services, both old and new
  • A patchwork of monitoring approaches across environments

The consequences were clear:

  • Some services weren’t monitored at all
  • Others had alerts in one environment but not in another
  • Nobody could say with certainty whether monitoring was truly “complete”

This situation is unsustainable for any platform at scale.

The challenge of grown OpenShift landscapes

If you’ve been running OpenShift for a few years, this story will sound familiar. You start with a single cluster for development. Later, production clusters are added. Eventually, you’re operating half a dozen clusters, each with its own quirks.

Monitoring evolves piecemeal:

  • One team spins up Prometheus here, another adds a Grafana dashboard there
  • Rules and alerts are defined manually, often as YAML manifests
  • Configurations drift, coverage gaps appear, and fragile systems emerge

The result is a patchwork of monitoring rules that no one can fully trust. This is where treating monitoring as code changes the game.

The idea: monitoring as a library

Instead of managing monitoring manifests by hand, I built a TypeScript library on top of CDK8s (Cloud Development Kit for Kubernetes).

This library integrates directly with the Prometheus Operator in OpenShift/Kubernetes and provides a high-level abstraction for monitoring. Developers no longer need to learn PromQL or author PrometheusRule manifests. Instead, they simply declare which services should be monitored, and the library handles the rest.

Here’s how it works:

  1. A consumer project imports the library and declares its monitoring needs.
  2. The CI/CD pipeline runs cdk8s synth, generating manifests such as PrometheusRules, ServiceMonitors, and PodMonitors.
  3. The pipeline commits these manifests back to Git.
  4. ArgoCD detects the changes and applies them to the appropriate OpenShift cluster.

Monitoring becomes pure code: versioned, reviewable, automated, and repeatable.

How It Works: From Commit to Cluster

Developer Workflow

For developers, onboarding monitoring is simple:

  1. Commit your intent
    • Declare which service should be monitored.
    • Specify which stage: INT, SYT, PROD, etc.
    • Configure alert channels: Slack channels for your team, etc.
    • Add any metadata your team requires: cluster name, environment, etc.
  2. Push to GitLab
    • Once you push your changes, the pipeline takes over.

That’s it! Developers just commit TypeScript configuration alongside their service code.


Behind the Scenes

  • Pipeline execution – the CI/CD pipeline invokes the TypeScript + CDK8s library.
  • YAML generation – manifests (PrometheusRules, PodMonitors, Alertmanager routes, etc.) are generated automatically.
  • GitOps handoff – the manifests are committed back to Git.
  • ArgoCD deployment – ArgoCD applies the configuration to the correct OpenShift stage.
  • Active monitoring – Prometheus and Alertmanager begin scraping and evaluating rules immediately.

Features at a Glance

🔹 Default Rules (applied to every declared service)

  • Unwanted Pod Existence – detect new services not yet monitored
  • Missing Metrics – flag services with no metrics exposed
  • Pod Pending – alert when pods stay pending too long
  • Pod Crashing – detect crash loops or frequent restarts
  • Pod Availability – ensure replicas meet availability targets

🔹 Additional Rules (added dynamically on demand)

  • PVC Thresholds – monitor persistent volume usage
  • Kafka Rules – track consumer lag and broker health
  • HTTP Rules – check REST endpoints for availability and latency
  • Custom PromQL – wrap custom queries into standardized PrometheusRules

🔹 Alerting & Notification Integrations

  • Slack integration – default and team-specific channels
  • ITIL/Webhook integration – send alerts into ticketing or incident systems

🔹 Ecosystem Integrations

  • ConfigMaps with Metadata
    • Automatically generated to provide metadata (like service names, environments, etc.).
    • Can be consumed by third-party tools like Grafana as input variables for dashboards.
  • RBAC Resources for Centralized Metrics
    • Generates RBAC policies so that service metrics can be collected securely.
    • Ensures all metrics flow into a central Thanos pool, enabling global queries and unified dashboards.

🔹 Governance & Traceability

  • Version Annotations → Every generated resource is annotated with the library version that produced it, making it easy to identify outdated or deprecated rules.

🔹 GitOps-Ready Workflow

  • Automated YAML generation via CDK8s.
  • GitOps pipeline integration: manifests are committed to Git.
  • ArgoCD deployment ensures monitoring resources are always in sync and self-healing.

End2End Picture

The following picture describes the end to end picture starting with the code from the consumer until the deployed alert rule.

Picture 1

Why This Matters

Monitoring shouldn’t be a fragile patchwork of YAML files. It should be:

  • Declarative – defined as intent, not manual config
  • Automated – generated and deployed by pipelines
  • Consistent – applied uniformly across environments
  • Testable – validated like application code
  • Traceable – versioned and annotated for governance

By combining TypeScript, CDK8s, and ArgoCD, we created a monitoring system that scales naturally with our OpenShift landscape instead of holding it back.

  • Define once in a library
  • Apply everywhere in a consistent way
  • Onboard new services effortlessly

Monitoring is no longer optional or an afterthought. It’s built-in, standardized, and self-healing, a first-class citizen of the platform.


Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert