Skip to content

Jagadeeshck/DataObs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

204 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataObs — Cloud-Agnostic Data Observability Platform

CI Release Latest Release GHCR Trivy Security Scan Security: Critical CVEs Python OpenTelemetry License: MIT

DataObs is a production-ready blueprint and implementation starter for end-to-end observability across infrastructure, data pipelines, data quality/freshness/lineage, and business impact — built on OpenTelemetry with dual back-end support for Elasticsearch/Kibana and Grafana Cloud.


Table of Contents


Overview

DataObs solves the "dark pipeline" problem: large-scale data platforms emit enormous volumes of telemetry but most of it is never correlated, analysed, or acted on in time to prevent data quality incidents.

This repository provides:

  • A 4-pillar observability model (Full-Stack, Pipeline, Data, Business) as both architectural guidance and runnable code
  • OpenTelemetry-native instrumentation for Python services, Apache Spark jobs, AWS Lambda, and EC2-hosted runtimes
  • A dual back-end strategy: Elasticsearch/Kibana for on-prem/hybrid, and Grafana Cloud (Tempo/Loki/Mimir) for SaaS
  • Alerting integrations for ServiceNow, PagerDuty, and Slack out of the box
  • A Helm chart and raw Kubernetes manifests for production deployment
  • Terraform modules for AWS infrastructure provisioning

Architecture

                        ┌─────────────────────────────────────────────────────┐
                        │                  Signal Sources                     │
                        │  Python Apps · Spark Jobs · Lambda · EC2 · K8s pods │
                        │  (OTel SDK — semconv resource attributes required)   │
                        └───────────────────────┬─────────────────────────────┘
                                                │  OTLP (gRPC / HTTP)
                        ┌───────────────────────▼─────────────────────────────┐
                        │         OTel Collector / Grafana Alloy               │
                        │  Receive → Semconv enrich → Normalize → Batch → Route│
                        │  spanmetrics connector (RED metrics from traces)      │
                        │  servicegraph connector (topology map)               │
                        └──┬──────────┬───────────────┬──────────┬────────────┘
                           │          │               │          │
          ┌────────────────▼──┐  ┌────▼────────┐  ┌──▼──────┐  ┌──▼──────────────┐
          │ Elasticsearch 9.x │  │  OpenSearch │  │   AMP   │  │  Grafana Cloud  │
          │ AIOps + ML        │  │ self / AWS  │  │(metrics)│  │ OTLP gateway    │
          │ ECS mapping       │  │ via OSIS or │  │         │  │ Tempo/Loki/Mimir│
          │ Kibana dashboards │  │ Data Prepper│  └────┬────┘  └──────────────────┘
          └──────────┬────────┘  └──────┬──────┘       │
                     │                  │         ┌─────▼──────────────────────────┐
                     │                  │         │   Amazon Managed Grafana       │
                     │                  └────────►│   Data sources: AMP + OSIS     │
                     │                            │   Service map · Anomaly detect  │
                     │                            │   Drilldown correlations        │
                     │                            └────────────────────────────────┘
                     │
          ┌──────────▼──────────────────────────────────────────────────────┐
          │                  DataObs Platform                                │
          │  Quality Engine · Freshness SLA · Lineage · Rules API           │
          └──────────────────────────────────┬──────────────────────────────┘
                                             │
          ┌──────────────────────────────────▼──────────────────────────────┐
          │              Alerting & ITSM                                     │
          │         ServiceNow · PagerDuty · Slack                          │
          └──────────────────────────────────────────────────────────────────┘

Backend selector

Set DATAOBS_BACKEND to activate the desired sink(s):

Value Metrics Traces Logs Use case
elasticsearch ES dataobs-metrics ES dataobs-traces ES dataobs-logs Default — local / on-prem AIOps
opensearch OpenSearch via Data Prepper OpenSearch trace-analytics OpenSearch logs Self-managed OpenSearch
opensearch_aws OSIS → OpenSearch domain OSIS → OpenSearch domain OSIS → OpenSearch domain Amazon OpenSearch Service
aws_grafana AMP → Amazon Managed Grafana OSIS → OpenSearch → AMG OSIS → OpenSearch → AMG Full AWS managed stack
grafana_cloud Grafana Cloud OTLP Grafana Cloud OTLP Grafana Cloud OTLP Grafana Cloud SaaS
all All of the above All of the above All of the above Fan-out / migration

See docs/aws-grafana-setup.md for the AWS provisioning walkthrough.

Signal flow for the Grafana Alloy integration:

Python Flask App
  └─► OTLP gRPC ──► Grafana Alloy pipeline
                        ├─ Resource detection (host.name, os.type)
                        ├─ Attribute enrichment / cleanup
                        ├─ Batch processor (512 records / 5 s)
                        └─► Grafana Cloud OTLP gateway
                                ├─► Tempo   (traces + service map)
                                ├─► Loki    (logs + trace correlation)
                                └─► Mimir   (metrics + exemplars)

Product Pillars

DataObs follows a 4-pillar model inspired by Datadog, Dynatrace, Monte Carlo, and Collibra:

Pillar What it covers Key signals
Full-Stack Observability Infrastructure, hosts, containers, K8s CPU, memory, network, pod health
Pipeline Observability Spark jobs, Kafka, Lambda, Glue Records in/out, lag, duration, errors
Data Observability Freshness, quality, schema, lineage SLA breach, null rate, row count drift
Business Observability KPIs, SLAs, revenue impact Order value, conversion, anomaly alerts

See the detailed model: docs/architecture/four-tower-model.md


Repository Layout

DataObs separates production modules, proof-of-concept helpers, deployment assets, integrations, and documentation so new features have an obvious home. See the maintainer-focused repository structure and feature-development guide for extension points, quality-check registration, and issue-readiness notes.

DataObs/
├── .github/
│   └── workflows/
│       ├── ci.yml                        # CI: validate + test + build + Trivy scan
│       └── release.yml                   # Release: build & push versioned images to GHCR
│
├── src/
│   ├── api/
│   │   └── main.py                       # REST API — rules and lineage management
│   ├── core/
│   │   ├── pillars.py                    # 4-pillar model with maturity scoring
│   │   └── enterprise_blueprint.py       # Enterprise deployment blueprints
│   ├── quality/
│   │   ├── checks/                       # Pluggable data quality checks
│   │   │   ├── base.py
│   │   │   ├── null_check.py
│   │   │   ├── row_count_check.py
│   │   │   ├── schema_check.py
│   │   │   ├── uniqueness_check.py
│   │   │   └── value_range_check.py
│   │   ├── freshness.py                  # Freshness SLA enforcement
│   │   └── lineage.py                    # Data lineage graph + impact analysis
│   ├── alerting/
│   │   ├── servicenow.py                 # ServiceNow incident client
│   │   ├── pagerduty.py                  # PagerDuty Events API client
│   │   └── slack.py                      # Slack webhook alerting
│   └── analytics/
│       └── elasticsearch_ml.py           # ES native ML job helpers
│
├── config/
│   ├── dataobs.example.yaml              # Main platform configuration blueprint
│   ├── otel-collector-config.yaml        # OTel Collector pipelines and exporters
│   ├── otel-ec2-agent-config.yaml        # ADOT EC2 agent config
│   └── elasticsearch/
│       ├── server/                       # ILM, index templates, ingest pipelines
│       └── tenant/                       # Per-tenant OTel overlay configs
│
├── integrations/
│   └── grafana-alloy/                    # Grafana Alloy → Grafana Cloud stack
│       ├── app/                          # Sample Python app (OTel SDK instrumented)
│       ├── alloy/config.alloy            # Grafana Alloy River pipeline config
│       ├── grafana/provisioning/         # Datasources + dashboard JSON
│       ├── docs/GUIDE.md                 # 10-step setup guide
│       └── docker-compose.yml
│
├── k8s/                                  # Raw Kubernetes manifests
├── helm/dataobs/                         # Helm chart for production K8s deployment
│
├── infra/terraform/
│   ├── aws-ec2-otel-agent/               # EC2 OTel agent rollout via SSM
│   └── elasticsearch/                    # ES server and tenant Terraform modules
│
├── docs/
│   ├── architecture/                     # System diagrams and design decisions
│   ├── production/                       # Production runbooks
│   ├── integrations/                     # Integration guides (ServiceNow, etc.)
│   ├── towers/                           # Per-pillar deep-dives
│   └── wiki/                             # Operations wiki
│
├── tests/                                # pytest test suite
├── docker-compose.yml                    # Local full stack (ES + Kibana + OTel + DataObs)
├── Dockerfile.api
├── Dockerfile.quality
├── requirements.txt
└── README.md

Prerequisites

Dependency Version Purpose
Python 3.11+ Runtime
Docker 24+ Local stack
Docker Compose v2.20+ Orchestration
Kubernetes 1.27+ Production deployment
Helm 3.12+ K8s package management
Terraform 1.5+ AWS infrastructure
Grafana Alloy v1.8+ OTel collector (Grafana Cloud integration)

For the Grafana Cloud integration only:

  • A Grafana Cloud account (free tier works)
  • A service account token with metrics:write logs:write traces:write scopes

Quick Start

Local stack (Elasticsearch + Kibana + OTel Collector)

git clone https://github.com/Jagadeeshck/DataObs.git
cd DataObs

# Copy and populate configuration
cp config/dataobs.example.yaml config/dataobs.yaml
cp .env.example .env
# Edit .env: set ELASTIC_PASSWORD, KIBANA_PASSWORD, etc.

# Start full stack
docker compose up -d

# Verify
curl http://localhost:8080/health          # DataObs API
open http://localhost:5601                 # Kibana

Grafana Cloud integration (Alloy + Tempo + Loki + Mimir)

cd integrations/grafana-alloy
cp .env.example .env
# Edit .env: GRAFANA_CLOUD_OTLP_ENDPOINT, GRAFANA_CLOUD_INSTANCE_ID, GRAFANA_CLOUD_API_KEY

docker compose --env-file .env up --build

The load generator fires traffic immediately. Traces, logs, and metrics appear in Grafana Cloud within ~90 seconds.

# Verify Alloy pipeline health
open http://localhost:12346          # Alloy UI — component graph

# Trigger a 4-span checkout trace
curl -X POST http://localhost:5001/checkout \
  -H "Content-Type: application/json" \
  -d '{"product_id":"P001","quantity":2}'

See integrations/grafana-alloy/docs/GUIDE.md for the complete 10-step walkthrough.


Troubleshooting

❌ Elasticsearch fails to start — "cannot upgrade a node from version [8.x] directly to version [9.x]"

Symptom

fatal exception while booting Elasticsearch
error.message: cannot upgrade a node from version [8.13.0] directly to version [9.x],
               upgrade to version [8.19.0] first.

Root cause

Elasticsearch stores its originating version in node metadata inside the esdata Docker volume. When you previously ran the stack with an older image (e.g. 8.13.0) and later pulled the current 9.x POC image, the new process reads the stale metadata and hard-blocks the start because Elastic enforces a mandatory stepping-stone upgrade path: you cannot skip directly from 8.x to 9.x — you must first pass through the last minor release of 8.x (8.19.0). Since this is a local POC with no production data, the simplest fix is to delete the stale volume.

Fix — delete the stale esdata volume

# 1. Tear down all running containers
docker compose down

# 2. Confirm the exact volume name (usually prefixed with your folder name)
docker volume ls | grep esdata

# 3. Remove the stale volume
docker volume rm dataobs_esdata

# 4. Ensure .env is populated
cp .env.example .env
# Edit .env — set ELASTIC_PASSWORD and KIBANA_PASSWORD

# 5. Re-start the stack from scratch
docker compose up -d

Verify the fix

# Watch ES boot — should report version 9.x with no errors
docker logs dataobs-es01 -f

# Confirm the running version
curl -s -u elastic:<your-password> http://localhost:9200 | jq .version.number
# Expected output: "9.4.2"

Kibana will be available at http://localhost:5601 once the es-setup init container completes its one-shot password bootstrap and the service_completed_successfully health gate opens for the Kibana service.

Note: This error can also appear if you restore an old esdata volume backup from a previous major version. The same fix applies — either delete the volume (POC) or perform the intermediate 8.19.0 upgrade step first (production).


es-setup container exits with non-zero code

Symptom: dataobs-es-setup exits with error Failed to set kibana_system password (HTTP 401).

Cause: ELASTIC_PASSWORD in .env does not match the password that was used when the esdata volume was first initialised.

Fix: Either update .env to match the original password, or delete the esdata volume (see above) and restart with a fresh password.


❌ Kibana shows "Kibana server is not ready yet"

Symptom: Browser shows the Kibana loading screen indefinitely.

Cause: Kibana depends on es-setup completing successfully. If es-setup failed, Kibana's depends_on: service_completed_successfully gate never opens.

Fix:

# Check es-setup logs first
docker logs dataobs-es-setup

# Then check Kibana logs
docker logs dataobs-kibana

Resolve any es-setup error first (see above), then restart:

docker compose restart kibana

Configuration Reference

config/dataobs.example.yaml

The main platform configuration. Copy to config/dataobs.yaml before starting.

Section Key fields Description
elasticsearch url, user, password, index_prefix ES connection and index naming
otel endpoint, service_name, environment OTLP exporter settings
quality checks, sla_thresholds, schedule Quality check rules and SLA config
freshness tables, max_age_hours Per-table freshness SLA definitions
lineage graph_index, impact_depth Lineage graph settings
alerting.servicenow instance, user, password, assignment_group ServiceNow connection
alerting.pagerduty routing_key, severity_map PagerDuty routing
alerting.slack webhook_url, channel Slack notifications

Environment variables (.env)

Variable Required Description
ELASTIC_PASSWORD Yes Elasticsearch elastic user password
KIBANA_PASSWORD Yes Kibana system user password
GRAFANA_CLOUD_OTLP_ENDPOINT Alloy integration Grafana Cloud OTLP gateway URL
GRAFANA_CLOUD_INSTANCE_ID Alloy integration Numeric Grafana Cloud stack ID
GRAFANA_CLOUD_API_KEY Alloy integration Service account token
OTEL_SERVICE_NAME No Override service name (default: dataobs-api)
DEPLOYMENT_ENVIRONMENT No development / staging / production

Integrations

Grafana Alloy → Grafana Cloud

Full OTel signal pipeline: metrics, logs, and traces from a Python app → Grafana Alloy collector → Grafana Cloud (Tempo / Loki / Mimir).

Features:

  • Service map (automatic from span relationships)
  • Anomaly detection (Grafana Application Observability)
  • Trace → log → metric drilldown correlation
  • Auto-formatted Alloy River config (validated in CI with alloy fmt --test)
  • Load generator for immediate realistic signal volume

Port map (offset from root stack):

Service External port
Sample app 5001
Alloy gRPC 4319
Alloy HTTP 4320
Alloy UI 12346

Location: integrations/grafana-alloy/

ServiceNow

Automatic incident creation from data quality failures. Supports priority mapping, assignment groups, and resolution callbacks.

Guide: docs/integrations/servicenow.md

PagerDuty

Events API v2 integration with severity mapping from DataObs pillar signals to PagerDuty urgency levels.

Slack

Webhook-based alerting with structured message blocks. Supports per-channel routing by pillar or severity.


Kubernetes & Helm

Raw manifests

kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/configmap.yaml
kubectl apply -f k8s/otel-collector.yaml
kubectl apply -f k8s/deployment-api.yaml
kubectl apply -f k8s/deployment-quality.yaml
kubectl apply -f k8s/service-api.yaml

Helm chart

helm repo add dataobs ./helm
helm install dataobs ./helm/dataobs \
  --namespace dataobs --create-namespace \
  --set elasticsearch.url=http://elasticsearch:9200 \
  --set elasticsearch.password=<password> \
  --set otel.endpoint=http://alloy:4317

See helm/dataobs/README.md for all values.yaml options.

Grafana Alloy on Kubernetes

Replace the Docker Compose Alloy service with a Helm DaemonSet:

helm repo add grafana https://grafana.github.io/helm-charts && helm repo update

helm install alloy grafana/alloy \
  --namespace monitoring --create-namespace \
  --set-file alloy.configMap.content=integrations/grafana-alloy/alloy/config.alloy \
  --set env[0].name=GRAFANA_CLOUD_OTLP_ENDPOINT \
  --set env[0].value="$GRAFANA_CLOUD_OTLP_ENDPOINT" \
  --set env[1].name=GRAFANA_CLOUD_INSTANCE_ID \
  --set env[1].value="$GRAFANA_CLOUD_INSTANCE_ID" \
  --set env[2].name=GRAFANA_CLOUD_API_KEY \
  --set env[2].value="$GRAFANA_CLOUD_API_KEY"

CI / GitHub Actions

The repo ships two workflows.

CI workflow (ci.yml)

Runs on every push and PR:

Job Trigger What it does
validate-configs All branches alloy fmt --test, YAML lint, dashboard JSON lint, Python syntax check
test-python After validate pytest suite — 12 tests across quality checks, API, alerting, analytics
build-docker After validate Builds dataobs/api, dataobs/quality, dataobs/sample-python-app with GHA layer cache
security-scan main only Trivy CVE scan (JSON + table + SARIF) with auto-issue creation

Security scan — auto-issue workflow


Release workflow (release.yml)

Triggered by a v*.*.* tag push (e.g. git tag v1.0.0 && git push --tags).

Job What it does
build-and-push (matrix × 3) Builds and pushes dataobs-api, dataobs-quality, dataobs-sample-app to GHCR with semver tags + :latest
create-release Generates a changelog from git log, creates a GitHub Release with pull instructions and Helm/K8s update notes

Tags published per image:

ghcr.io/jagadeeshck/dataobs-api:1.2.3    # exact version
ghcr.io/jagadeeshck/dataobs-api:1.2      # minor alias
ghcr.io/jagadeeshck/dataobs-api:1        # major alias
ghcr.io/jagadeeshck/dataobs-api:latest   # always newest release

Images include SBOM attestations and SLSA build provenance.

Pre-release tags (e.g. v1.0.0-rc.1) are marked as pre-release automatically.

To cut a release:

git tag v1.0.0 -m "Release v1.0.0"
git push origin v1.0.0

Security scan — auto-issue workflow

CRITICAL CVEs found?
  YES, open issue exists? → add comment with new CVE table + run link
  YES, no open issue?    → create issue: label security:critical-cve, assign actor
  NO CRITICALs           → auto-close open issue with resolution comment

The Security: Critical CVEs badge shows the live open-issue count.


Roadmap

In progress

  • Column distribution drift checks with dynamic thresholds
  • End-to-end integration tests for API + alert delivery paths
  • Elasticsearch-backed persistence for API (replace in-memory store)

Planned

  • Spark job instrumentation examples (PySpark + OTel SDK)
  • AWS Lambda OTel layer for serverless data pipelines
  • dbt integration — surface model run results as OTel spans
  • Monte Carlo-style automated anomaly detection on metric histograms
  • Multi-tenant config management with per-tenant Alloy overlays
  • Grafana dashboard provisioning via Terraform (Grafana provider)
  • GitHub Actions: auto-PR for Dependabot security fixes
  • OpenSearch migration path (config + index template equivalents)

Completed

  • 4-pillar observability model and core quality checks
  • Grafana Alloy → Grafana Cloud integration (metrics + logs + traces)
  • ServiceNow, PagerDuty, Slack alerting clients
  • Helm chart + raw K8s manifests
  • Terraform modules for AWS EC2 OTel agent rollout
  • CI pipeline: validate + test + Docker build + Trivy security scan
  • Auto-issue creation on CRITICAL CVEs with auto-close on resolution
  • README badges: CI status, Trivy scan, critical CVE count

Contributing

Contributions are welcome. Please read CONTRIBUTING.md before opening a PR.

Quick contribution flow:

# 1. Fork and clone
git clone https://github.com/<your-username>/DataObs.git
cd DataObs

# 2. Create a feature branch
git checkout -b feat/your-feature-name

# 3. Install dependencies
pip install -r requirements.txt pytest

# 4. Make changes, run tests
python -m pytest tests/ -v

# 5. Open a PR against main

All PRs must pass the full CI pipeline (validate → test → build) before merge.


License

This project is licensed under the MIT License.

Configuration

See docs/production/configuration.md for mode-aware and production-safe settings.

About

Cloud-agnostic data observability platform — OpenTelemetry-native telemetry for infrastructure, pipelines, data quality, and business impact. Elasticsearch/Kibana + Grafana Cloud + Kubernetes ready.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors