DataObs is a production-ready blueprint and implementation starter for end-to-end observability across infrastructure, data pipelines, data quality/freshness/lineage, and business impact — built on OpenTelemetry with dual back-end support for Elasticsearch/Kibana and Grafana Cloud.
- Overview
- Architecture
- Product Pillars
- Repository Layout
- Prerequisites
- Quick Start
- Troubleshooting
- Configuration Reference
- Integrations
- Kubernetes & Helm
- CI / GitHub Actions
- Roadmap
- Contributing
- License
DataObs solves the "dark pipeline" problem: large-scale data platforms emit enormous volumes of telemetry but most of it is never correlated, analysed, or acted on in time to prevent data quality incidents.
This repository provides:
- A 4-pillar observability model (Full-Stack, Pipeline, Data, Business) as both architectural guidance and runnable code
- OpenTelemetry-native instrumentation for Python services, Apache Spark jobs, AWS Lambda, and EC2-hosted runtimes
- A dual back-end strategy: Elasticsearch/Kibana for on-prem/hybrid, and Grafana Cloud (Tempo/Loki/Mimir) for SaaS
- Alerting integrations for ServiceNow, PagerDuty, and Slack out of the box
- A Helm chart and raw Kubernetes manifests for production deployment
- Terraform modules for AWS infrastructure provisioning
┌─────────────────────────────────────────────────────┐
│ Signal Sources │
│ Python Apps · Spark Jobs · Lambda · EC2 · K8s pods │
│ (OTel SDK — semconv resource attributes required) │
└───────────────────────┬─────────────────────────────┘
│ OTLP (gRPC / HTTP)
┌───────────────────────▼─────────────────────────────┐
│ OTel Collector / Grafana Alloy │
│ Receive → Semconv enrich → Normalize → Batch → Route│
│ spanmetrics connector (RED metrics from traces) │
│ servicegraph connector (topology map) │
└──┬──────────┬───────────────┬──────────┬────────────┘
│ │ │ │
┌────────────────▼──┐ ┌────▼────────┐ ┌──▼──────┐ ┌──▼──────────────┐
│ Elasticsearch 9.x │ │ OpenSearch │ │ AMP │ │ Grafana Cloud │
│ AIOps + ML │ │ self / AWS │ │(metrics)│ │ OTLP gateway │
│ ECS mapping │ │ via OSIS or │ │ │ │ Tempo/Loki/Mimir│
│ Kibana dashboards │ │ Data Prepper│ └────┬────┘ └──────────────────┘
└──────────┬────────┘ └──────┬──────┘ │
│ │ ┌─────▼──────────────────────────┐
│ │ │ Amazon Managed Grafana │
│ └────────►│ Data sources: AMP + OSIS │
│ │ Service map · Anomaly detect │
│ │ Drilldown correlations │
│ └────────────────────────────────┘
│
┌──────────▼──────────────────────────────────────────────────────┐
│ DataObs Platform │
│ Quality Engine · Freshness SLA · Lineage · Rules API │
└──────────────────────────────────┬──────────────────────────────┘
│
┌──────────────────────────────────▼──────────────────────────────┐
│ Alerting & ITSM │
│ ServiceNow · PagerDuty · Slack │
└──────────────────────────────────────────────────────────────────┘
Set DATAOBS_BACKEND to activate the desired sink(s):
| Value | Metrics | Traces | Logs | Use case |
|---|---|---|---|---|
elasticsearch |
ES dataobs-metrics |
ES dataobs-traces |
ES dataobs-logs |
Default — local / on-prem AIOps |
opensearch |
OpenSearch via Data Prepper | OpenSearch trace-analytics | OpenSearch logs | Self-managed OpenSearch |
opensearch_aws |
OSIS → OpenSearch domain | OSIS → OpenSearch domain | OSIS → OpenSearch domain | Amazon OpenSearch Service |
aws_grafana |
AMP → Amazon Managed Grafana | OSIS → OpenSearch → AMG | OSIS → OpenSearch → AMG | Full AWS managed stack |
grafana_cloud |
Grafana Cloud OTLP | Grafana Cloud OTLP | Grafana Cloud OTLP | Grafana Cloud SaaS |
all |
All of the above | All of the above | All of the above | Fan-out / migration |
See docs/aws-grafana-setup.md for the AWS provisioning walkthrough.
Signal flow for the Grafana Alloy integration:
Python Flask App
└─► OTLP gRPC ──► Grafana Alloy pipeline
├─ Resource detection (host.name, os.type)
├─ Attribute enrichment / cleanup
├─ Batch processor (512 records / 5 s)
└─► Grafana Cloud OTLP gateway
├─► Tempo (traces + service map)
├─► Loki (logs + trace correlation)
└─► Mimir (metrics + exemplars)
DataObs follows a 4-pillar model inspired by Datadog, Dynatrace, Monte Carlo, and Collibra:
| Pillar | What it covers | Key signals |
|---|---|---|
| Full-Stack Observability | Infrastructure, hosts, containers, K8s | CPU, memory, network, pod health |
| Pipeline Observability | Spark jobs, Kafka, Lambda, Glue | Records in/out, lag, duration, errors |
| Data Observability | Freshness, quality, schema, lineage | SLA breach, null rate, row count drift |
| Business Observability | KPIs, SLAs, revenue impact | Order value, conversion, anomaly alerts |
See the detailed model: docs/architecture/four-tower-model.md
DataObs separates production modules, proof-of-concept helpers, deployment assets, integrations, and documentation so new features have an obvious home. See the maintainer-focused repository structure and feature-development guide for extension points, quality-check registration, and issue-readiness notes.
DataObs/
├── .github/
│ └── workflows/
│ ├── ci.yml # CI: validate + test + build + Trivy scan
│ └── release.yml # Release: build & push versioned images to GHCR
│
├── src/
│ ├── api/
│ │ └── main.py # REST API — rules and lineage management
│ ├── core/
│ │ ├── pillars.py # 4-pillar model with maturity scoring
│ │ └── enterprise_blueprint.py # Enterprise deployment blueprints
│ ├── quality/
│ │ ├── checks/ # Pluggable data quality checks
│ │ │ ├── base.py
│ │ │ ├── null_check.py
│ │ │ ├── row_count_check.py
│ │ │ ├── schema_check.py
│ │ │ ├── uniqueness_check.py
│ │ │ └── value_range_check.py
│ │ ├── freshness.py # Freshness SLA enforcement
│ │ └── lineage.py # Data lineage graph + impact analysis
│ ├── alerting/
│ │ ├── servicenow.py # ServiceNow incident client
│ │ ├── pagerduty.py # PagerDuty Events API client
│ │ └── slack.py # Slack webhook alerting
│ └── analytics/
│ └── elasticsearch_ml.py # ES native ML job helpers
│
├── config/
│ ├── dataobs.example.yaml # Main platform configuration blueprint
│ ├── otel-collector-config.yaml # OTel Collector pipelines and exporters
│ ├── otel-ec2-agent-config.yaml # ADOT EC2 agent config
│ └── elasticsearch/
│ ├── server/ # ILM, index templates, ingest pipelines
│ └── tenant/ # Per-tenant OTel overlay configs
│
├── integrations/
│ └── grafana-alloy/ # Grafana Alloy → Grafana Cloud stack
│ ├── app/ # Sample Python app (OTel SDK instrumented)
│ ├── alloy/config.alloy # Grafana Alloy River pipeline config
│ ├── grafana/provisioning/ # Datasources + dashboard JSON
│ ├── docs/GUIDE.md # 10-step setup guide
│ └── docker-compose.yml
│
├── k8s/ # Raw Kubernetes manifests
├── helm/dataobs/ # Helm chart for production K8s deployment
│
├── infra/terraform/
│ ├── aws-ec2-otel-agent/ # EC2 OTel agent rollout via SSM
│ └── elasticsearch/ # ES server and tenant Terraform modules
│
├── docs/
│ ├── architecture/ # System diagrams and design decisions
│ ├── production/ # Production runbooks
│ ├── integrations/ # Integration guides (ServiceNow, etc.)
│ ├── towers/ # Per-pillar deep-dives
│ └── wiki/ # Operations wiki
│
├── tests/ # pytest test suite
├── docker-compose.yml # Local full stack (ES + Kibana + OTel + DataObs)
├── Dockerfile.api
├── Dockerfile.quality
├── requirements.txt
└── README.md
| Dependency | Version | Purpose |
|---|---|---|
| Python | 3.11+ | Runtime |
| Docker | 24+ | Local stack |
| Docker Compose | v2.20+ | Orchestration |
| Kubernetes | 1.27+ | Production deployment |
| Helm | 3.12+ | K8s package management |
| Terraform | 1.5+ | AWS infrastructure |
| Grafana Alloy | v1.8+ | OTel collector (Grafana Cloud integration) |
For the Grafana Cloud integration only:
- A Grafana Cloud account (free tier works)
- A service account token with
metrics:write logs:write traces:writescopes
git clone https://github.com/Jagadeeshck/DataObs.git
cd DataObs
# Copy and populate configuration
cp config/dataobs.example.yaml config/dataobs.yaml
cp .env.example .env
# Edit .env: set ELASTIC_PASSWORD, KIBANA_PASSWORD, etc.
# Start full stack
docker compose up -d
# Verify
curl http://localhost:8080/health # DataObs API
open http://localhost:5601 # Kibanacd integrations/grafana-alloy
cp .env.example .env
# Edit .env: GRAFANA_CLOUD_OTLP_ENDPOINT, GRAFANA_CLOUD_INSTANCE_ID, GRAFANA_CLOUD_API_KEY
docker compose --env-file .env up --buildThe load generator fires traffic immediately. Traces, logs, and metrics appear in Grafana Cloud within ~90 seconds.
# Verify Alloy pipeline health
open http://localhost:12346 # Alloy UI — component graph
# Trigger a 4-span checkout trace
curl -X POST http://localhost:5001/checkout \
-H "Content-Type: application/json" \
-d '{"product_id":"P001","quantity":2}'See integrations/grafana-alloy/docs/GUIDE.md for the complete 10-step walkthrough.
❌ Elasticsearch fails to start — "cannot upgrade a node from version [8.x] directly to version [9.x]"
Symptom
fatal exception while booting Elasticsearch
error.message: cannot upgrade a node from version [8.13.0] directly to version [9.x],
upgrade to version [8.19.0] first.
Root cause
Elasticsearch stores its originating version in node metadata inside the esdata Docker volume. When you previously ran the stack with an older image (e.g. 8.13.0) and later pulled the current 9.x POC image, the new process reads the stale metadata and hard-blocks the start because Elastic enforces a mandatory stepping-stone upgrade path: you cannot skip directly from 8.x to 9.x — you must first pass through the last minor release of 8.x (8.19.0). Since this is a local POC with no production data, the simplest fix is to delete the stale volume.
Fix — delete the stale esdata volume
# 1. Tear down all running containers
docker compose down
# 2. Confirm the exact volume name (usually prefixed with your folder name)
docker volume ls | grep esdata
# 3. Remove the stale volume
docker volume rm dataobs_esdata
# 4. Ensure .env is populated
cp .env.example .env
# Edit .env — set ELASTIC_PASSWORD and KIBANA_PASSWORD
# 5. Re-start the stack from scratch
docker compose up -dVerify the fix
# Watch ES boot — should report version 9.x with no errors
docker logs dataobs-es01 -f
# Confirm the running version
curl -s -u elastic:<your-password> http://localhost:9200 | jq .version.number
# Expected output: "9.4.2"Kibana will be available at http://localhost:5601 once the es-setup init container completes its one-shot password bootstrap and the service_completed_successfully health gate opens for the Kibana service.
Note: This error can also appear if you restore an old
esdatavolume backup from a previous major version. The same fix applies — either delete the volume (POC) or perform the intermediate 8.19.0 upgrade step first (production).
Symptom: dataobs-es-setup exits with error Failed to set kibana_system password (HTTP 401).
Cause: ELASTIC_PASSWORD in .env does not match the password that was used when the esdata volume was first initialised.
Fix: Either update .env to match the original password, or delete the esdata volume (see above) and restart with a fresh password.
Symptom: Browser shows the Kibana loading screen indefinitely.
Cause: Kibana depends on es-setup completing successfully. If es-setup failed, Kibana's depends_on: service_completed_successfully gate never opens.
Fix:
# Check es-setup logs first
docker logs dataobs-es-setup
# Then check Kibana logs
docker logs dataobs-kibanaResolve any es-setup error first (see above), then restart:
docker compose restart kibanaThe main platform configuration. Copy to config/dataobs.yaml before starting.
| Section | Key fields | Description |
|---|---|---|
elasticsearch |
url, user, password, index_prefix |
ES connection and index naming |
otel |
endpoint, service_name, environment |
OTLP exporter settings |
quality |
checks, sla_thresholds, schedule |
Quality check rules and SLA config |
freshness |
tables, max_age_hours |
Per-table freshness SLA definitions |
lineage |
graph_index, impact_depth |
Lineage graph settings |
alerting.servicenow |
instance, user, password, assignment_group |
ServiceNow connection |
alerting.pagerduty |
routing_key, severity_map |
PagerDuty routing |
alerting.slack |
webhook_url, channel |
Slack notifications |
| Variable | Required | Description |
|---|---|---|
ELASTIC_PASSWORD |
Yes | Elasticsearch elastic user password |
KIBANA_PASSWORD |
Yes | Kibana system user password |
GRAFANA_CLOUD_OTLP_ENDPOINT |
Alloy integration | Grafana Cloud OTLP gateway URL |
GRAFANA_CLOUD_INSTANCE_ID |
Alloy integration | Numeric Grafana Cloud stack ID |
GRAFANA_CLOUD_API_KEY |
Alloy integration | Service account token |
OTEL_SERVICE_NAME |
No | Override service name (default: dataobs-api) |
DEPLOYMENT_ENVIRONMENT |
No | development / staging / production |
Full OTel signal pipeline: metrics, logs, and traces from a Python app → Grafana Alloy collector → Grafana Cloud (Tempo / Loki / Mimir).
Features:
- Service map (automatic from span relationships)
- Anomaly detection (Grafana Application Observability)
- Trace → log → metric drilldown correlation
- Auto-formatted Alloy River config (validated in CI with
alloy fmt --test) - Load generator for immediate realistic signal volume
Port map (offset from root stack):
| Service | External port |
|---|---|
| Sample app | 5001 |
| Alloy gRPC | 4319 |
| Alloy HTTP | 4320 |
| Alloy UI | 12346 |
Location: integrations/grafana-alloy/
Automatic incident creation from data quality failures. Supports priority mapping, assignment groups, and resolution callbacks.
Guide: docs/integrations/servicenow.md
Events API v2 integration with severity mapping from DataObs pillar signals to PagerDuty urgency levels.
Webhook-based alerting with structured message blocks. Supports per-channel routing by pillar or severity.
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/configmap.yaml
kubectl apply -f k8s/otel-collector.yaml
kubectl apply -f k8s/deployment-api.yaml
kubectl apply -f k8s/deployment-quality.yaml
kubectl apply -f k8s/service-api.yamlhelm repo add dataobs ./helm
helm install dataobs ./helm/dataobs \
--namespace dataobs --create-namespace \
--set elasticsearch.url=http://elasticsearch:9200 \
--set elasticsearch.password=<password> \
--set otel.endpoint=http://alloy:4317See helm/dataobs/README.md for all values.yaml options.
Replace the Docker Compose Alloy service with a Helm DaemonSet:
helm repo add grafana https://grafana.github.io/helm-charts && helm repo update
helm install alloy grafana/alloy \
--namespace monitoring --create-namespace \
--set-file alloy.configMap.content=integrations/grafana-alloy/alloy/config.alloy \
--set env[0].name=GRAFANA_CLOUD_OTLP_ENDPOINT \
--set env[0].value="$GRAFANA_CLOUD_OTLP_ENDPOINT" \
--set env[1].name=GRAFANA_CLOUD_INSTANCE_ID \
--set env[1].value="$GRAFANA_CLOUD_INSTANCE_ID" \
--set env[2].name=GRAFANA_CLOUD_API_KEY \
--set env[2].value="$GRAFANA_CLOUD_API_KEY"The repo ships two workflows.
Runs on every push and PR:
| Job | Trigger | What it does |
|---|---|---|
validate-configs |
All branches | alloy fmt --test, YAML lint, dashboard JSON lint, Python syntax check |
test-python |
After validate | pytest suite — 12 tests across quality checks, API, alerting, analytics |
build-docker |
After validate | Builds dataobs/api, dataobs/quality, dataobs/sample-python-app with GHA layer cache |
security-scan |
main only |
Trivy CVE scan (JSON + table + SARIF) with auto-issue creation |
Triggered by a v*.*.* tag push (e.g. git tag v1.0.0 && git push --tags).
| Job | What it does |
|---|---|
build-and-push (matrix × 3) |
Builds and pushes dataobs-api, dataobs-quality, dataobs-sample-app to GHCR with semver tags + :latest |
create-release |
Generates a changelog from git log, creates a GitHub Release with pull instructions and Helm/K8s update notes |
Tags published per image:
ghcr.io/jagadeeshck/dataobs-api:1.2.3 # exact version
ghcr.io/jagadeeshck/dataobs-api:1.2 # minor alias
ghcr.io/jagadeeshck/dataobs-api:1 # major alias
ghcr.io/jagadeeshck/dataobs-api:latest # always newest release
Images include SBOM attestations and SLSA build provenance.
Pre-release tags (e.g. v1.0.0-rc.1) are marked as pre-release automatically.
To cut a release:
git tag v1.0.0 -m "Release v1.0.0"
git push origin v1.0.0CRITICAL CVEs found?
YES, open issue exists? → add comment with new CVE table + run link
YES, no open issue? → create issue: label security:critical-cve, assign actor
NO CRITICALs → auto-close open issue with resolution comment
The badge shows the live open-issue count.
- Column distribution drift checks with dynamic thresholds
- End-to-end integration tests for API + alert delivery paths
- Elasticsearch-backed persistence for API (replace in-memory store)
- Spark job instrumentation examples (PySpark + OTel SDK)
- AWS Lambda OTel layer for serverless data pipelines
- dbt integration — surface model run results as OTel spans
- Monte Carlo-style automated anomaly detection on metric histograms
- Multi-tenant config management with per-tenant Alloy overlays
- Grafana dashboard provisioning via Terraform (Grafana provider)
- GitHub Actions: auto-PR for Dependabot security fixes
- OpenSearch migration path (config + index template equivalents)
- 4-pillar observability model and core quality checks
- Grafana Alloy → Grafana Cloud integration (metrics + logs + traces)
- ServiceNow, PagerDuty, Slack alerting clients
- Helm chart + raw K8s manifests
- Terraform modules for AWS EC2 OTel agent rollout
- CI pipeline: validate + test + Docker build + Trivy security scan
- Auto-issue creation on CRITICAL CVEs with auto-close on resolution
- README badges: CI status, Trivy scan, critical CVE count
Contributions are welcome. Please read CONTRIBUTING.md before opening a PR.
Quick contribution flow:
# 1. Fork and clone
git clone https://github.com/<your-username>/DataObs.git
cd DataObs
# 2. Create a feature branch
git checkout -b feat/your-feature-name
# 3. Install dependencies
pip install -r requirements.txt pytest
# 4. Make changes, run tests
python -m pytest tests/ -v
# 5. Open a PR against mainAll PRs must pass the full CI pipeline (validate → test → build) before merge.
This project is licensed under the MIT License.
See docs/production/configuration.md for mode-aware and production-safe settings.