A production-grade data pipeline for processing and analyzing user clickstream data at scale.
graph TD
A[Kafka Producer] --> B[Kafka]
B --> C[Spark Streaming]
C --> D[Delta Lake]
C --> E[Hudi]
D --> F[Trino]
E --> F
F --> G[dbt]
G --> H[Redshift]
- Apache Kafka: Real-time event ingestion
- Apache Spark Structured Streaming: Stream processing
- Delta Lake & Apache Hudi: Dual-format data lake storage
- Hive Metastore + Trino: SQL-based query engine and catalog
- Airflow: Pipeline orchestration
- dbt: Data quality validation and schema checks
- Redshift: Downstream BI integration
- Terraform: AWS infrastructure provisioning
.
├── terraform/ # Infrastructure as Code
│ ├── modules/ # Reusable Terraform modules
│ └── environments/ # Environment-specific configurations
├── src/ # Source code
│ ├── producer/ # Kafka producer
│ ├── streaming/ # Spark streaming jobs
│ └── batch/ # Batch processing jobs
├── dbt/ # Data transformation
│ ├── models/ # dbt models
│ ├── tests/ # Data quality tests
│ └── analyses/ # Ad-hoc analyses
├── airflow/ # Pipeline orchestration
│ └── dags/ # Airflow DAGs
└── config/ # Configuration files
- AWS Account with appropriate permissions
- Terraform v1.0+
- Python 3.8+
- Apache Spark 3.3+
- Apache Kafka 2.8+
- dbt 1.0+
- Airflow 2.0+
- Initialize Terraform:
cd terraform/environments/dev
terraform init- Apply infrastructure:
terraform apply- Start Kafka:
./scripts/start_kafka.sh- Start the producer:
python src/producer/producer.py- Start Spark streaming:
spark-submit src/streaming/streaming_job.py- Run batch processing:
spark-submit src/batch/batch_job.py- Run dbt tests:
cd dbt
dbt test- Ingestion: Kafka producer generates synthetic clickstream events
- Processing: Spark streaming job processes events in real-time
- Storage: Data is stored in both Delta Lake and Hudi formats
- Transformation: dbt models transform and validate the data
- Analytics: Trino provides SQL access for analytics
- BI: Aggregated data is synced to Redshift for BI tools
- Schema validation
- Data completeness checks
- Duplicate detection
- Anomaly detection
- Freshness monitoring
- Optimize partition count based on throughput
- Configure appropriate retention policies
- Monitor consumer lag
- Tune executor memory and cores
- Optimize shuffle partitions
- Configure appropriate batch intervals
- Regular compaction
- Z-ordering for query performance
- Vacuum old files
- Configure appropriate compaction strategy
- Optimize upsert performance
- Manage file sizes
- Airflow task status and logs
- Spark UI for job monitoring
- Kafka metrics
- dbt test results
- Data quality dashboards
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
For support, please open an issue in the GitHub repository or contact the maintainers.