In this project, I designed and built an end-to-end real-time data pipeline that monitors earthquakes from the USGS Earthquake API. The pipeline extracts and loads data into a MySQL database, captures changes via a MySQL Debezium CDC connector, streams it through into a Kafka topic, sinks it into PostgreSQL, and visualizes it in Grafana Cloud. The workflow is automated using Apache Airflow to run hourly.
You can access the GitHub repository here
Project Workflow Chart
This pipeline ensures that earthquake data is extracted, streamed, stored, and visualized in near real-time, fully automated and orchestrated with Apache Airflow to run hourly.
Project Walkthrough
1. Earthquake Data Extraction from the USGS Earthquake API
The pipeline begins by calling the USGS Earthquake GeoJSON API, which provides up-to-date global earthquake data. The script:
Pulls earthquake events for the past 24 hours
Extracts key fields: magnitude, place, time, alert, coordinates, and more
Triggers an email alert to Gmail if the quake magnitude ≥ 4.5
This extraction script is written in Python using requests, smtplib
for sending emails in Python via SMTP, sqlalchemy
and pymysql
.
2. Loading into MySQL Database
The extracted data is then cleaned and loaded into a MySQL database table called earthquake_data. This serves as the staging area where new and changed data is logged before flowing downstream.
The schema includes:
Event ID
Place and ___location coordinates
Magnitude and depth
Alert level
Timestamps (time, updated)
USGS metadata like URL, IDs, and status
Below is a snapshot of the data in a MySQL database:
3. Change Data Capture with MySQL Debezium Connector
To stream changes in the database in real time, the Debezium MySQL CDC Connector on Confluent Cloud listens to the binlog (binary log) of the MySQL database. It captures:
New earthquake events
Updates (e.g., revised magnitudes or alert levels)
Deletions
These changes are published to a Kafka topic in JSON_SR
(JSON Schema Registry) format. This is very important as our Postgres Sink connector only accepts input in JSON_SR
and not in JSON
.
Below is a snapshot of the MySQL CDC connector.
4. Kafka Streams Earthquake Events to a PostgreSQL Sink
Next, the PostgreSQL Sink Connector subscribes to the same Kafka topic. It:
Reads the after field from the change event
Deserializes the JSON_SR format
Inserts or updates the data into a PostgreSQL table for analytics
TIP: You must configure the connector to match table naming formats and use primary key fields to support upserts.
Below is a snapshot of the Postgres Sink Connector and the data in the Postgres sink:
5. Real-Time Analytics & Visualization in Grafana
The final PostgreSQL database now holds real-time earthquake records, ready to be queried.
Using Grafana Cloud, several dashboards were created, including:
Quakes Per Minute – a time series of seismic activity
Alerts by Magnitude – categorize quakes into green/yellow/orange/red
Interactive World Map – plot quakes by longitude/latitude using heat maps or markers
Top 5 Quake Locations – based on frequency
If you want to interact with the dashboard, follow this link
Below is a snapshot of the dashboard on Grafana Cloud
6. Automated Hourly Execution with Apache Airflow
The entire process is scheduled and monitored using Apache Airflow:
A DAG (Directed Acyclic Graph) defines the workflow
Runs every hour to ensure near real-time data ingestion and sends an email whenever a DAG run is successful or not
Airflow provides visibility into task failures, retries, and pipeline health via its web UI.
Below is a successful DAG run for the pipeline and email alert task.
Why are the connectors running on Confluent Cloud?
Running our Debezium CDC Connector and Sink connector in Confluent offers a lot of advantages as opposed to managing our connectors manually with Kafka. Below is a table of the advantages that Confluent Cloud offers over self managed Kafka.
Advantage | Confluent Cloud | Self-Managed Kafka |
---|---|---|
Infrastructure Management | Fully managed by Confluent, no need to manage servers | Requires manual setup, scaling, and maintenance |
Connector Management | Easy UI for deploying and managing connectors | Manual deployment and monitoring required |
Scalability | Seamless automatic scaling based on workloads | Must manually configure and monitor scaling |
Security | Built-in enterprise-grade security (encryption, ACLs) | Responsible for securing infrastructure yourself |
Schema Registry | Managed Schema Registry with simple integration | Needs separate setup and maintenance |
Monitoring & Observability | Rich monitoring dashboards and alerting included | Need to set up your own monitoring stack |
Upgrades & Patching | Handled automatically by Confluent | You are responsible for patching/upgrading |
High Availability & Reliability | SLA-backed uptime with high availability | Need to architect HA setups manually |
Faster Time to Market | Quick setup, lets you focus on building data pipelines | Slower due to infrastructure and connector setup |
Support | Access to Confluent support and managed services | Community support or paid enterprise support |
Conclusion
This project enhanced my hands-on skills in API data extraction, real-time streaming with Kafka, and Change Data Capture (CDC) using Debezium. I gained practical experience building end-to-end data pipelines, automating workflows with Airflow, and creating real-time dashboards in Grafana. Overall, it was a great opportunity to apply data engineering concepts to a real-world monitoring use case.
If you have any questions, don't hesitate to reach out. Please leave a like, comment and follow for more informative data engineering content!
Top comments (0)