Denzel Kanyeki

Posted on Jul 14

A Real-Time Earthquake Monitoring Pipeline with Kafka, MySQL, PostgreSQL, and Grafana

#python #dataengineering #data

In this project, I designed and built an end-to-end real-time data pipeline that monitors earthquakes from the USGS Earthquake API. The pipeline extracts and loads data into a MySQL database, captures changes via a MySQL Debezium CDC connector, streams it through into a Kafka topic, sinks it into PostgreSQL, and visualizes it in Grafana Cloud. The workflow is automated using Apache Airflow to run hourly.

You can access the GitHub repository here

Project Workflow Chart

This pipeline ensures that earthquake data is extracted, streamed, stored, and visualized in near real-time, fully automated and orchestrated with Apache Airflow to run hourly.

Project Walkthrough

1. Earthquake Data Extraction from the USGS Earthquake API

The pipeline begins by calling the USGS Earthquake GeoJSON API, which provides up-to-date global earthquake data. The script:

Pulls earthquake events for the past 24 hours
Extracts key fields: magnitude, place, time, alert, coordinates, and more
Triggers an email alert to Gmail if the quake magnitude ≥ 4.5

This extraction script is written in Python using requests, smtplib for sending emails in Python via SMTP, sqlalchemy and pymysql.

2. Loading into MySQL Database

The extracted data is then cleaned and loaded into a MySQL database table called earthquake_data. This serves as the staging area where new and changed data is logged before flowing downstream.

The schema includes:

Event ID
Place and ___location coordinates
Magnitude and depth
Alert level
Timestamps (time, updated)
USGS metadata like URL, IDs, and status

Below is a snapshot of the data in a MySQL database:

3. Change Data Capture with MySQL Debezium Connector

To stream changes in the database in real time, the Debezium MySQL CDC Connector on Confluent Cloud listens to the binlog (binary log) of the MySQL database. It captures:

New earthquake events
Updates (e.g., revised magnitudes or alert levels)
Deletions

These changes are published to a Kafka topic in JSON_SR (JSON Schema Registry) format. This is very important as our Postgres Sink connector only accepts input in JSON_SR and not in JSON.

Below is a snapshot of the MySQL CDC connector.

4. Kafka Streams Earthquake Events to a PostgreSQL Sink

Next, the PostgreSQL Sink Connector subscribes to the same Kafka topic. It:

Reads the after field from the change event
Deserializes the JSON_SR format
Inserts or updates the data into a PostgreSQL table for analytics

TIP: You must configure the connector to match table naming formats and use primary key fields to support upserts.

Below is a snapshot of the Postgres Sink Connector and the data in the Postgres sink:

5. Real-Time Analytics & Visualization in Grafana

The final PostgreSQL database now holds real-time earthquake records, ready to be queried.

Using Grafana Cloud, several dashboards were created, including:

Quakes Per Minute – a time series of seismic activity
Alerts by Magnitude – categorize quakes into green/yellow/orange/red
Interactive World Map – plot quakes by longitude/latitude using heat maps or markers
Top 5 Quake Locations – based on frequency

If you want to interact with the dashboard, follow this link

Below is a snapshot of the dashboard on Grafana Cloud

6. Automated Hourly Execution with Apache Airflow

The entire process is scheduled and monitored using Apache Airflow:

A DAG (Directed Acyclic Graph) defines the workflow
Runs every hour to ensure near real-time data ingestion and sends an email whenever a DAG run is successful or not

Airflow provides visibility into task failures, retries, and pipeline health via its web UI.

Below is a successful DAG run for the pipeline and email alert task.

Why are the connectors running on Confluent Cloud?

Running our Debezium CDC Connector and Sink connector in Confluent offers a lot of advantages as opposed to managing our connectors manually with Kafka. Below is a table of the advantages that Confluent Cloud offers over self managed Kafka.

Advantage	Confluent Cloud	Self-Managed Kafka
Infrastructure Management	Fully managed by Confluent, no need to manage servers	Requires manual setup, scaling, and maintenance
Connector Management	Easy UI for deploying and managing connectors	Manual deployment and monitoring required
Scalability	Seamless automatic scaling based on workloads	Must manually configure and monitor scaling
Security	Built-in enterprise-grade security (encryption, ACLs)	Responsible for securing infrastructure yourself
Schema Registry	Managed Schema Registry with simple integration	Needs separate setup and maintenance
Monitoring & Observability	Rich monitoring dashboards and alerting included	Need to set up your own monitoring stack
Upgrades & Patching	Handled automatically by Confluent	You are responsible for patching/upgrading
High Availability & Reliability	SLA-backed uptime with high availability	Need to architect HA setups manually
Faster Time to Market	Quick setup, lets you focus on building data pipelines	Slower due to infrastructure and connector setup
Support	Access to Confluent support and managed services	Community support or paid enterprise support

Conclusion

This project enhanced my hands-on skills in API data extraction, real-time streaming with Kafka, and Change Data Capture (CDC) using Debezium. I gained practical experience building end-to-end data pipelines, automating workflows with Airflow, and creating real-time dashboards in Grafana. Overall, it was a great opportunity to apply data engineering concepts to a real-world monitoring use case.

If you have any questions, don't hesitate to reach out. Please leave a like, comment and follow for more informative data engineering content!

DEV Community