This repository contains a complete Kafka and Spark-based streaming pipeline for real-time fraud detection. The pipeline simulates realistic user behavior and transaction events and analyzes them to detect fraudulent activities immediately.
Purpose: Processes streaming data using Apache Spark Structured Streaming to identify fraudulent activities in real-time.
Key Features:
- Structured data ingestion with explicit schema definitions.
- Real-time data processing integrated with Kafka.
- Robust configuration with memory optimization and fault tolerance.
Purpose: Simulates realistic data streams representing user browsing behavior and transactions, publishing these streams into Kafka topics.
Key Features:
- Kafka producer for continuous, scalable data streaming.
- Random batch sizes to imitate real-world user interaction variability.
- Timestamped data events for realistic simulation.
- Data Simulation: The producer notebook simulates real-world data and publishes it to Kafka.
- Streaming Analysis: Spark Streaming consumes and processes these Kafka data streams to detect potential fraud in real-time.
- Real-Time Fraud Detection
- Behavioral and Transactional Analytics
- System Performance Testing and Validation
- Kafka
- Apache Spark (PySpark)
- Python 3
- Start Kafka and Spark services.
- Run
Fraud_Detection_Producer.ipynb
to initiate data streaming. - Run
Fraud_Data_Spark_Streaming.ipynb
to process data in real-time.
Contributions are welcome. Please submit pull requests for new features or improvements.
This project is licensed under the GNU General Public License v3.0. See the LICENSE file for details.