A high-performance computer vision system that performs real-time image segmentation using K-Means clustering with multiple parallel backends for optimal performance across different hardware configurations.
Real-time parallel K-means clustering: original vs segmented frames with dynamic K-value control via slider
- ๐ Real-Time Performance: Up to 55+ FPS with CUDA backend on live webcam feeds
- ๐ง Multiple Parallel Backends: Sequential, Multi-threaded, MPI, and CUDA implementations
- ๐ณ RCC Tree Optimization: Recursive Cached Coreset tree for efficient streaming segmentation
- ๐ฏ 5D Feature Space: Combines color (BGR) and spatial (x,y) features for coherent segmentation
- โก Dynamic Backend Switching: Switch between backends in real-time with keyboard shortcuts
- ๐ Performance Monitoring: Live FPS tracking with min/max statistics
- ๐ผ๏ธ Interactive Controls: Adjustable K-value slider and side-by-side visualization
This project implements an advanced K-Means clustering system optimized for real-time image segmentation with multiple parallelization strategies:
- Core Algorithm: K-Means clustering adapted for image segmentation
- Feature Engineering: 5D vectors combining color similarity and spatial proximity
- Coreset Sampling: Reduces computational complexity from O(nยทkยทt) to O(sยทkยทt), where n = total pixels, s = coreset size (s โช n)
-
RCC Tree Structure: Maintains
$(1 \pm \epsilon)$ -approximation with bounded memory - Hardware Optimization: Leverages multi-core CPUs, distributed systems, and GPUs
Backend | K=2 (Min FPS) | K=2 (Max FPS) | K=12 (Min FPS) | K=12 (Max FPS) | Performance Ratio |
---|---|---|---|---|---|
Sequential | 15 | 17 | 5 | 6 | 1.0ร (baseline) |
MPI | 13 | 31 | 10 | 13 | 1.8ร average |
Multi-threaded | 14 | 44 | 10 | 22 | 2.4ร average |
CUDA | 14 | 55 | 15 | 44 | 3.2ร average |
Performance Improvement Factor (vs Sequential):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CUDA: โโโโโโโโโโโโโโโโ 3.2ร โ
โ Multi-thread: โโโโโโโโโโโโ 2.4ร โ
โ MPI: โโโโโโโ 1.8ร โ
โ Sequential: โโโโ 1.0ร (baseline) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Component | Complexity | Notes |
---|---|---|
Sequential K-Means | Baseline implementation | |
Coreset K-Means | Reduced complexity with |
|
RCC Tree Insertion | Streaming update per frame | |
RCC Tree Merging | Weighted merge operation | |
GPU Assignment | Massive parallelization |
- Strategy: Row-based work distribution across CPU threads
- Synchronization: Lock-free design with barrier synchronization
- Memory Safety: Read-only sharing with exclusive write regions
- Best For: Multi-core desktop systems
- Strategy: Process-level distribution with thread-level parallelization
- Communication: MPI_Bcast for centers, MPI_Gatherv for results
- Hybrid Approach: Combines inter-process and intra-process parallelism
- Best For: HPC clusters and large distributed systems
- Strategy: One thread per pixel for maximum throughput
- Memory Management: Efficient host-device transfers
- Synchronization: cudaDeviceSynchronize() for completion barriers
- Best For: High-throughput applications with GPU hardware
# System Requirements
- C++17 compatible compiler (GCC/MSVC/Clang)
- CMake 3.18+
- OpenCV 4.0+
- CUDA Toolkit (for GPU backend)
- MPI implementation (for distributed backend)
# Clone the repository
git clone https://github.com/dosqas/realtime-parallel-kmeans-segmentation.git
cd realtime-parallel-kmeans-segmentation
# Create build directory
mkdir build && cd build
# Configure with CMake
cmake ..
# Build the project
cmake --build . --config Release
# Run the application
./realtime_segmentation # Linux/Mac
# or
realtime_segmentation.exe # Windows
Key | Action |
---|---|
'1' | Switch to Sequential backend |
'2' | Switch to MPI backend |
'3' | Switch to Multi-threaded backend |
'4' | Switch to CUDA backend |
ESC | Exit application |
K Slider | Adjust cluster count (2-12) |
// Default configuration values
const int k_min = 1; // Minimum clusters
const int k_max = 12; // Maximum clusters
const int sample_size = 2000; // Coreset size
const float color_scale = 1.0f; // Color feature scaling
const float spatial_scale = 0.5f; // Spatial feature scaling
const int max_levels = 8; // Maximum tree height
const int default_sample = 2000; // Default coreset size
FPS Range | Classification | User Experience | Recommended Backends |
---|---|---|---|
30+ FPS | Excellent | Smooth real-time | CUDA, Multi-threaded (Kโค8) |
20-30 FPS | Good | Acceptable real-time | Multi-threaded, MPI (Kโค6) |
10-20 FPS | Fair | Noticeable lag | All backends (Kโค4) |
<10 FPS | Poor | Choppy playback | Sequential only (high K) |
- Real-time video segmentation for streaming applications
- Live broadcast effects and background replacement
- Content creation and video editing workflows
- Baseline implementation for segmentation algorithms
- Performance benchmarking across different hardware
- Educational demonstrations of parallel computing concepts
- Real-time analysis of medical imagery
- Interactive segmentation for diagnostic applications
- High-throughput batch processing of medical data
- Real-time augmented reality applications
- Interactive art installations
- Gaming and entertainment systems
// In clustering_backends.hpp
enum Backend {
BACKEND_SEQ = 0,
BACKEND_CUDA = 1,
BACKEND_THR = 2,
BACKEND_MPI = 3,
BACKEND_CUSTOM = 4 // Your custom backend
};
// Implement your backend function
cv::Mat segmentFrameWithKMeans_custom(
const cv::Mat& frame, int k, int sample_size,
float color_scale, float spatial_scale);
// Adjust coreset sampling for speed vs quality trade-off
const int fast_sample_size = 1000; // Faster, lower quality
const int quality_sample_size = 5000; // Slower, higher quality
// Modify feature scaling for different segmentation characteristics
const float color_emphasis = 2.0f; // Emphasize color similarity
const float spatial_emphasis = 0.1f; // De-emphasize spatial proximity
realtime-parallel-kmeans-segmentation/
โโโ ๐ include/ # Header files
โ โโโ clustering.hpp # Main clustering interface
โ โโโ clustering_backends.hpp # Backend implementations
โ โโโ coreset.hpp # Coreset data structures
โ โโโ rcc.hpp # RCC tree implementation
โ โโโ utils.hpp # Utility functions
โ โโโ video_io.hpp # Video I/O interface
โโโ ๐ src/ # Source files
โ โโโ ๐ clustering/ # Backend implementations
โ โ โโโ clustering_cuda.cu # CUDA GPU backend
โ โ โโโ clustering_entry.cpp # Backend dispatcher
โ โ โโโ clustering_mpi.cpp # MPI distributed backend
โ โ โโโ clustering_seq.cpp # Sequential CPU backend
โ โ โโโ clustering_thr.cpp # Multi-threaded backend
โ โโโ coreset.cpp # Coreset algorithms
โ โโโ main.cpp # Application entry point
โ โโโ rcc.cpp # RCC tree implementation
โ โโโ utils.cpp # Utility functions
โ โโโ video_io.cpp # Video I/O implementation
โโโ ๐ docs/ # Documentation
โ โโโ project__demo.gif # Program demonstration GIF
โโโ ๐ docs/ # Documentation
โ โโโ algorithms.md # Algorithm descriptions
โ โโโ parallelization.md # Synchronization details
โ โโโ performance.md # Performance analysis
โโโ ๐ tests/ # Test files
โ โโโ test_clustering.cpp # Clustering tests
โ โโโ test_coreset.cpp # Coreset tests
โ โโโ test_rcc_.cpp # RCC tree tests
โ โโโ test_utils.cpp # Utility tests
โ โโโ test_video_io_.cpp # Video I/O tests
โโโ CMakeLists.txt # Build configuration
โโโ LICENSE # MIT License
โโโ README.md # This file
The RCC tree enables efficient streaming K-means by:
- Leaf Insertion: New frame coresets inserted with carry propagation
- Node Merging: Weighted coreset combination with bounded size
- Root Computation: Dynamic merging of all levels for comprehensive representation
- Memory Bounds: Tree height limited to prevent unbounded growth
- Multi-threaded: Lock-free design with const references and exclusive write regions
- MPI: Collective operations (MPI_Bcast, MPI_Gatherv) with hybrid OpenMP parallelization
- CUDA: Host-device synchronization with cudaDeviceSynchronize() barriers
- Memory Requirements: CUDA backend requires sufficient GPU memory for large images
- Network Dependency: MPI performance varies with network latency and bandwidth
- K-value Scaling: All backends show performance degradation with very high cluster counts
- Hardware Specific: Optimal performance depends on specific hardware configuration
- Adaptive Coreset Sizing: Dynamic adjustment based on image complexity
- Additional Color Spaces: Support for HSV, LAB, and other color representations
- Temporal Coherence: Frame-to-frame consistency improvements
- Mobile Optimization: ARM NEON and mobile GPU backend support
- Cloud Integration: Distributed processing across cloud instances
- OpenCV Team โ For comprehensive computer vision library and excellent documentation
- NVIDIA CUDA โ For GPU computing platform and development tools
- Open MPI Project โ For high-performance message passing interface
- CMake Community โ For cross-platform build system
- Research Community โ For foundational work on coreset algorithms and RCC trees
- Feldman, D., Schmidt, M., & Sohler, C. (2013). Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering
- Bachem, O., Lucic, M., & Krause, A. (2017). Practical coreset constructions for machine learning
This project is licensed under the MIT License. See the LICENSE file for details.
Questions, feedback, or ideas? Reach out anytime at sebastian.soptelea@proton.me.