Only the extraordinary can beget the extraordinary
Lavoisier is a high-performance computing solution for mass spectrometry-based metabolomics data analysis pipelines. It combines traditional numerical methods with advanced visualization and AI-driven analytics to provide comprehensive insights from high-volume MS data.
Lavoisier features a metacognitive orchestration layer that coordinates two main pipelines:
-
Numerical Analysis Pipeline: Uses established computational methods for ion spectra extraction, annotates ion peaks through database search, fragmentation rules, and natural language processing.
-
Visual Analysis Pipeline: Converts spectra into video format and applies computer vision methods for annotation.
The orchestration layer manages workflow execution, resource allocation, and integrates LLM-powered intelligence for analysis and decision-making.
┌────────────────────────────────────────────────────────────────┐
│ Metacognitive Orchestration │
│ │
│ ┌──────────────────────┐ ┌───────────────────────┐ │
│ │ │ │ │ │
│ │ Numerical Pipeline │◄────────►│ Visual Pipeline │ │
│ │ │ │ │ │
│ └──────────────────────┘ └───────────────────────┘ │
│ ▲ ▲ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────┐ ┌───────────────────────┐ │
│ │ │ │ │ │
│ │ Model Repository │◄────────►│ LLM Integration │ │
│ │ │ │ │ │
│ └──────────────────────┘ └───────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────┘
Lavoisier provides a high-performance CLI interface for seamless interaction with all system components:
- Built with modern CLI frameworks for visually pleasing, intuitive interaction
- Color-coded outputs, progress indicators, and interactive components
- Command completions and contextual help
- Workflow management and pipeline orchestration
- Integrated with LLM assistants for natural language interaction
- Configuration management and parameter customization
- Results visualization and reporting
The numerical pipeline processes raw mass spectrometry data through a distributed computing architecture, specifically designed for handling large-scale MS datasets:
- Extracts MS1 and MS2 spectra from mzML files
- Performs intensity thresholding (MS1: 1000.0, MS2: 100.0 by default)
- Applies m/z tolerance filtering (0.01 Da default)
- Handles retention time alignment (0.5 min tolerance)
- Multi-database annotation system integrating multiple complementary resources
- Spectral matching against libraries (MassBank, METLIN, MzCloud, in-house)
- Accurate mass search across HMDB, LipidMaps, KEGG, and PubChem
- Fragmentation tree generation for structural elucidation
- Pathway integration with KEGG and HumanCyc databases
- Multi-component confidence scoring system for reliable identifications
- Deep learning models for MS/MS prediction and interpretation
- Deep learning models for spectral interpretation
- Transfer learning from large-scale metabolomics datasets
- Model serialization for all analytical outputs
- Automated hyperparameter optimization
- Utilizes Ray for parallel processing
- Implements Dask for large dataset handling
- Automatic resource management based on system capabilities
- Dynamic workload distribution across available cores
- Efficient data storage using Zarr format
- Compressed data storage with LZ4 compression
- Parallel I/O operations for improved performance
- Hierarchical data organization
- Automatic chunk size optimization
- Memory-efficient processing
- Progress tracking and reporting
- Comprehensive error handling and logging
The visualization pipeline transforms processed MS data into interpretable visual formats:
- MS image database creation and management
- Feature extraction from spectral data
- Resolution-specific image generation (default 1024x1024)
- Feature dimension handling (128-dimensional by default)
- Creates time-series visualizations of MS data
- Generates analysis videos showing spectral changes
- Supports multiple visualization formats
- Custom color mapping and scaling
- Combines multiple spectra into cohesive visualizations
- Temporal alignment of spectral data
- Metadata integration into visualizations
- Batch processing capabilities
- High-resolution image generation
- Video compilation of spectral changes
- Interactive visualization options
- Multiple export formats support
Lavoisier integrates commercial and open-source LLMs to enhance analytical capabilities and enable continuous learning:
- Natural language interface through CLI
- Context-aware analytical assistance
- Automated report generation
- Expert knowledge integration
- Integration with Claude, GPT, and other commercial LLMs
- Local models via Ollama for offline processing
- Numerical model API endpoints for LLM queries
- Pipeline result interpretation
- Feedback loop capturing new analytical results
- Incremental model updates via train-evaluate cycles
- Knowledge distillation from commercial LLMs to local models
- Versioned model repository with performance tracking
- Auto-generated queries of increasing complexity
- Integration of numerical model outputs with LLM knowledge
- Comparative analysis between numeric and visual pipelines
- Knowledge extraction and synthesis
Lavoisier incorporates domain-specific models for advanced analysis tasks:
- BioMedLM integration for biomedical text analysis and generation
- Context-aware analysis of mass spectrometry data
- Biological pathway interpretation and metabolite identification
- Custom prompting templates for different analytical tasks
- SciBERT model for scientific literature processing and embedding
- Multiple pooling strategies for optimal text representation
- Similarity-based search across scientific documents
- Batch processing of large text collections
- PubMedBERT-NER-Chemical for extracting chemical compounds from text
- Identification and normalization of chemical nomenclature
- Entity replacement for text preprocessing
- High-precision extraction with confidence scoring
- InstaNovo model for de novo peptide sequencing
- Integration of proteomics and metabolomics data
- Cross-modal analysis for comprehensive biomolecule profiling
- Advanced protein identification workflows
- Processing speeds: Up to 1000 spectra/second (hardware dependent)
- Memory efficiency: Streaming processing for large datasets
- Scalability: Automatic adjustment to available resources
- Parallel processing: Multi-core utilization
- Input formats: mzML (primary), with extensible format support
- Output formats: Zarr, HDF5, video (MP4), images (PNG/JPEG)
- Data volumes: Capable of handling datasets >100GB
- Batch processing: Multiple file handling
- Multi-tiered annotation combining spectral matching and accurate mass search
- Integrated pathway analysis for biological context
- Confidence scoring system weighing multiple evidence sources
- Parallelized database searches for rapid compound identification
- Isotope pattern matching and fragmentation prediction
- RT prediction for additional identification confidence
- Automated validation checks
- Signal-to-noise ratio monitoring
- Quality metrics reporting
- Error detection and handling
- Peak detection and quantification
- Retention time alignment
- Mass accuracy verification
- Intensity normalization
- Protein identification workflows
- Peptide quantification
- Post-translational modification analysis
- Comparative proteomics studies
- De novo peptide sequencing with InstaNovo integration
- Cross-analysis of proteomics and metabolomics datasets
- Protein-metabolite interaction mapping
- Metabolite profiling
- Pathway analysis
- Biomarker discovery
- Time-series metabolomics
- Instrument performance monitoring
- Method validation
- Batch effect detection
- System suitability testing
- Scientific presentation
- Publication-quality figures
- Time-course analysis
- Comparative analysis visualization
Our comprehensive validation demonstrates the effectiveness of Lavoisier's dual-pipeline approach through rigorous statistical analysis and performance metrics:
- Feature Extraction Accuracy: 0.989 similarity score between pipelines, with complementarity index of 0.961
- Vision Pipeline Robustness: 0.954 stability score against noise/perturbations
- Annotation Performance: Numerical pipeline achieves perfect accuracy (1.0) for known compounds
- Temporal Consistency: 0.936 consistency score for time-series analysis
- Anomaly Detection: Low anomaly score of 0.02, indicating reliable performance
Full scan mass spectrum showing the comprehensive metabolite profile with high mass accuracy and resolution
MS/MS fragmentation pattern analysis for glucose, demonstrating detailed structural elucidation
Comparison of feature extraction between numerical and visual pipelines, showing high concordance and complementarity
The following video demonstrates our novel computer vision approach to mass spectrometry analysis:
The video above shows real-time mass spectrometry data analysis through our computer vision pipeline, demonstrating:
- Real-time conversion of mass spectra to visual patterns
- Dynamic feature detection and tracking across time
- Metabolite intensity changes visualized as flowing patterns
- Structural similarities highlighted through visual clustering
- Pattern changes detected and analyzed in real-time
The visual pipeline represents a groundbreaking approach to mass spectrometry data analysis through computer vision techniques. This section details the mathematical foundations and implementation of the method demonstrated in the video above.
-
Spectrum-to-Image Transformation The conversion of mass spectra to visual representations follows:
F(m/z, I) → R^(n×n)
where:
- m/z ∈ R^k: mass-to-charge ratio vector
- I ∈ R^k: intensity vector
- n: resolution dimension (default: 1024)
The transformation is defined by:
P(x,y) = G(σ) * ∑[δ(x - φ(m/z)) · ψ(I)]
where:
- P(x,y): pixel intensity at coordinates (x,y)
- G(σ): Gaussian kernel with σ=1
- φ: m/z mapping function to x-coordinate
- ψ: intensity scaling function (log1p transform)
- δ: Dirac delta function
-
Temporal Integration Sequential frames are processed using a sliding window approach:
B_t = {F_i | i ∈ [t-w, t]}
where:
- B_t: frame buffer at time t
- w: window size (default: 30 frames)
- F_i: transformed frame at time i
-
Scale-Invariant Feature Transform (SIFT)
- Keypoint detection using DoG (Difference of Gaussians)
- Local extrema detection in scale space
- Keypoint localization and filtering
- Orientation assignment
- Feature descriptor generation
-
Temporal Pattern Analysis
- Optical flow computation using Farneback method
- Flow magnitude and direction analysis:
M(x,y) = √(fx² + fy²) θ(x,y) = arctan(fy/fx)
where:
- M: flow magnitude
- θ: flow direction
- fx, fy: flow vectors
-
Feature Correlation Temporal patterns are analyzed using frame-to-frame correlation:
C(i,j) = corr(F_i, F_j)
where C(i,j) is the correlation coefficient between frames i and j.
-
Significant Movement Detection Features are tracked using a statistical threshold:
T = μ(M) + 2σ(M)
where:
- T: movement threshold
- μ(M): mean flow magnitude
- σ(M): standard deviation of flow magnitude
-
Resolution and Parameters
- Frame resolution: 1024×1024 pixels
- Feature vector dimension: 128
- Gaussian blur σ: 1.0
- Frame rate: 30 fps
- Window size: 30 frames
-
Processing Pipeline a. Raw spectrum acquisition b. m/z and intensity normalization c. Coordinate mapping d. Gaussian smoothing e. Feature detection f. Temporal integration g. Video generation
-
Quality Metrics
- Structural Similarity Index (SSIM)
- Peak Signal-to-Noise Ratio (PSNR)
- Feature stability across frames
- Temporal consistency measures
This novel approach enables:
- Real-time visualization of spectral changes
- Pattern detection in complex MS data
- Intuitive interpretation of metabolomic profiles
- Enhanced feature detection through computer vision
- Temporal analysis of metabolite dynamics
The system generates comprehensive analytical outputs organized in:
-
Time Series Analysis (
time_series/
)- Chromatographic peak tracking
- Retention time alignment
- Intensity variation monitoring
-
Feature Analysis (
feature_analysis/
)- Principal component analysis
- Feature clustering
- Pattern recognition results
-
Interactive Dashboards (
interactive_dashboards/
)- Real-time data exploration
- Dynamic filtering capabilities
- Interactive peak annotation
-
Publication Quality Figures (
publication_figures/
)- High-resolution spectral plots
- Statistical analysis visualizations
- Comparative analysis figures
The dual-pipeline approach shows strong synergistic effects:
- Feature Comparison: Multiple validation scores [1.0, 0.999, 0.999, 0.999, 0.932, 1.0] across different aspects
- Vision Analysis: Robust performance in both noise resistance (0.914) and temporal analysis (0.936)
- Annotation Synergy: While numerical pipeline excels in accuracy, visual pipeline provides complementary insights
For detailed information about our validation approach and complete results, please refer to:
- Visualization Documentation - Comprehensive analysis framework
validation_results/
- Raw validation data and metricsvalidation_visualizations/
- Interactive visualizations and temporal analysisassets/analytical_visualizations/
- Detailed analytical outputs
lavoisier/
├── pyproject.toml # Project metadata and dependencies
├── LICENSE # Project license
├── README.md # This file
├── docs/ # Documentation
│ ├── user_guide.md # User documentation
│ └── developer_guide.md # Developer documentation
├── lavoisier/ # Main package
│ ├── __init__.py # Package initialization
│ ├── cli/ # Command-line interface
│ │ ├── __init__.py
│ │ ├── app.py # CLI application entry point
│ │ ├── commands/ # CLI command implementations
│ │ └── ui/ # Terminal UI components
│ ├── core/ # Core functionality
│ │ ├── __init__.py
│ │ ├── metacognition.py # Orchestration layer
│ │ ├── config.py # Configuration management
│ │ ├── logging.py # Logging utilities
│ │ └── ml/ # Machine learning components
│ │ ├── __init__.py
│ │ ├── models.py # ML model implementations
│ │ └── MSAnnotator.py # MS2 annotation engine
│ ├── numerical/ # Numerical pipeline
│ │ ├── __init__.py
│ │ ├── processing.py # Data processing functions
│ │ ├── pipeline.py # Main pipeline implementation
│ │ ├── ms1.py # MS1 spectra analysis
│ │ ├── ms2.py # MS2 spectra analysis
│ │ ├── ml/ # Machine learning components
│ │ │ ├── __init__.py
│ │ │ ├── models.py # ML model definitions
│ │ │ └── training.py # Training utilities
│ │ ├── distributed/ # Distributed computing
│ │ │ ├── __init__.py
│ │ │ ├── ray_utils.py # Ray integration
│ │ │ └── dask_utils.py # Dask integration
│ │ └── io/ # Input/output operations
│ │ ├── __init__.py
│ │ ├── readers.py # File format readers
│ │ └── writers.py # File format writers
│ ├── visual/ # Visual pipeline
│ │ ├── __init__.py
│ │ ├── conversion.py # Spectra to visual conversion
│ │ ├── processing.py # Visual processing
│ │ ├── video.py # Video generation
│ │ └── analysis.py # Visual analysis
│ ├── llm/ # LLM integration
│ │ ├── __init__.py
│ │ ├── api.py # API for LLM communication
│ │ ├── ollama.py # Ollama integration
│ │ ├── commercial.py # Commercial LLM integrations
│ │ └── query_gen.py # Query generation
│ ├── models/ # Model repository
│ │ ├── __init__.py
│ │ ├── repository.py # Model management
│ │ ├── distillation.py # Knowledge distillation
│ │ └── versioning.py # Model versioning
│ └── utils/ # Utility functions
│ ├── __init__.py
│ ├── helpers.py # General helpers
│ └── validation.py # Validation utilities
├── tests/ # Tests
│ ├── __init__.py
│ ├── test_numerical.py
│ ├── test_visual.py
│ ├── test_llm.py
│ └── test_cli.py
└── examples/ # Example workflows
├── basic_analysis.py
├── distributed_processing.py
├── llm_assisted_analysis.py
└── visual_analysis.py
pip install lavoisier
For development installation:
git clone https://github.com/username/lavoisier.git
cd lavoisier
pip install -e ".[dev]"
Process a single MS file:
lavoisier process --input sample.mzML --output results/
Run with LLM assistance:
lavoisier analyze --input sample.mzML --llm-assist