tomarovsky · Zubiriguitar · Nov 10, 2024 · Mar 16, 2025 · Mar 20, 2025 · Mar 21, 2025
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -0,0 +1,3 @@
+{
+    "editor.renderWhitespace": "all"
+}
diff --git a/README.md b/README.md
@@ -3,83 +3,124 @@
 [![Snakemake](https://img.shields.io/badge/snakemake-<8.0-brightgreen.svg)](https://snakemake.github.io)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 
-## Description
+# BuscoClade Pipeline
 
-Pipeline to construct species phylogenies using [BUSCO](https://busco.ezlab.org/).
+**BuscoClade** is a **Snakemake**-based workflow that constructs species phylogenies using *BUSCO* (Benchmarking Universal Single-Copy Orthologs). It runs multiple analysis stages—from preparing inputs and running BUSCO on each genome, through multiple-sequence alignment (MSA), trimming, and tree-building—to produce a final phylogenetic tree and related visualizations. BuscoClade is organized as modular Snakemake rules; each rule corresponds to a step (e.g. alignment, filtering, tree inference). Snakemake automatically infers dependencies between rules based on input/output files. By leveraging Snakemake and Conda, BuscoClade ensures a reproducible, scalable pipeline that can run locally or on an HPC cluster with minimal user effort.
 
-![Workflow scheme](./workflow.png)
+Key components and stages of the pipeline include:
 
-- Alignment: [PRANK](http://wasabiapp.org/software/prank/), [MAFFT](https://mafft.cbrc.jp/alignment/software/).
-- Trimming: [GBlocks](https://academic.oup.com/mbe/article/17/4/540/1127654), [TrimAl](http://trimal.cgenomics.org/).
-- Phylogenetic tree constraction: [IQTree](http://www.iqtree.org/), [MrBayes](https://nbisweden.github.io/MrBayes/), [ASTRAL III](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2129-y), [RapidNJ](https://birc.au.dk/software/rapidnj), [PHYLIP](https://phylipweb.github.io/phylip/).
-- Visualization: [Etetoolkit](http://etetoolkit.org/), [Matplotlib](https://matplotlib.org/stable/).
+* **Input Preparation:** Genome assemblies (FASTA files) and optionally variant data (VCF files) are placed in designated input directories.
+* **[BUSCO](https://busco.ezlab.org/) Analysis:** Each genome is analyzed with BUSCO to identify conserved single-copy orthologous genes. BuscoClade can use either the *MetaEuk* or *AUGUSTUS* gene predictor (configurable). The pipeline collects single-copy BUSCO gene IDs that are common across all species.
+* **Sequence Extraction:** BUSCO outputs are parsed to extract the DNA or protein sequences of the common single-copy genes from each genome.
+* **Alignment:** Extracted gene sequences are aligned across species. The pipeline supports [PRANK](http://wasabiapp.org/software/prank/) or [MAFFT](https://mafft.cbrc.jp/alignment/software/) for alignment; users select by configuration (e.g. `dna_alignment: 'prank'` or `'mafft'`).
+* **Filtering/Trimming:** Alignments are cleaned of poorly aligned or gap-rich regions. BuscoClade supports [GBlocks](https://academic.oup.com/mbe/article/17/4/540/1127654) or [TrimAl](http://trimal.cgenomics.org/) for trimming (configurable). This reduces noise before tree inference.
+* **Concatenation:** Individual gene alignments are concatenated into a supermatrix (combined alignment). A concatenated alignment file (FASTA, PHYLIP, etc.) is produced for downstream analyses.
+* **Phylogenetic Inference:** The pipeline can infer phylogenies using multiple methods:
 
-## Usage
+  * **[IQTree](http://www.iqtree.org/) (Maximum Likelihood):** Fast ML tree reconstruction with model testing and bootstrapping.
+  * **[ASTRAL III](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2129-y) (Coalescent):** Builds a species tree from gene trees (produced by IQ-TREE) using the ASTRAL algorithm for coalescent-based inference.
+  * **[RapidNJ](https://birc.au.dk/software/rapidnj) (Neighbor Joining):** Quick distance-based tree (with bootstraps) from the concatenated alignment.
+  * **[PHYLIP](https://phylipweb.github.io/phylip/) (Neighbor Joining/UPGMA):** Tree building via the classic PHYLIP package.
+  * **[Raxml-NG](https://github.com/amkozlov/raxml-ng) (Maximum Likelihood):** Alternative ML tree builder.
+  * **[MrBayes](https://nbisweden.github.io/MrBayes/) (Bayesian MCMC):** Optional Bayesian inference (requires more time).
+* **VCF-Based Alternative:** Instead of BUSCO, BuscoClade can operate on aligned variant data. Provided multi-sample VCFs can be split per sample and converted to a phylogenetic alignment (via [vcf2phylip](https://github.com/edgardomortiz/vcf2phylip)), enabling phylogeny from SNP data.
+* **Visualization:** Final trees are visualized ([Etetoolkit](http://etetoolkit.org/)) and saved as image files. The pipeline also produces summary plots such as BUSCO completeness histograms and charts of ortholog presence using [Matplotlib](https://matplotlib.org/stable/) scripts.
 
-### Step 1. Deploy workflow
 
-To use this workflow, you can either download and extract the [latest release](https://github.com/tomarovsky/BuscoClade/releases) or clone the repository:
 
-```
-git clone https://github.com/tomarovsky/BuscoClade.git
-```
 
-### Step 2. Add species genomes
+## Quick Start Guide
 
-Place your unpacked FASTA genome assemblies into the `genomes/` directory. Keep in mind that the file prefixes will influence the output phylogeny. Ensure that your files have a `.fasta` extension.
+Follow these steps to run BuscoClade on your data. This guide assumes basic familiarity with the command line.
 
-### Step 3. Configure workflow
+1. **Clone the Repository:**
 
-To set up the workflow, modify `config/default.yaml`. I recommend to copy config gile and do all modifications in this copy. Some of the options (all nonested options from default.yaml) could also be set via command line using `--config` flag. Sections of config file:
+   ```bash
+   git clone https://github.com/tomarovsky/BuscoClade.git
+   cd BuscoClade
+   ```
 
-- **Pipeline Configuration:**
-This section outlines the workflow. By default, it includes alignments and following filtration of nucleotide sequences, and all tools for phylogeny reconstruction, except for MrBayes (it is recommended to run the GPU compiled version separately). To disable a tool, set its value to `False` or comment out the corresponding line.
+2. **Install Requirements:**
 
-  **NB!** When constructing a phylogeny using the Neighbor-Joining (NJ) method with PHYLIP, ensure that the first 10 characters of each species name are unique and distinct from one another.
+   BuscoClade relies on Snakemake and various bioinformatics tools. It is recommended to use Conda to install dependencies. You can either install Snakemake and tools globally or let Snakemake create environments. For example:
 
-- **Tool Parameters:**
-Specify parameters for each tool. To perform BUSCO, it is important to specify:
-  - `busco_dataset_path`: Download the BUSCO dataset beforehand and specify its path here.
-  - `busco_params`: Use the `--offline` flag and the `--download_path` parameter, indicating the path to the `busco_downloads/` directory.
+   ```bash
+   # (Optional) Create a new Conda environment with Snakemake
+   conda create -n buscoclade_env snakemake python=3.8 -y
+   conda activate buscoclade_env
 
-- **Directory structure:**
-Define output file structure in the `results/` directory. It is recommended to leave it unchanged.
+   # Or install Snakemake and Mamba/Conda globally as needed
+   ```
 
-- **Resources:**
-Specify Slurm queue, threads, memory, and runtime for each tool.
+3. **Prepare Input Data:**
 
-### Step 4. Execute workflow
+   * Create the input directory structure: put your genome FASTA files in `input/genomes/`. Each file should have a `.fasta` extension. For example:
 
-For a dry run:
+     ```
+     input/genomes/
+         species_1_.fasta
+         ...
+         species_n_.fasta
+     ```
+   * (Optional) If you have variant data instead of raw genomes, place VCF files in .gz archive format with reference FASTA file if using GATK into `input/vcf_reconstruct/*any_folder*`. The pipeline will split multi-sample VCFs into per-sample VCFs and reconstruct alignments from them. Genome input is also an option combining with VCF. 
+
+    ```
+     input/genomes/
+         species_1_.fasta
+         ...
+         species_n_.fasta
+     input/vcf_reconstruct/*folder_1*
+         reference_1.fasta
+         vcf_file_1.vcf.gz
+         ...
+         vcf_file_m.vcf.gz
+      input/vcf_reconstruct/*folder_k*
+         reference_k.fasta
+         vcf_file_1.vcf.gz
+         ...
+         vcf_file_i.vcf.gz
+     ```
 
-```
-snakemake --profile profile/slurm/ --configfile config/default.yaml --dry-run
-```
+   * If you using vcf2phylip yo are able to place only one VCF `input/vcf_reconstruct/*any_folder*`
 
-Snakemake will print all the rules that will be executed. Remove `--dry-run` to initiate the actual run.
+4. **Download BUSCO Datasets:**
 
-**How to run the workflow if I have completed BUSCOs?**
+   BuscoClade requires a BUSCO lineage dataset. Download the appropriate BUSCO dataset (e.g. `saccharomycetes_odb10`) from the BUSCO website and note its path. Update the `busco_dataset_path` in the config (see next step).
 
-First, move the genome assemblies to the ` genomes/` directory or create empty files with corresponding names. Then, create a `results/busco/` directory and move the BUSCO output directories into it. Note that BUSCO output must be formatted. Thus, for `Ailurus_fulgens.fasta` BUSCO output should look like this:
+5. **Configure the Workflow:**
 
-```
-results/
-    busco/
-        Ailurus_fulgens/
-            busco_sequences/
-                fragmented_busco_sequences/
-                multi_copy_busco_sequences/
-                single_copy_busco_sequences/
-            hmmer_output/
-            logs/
-            metaeuk_output/
-            full_table_Ailurus_fulgens.tsv
-            missing_busco_list_Ailurus_fulgens.tsv
-            short_summary_Ailurus_fulgens.txt
-            short_summary.json
-            short_summary.specific.mammalia_odb10.Ailurus_fulgens.json
-            short_summary.specific.mammalia_odb10.Ailurus_fulgens.txt
-```
+   * Open `config/default.yaml` in an editor. Adjust the parameters as needed (see next section for details). At minimum, set:
+
+     ```yaml
+     genome_dir: "input/genomes/"
+     busco_dataset_path: "/path/to/busco/saccharomycetes_odb10/"
+     output_dir: "results/"
+     ```
+   * You can leave many defaults unchanged on first run. The default config enables DNA alignment with PRANK, trimming with Gblocks, and tree inference with IQ-TREE, ASTRAL, RapidNJ, PHYLIP, and RAxML.
+
+6. **Run the Pipeline:**
+
+   Use Snakemake to execute the workflow. For a local run (no cluster), a simple command is:
+
+   ```bash
+   snakemake --cores 4 --configfile config/default.yaml
+   ```
+
+   This tells Snakemake to use at most 4 CPU cores and to create/use Conda environments as specified. The `--use-conda` flag enables environment management for reproducibility. Snakemake will execute the pipeline steps in order.
+
+   If you have a SLURM cluster, you can use the provided profile:
+
+   ```bash
+   snakemake --profile profile/slurm --configfile config/default.yaml
+   ```
+
+   This submits jobs to SLURM as per the `profile/slurm/config.yaml` settings. Adjust the `profile/slurm/config.yaml` (queues, time, etc.) if needed for your cluster.
+
+7. **Inspect Results:**
+
+   After completion, see the `results/` directory (or the directory you set). It will contain subdirectories for alignments, phylogenetic trees, and summary outputs (see **Output Description** below).
+
+Throughout execution, Snakemake will produce logs in the `logs/` directory (and cluster logs in `cluster_logs/`). You can resume or rerun specific steps by adjusting the command (e.g. adding `-R` to rerun certain rules).
 
 ## Contact