Skip to content

Commit 234959c

Browse files
authored
Merge pull request #20 from shenwei356/0.8.0
0.8.0
2 parents 285d955 + ee933dd commit 234959c

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

56 files changed

+138
-94
lines changed

CHANGELOG.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# Changelog
22

3-
### v0.8.0 - 2025-xx-xx
3+
### v0.8.0 - 2025-09-10
4+
5+
No changes to the index format (see [Index format changelog](https://bioinf.shenwei.me/LexicMap/tutorials/index/#index-format-changelog)).
46

57
- New commands:
68
- **`lexicmap utils merge-search-results`: Merge a query's search results from multiple indexes**.

README.md

Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## <a href="https://bioinf.shenwei.me/LexicMap"><img src="logo.svg" width="30"/></a> LexicMap: efficient sequence alignment against millions of prokaryotic genomes​
1+
## <a href="https://bioinf.shenwei.me/LexicMap"><img src="logo.svg" width="36"/></a> LexicMap: efficient sequence alignment against millions of prokaryotic genomes​
22

33
[![Latest Version](https://img.shields.io/github/release/shenwei356/LexicMap.svg?style=flat?maxAge=86400)](https://github.com/shenwei356/LexicMap/releases)
44
[![Anaconda Cloud](https://anaconda.org/bioconda/lexicmap/badges/version.svg)](https://anaconda.org/bioconda/lexicmap)
@@ -11,16 +11,14 @@ Documents: https://bioinf.shenwei.me/LexicMap
1111

1212
For the latest features and improvements, please download the [pre-release binaries](https://github.com/shenwei356/LexicMap/issues/10).
1313

14-
Preprint:
14+
Please cite:
1515

1616
> Wei Shen, John A. Lees, Zamin Iqbal.
17-
> (2024) LexicMap: efficient sequence alignment against millions of prokaryotic genomes.
18-
> bioRxiv. [https://doi.org/10.1101/2024.08.30.610459](https://doi.org/10.1101/2024.08.30.610459)
17+
> (2025) Efficient sequence alignment against millions of prokaryotic genomes with LexicMap.
18+
> Nature Biotechnology. [https://doi.org/10.1038/s41587-025-02812-8](https://doi.org/10.1038/s41587-025-02812-8)
1919
2020
## Table of contents
2121

22-
- [ LexicMap: efficient sequence alignment against millions of prokaryotic genomes​](#-lexicmap-efficient-sequence-alignment-against-millions-of-prokaryotic-genomes)
23-
- [Table of contents](#table-of-contents)
2422
- [Features](#features)
2523
- [Introduction](#introduction)
2624
- [Quick start](#quick-start)
@@ -37,19 +35,19 @@ Preprint:
3735
## Features
3836

3937
1. **The accuracy of LexicMap is comparable with Blastn, MMseqs2, and Minimap2**. It
40-
- performs **base-level alignment**, with `qcovGnm`, `qcovHSP`, `pident`, `evalue` and `bitscore` returned,
38+
- **performs base-level alignment**, with `qcovGnm`, `qcovHSP`, `pident`, `evalue` and `bitscore` returned,
4139
both in TSV and pairwise alignment format ([output format](https://bioinf.shenwei.me/LexicMap/tutorials/search/#output)).
4240
- provides a genome-wide query coverage metric (`qcovGnm`),
4341
which enables accurate interpretation of search results - particularly for [circular queries (such as plasmid, virus, and mtDNA)](https://bioinf.shenwei.me/LexicMap/tutorials/search/#searching-with-plasmids-or-other-longer-queries)
4442
against both complete and fragmented assemblies.
45-
- returns all possible matches, including multiple copies of a gene in a genome.
43+
- **returns all possible matches**, including multiple copies of a gene in a genome.
4644
1. **The alignment is fast and memory-efficient, scalable to up to millions of prokaryotic genomes**.
4745
1. LexicMap is **easy to [install](http://bioinf.shenwei.me/LexicMap/installation/),
4846
we provide [binary files](https://github.com/shenwei356/LexicMap/releases/)** with no dependencies for Linux, Windows, MacOS (x86 and arm CPUs).
4947
2. LexicMap is **easy to use** (see [tutorials](http://bioinf.shenwei.me/LexicMap/tutorials/index/), [usages](http://bioinf.shenwei.me/LexicMap/usage/lexicmap/), and [FAQs](https://bioinf.shenwei.me/LexicMap/faqs/)).
5048
- [Database building](https://bioinf.shenwei.me/LexicMap/tutorials/index/) requires only a simple command, accepting input from files, a file list, or even a directory.
5149
- [Sequence searching](https://bioinf.shenwei.me/LexicMap/tutorials/search/) supports limiting search by TaxId(s), provides a progress bar.
52-
- [Several utility commands](https://bioinf.shenwei.me/LexicMap/usage/utils/) are available to resume unfinished indexing, and explore the index data, extract indexed subsequences.
50+
- [Several utility commands](https://bioinf.shenwei.me/LexicMap/usage/utils/) are available to resume unfinished indexing, explore the index data, merge search results, extract matched subsequences and more.
5351

5452
## Introduction
5553

@@ -76,7 +74,7 @@ However, given the increasing rate at which genomes are sequenced, **existing to
7674
1. LexicMap enables efficient indexing and searching of both RefSeq+GenBank and the [AllTheBacteria](https://www.biorxiv.org/content/10.1101/2024.03.08.584059v1) datasets (**2.3 and 1.9 million prokaryotic assemblies** respectively).
7775
1. When searching in all **2,340,672 Genbank+Refseq prokaryotic genomes**, *Blastn is unable to run with this dataset on common servers as it requires >2000 GB RAM*. (see [performance](#performance)).
7876

79-
**With LexicMap v0.7.0** (48 CPUs),
77+
**With LexicMap v0.7.0** (48 CPUs, indexes and queries queries in HDDs),
8078

8179
|Query |Genome hits|Genome hits<br/>(high-similarity)|Genome hits<br/>(medium-similarity)|Genome hits<br/>(low-similarity)|Time |RAM |
8280
|:-------------------|----------:|--------------------------------:|----------------------------------:|-------------------------------:|----------:|-------:|
@@ -90,8 +88,8 @@ However, given the increasing rate at which genomes are sequenced, **existing to
9088
1. Only the best alignment of a genome is used to evaluate alignment similarity:
9189
- high-similarity: (a) qcov >= 90% (genes) or 70% (plasmids), (b) pident>=90%.
9290
- medium-similarity: (a) not belong to high-similarity, (b) qcov >= 50% (genes) or 30% (plasmids), (c) pident>=80%.
93-
- low-similarity: left.
94-
1. The search time varies in different computing environments and mainly depends on the I/O speed.
91+
- low-similarity: the remaining.
92+
1. The search time varies in different computing environments and mainly depends on the I/O speed and the number of threads.
9593

9694

9795
More documents: https://bioinf.shenwei.me/LexicMap.
@@ -238,8 +236,8 @@ See the [paper](#citation) for details.
238236
## Citation
239237

240238
> Wei Shen, John A. Lees, Zamin Iqbal.
241-
> (2024) LexicMap: efficient sequence alignment against millions of prokaryotic genomes.
242-
> bioRxiv. [https://doi.org/10.1101/2024.08.30.610459](https://doi.org/10.1101/2024.08.30.610459)
239+
> (2025) Efficient sequence alignment against millions of prokaryotic genomes with LexicMap.
240+
> Nature Biotechnology. [https://doi.org/10.1038/s41587-025-02812-8](https://doi.org/10.1038/s41587-025-02812-8)
243241
244242
## Limitations
245243

docs/content/_index.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,9 @@ Step 2: searching
5656

5757
lexicmap search -d db.lmi q.fasta -o r.tsv
5858

59+
[10 utility commands](https://bioinf.shenwei.me/LexicMap/usage/utils/) are available
60+
to explore the index data, merge search results, extract matched subsequences and more.
61+
5962
{{< button size="small" relref="tutorials/index" >}}Tutorials{{< /button >}}
6063
{{< button size="small" relref="usage/lexicmap" >}}Usages{{< /button >}}
6164
{{< button size="small" relref="faqs" >}}FAQs{{< /button >}}
@@ -64,7 +67,7 @@ Step 2: searching
6467

6568
### Accurate and efficient alignment
6669

67-
Using LexicMap to align in the whole **2,340,672** Genbank+Refseq prokaryotic genomes with 48 CPUs.
70+
Using LexicMap v0.7.0 to align against the whole **2,340,672** Genbank+Refseq prokaryotic genomes with 48 CPUs.
6871

6972
|Query |Genome hits|Time |RAM(GB)|
7073
|:----------------|----------:|------:|------:|

docs/content/installation/_index.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,8 @@ Linux and MacOS (both x86 and arm CPUs) are supported.
3636

3737
|OS |Arch |File, 中国镜像 |
3838
|:------|:---------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
39-
|Linux |**64-bit**|[**lexicmap_linux_amd64.tar.gz**](https://github.com/shenwei356/LexicMap/releases/download/v0.7.0/lexicmap_linux_amd64.tar.gz), [中国镜像](http://app.shenwei.me/data/lexicmap/lexicmap_linux_amd64.tar.gz) |
40-
|Linux |arm64 |[**lexicmap_linux_arm64.tar.gz**](https://github.com/shenwei356/LexicMap/releases/download/v0.7.0/lexicmap_linux_arm64.tar.gz), [中国镜像](http://app.shenwei.me/data/lexicmap/lexicmap_linux_arm64.tar.gz) |
39+
|Linux |**64-bit**|[**lexicmap_linux_amd64.tar.gz**](https://github.com/shenwei356/LexicMap/releases/download/v0.8.0/lexicmap_linux_amd64.tar.gz), [中国镜像](http://app.shenwei.me/data/lexicmap/lexicmap_linux_amd64.tar.gz) |
40+
|Linux |arm64 |[**lexicmap_linux_arm64.tar.gz**](https://github.com/shenwei356/LexicMap/releases/download/v0.8.0/lexicmap_linux_arm64.tar.gz), [中国镜像](http://app.shenwei.me/data/lexicmap/lexicmap_linux_arm64.tar.gz) |
4141

4242
2. Decompress it:
4343

@@ -70,8 +70,8 @@ Linux and MacOS (both x86 and arm CPUs) are supported.
7070

7171
|OS |Arch |File, 中国镜像 |
7272
|:------|:---------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
73-
|macOS |64-bit|[**lexicmap_darwin_amd64.tar.gz**](https://github.com/shenwei356/LexicMap/releases/download/v0.7.0/lexicmap_darwin_amd64.tar.gz), [中国镜像](http://app.shenwei.me/data/lexicmap/lexicmap_darwin_amd64.tar.gz) |
74-
|macOS |**arm64** |[**lexicmap_darwin_arm64.tar.gz**](https://github.com/shenwei356/LexicMap/releases/download/v0.7.0/lexicmap_darwin_arm64.tar.gz), [中国镜像](http://app.shenwei.me/data/lexicmap/lexicmap_darwin_arm64.tar.gz) |
73+
|macOS |64-bit |[**lexicmap_darwin_amd64.tar.gz**](https://github.com/shenwei356/LexicMap/releases/download/v0.8.0/lexicmap_darwin_amd64.tar.gz), [中国镜像](http://app.shenwei.me/data/lexicmap/lexicmap_darwin_amd64.tar.gz) |
74+
|macOS |**arm64** |[**lexicmap_darwin_arm64.tar.gz**](https://github.com/shenwei356/LexicMap/releases/download/v0.8.0/lexicmap_darwin_arm64.tar.gz), [中国镜像](http://app.shenwei.me/data/lexicmap/lexicmap_darwin_arm64.tar.gz) |
7575

7676
2. Copy it to any directory in the environment variable `PATH`:
7777

@@ -96,7 +96,7 @@ Linux and MacOS (both x86 and arm CPUs) are supported.
9696

9797
|OS |Arch |File, 中国镜像 |
9898
|:------|:---------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
99-
|FreeBSD|**64-bit**|[**lexicmap_freebsd_amd64.tar.gz**](https://github.com/shenwei356/LexicMap/releases/download/v0.7.0/lexicmap_freebsd_amd64.tar.gz), [中国镜像](http://app.shenwei.me/data/lexicmap/lexicmap_freebsd_amd64.tar.gz) |
99+
|FreeBSD|**64-bit**|[**lexicmap_freebsd_amd64.tar.gz**](https://github.com/shenwei356/LexicMap/releases/download/v0.8.0/lexicmap_freebsd_amd64.tar.gz), [中国镜像](http://app.shenwei.me/data/lexicmap/lexicmap_freebsd_amd64.tar.gz) |
100100

101101
{{< /tab >}}
102102

@@ -108,7 +108,7 @@ Linux and MacOS (both x86 and arm CPUs) are supported.
108108

109109
|OS |Arch |File, 中国镜像 |
110110
|:------|:---------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
111-
|Windows|**64-bit**|[**lexicmap_windows_amd64.exe.tar.gz**](https://github.com/shenwei356/LexicMap/releases/download/v0.7.0/lexicmap_windows_amd64.exe.tar.gz), [中国镜像](http://app.shenwei.me/data/lexicmap/lexicmap_windows_amd64.exe.tar.gz)|
111+
|Windows|**64-bit**|[**lexicmap_windows_amd64.exe.tar.gz**](https://github.com/shenwei356/LexicMap/releases/download/v0.8.0/lexicmap_windows_amd64.exe.tar.gz), [中国镜像](http://app.shenwei.me/data/lexicmap/lexicmap_windows_amd64.exe.tar.gz)|
112112

113113

114114
2. Decompress it.

docs/content/introduction/_index.md

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -11,19 +11,18 @@ weight: 10
1111

1212
LexicMap is a **nucleotide sequence alignment** tool for efficiently querying **gene, plasmid, viral, or long-read sequences (>150 bp)** against up to **millions of prokaryotic genomes**.
1313

14+
Source code: https://github.com/shenwei356/LexicMap
1415

1516
For the latest features and improvements, please download the [pre-release binaries](https://github.com/shenwei356/LexicMap/issues/10).
1617

17-
Preprint:
18+
Please cite:
1819

1920
> Wei Shen, John A. Lees, Zamin Iqbal.
20-
> (2024) LexicMap: efficient sequence alignment against millions of prokaryotic genomes.
21-
> bioRxiv. [https://doi.org/10.1101/2024.08.30.610459](https://doi.org/10.1101/2024.08.30.610459)
22-
21+
> (2025) Efficient sequence alignment against millions of prokaryotic genomes with LexicMap.
22+
> Nature Biotechnology. [https://doi.org/10.1038/s41587-025-02812-8](https://doi.org/10.1038/s41587-025-02812-8)
2323
2424
## Table of contents
2525

26-
- [Table of contents](#table-of-contents)
2726
- [Features](#features)
2827
- [Introduction](#introduction)
2928
- [Quick start](#quick-start)
@@ -40,19 +39,19 @@ Preprint:
4039
## Features
4140

4241
1. **The accuracy of LexicMap is comparable with Blastn, MMseqs2, and Minimap2**. It
43-
- performs **base-level alignment**, with `qcovGnm`, `qcovHSP`, `pident`, `evalue` and `bitscore` returned,
42+
- **performs base-level alignment**, with `qcovGnm`, `qcovHSP`, `pident`, `evalue` and `bitscore` returned,
4443
both in TSV and pairwise alignment format ([output format](https://bioinf.shenwei.me/LexicMap/tutorials/search/#output)).
4544
- provides a genome-wide query coverage metric (`qcovGnm`),
4645
which enables accurate interpretation of search results - particularly for [circular queries (such as plasmid, virus, and mtDNA)](https://bioinf.shenwei.me/LexicMap/tutorials/search/#searching-with-plasmids-or-other-longer-queries)
4746
against both complete and fragmented assemblies.
48-
- returns all possible matches, including multiple copies of a gene in a genome.
47+
- **returns all possible matches**, including multiple copies of a gene in a genome.
4948
1. **The alignment is fast and memory-efficient, scalable to up to millions of prokaryotic genomes**.
5049
1. LexicMap is **easy to [install](http://bioinf.shenwei.me/LexicMap/installation/),
5150
we provide [binary files](https://github.com/shenwei356/LexicMap/releases/)** with no dependencies for Linux, Windows, MacOS (x86 and arm CPUs).
5251
2. LexicMap is **easy to use** (see [tutorials](http://bioinf.shenwei.me/LexicMap/tutorials/index/), [usages](http://bioinf.shenwei.me/LexicMap/usage/lexicmap/), and [FAQs](https://bioinf.shenwei.me/LexicMap/faqs/)).
5352
- [Database building](https://bioinf.shenwei.me/LexicMap/tutorials/index/) requires only a simple command, accepting input from files, a file list, or even a directory.
5453
- [Sequence searching](https://bioinf.shenwei.me/LexicMap/tutorials/search/) supports limiting search by TaxId(s), provides a progress bar.
55-
- [Several utility commands](https://bioinf.shenwei.me/LexicMap/usage/utils/) are available to resume unfinished indexing, and explore the index data, extract indexed subsequences.
54+
- [Several utility commands](https://bioinf.shenwei.me/LexicMap/usage/utils/) are available to resume unfinished indexing, explore the index data, merge search results, extract matched subsequences and more.
5655

5756
## Introduction
5857

@@ -79,7 +78,7 @@ However, given the increasing rate at which genomes are sequenced, **existing to
7978
1. LexicMap enables efficient indexing and searching of both RefSeq+GenBank and the [AllTheBacteria](https://www.biorxiv.org/content/10.1101/2024.03.08.584059v1) datasets (**2.3 and 1.9 million prokaryotic assemblies** respectively).
8079
1. When searching in all **2,340,672 Genbank+Refseq prokaryotic genomes**, *Blastn is unable to run with this dataset on common servers as it requires >2000 GB RAM*. (see [performance](#performance)).
8180

82-
**With LexicMap v0.7.0** (48 CPUs),
81+
**With LexicMap v0.7.0** (48 CPUs, indexes and queries queries in HDDs),
8382

8483
|Query |Genome hits|Genome hits<br/>(high-similarity)|Genome hits<br/>(medium-similarity)|Genome hits<br/>(low-similarity)|Time |RAM |
8584
|:-------------------|----------:|--------------------------------:|----------------------------------:|-------------------------------:|----------:|-------:|
@@ -93,8 +92,8 @@ However, given the increasing rate at which genomes are sequenced, **existing to
9392
1. Only the best alignment of a genome is used to evaluate alignment similarity:
9493
- high-similarity: (a) qcov >= 90% (genes) or 70% (plasmids), (b) pident>=90%.
9594
- medium-similarity: (a) not belong to high-similarity, (b) qcov >= 50% (genes) or 30% (plasmids), (c) pident>=80%.
96-
- low-similarity: left.
97-
1. The search time varies in different computing environments and mainly depends on the I/O speed.
95+
- low-similarity: the remaining.
96+
1. The search time varies in different computing environments and mainly depends on the I/O speed and the number of threads.
9897

9998
## Quick start
10099

@@ -238,8 +237,8 @@ See the [paper](#citation) for details.
238237
## Citation
239238

240239
> Wei Shen, John A. Lees, Zamin Iqbal.
241-
> (2024) LexicMap: efficient sequence alignment against millions of prokaryotic genomes.
242-
> bioRxiv. [https://doi.org/10.1101/2024.08.30.610459](https://doi.org/10.1101/2024.08.30.610459)
240+
> (2025) Efficient sequence alignment against millions of prokaryotic genomes with LexicMap.
241+
> Nature Biotechnology. [https://doi.org/10.1038/s41587-025-02812-8](https://doi.org/10.1038/s41587-025-02812-8)
243242
244243
## Limitations
245244

0 commit comments

Comments
 (0)