Metadata-Version: 2.4
Name: spestimator
Version: 0.3.0.233
Summary: A tool to predict bacterial species from fasta files using RefSeq 16S.
Author-email: Erin Young <eriny@utah.gov>
License: MIT License
        
        Copyright (c) 2024 Young
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/erinyoung/Spestimator
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas
Requires-Dist: requests
Requires-Dist: tqdm
Requires-Dist: ncbi-datasets-pyclient
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Dynamic: license-file

<div align="center">
  <img src="assets/logo.png" width="200" alt="Spestimator Logo">
  <h1>Spestimator</h1>
</div>

![PyPI - Version](https://img.shields.io/pypi/v/Spestimator)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/Spestimator)
![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/erinyoung/Spestimator/test.yml?label=tests)
![GitHub License](https://img.shields.io/github/license/erinyoung/Spestimator)

**Species Estimator & Genome Retriever**

Species + Estimator = Spestimator

Spestimator is a lightweight Python command-line tool designed to quickly identify bacterial species from input FASTA sequences (contigs or reads) and automatically retrieve their corresponding reference genomes from NCBI RefSeq.


## Why Spestimator?

I needed a tool that could:
1. Quickly estimate a range of potential reference genomes for an input sample.
2. Use a database maintained by a trustworthy, authoritative group (NCBI RefSeq) rather than maintaining a custom, static database that gets outdated.

---

## Installation

### From Pip/From Source
Spestimator requires [NCBI BLAST+](https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html#downloadblastdata) to be installed and available in your system path.

```bash
sudo apt-get install ncbi-blast+
```

```bash
# installation from pypi
pip install spestimator
```

```bash
# from source
git clone https://github.com/erinyoung/Spestimator.git
cd Spestimator
pip install .
```

### From Conda (includes blast)
```bash
conda install -c bioconda spestimator
```




---

## Quick Start

Identify organisms in your FASTA file(s)

```bash
spestimator -i *.fasta -o results.csv
```

Identify organisms and download the matched RefSeq genomes to a folder:

```bash
spestimator -i *.fasta -o results.csv -d genomes_dir/
```

### Updating the database

RefSeq updates quarterly. There is an attempt to keep this package on a similar update schedule, but this may not be feasible. To get a current version of the database.

```bash
# downloads https://ftp.ncbi.nlm.nih.gov/refseq/TargetedLoci/Bacteria/bacteria.16SrRNA.fna.gz for blast database
# uses eutils to get taxids for blast database (NCBI api-key will speed this step up)
# downloads https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt for refseq accesions
spestimator --update-db --db-dir database --api-key NCBI_API_KEY
```
This database can then be used by

```bash
spestimator -i *.fasta -o results.csv -d genomes_dir/ --db-dir database
```

---

## Usage & Options

```
usage: spestimator [-h] [-v] [-i INPUT [INPUT ...]] [-o OUTPUT] [-d [DIR]] [--db-dir DB_DIR] [--db-name DB_NAME] [-u] [--api-key API_KEY] [-t THREADS]
                   [--max-target-seqs MAX_TARGET_SEQS] [--min-identity MIN_IDENTITY] [--min-coverage MIN_COVERAGE] [--min-hits MIN_HITS]
                   [--min-alignment-len MIN_ALIGNMENT_LEN] [--top-k-taxa TOP_K_TAXA]

Spestimator: Predict bacterial TaxIDs from 16S and download genomes.

options:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -i, --input INPUT [INPUT ...]
                        Input FASTA files
  -o, --output OUTPUT   Output CSV file
  -d, --download-genomes [DIR]
                        Download found genomes. Defaults to 'genomes/' if flag is used without a path.
  --db-dir DB_DIR       Override path to BLAST database directory
  --db-name DB_NAME     Custom name for the database to appear in results (Default: DB filename)
  -u, --update-db       Download database and generate metadata
  --api-key API_KEY     NCBI API Key (Speeds up metadata generation)
  -t, --threads THREADS
                        BLAST threads

Filtering Options:
  --max-target-seqs MAX_TARGET_SEQS
                        BLAST: Hits to keep per read (Default: 10)
  --min-identity MIN_IDENTITY
                        Filter: Minimum Percent Identity (0-100). Default: 90.0
  --min-coverage MIN_COVERAGE
                        Filter: Minimum Query Coverage (0-100). Default: 0.0
  --min-hits MIN_HITS   Filter: Minimum reads required to report an organism
  --min-alignment-len MIN_ALIGNMENT_LEN
                        Filter: Minimum Alignment Length in bp (Default: 0/No Filter)
  --top-k-taxa TOP_K_TAXA
                        Report: Only keep the top K unique organisms per file (Default: 10)
```

## How It Works
- Runs [blastn](https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html#downloadblastdata) of your input against the 16S database in the repo.
- Filters hits based on identity and coverage.
- Aggregates hits to identify the most likely species present.
- (Optional): Uses [ncbi-datasets-pyclient](https://github.com/misialq/ncbi-datasets-pyclient) to download the reference genome assembly (`GCF_xxxx`) associated with the identified species.

---

## Output Format

The results CSV contains the following columns:

- **organism:** The clean species name (e.g., *E. coli*).
- **taxid:** NCBI Taxonomy ID for the organism.
- **species_taxid:** NCBI Taxonomy ID for the species of the organism.
- **refseq_assembly_accession:** The Assembly ID (e.g., `GCF_000005845.2`) used for downloading.
- **sacc:** The specific 16S sequence accession hit.
- **count:** Number of input sequences matching this organism.
- **avg_pident:** Average percent identity of the matches.
- **avg_bitscore:** A metric combining match quality and length.

## Example

```bash
spestimator -i tests/sample_positive.fasta
```

The results should look like this

```
input file,sacc,organism,refseq_assembly_accession,taxid,species_taxid,count,total_bitscore,avg_bitscore,avg_pident,max_pident,avg_qcov,best_evalue
sample_positive.fasta,NR_112088,Streptococcus pyogenes,GCF_900475035.1,1314,1314,1,2772,2772.0,100.0,100.0,33.25947263461113,0.0
sample_positive.fasta,NR_115729,Streptococcus canis,GCF_900636575.1,1329,1329,1,2545,2545.0,98.085,98.085,32.395302459561265,0.0
sample_positive.fasta,NR_024633,Streptococcus canis,GCF_900636575.1,1329,1329,1,2536,2536.0,97.202,97.202,33.25947263461113,0.0
sample_positive.fasta,NR_037101,Streptococcus urinalis,GCF_900636885.1,149016,149016,1,2534,2534.0,97.135,97.135,33.25947263461113,0.0
sample_positive.fasta,NR_043661,Streptococcus dysgalactiae subsp. equisimilis,GCF_016128095.1,119602,1334,1,2534,2534.0,97.443,97.443,32.92709949036118,0.0
sample_positive.fasta,NR_025148,Streptococcus iniae,GCF_000831485.1,1346,1346,1,2532,2532.0,96.422,96.422,34.05716818081099,0.0
sample_positive.fasta,NR_115802,Streptococcus ictaluri 707-05,GCF_000188015.2,764299,380397,1,2523,2523.0,97.627,97.627,32.68335918457789,0.0
sample_positive.fasta,NR_178901,Streptococcus penaeicida,GCF_053204055.1,1765960,1765960,1,2514,2514.0,96.217,96.217,33.96853534234434,0.0
sample_positive.fasta,NR_199923,Streptococcus tangpeifui,GCF_044503215.1,2709400,2709400,1,2494,2494.0,95.858,95.858,34.23443385774429,0.0
sample_positive.fasta,NR_040821,Streptococcus agalactiae ATCC 13813,GCF_001552035.1,888745,1311,1,2484,2484.0,96.536,96.536,33.25947263461113,0.0
```

## Disclaimer: NCBI Datasets

Spestimator relies on [ncbi-datasets-pyclient](https://github.com/misialq/ncbi-datasets-pyclient) to retrieve genomes. This library updates frequently, as does the underlying [datasets](https://github.com/ncbi/datasets) tool. If you encounter unexpected errors during genome download (e.g., `ApiException` or connection drops), please [submit an issue](https://github.com/erinyoung/Spestimator/issues) and use the accessions in the "refseq_accession" column to download with [datasets](https://github.com/ncbi/datasets) separately.

```bash
# download datasets
wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets && chmod +x datasets

# get a list of accessions
cut -f 4 -d , results.csv | grep GCF > id_list.txt

# use the list of accessions to download genomes
./datasets download genome accession --inputfile id_list.txt --filename ncbi_dataset.zip

# decompress file and use for additional analysis
unzip ncbi_dataset.zip
```

## AI Usage

Spestimator began as a set of custom shell and Python scripts used for ad-hoc analysis. To make these tools more reliable and accessible, Google's Gemini was used to accelerate the transition into a Python package. Gemini assisted in modularizing the codebase, replacing parsing logic with Pandas operations, and implementing a "mocked" testing suite that allows for safe CI/CD without hitting NCBI servers. Gemini also created the logo.
