Metadata-Version: 2.4
Name: babappalign
Version: 1.4.0
Summary: Embedding-first deep learning multiple sequence alignment engine with affine-gap DP
Author: Krishnendu Sinha
License-Expression: MIT
Project-URL: Homepage, https://github.com/sinhakrishnendu/BABAPPAlign
Project-URL: Repository, https://github.com/sinhakrishnendu/BABAPPAlign
Project-URL: Issues, https://github.com/sinhakrishnendu/BABAPPAlign/issues
Project-URL: DOI, https://doi.org/10.5281/zenodo.17934124
Keywords: bioinformatics,multiple-sequence-alignment,protein-alignment,deep-learning,esm
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: <3.13,>=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: pandas
Requires-Dist: biopython
Requires-Dist: tqdm
Requires-Dist: torch>=1.12
Requires-Dist: transformers>=4.30
Requires-Dist: fair-esm
Dynamic: license-file

# BABAPPAlign

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17934124.svg)](https://doi.org/10.5281/zenodo.17934124)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.19034335.svg)](https://doi.org/10.5281/zenodo.19034335)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18053201.svg)](https://doi.org/10.5281/zenodo.18053201)

## Overview

BABAPPAlign is an embedding-first progressive multiple sequence alignment (MSA) engine
for protein and coding nucleotide sequences.

It integrates pretrained protein language model embeddings with a learned neural
residue–residue scoring function within a classical, exact affine-gap dynamic
programming framework (Gotoh).

Current release: 1.4.0.

Version 1.4.0 adds automatic accelerator selection across CUDA, Apple Silicon
Metal/MPS, and CPU. BABAPPAlign now probes available backends at runtime and
selects the fastest usable device safely.

Native codon alignment mode, introduced in v1.2.0, allows direct CDS alignment
without requiring external PAL2NAL.

BABAPPAlign is fully functional on CPU-only systems.
CUDA and Apple Silicon Metal/MPS acceleration are optional and affect performance only, not correctness.

---

## Key Features

- Progressive multiple sequence alignment (MSA)
- Strict learned residue–residue scoring model (BABAPPAScore)
- Pretrained protein language model residue embeddings
- Column-aware profile scoring
- True affine-gap dynamic programming (Gotoh algorithm)
- Exact dynamic programming (no heuristics inside DP)
- Neural inference performed outside DP recursion
- Native codon alignment mode (CDS → translate → back-map)
- Automatic frame validation in codon mode
- CPU-only compatible
- Automatic `auto` device selection: CUDA → Apple Metal/MPS → CPU
- Optional manual device override with `--device {auto,cpu,cuda,mps}`
- Mandatory `babappascore.pt` model loading (no model override)
- Reproducible and Zenodo-backed model distribution

---

## Installation

Install from PyPI:

    pip install babappalign

BABAPPAlign remains fully functional on CPU-only systems.
If CUDA or Apple Silicon Metal/MPS support is available through PyTorch,
BABAPPAlign can use it automatically.

---

## Quick Start

### Protein alignment (default)

    babappalign input.fasta

Output:

    input.protein.aln.fasta

---

### Codon alignment

    babappalign cds.fasta --mode codon

Outputs:

    cds.protein.aln.fasta
    cds.codon.aln.fasta

No -o option is required.
Output filenames are generated automatically.

---

### Interactive mode (`--i`)

    babappalign --i

Prompts:

    Sequence FASTA file:
    Mode [protein/codon] (default: protein):

The scorer is always the required `babappascore.pt` model.

Without `--i`, BABAPPAlign runs in normal static CLI mode and expects
the FASTA path directly in the command line.

---

## Codon Mode Details

When --mode codon is enabled:

1. CDS sequences are validated:
   - Length divisible by 3
   - No internal stop codons
   - Valid nucleotide alphabet

2. Sequences are translated to protein.

3. Alignment is performed in protein space using the learned neural scoring model.

4. Aligned proteins are back-mapped to codon alignment (PAL2NAL-style logic).

Gap penalties are automatically scaled in codon mode for biological consistency.

No external PAL2NAL dependency is required.

---

## How BABAPPAlign Works

1. Residue Embedding  
   Protein sequences are converted into residue-level embeddings using a pretrained
   protein language model.

2. Learned Residue Scoring  
   Residue compatibility is evaluated using a pretrained neural scoring model
   (BABAPPAScore), replacing traditional substitution matrices.

3. Progressive Alignment  
   Sequences are progressively aligned using exact affine-gap dynamic programming
   (Gotoh). Neural inference is performed outside the DP recursion to preserve
   correctness.

The progressive ordering is a computational heuristic and is not interpreted
as a phylogeny.

---

## Alignment Core Integrity

The alignment engine uses:

- Three-state affine-gap DP (M, Ix, Iy)
- Explicit traceback matrices
- Exact dynamic programming
- No heuristic shortcuts inside recursion

Version 1.4.0 does not modify the affine-gap DP alignment core.
The release changes hardware selection and packaging behavior only.
Scientific reproducibility from earlier versions is preserved.

---

## Model Weights (Required)

BABAPPAlign requires a trained neural residue-level scoring model (BABAPPAScore),
distributed separately via Zenodo.

Concept DOI (all versions):

    https://doi.org/10.5281/zenodo.18053200

Download model:

    mkdir -p ~/.cache/babappalign/models

    wget https://zenodo.org/record/18053201/files/babappascore.pt \
      -O ~/.cache/babappalign/models/babappascore.pt

BABAPPAlign always loads:

    ~/.cache/babappalign/models/babappascore.pt

If this file is missing, the CLI exits explicitly with a `[FATAL]` error.

---

## CPU and Accelerator Execution

BABAPPAlign produces identical alignments on CPU, CUDA, and Apple Silicon Metal/MPS.
Hardware acceleration affects performance only.

The default device is `auto`. In this mode BABAPPAlign checks backends in order:

1. CUDA, if PyTorch reports it available and a small runtime tensor probe succeeds
2. Apple Silicon Metal/MPS, if PyTorch reports it available and the runtime probe succeeds
3. CPU fallback

If `--device cuda` or `--device mps` is requested but the backend is unavailable
or fails the runtime probe, BABAPPAlign falls back to CPU with a warning.

Component                     CPU     CUDA    Metal/MPS
-------------------------------------------------------
Progressive alignment (DP)    Yes     Yes     Yes
Learned scoring               Yes     Yes     Yes
Embedding generation          Slower  Faster  Faster

Examples:

    babappalign input.fasta
    babappalign input.fasta --device auto
    babappalign input.fasta --device mps
    babappalign input.fasta --device cuda

---

## Input Requirements

Protein mode:
- Protein FASTA sequences

Codon mode:
- CDS nucleotide FASTA sequences
- Length divisible by 3
- No internal stop codons

No strict limits on sequence number or length
(runtime depends on hardware).

---

## Command Line Interface

    babappalign --help

Key options:

    --i                   interactive mode
    --mode {protein,codon}
    --gap-open FLOAT
    --gap-extend FLOAT
    --device {auto,cpu,cuda,mps}

Output filenames are generated automatically.

---

## License

MIT License. See LICENSE file.

---

## Citation

If this software contributes to your research, please cite:

Krishnendu Sinha, BABAPPAlign: A Multiple Sequence Alignment Engine with a Learned Residue-Level Scoring Function, Bioinformatics, 2026;, btag189, https://doi.org/10.1093/bioinformatics/btag189

Link: [http://biorxiv.org/content/early/2025/12/29/2025.12.26.696577.abstract](http://biorxiv.org/content/early/2025/12/29/2025.12.26.696577.abstract)

---

## Author

Krishnendu Sinha
https://github.com/sinhakrishnendu/BABAPPAlign
