Metadata-Version: 2.4
Name: famus
Version: 0.1.1
Summary: Functional Annotation Method Using Siamese neural networks (FAMUS)
Home-page: https://github.com/burstein-lab/famus
Author: Guy Shur
Author-email: guyshur@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.11,<3.13
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy<2.0,>=1.26.4
Requires-Dist: pandas>=2.2.3
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: scipy>=1.10.0
Requires-Dist: biopython>=1.76
Requires-Dist: tqdm>=4.66.2
Requires-Dist: matplotlib>=3.7.0
Requires-Dist: seaborn>=0.13.2
Requires-Dist: pyyaml>=5.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# FAMUS: Functional Annotation Method Using Supervised contrastive learning

[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/famus/README.html)

FAMUS has a web interface available at:  
https://app.famus.bursteinlab.org  

<p align="center">
<img src="https://famus-6e94e.web.app/superfamus.svg" width="96"></p>


FAMUS is a SupCon (Supervised Contrastive) learning based framework that annotates protein sequences with function. Input sequences are transformed to numeric vectors with pre-trained neural networks tailored to individual protein family databases, and then compared to sequences of those databases to find the closest match.

This repository (or the famus conda package) can also be used to train a model for any protein database by using one fasta file for each protein family, and preferrably a large number of negative examples (sequences not belonging to any family in the given database). 

We provide one main module for training and one for classification, which automatically take care of all relevant steps of training and/or inference. If interrupted, using these modules again with the same parameters will attempt to resume from where the program has stopped. For this reason do not rename/remove files from the directories the modules are using if you intend to restart an interrupted pipeline using the same data directory, as the names of some files are hardcoded into the program.

## Installation

### With Conda

To install with conda, first create a new conda environment:

`conda create -n famus -c conda-forge -c bioconda famus`

Activate the environment with `conda activate famus` and install the correct pytorch version for your environment from the pytorch website via pip: https://pytorch.org/get-started/locally/

If you wish to use WanDB logging during training, install wandb with: `pip install wandb`.

Using famus tools requires that the conda environment is activated. You can check if the installation was successful by running `famus-train -h`.

### From source

Alternatively, you can download the famus source code note that FAMUS currently only supports Python>=3.11. First, clone the repository:

`git clone https://github.com/burstein-lab/famus.git`


Create and activate a new conda or pip virtual environment, then install the required python packages with:
`pip install -r requirements.txt`

FAMUS has five dependencies (other than python and pip) that need to be installed separately if installing from source:

- PyTorch - follow the instructions at https://pytorch.org/get-started/locally/
- mmseqs2
- seqkit
- hmmer
- mafft

Make sure that the executables for `mmseqs2`, `seqkit`, `hmmsearch` from hmmer, and `mafft` are all in your PATH variable. They can also be installed via conda:
`conda install -c conda-forge -c bioconda hmmer mafft seqkit mmseqs2`

### Downloading pre-trained models

If you plan on using the pre-trained models, you will need to download them from Zenodo (https://zenodo.org/uploads/14941373). The available pre-trained models are:  

```
kegg_comprehensive
kegg_light
interpro_comprehensive
interpro_light
orthodb_comprehensive
orthodb_light
eggnog_comprehensive
eggnog_light
```

To easily download pre-trained models, we provide a command line tool called `famus-install` (for conda). source code users will use the module famus.cli.install_models. This will download a large number of profile HMMs, so make sure you have enough disk space (several GBs depending on the models you download).

If installed with conda, run `famus-install --models <comma-separated list of model names> --models-dir <path to models directory>`. For example, to download the comprehensive KEGG and light InterPro models to famus_models, run:
`famus-install --models kegg_comprehensive,interpro_light --models-dir famus_models`. If using the source code, run `python -m famus.cli.install_models` from the root directory instead of `famus-install`. See details below for a comprehensive list of command line arguments.

Python data is downloaded as JSON for security reasons. After running this command, it is recommended (but optional) to convert the data that was downloaded as JSON to pickle format for faster data loading. This can be done by running the following command: 
 - conda: `famus-convert-sdf --models-dir <path to models directory>`
 - source code: `python -m famus.cli.convert_sdf --models-dir <path to models directory>`

## Configuration and priority of parameters

Most FAMUS tools expect parameters as either command line arguments or in a given (optional) YAML-formatted configuration file. An example configuration file - `example_cfg.yaml` - can be found in the root directory of this repository and contains an example for each parameter it can override. Config files don't have to include all parameters, just the ones you want to override. Any parameter not specified will use a default value. For example, the default path to save/load models is ~/.famus/models/, and should be overridden if you want to use a different path. The order of priority is as follows:

1. Command line arguments
2. Configuration file parameters (if provided as a command line argument and the relevant parameter is specified there)
3. Default parameters (running `famus-defaults` (conda) or `python -m famus.config` (source code) will print the default parameters to the console)

See sections below for details on command line arguments for each tool, and the end of this README for a comprehensive list of configuration parameters.

## Classifying sequences

To classify sequences, you will need a fasta format file of sequences to classify.

The main tool for classification is `famus-classify` for conda and `famus.cli.classify` for source code users.
Usage:
 - conda: `famus-classify [options] <input_fasta_file_path> <output_dir>`
 - source code: `python -m famus.cli.classify [options] <input_fasta_file_path> <output_dir>`

Main command line arguments for `famus-classify` (unused arguments will be read from config or set to default values):
- input_fasta_file_path - the path of the sequeces for classification. (required)
- output_dir - the directory to save the results to. (required)
- --config - path to configuration file.
- --n-processes - number of cpu cores to use.
- --device - cpu/cuda
- --models - space-separated list of model names to use. Note that models must be explicitly chosen via the command line argument or config file, otherwise no classification will be performed.
- --models-dir - directory where the models are installed.
- --model-type - comprehensive/light - type of model to use (light may be slightly less accurate but significantly faster).
- --load-sdf-from-pickle - loads training data from pickle instead of json. Only usable on downloaded / trained models after running `famus-convert-sdf` for conda users or `python -m famus.cli.convert_sdf` for source code users.
- --no-log - do not create a log file.
- --log-dir - directory to save the log file to.

## Training a model

Training a model on a large database can take a long time and be computationally expensive. It is recommended to be faniliar with the options in the configuration file before starting training.

The main tools for training are `famus-train` for conda, and `famus.cli.train` for source code users.

Usage:
 - conda: `famus-train [options] <input_fasta_dir_path>`
 - source code: `python -m famus.cli.train [options] <input_fasta_dir_path>`

**Important notes:**
 - every file name in the input directory **must** end in .fasta, and files must not be named unknown.fasta (since unknown is reserved for unknown sequences).
 - It is recommended to provide a fasta file of unknown sequences (sequences not belonging to any family in the database) as negative examples for training. This will reduce false positives during classification.

Main command line arguments for `famus-train` (unused arguments will be read from config or set to default values):
- input_fasta_dir_path - the path of the directory holding fasta files where each file defines a protein family (required).
- --config - path to configuration file.
- --create-subclusters / --no-create-subclusters - whether to create a comprehensive (True) or light (False) model. Comprehensive models cluster protein families into sub-families, which increases accuracy but also training and classification time.
- --model-name - optional name for the model. If not specified, the input directory base name will be used. Can't be set using the config file and must be provided as a command line argument.
- --unknown-sequences-fasta-path - fasta file with sequences of unknown function as negative examples for the model. Optional but recommended. Can't be set using the config file and must be provided as a command line argument.
- --n-processes - number of CPU cores to use.
- --num-epochs - number of epochs to train the model for.
- --batches-per-epoch - number of batches per epoch.
- --stop-before-training - calling this module with --stop-before-training will exit before starting to train the model (useful for things like preprocessing in a high-CPU environment and them training the model in a different environment with CUDA).
- --device - cpu/cuda.
- --chunksize - reduce if GPU RAM becomes an issue when calculating threshold using GPU.
- --overwrite-checkpoint - whether to overwrite existing checkpoints if resuming training.
- --continue-from-checkpoint - whether to continue training from the last checkpoint if one exists.

## Comprehensive list of configuration parameters

### Training and classification parameters:
- --n-processes: number of processes to use for parallelization during preprocessing and cpu-based training and classification
- --user-device: 'cpu' or 'cuda' - the device to use for training and classification. Classification with GPU is only marginally faster within HPC environments.
- --no-log: do not create a log file.
- --log-dir: directory to save the log file to.
- --models-dir: directory where the models are installed.

### Classification-specific parameters:

- --model-type: 'comprehensive' or 'light' - type of model to use for classification (light may be slightly less accurate but significantly faster).
- --models: a space-separated list of protein family databases to use for training or classification. The available pretrained are: kegg, interpro, orthodb, eggnog for both comprehensive and light models. Classification will use all models specified here for the specified model type. Note that models must be explicitly chosen via the command line argument or config file, otherwise no classification will be performed. This is to prevent unintended classification with large models, particularly in shared environments.
- --chunksize: positive integer - the number of sequences to process (load to GPU) in each batch during classification - decrease if running out of GPU RAM.
- --load-sdf-from-pickle: whether to load training data from pickle files instead of json files. Makes classification preprocessing slightly faster. Only usable after running `famus-convert-sdf` for conda users or `python -m famus.cli.convert_sdf` for source code users. Recommended if using models repeatedly.  
 

### Training-specific parameters:

- --batch-size: positive integer - the batch size to use for training.
- --num-epochs: positive integer - the number of epochs to train for.
- --create-sublucsters/--no-create-subclusters: whether to create a comprehensive or light model. Comprehensive models cluster protein families into sub-families, which increases accuracy but also training and classification time.
- --processes-per-mmseqs-job: positive integer - the number of processes to use for each mmseqs job during preprocessing. Higher values will work faster for fewer but bigger protein families, lower values will work faster for many small protein families.
- --sampled-sequences-per-subcluster: 'use_all' or positive integer - the number of sequences to sample from each subcluster during preprocessing. If 'use_all', all sequences will be used. These sequences will be used to train the model and as positive examples during classification. decrease this value to reduce preprocessing time and space usage, increase to improve training data variety.
- --fraction-of-sampled-unknown-sequences: 'use_all', 'do_not_use', or 0 <= float <= 1.0 - the fraction of unknown sequences to sample relative to the number of labeled seuqneces that were sampled (e.g, 1.0 will sample up to the same number of unknown sequences as total labeled sequences). If 'use_all', all unknown sequences will be used (not recommended if the number of unknowns is much higher). These sequences will be used as negative examples during training and classification.
- --samples-profiles-product-limit: positive integer - if the number of protein families (in light models) or sub-families (in comprehensive models) times the number of sampled sequences per subcluster exceeds this limit, the number of sampled sequences per subcluster will be reduced to stay below the limit. This is to avoid extremely long processing times.
- --mmseqs-cluster-coverage: float between 0 and 1 - mmseqs clustering coverage parameter during deduplication of protein families. Higher values will de-duplicate less aggressively.
- --mmseqs-cluster-identity: float between 0 and 1 - mmseqs clustering identity parameter during deduplication of protein families. Higher values will de-duplicate less aggressively.
- --mmseqs-coverage-subclusters: float between 0 and 1 - mmseqs coverage parameter during creation of subclusters within protein families. Higher values will create more and smaller subclusters.
- --stop-before-training: if set to True, will exit before starting to train the model (useful for things like preprocessing in a high-CPU environment and them training the model in a different environment with CUDA).
- --log-to-wandb: whether to log training metrics to Weights & Biases.
- --wandb-project-name: name of the Weights & Biases project to log to if --log-to-wandb is set.
- --wandb-api-key-path: path to a text file containing the Weights & Biases API key if --log-to-wandb is set.
- --overwrite-checkpoint: whether to overwrite existing checkpoints if resuming training.
- --continue-from-checkpoint: whether to continue training from the last checkpoint if one exists.



