Metadata-Version: 2.4
Name: atol-bpa-datamapper
Version: 0.2.0
Summary: Map data from the BPA data portal for AToL's Genome Engine
Author-email: Tom Harrop <tharrop@unimelb.edu.au>
License: GPL-3.0-or-later
Project-URL: Homepage, https://github.com/tomharrop/atol-bpa-datamapper
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Natural Language :: English
Classifier: Operating System :: POSIX :: Linux
Classifier: Private :: Do Not Upload
Classifier: Programming Language :: Python :: 3.12
Requires-Python: <3.15,>=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: ckanapi>=4.8
Requires-Dist: jsonlines>=4.0.0
Requires-Dist: scikit-bio>=0.6.3
Dynamic: license-file

# atol-bpa-datamapper

Map data from the BPA data portal for AToL's Genome Engine.

The pipeline consists of three main steps:
1. **filter-packages**: Filter packages based on controlled vocabularies
2. **map-metadata**: Map BPA metadata to [AToL's metadata
schema](https://docs.google.com/spreadsheets/d/1ml5hASZ-qlAuuTrwHeGzNVqqe1mXsmmoDTekd6d9pto)
3. **transform-data**: Extract unique samples and organisms and track their relationships to BPA packages

## Installation

The
[BioContainer](https://quay.io/repository/biocontainers/atol-bpa-datamapper?tab=tags)
is the only supported method of running `atol-bpa-datamapper`.

*e.g.* with Apptainer/Singularity:

```bash
apptainer exec \
  docker://quay.io/biocontainers/atol-bpa-datamapper:0.1.2--pyhdfd78af_0 \
  filter-packages
```

### Via `pip` or `conda`

Local installation isn't supported, but can be done with `pip`
from this repo, or from 
[bioconda](https://anaconda.org/bioconda/atol-bpa-datamapper).


## Reference data

### `map-metadata`

- The `nodes` and `names` files are `nodes.dmp` and `names.dmp` from NCBI's
  [new_taxdump](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/). 
- `taxids_to_busco_dataset_mapping` is the `mapping_taxids-busco_dataset_name`
  file from
  [busco-data.ezlab.org/v5/data/placement_files](https://busco-data.ezlab.org/v5/data/placement_files/)
- `taxids_to_augustus_dataset_mapping` is a mapping of Augustus training
  datasets to NCBI TaxID, [provided in this
  repo](dev/resources/taxid_to_augustus_dataset.tsv).


## Usage

The input is compressed jsonlines data output from the `ckanapi search datasets` command.

Output is compressed jsonlines data.

See [`dev/scripts/test_commands.sh`](dev/scripts/test_commands.sh) for an example.


### filter-packages

```
usage: filter-packages [-h] [-i INPUT] [-o OUTPUT] [-f PACKAGE_FIELD_MAPPING_FILE] [-r RESOURCE_FIELD_MAPPING_FILE] [-v VALUE_MAPPING_FILE] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [-n] [--raw_field_usage RAW_FIELD_USAGE] [--bpa_field_usage BPA_FIELD_USAGE] [--bpa_value_usage BPA_VALUE_USAGE] [--decision_log DECISION_LOG]

Filter packages from jsonlines.gz

options:
  -h, --help            show this help message and exit

Input:
  -i INPUT, --input INPUT
                        Input file (default: stdin)

Output:
  -o OUTPUT, --output OUTPUT
                        Output file (default: stdout)

General options:
  -f PACKAGE_FIELD_MAPPING_FILE, --package_field_mapping_file PACKAGE_FIELD_MAPPING_FILE
                        Package-level field mapping file in json.
  -r RESOURCE_FIELD_MAPPING_FILE, --resource_field_mapping_file RESOURCE_FIELD_MAPPING_FILE
                        Resource-level field mapping file in json.
  -v VALUE_MAPPING_FILE, --value_mapping_file VALUE_MAPPING_FILE
                        Value mapping file in json.
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level (default: INFO)
  -n, --dry-run         Test mode. Output will be uncompressed jsonlines.

Counters:
  --raw_field_usage RAW_FIELD_USAGE
                        File for field usage counts in the raw data
  --bpa_field_usage BPA_FIELD_USAGE
                        File for BPA field usage counts
  --bpa_value_usage BPA_VALUE_USAGE
                        File for BPA value usage counts

Filtering options:
  --decision_log DECISION_LOG
                        Compressed CSV file to record the filtering decisions for each package
```

### map-metadata

```
usage: map-metadata [-h] [-i INPUT] [-o OUTPUT] [-f PACKAGE_FIELD_MAPPING_FILE] [-r RESOURCE_FIELD_MAPPING_FILE] [-v VALUE_MAPPING_FILE] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [-n] [--raw_field_usage RAW_FIELD_USAGE] [--raw_value_usage RAW_VALUE_USAGE] [--mapped_field_usage MAPPED_FIELD_USAGE] [--mapped_value_usage MAPPED_VALUE_USAGE]
                    [--unused_field_counts UNUSED_FIELD_COUNTS] [--mapping_log MAPPING_LOG] [--sanitization_changes SANITIZATION_CHANGES] --nodes NODES --names NAMES [--grouping_log GROUPING_LOG] [--grouped_packages GROUPED_PACKAGES] [--cache_dir CACHE_DIR]

Map metadata in filtered jsonlines.gz

options:
  -h, --help            show this help message and exit

Input:
  -i INPUT, --input INPUT
                        Input file (default: stdin)
  --nodes NODES         NCBI nodes.dmp file from taxdump
  --names NAMES         NCBI names.dmp file from taxdump

Output:
  -o OUTPUT, --output OUTPUT
                        Output file (default: stdout)

General options:
  -f PACKAGE_FIELD_MAPPING_FILE, --package_field_mapping_file PACKAGE_FIELD_MAPPING_FILE
                        Package-level field mapping file in json.
  -r RESOURCE_FIELD_MAPPING_FILE, --resource_field_mapping_file RESOURCE_FIELD_MAPPING_FILE
                        Resource-level field mapping file in json.
  -v VALUE_MAPPING_FILE, --value_mapping_file VALUE_MAPPING_FILE
                        Value mapping file in json.
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level (default: INFO)
  -n, --dry-run         Test mode. Output will be uncompressed jsonlines.
  --cache_dir CACHE_DIR
                        Directory to cache the NCBI taxonomy after processing

Counters:
  --raw_field_usage RAW_FIELD_USAGE
                        File for field usage counts in the raw data
  --raw_value_usage RAW_VALUE_USAGE
                        File for value usage counts in the raw data
  --mapped_field_usage MAPPED_FIELD_USAGE
                        File for counts of how many times each BPA field was mapped to an AToL field
  --mapped_value_usage MAPPED_VALUE_USAGE
                        File for counts of the values mapped from BPA fields to AToL fields
  --unused_field_counts UNUSED_FIELD_COUNTS
                        File for counts of fields in the BPA data that weren't used

Mapping options:
  --mapping_log MAPPING_LOG
                        Compressed CSV file to record the mapping used for each package
  --sanitization_changes SANITIZATION_CHANGES
                        File to record the sanitization changes made during mapping
  --grouping_log GROUPING_LOG
                        Compressed CSV file to record derived organism info for each package
  --grouped_packages GROUPED_PACKAGES
                        JSON file of Package IDs grouped by organism grouping_key
```

### transform-data

```
usage: transform-data [-h] [-i INPUT] [-o OUTPUT] [-f PACKAGE_FIELD_MAPPING_FILE] [-r RESOURCE_FIELD_MAPPING_FILE] [-v VALUE_MAPPING_FILE] [-s SANITIZATION_CONFIG_FILE] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                      [-n] [--sample_conflicts SAMPLE_CONFLICTS] [--sample_package_map SAMPLE_PACKAGE_MAP] [--transformation_changes TRANSFORMATION_CHANGES] [--unique_organisms UNIQUE_ORGANISMS]
                      [--organism_conflicts ORGANISM_CONFLICTS] [--organism_package_map ORGANISM_PACKAGE_MAP] [--sample_ignored_fields SAMPLE_IGNORED_FIELDS] [--organism_ignored_fields ORGANISM_IGNORED_FIELDS]
                      [--experiments_output EXPERIMENTS_OUTPUT]

Transform mapped metadata to extract unique samples

options:
  -h, --help            show this help message and exit

Input:
  -i INPUT, --input INPUT
                        Input file (default: stdin)

Output:
  -o OUTPUT, --output OUTPUT
                        Output file (default: stdout)

General options:
  -f PACKAGE_FIELD_MAPPING_FILE, --package_field_mapping_file PACKAGE_FIELD_MAPPING_FILE
                        Package-level field mapping file in json.
  -r RESOURCE_FIELD_MAPPING_FILE, --resource_field_mapping_file RESOURCE_FIELD_MAPPING_FILE
                        Resource-level field mapping file in json.
  -v VALUE_MAPPING_FILE, --value_mapping_file VALUE_MAPPING_FILE
                        Value mapping file in json.
  -s SANITIZATION_CONFIG_FILE, --sanitization_config_file SANITIZATION_CONFIG_FILE
                        Sanitization configuration file in json.
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level (default: INFO)
  -n, --dry-run         Test mode. Output will be uncompressed jsonlines.

Transform options:
  --sample_conflicts SAMPLE_CONFLICTS
                        File to record conflicts between samples with the same bpa_sample_id
  --sample_package_map SAMPLE_PACKAGE_MAP
                        File to record which packages relate to each unique sample
  --transformation_changes TRANSFORMATION_CHANGES
                        File to record the transformation changes made during sample merging
  --unique_organisms UNIQUE_ORGANISMS
                        File to record unique organisms extracted from the data
  --organism_conflicts ORGANISM_CONFLICTS
                        File to record conflicts between organisms with the same organism_grouping_key
  --organism_package_map ORGANISM_PACKAGE_MAP
                        File to record which packages relate to each unique organism
  --sample_ignored_fields SAMPLE_IGNORED_FIELDS
                        Comma-separated list of sample fields to ignore when determining uniqueness. Conflicts in these fields will still be reported but won't prevent inclusion in the unique samples list.
  --organism_ignored_fields ORGANISM_IGNORED_FIELDS
                        Comma-separated list of organism fields to ignore when determining uniqueness. Conflicts in these fields will still be reported but won't prevent inclusion in the unique organisms list.
  --experiments_output EXPERIMENTS_OUTPUT
                        File to record extracted experiments data
```

### Deployment

The package comes with metadata mapping specifications in
[`src/atol_bpa_datamapper/config`](src/atol_bpa_datamapper/config). The field
mapping spec can be generated from [AToL's metadata
schema](https://docs.google.com/spreadsheets/d/1ml5hASZ-qlAuuTrwHeGzNVqqe1mXsmmoDTekd6d9pto)
using the script at
[`dev/scripts/read_atol_schemas.py`](dev/scripts/read_atol_schemas.py).
