Here's the README file content based on the manuscript and reproducibility guidelines: ```markdown # Machine Learning-Assisted Genomic Profiling of BCG vs. Wild-Type Mycobacterium bovis ## Description This repository contains code for differentiating BCG vaccine strains from non-BCG wild-type Mycobacterium bovis using machine learning approaches. The pipeline implements both random forest and 1D CNN classifiers trained on genomic features extracted from 72 bacterial genomes. ## Dataset Information ### Source - 72 complete genome assemblies from NCBI: - 28 BCG vaccine strains - 44 non-BCG wild-type M. bovis - Accession numbers: See Supplementary Excel File 1 ### Characteristics - Assembly quality: ≤200 contigs/genome, N50 ≥50kb - Validation: ANI ≥98% against reference genomes - Exclusion criteria: Contamination <1%, assembly gaps <5% ## Code Information ### Main Scripts 1. `preprocessing_pipeline.sh` - deltaBS workflow for data preparation 2. `random_forest_classifier.R` - R script for RF model development 3. `cnn_classifier.py` - Python CNN implementation ### Key Features - DeltaBS-based feature extraction - Iterative feature selection with permutation testing - CNN architecture with dropout regularization (0.3-0.5) ## Usage Instructions ### Step-by-Step Implementation 1. Clone repository: ```bash git clone https://gist.github.com/szypanther/ffa7e7a6d869020cc53eb809e4794f0d ``` 2. Install dependencies (see Requirements section) 3. Run preprocessing: ```bash bash preprocessing_pipeline.sh ``` 4. Execute classifiers: ```bash Rscript random_forest_classifier.R python cnn_classifier.py ``` ## Requirements ### Software - Numpy version: 1.21.6 - Joblib version: 1.3.2 - R ≥4.1.2 - Python ≥3.8 - deltaBS ### Libraries ```plaintext R: randomForest, caret Python: TensorFlow 2.15, scikit-learn 1.2, pandas 1.5 Bioinformatics: Prokka 1.14.6, Roary 3.13.0, HMMER 3.3.2 ``` ## Methodology ### Workflow Overview 1. Genome annotation (Prokka) 2. Pan-genome analysis (Roary) 3. Bitscore calculation (DeltaBS) 4. Feature selection: - RF: OOB error minimization - CNN: Gradient activation mapping 5. Model training: - RF: 10,000 trees, mtry=n/10 - CNN: Two Conv1D layers (32/64 filters) ## Citations If using this work, please cite: ```bibtex @article{shi2025ml_bcg, title={Machine Learning-Assisted genomic profiling to identify differences between BCG vaccine strains and non-BCG wild-type Mycobacterium bovis}, author={Shi, Yunyun et al.}, journal={PeerJ}, year={2025} } ``` ## License This code is made available under the MIT License. Contributions via pull requests are welcome. ## Contact Zhiyong Shen (szypanther@gmail.com) School of Basic Medical Sciences Hunan University of Medicine ```