Efficient Parkinson's Disease classification from speech with filter-based feature selection and Genetic Algorithm–Bayesian Optimization ensemble integration.


Description
-----------
This repository contains the source code, data processing steps, and model training routines for the study titled:

"Efficient Parkinson's Disease classification from speech with filter-based feature selection and Genetic Algorithm–Bayesian Optimization ensemble integration."


This project introduces a modular and computationally efficient framework for the classification of Parkinson’s Disease (PD) using vocal features. The approach focuses on integrating selective feature optimization methods with ensemble learning strategies to develop a lightweight, scalable, and interpretable diagnostic support system.

The framework includes:
	•Extraction and processing of vocal features from multiple domains, including TQWT, MFCC, and baseline dysphonia measures.
	•Application of multiple filter-based feature selection techniques, including Mutual Information, F-score, Chi-square, and their derived ensemble variants (hybrid voting, score fusion, classwise intersection).
	•Evaluation of selected features using two classifiers: k-Nearest Neighbors (kNN) and Support Vector Machines (SVM) within a stratified 10-fold cross-validation setup.
	•Integration of classifier predictions using three ensemble strategies: (i) Iterative Majority Voting, (ii) Genetic Algorithm-based model selection, and (iii) Bayesian Optimization-based weighted voting.
	•A hybrid two-stage ensemble integration scheme that combines GA-based subset selection with BO-driven weighting to optimize predictive fusion.

The implementation is structured for reproducibility and is designed to be applicable in real-world and resource-constrained settings, providing a foundation for voice-based, AI-assisted diagnostic tools for Parkinson’s disease.


Dataset Information
-------------------
The experiments were conducted on a publicly available Parkinson’s Disease vocal dataset from the UCI Machine Learning Repository:

https://archive.ics.uci.edu/dataset/470/parkinson+s+disease+classification

It contains:
- **Baseline dysphonia features** (e.g., jitter, shimmer, HNR)
- **Mel-Frequency Cepstral Coefficients (MFCCs)**
- **Tunable Q-Factor Wavelet Transform (TQWT) features**

Each instance represents vocal measurements of individuals diagnosed with Parkinson’s disease or healthy controls.

The dataset was standardized and split using stratified 10-fold cross-validation.


Code Information
----------------
The main code is implemented in the following notebook:

- parkinson_ga+bo_ensemble.ipynb: Includes the full pipeline:
  - Feature selection methods (Mutual Information, F-score, Chi-square, filter-ensemble variants)
  - Classifier training (SVM, kNN)
  - GA-based model selection
  - BO-based ensemble weighting
  - Hybrid strategy (**GA–BO Ensemble**) combining subset selection and probabilistic weighting
  - Final evaluation and accuracy report


Requirements
------------
Install the following Python dependencies before running the notebook:

pip install numpy pandas scikit-learn bayesian-optimization matplotlib seaborn

Python version: 3.8+


Usage Instructions
------------------
1.Open the parkinson_ga+bo_ensemble.ipynb notebook using Google Colab or Jupyter Lab.

2.Download the dataset from UCI Parkinson’s Disease Classification Dataset and place it in the same directory as the notebook.

3.Execute the cells in sequence to:
	•Load and preprocess the dataset
	•Perform feature selection
	•Train base classifiers (SVM and kNN)
	•Apply Genetic Algorithm (GA) for model subset selection
	•Apply Bayesian Optimization (BO) for classifier weight tuning
  	•Apply hybrid strategy (**GA–BO Ensemble**) 
	•Generate final ensemble predictions and evaluate performance


Methodology Summary
-------------------
The study includes:

- Filter-based feature selection using MI, F-score, and Chi-square
- Class-wise fusion and ensemble filtering strategies
- Model training with kNN and SVM
- Ensemble strategies:
  - Iterative Majority Voting (IMV)
  - Genetic Algorithm (GA)
  - Bayesian Optimization (BO)
  - Proposed GA–BO–Ensemble (highest performance with 96.4% accuracy)

Evaluation method:
- Stratified 10-Fold Cross-Validation
- Metrics: Accuracy, F1-score, MCC
- Code is fully reproducible and all configurations are documented in the notebook


Citation
--------
If you use this code or dataset in your work, please cite our paper (once published):

Gündüz, H. (2025). Efficient Parkinson's Disease classification from speech with filter-based feature selection and Genetic Algorithm–Bayesian Optimization ensemble integration. PeerJ Computer Science.

License
-------
This project is released under the MIT License.