**From Prediction to Trust: Enhancing Deep Learning Models for Insurance Fraud Detection through Uncertainty Quantification** **Description** This project provides a comprehensive Python implementation for auto insurance fraud detection using deep neural networks enriched with uncertainty quantification and interpretability techniques. The code trains and evaluates three advanced modeling approaches – Monte Carlo Dropout (MCD), Deep Ensembles, and Ensemble Monte Carlo Dropout (EMCD) – on an imbalanced insurance claims dataset. It demonstrates how data resampling (using Random Over-Sampling and SMOTE) can improve fraud detection for minority classes, and quantifies the impact on model uncertainty. To make the model’s decisions more transparent, the script also integrates explainability tools (SHAP and LIME) to show feature importance for predictions. The output includes high-resolution visualizations and detailed performance metrics, aiding reproducibility and providing insights into model confidence and behavior. **Dataset** * **Source:** Real-world auto insurance claim dataset originally provided by Oracle (publicly available as “fraud_oracle” dataset on platforms like Kaggle and Figshare). The code will download the CSV from a GitHub repository. * **Records:** 15,420 insurance claims spanning two years in the U.S., of which 923 are labeled as fraudulent and 14,497 as legitimate (highly imbalanced with ~6% fraud). * **Features:** 33 original variables describing each claim (policy and vehicle details, claim characteristics, etc.). Prior to modeling, 4 identifier/redundant columns (RepNumber, PolicyNumber, Age, Year) are removed, yielding 29 features used for training. Categorical features (e.g. Month, DayOfWeek, etc.) are ordinal-encoded with meaningful ordering, and numeric features are imputed and scaled (StandardScaler) as part of preprocessing. **Software Dependencies** To run the code, you will need Python 3.x and the following libraries/packages installed: * **Data Manipulation:** pandas, numpy * **Machine Learning:** scikit-learn, imbalanced-learn (for RandomOverSampler and SMOTE) * **Deep Learning:** tensorflow (Keras API, tested with TensorFlow 2.x), keras-tuner (for hyperparameter tuning, optional) * **Explainability:** shap, lime * **Visualization:** matplotlib, seaborn, plotly, bokeh * **Additional (Optional):** xgboost, lightgbm, h2o (used in the notebook for optional model comparisons or ensemble methods) * **Utility:** gdown (used to download the dataset) **Note: The code was originally developed in a Jupyter/Colab notebook environment, so some interactive visualization libraries (Plotly, Bokeh) and JS-based outputs (SHAP force plots) are best viewed in Jupyter. However, the core training and evaluation will run in any Python environment given the above dependencies.** Installation You can install the required packages using pip. For example: **pip install pandas numpy scikit-learn imbalanced-learn tensorflow keras-tuner shap lime matplotlib seaborn plotly bokeh xgboost lightgbm h2o gdown** It is recommended to use a virtual environment (or conda environment) to avoid version conflicts. Ensure that TensorFlow is installed with a version compatible with your Python (the code is tested with TensorFlow 2.x). If you intend to use GPU acceleration for training, make sure to install the GPU-enabled TensorFlow and necessary CUDA drivers (though the code also runs on CPU – it was tested on a 13th-gen Intel Core i7 CPU without a GPU). **Running the Code** * **Obtain the Dataset:** The script will automatically download the dataset file (fraud_oracle.csv) via gdown from a GitHub repository. Make sure you have an internet connection when you run for the first time. If you prefer, you can manually download the dataset (e.g., from Kaggle’s “Vehicle Insurance Claim Fraud Detection” dataset) and place fraud_oracle.csv in the working directory – the code will read from the local file if it exists. * **Prepare the Environment:** Install the dependencies as above. It's recommended to run this project in a Jupyter Notebook or Google Colab for the best experience, since some output (like SHAP plots) is interactive. If running as a standalone script (paper_of_peerj_code_.py), ensure the environment is properly set up and note that interactive visuals may not render – static plots will still be generated. * **Execute the Script/Notebook:** If using Jupyter/Colab, open the notebook (or import the .py script) and run the cells sequentially. If using the Python script directly, run it from the command line: python paper_of_peerj_code_.py The code will print progress and results to the console and display plots. Training the neural networks (especially the 50-member ensemble and performing 300 Monte Carlo dropout iterations) may take some time, but it is feasible on a standard PC. * **Reproducing Results:** The random seeds for NumPy and TensorFlow are fixed (e.g., np.random.seed(42) and tf.random.set_seed(42)) to ensure reproducibility of results. By running the code as provided, you should obtain the same performance metrics and figures as reported in the associated research manuscript. The final outputs include printed metrics and a series of figures (described below). **Project Structure and Features** * **Data Preprocessing:** The script handles all preprocessing automatically. It splits the data into a training set (70%) and testing set (30%). Categorical features are transformed via a ColumnTransformer – time-related features (like Month, DayOfWeek) are ordinal-encoded using a custom logical order, other categorical features are ordinal-encoded arbitrarily, and the single numeric feature (Deductible or others) is imputed for missing values and scaled. The preprocessing pipeline prints out the encoding mappings for transparency. No manual preprocessing is required from the user. * **Model Training:** Three neural network-based models are trained: (1) a single model with Monte Carlo Dropout (MCD) enabled, (2) a Deep Ensemble of 50 separate models, and (3) an Ensemble Monte Carlo Dropout (EMCD) model combining both approaches. All models share a common base architecture: a fully-connected feedforward network with 28 input features, two hidden layers (default 64 and 32 neurons respectively, ReLU activation), and a single sigmoid output neuron for binary fraud classification. Dropout layers (with rate 0.25) are inserted after each hidden layer in the MCD and EMCD models. Training is done for 100 epochs using the Adam optimizer (learning rate 0.001) and binary crossentropy loss. The code ensures reproducibility and prints out training progress. * **Handling Class Imbalance:** The dataset is highly imbalanced in favor of legitimate claims. The code addresses this by rebalancing the training data in two ways: Random Over-Sampling (ROS), which duplicates minority class examples, and SMOTE, which synthetically generates minority examples. The script trains and evaluates each model under three scenarios: using the original imbalanced training set, using an oversampled training set (ROS), and using a SMOTE-augmented training set. This allows you to compare how resampling affects the model’s performance and uncertainty. The class distributions before and after resampling are printed for verification. * **Evaluation Procedure:** After training, each model is evaluated on the held-out test set (which remains imbalanced to reflect real-world performance). Rather than using the default 0.5 probability cutoff, the code determines an optimal classification threshold by sweeping through possible thresholds and finding the point where accuracy, sensitivity, and specificity are most balanced (minimizing the variance between them). Using this chosen threshold (in our experiments, a threshold around ~0.07 was found optimal due to the imbalance), the model’s probabilistic outputs are converted into binary fraud/no-fraud predictions. The script then computes a variety of performance metrics on the test predictions, including Accuracy, Sensitivity (Recall for the fraud class), Specificity (Recall for the non-fraud class), Precision, and F1-score. These metrics are printed or can be retrieved to assess the model’s effectiveness in catching frauds versus false alarms. * **Uncertainty Quantification:** A key feature of this code is the measurement of prediction uncertainty. For the MCD model, the script uses Monte Carlo Dropout sampling – it performs T = 300 forward passes with dropout enabled on each test sample. This yields a distribution of predicted probabilities for each instance; the mean of these 300 samples is taken as the final prediction and the standard deviation is recorded as the model’s uncertainty for that prediction. For the Deep Ensemble, the script collects predictions from 50 independently trained networks; the mean and standard deviation of these ensemble predictions provide an analogous measure of prediction and uncertainty. The EMCD model combines both: it aggregates predictions from multiple networks and multiple dropout iterations (capturing both model and sampling uncertainty). The code analyzes how often the models are “certain and correct” vs “uncertain or wrong” using an Uncertainty Confusion Matrix framework. It classifies each test prediction into one of four categories (True Correct & Certain, True Uncertain, False Certain, False Uncertain) based on whether the prediction was correct and whether the model’s uncertainty for that prediction exceeds a given threshold. From this, uncertainty-specific metrics are calculated: Uncertainty Sensitivity (the fraction of wrong predictions that the model flagged as uncertain), Uncertainty Specificity (the fraction of correct predictions that were flagged as certain), Uncertainty Precision (the proportion of predictions flagged as uncertain that were actually wrong), and Uncertainty Accuracy (overall fraction of cases where the model’s correctness aligned with its certainty/uncertainty flag). These metrics provide a quantitative evaluation of the model’s ability to know what it doesn’t know. * **Interpretability (SHAP & LIME):** To interpret the model’s decisions, the code uses SHAP (SHapley Additive exPlanations) values. After training, a SHAP explainer is created (e.g., using a TensorFlow DeepExplainer or KernelExplainer on the neural network) to compute feature attributions for the predictions. The script includes examples of generating SHAP force plots for individual predictions – these visualizations highlight which features contributed most to classifying a specific claim as fraud or not. The shap.initjs() call initializes the JavaScript outputs for these plots (which will display in a Jupyter notebook). Similarly, the LIME library is imported and could be used to produce local interpretability explanations, though the provided code focuses on SHAP. These tools help users and analysts understand why the model made a certain prediction, increasing trust in the model by linking domain features (like claim amount, accident day, etc.) to the prediction outcome. **Output and Visualization** When you run the code to completion, it will generate extensive output to help evaluate and understand the models: **Performance Metrics:** The console will show the train/test split sizes and class distributions after resampling. After model evaluation, it prints metrics such as accuracy, sensitivity (recall), specificity, precision, etc., for each model under each data scenario (Original, ROS, SMOTE). This includes the optimal threshold chosen for classification and potentially a classification report for each case. **Uncertainty Metrics:** The code will also output or plot the uncertainty confusion matrix results. For example, it may display how many test cases fell into each category (certain-correct, certain-incorrect, etc.) and the computed values of USen, USpe, UPre, UAcc, giving a sense of how well-calibrated the model’s confidence is. **Plots and Figures:** A variety of figures are generated to visualize model predictions and uncertainties: **Predicted Probability Distributions:** Histograms (and/or violin plots) of predicted fraud probabilities for the test set, separated by true class. These illustrate how well the model separates fraudulent vs. legitimate claims in terms of predicted score distribution. **Monte Carlo Sample Histograms:** For a given test instance, the script plots a histogram of the 300 Monte Carlo dropout sample predictions, overlaying the mean prediction and indicating the true class. This shows the spread of predictions due to dropout stochasticity. Similar plots are produced for instances in the oversampled and SMOTE scenarios, to visualize how uncertainty increases with resampling. **Threshold vs. Metrics Curve:** A line chart is generated by sweeping the classification threshold from 0 to 1 and plotting accuracy, sensitivity, and specificity. The point where these curves converge (or the variance among them is minimal) is marked as the chosen optimal threshold. This visualization helps explain why a threshold much lower than 0.5 is used (due to class imbalance considerations). **2D Uncertainty Scatter Plot:** The code produces a 2D density or scatter plot (e.g., a KDE plot) of predicted probability vs. uncertainty (std dev) for all test predictions. Points are color-coded by whether the prediction was correct or incorrect. This plot provides an at-a-glance view of the relationship between model confidence and accuracy – ideally, incorrect predictions cluster at high uncertainty. **SHAP Explanation Plots:** For interpretability, the script can display SHAP force plots or bar charts for sample predictions. In the force plot, features pushing the prediction toward fraud or non-fraud are visualized with their contribution values. These plots will open in the Jupyter notebook interface (they require an interactive view). They are useful for case studies – for example, explaining why the model flagged a particular claim as fraud (perhaps due to unusual claim amount, policy history, etc.). All figures are either displayed inline (if using a notebook) or saved to files (e.g., high-resolution PNG images) as indicated in the code. For instance, the combined uncertainty vs. probability KDE plot is saved as class_UNCERTAILY.png in the working directory. You can refer to these outputs to verify the results reported in the paper and to gain intuition about model behavior under different training conditions. **Reproducing and Modifying** The code is modular and heavily commented, so it’s straightforward to adjust parameters or extend the analysis. You can change the random seed for a different train/test split, experiment with different numbers of Monte Carlo iterations or ensemble members, or even plug in a different dataset. The use of keras-tuner (imported) suggests that hyperparameter tuning can be incorporated if desired. Additionally, you may use the imported H2O, XGBoost, and LightGBM to compare the deep learning approaches with tree-based models – though such comparisons were optional in this project, the environment is set up for it. Feel free to use this script as a template for uncertainty-aware fraud detection or to integrate parts of it (like the uncertainty evaluation framework or SHAP analysis) into your own projects. The combination of resampling strategies, ensemble methods, and uncertainty metrics showcased here can serve as a guide for building more trustworthy AI systems in fraud detection and beyond. """