INSTRUCTIONS FOR REPLICATION OF RESULTS Paper: Prediabetes Risk Classification Algorithm via Carotid Bodies and K-means Clustering Technique Authors: Rafael F. Pinheiro, Maria P. Guarino, Marlene Lages and Rui Fonseca Pinto The aim of this work is not to provide a complete PRCA software or computer code, but simply to present the concept and provide the scientific community with the key parameters for replicability. To this end, the raw data and base codes in Python and Matlab for obtaining clusters via the K-means method and for data balancing, respectively, are provided. In addition, some organized data (from the raw data) is provided for input into the codes. Therefore, to replicate the results, some additional details are provided to run the code and apply Algorithms 1, 2 and 3 suggested in this work: 1. A Python base code is provided for obtaining the clusters via the K-means method, called "code_clusters_CBmeter.ipynb", with the return of the complete time series, i.e. from minute 1 to minute 80. Algorithm 1 is a suggestion, where using the Python base code, one can implement the procedures for obtaining all the clusters (that is, HR, RR and RRxHR), as well as extracting the times of minutes 11, 12 and 13. As an example of input data for "code_clusters_CBmeter. ipynb", a file with a csv extension called "control_brutos_HR.csv" was provided, which contains the time series selection of the HR data of the 25 control volunteers (similar to Table 1, but for control volunteers), including the time in the first column. Using this input data one obtains the HR clusters for the 25 control volunteers. 2. The execution of Algorithm 2 depends on obtaining the clusters via "code_clusters_CBmeter.ipynb", which can be obtained via Algorithm 1. Algorithm 2 is suggested for performing the processes necessary for calculating SL, W, Score Matrix, and Maximum Risk. Also in Algorithm 2, and considering this work, the number of cluster names (n) is given by 5 and the number of variables (m) is 3 (HR, RR, RRxHR). One notes that this procedure generalizes the method for scalability, in other words, using a larger number of cluster types and variables. 3. The processes presented by Algorithms 1 and 2 set up the mechanism for building and training the PRCA. Once this is done, the PRCA operation proceeds as suggested by Algorithm 3 in conjunction with Algorithm 1, which returns the risk for a given patient whose HR and RR information must be entered using a D matrix. The $\Psi$ matrix and the maximum risk number ($\chi$) are provided by Algorithm 2, which should only operate again, in conjunction with Algorithm 1, if a new training process is required (for example, if more samples are available for training). Note that for every new patient who wants to check the risk of diabetes, it is sufficient to run Algorithm 3. 4. The validation process is done separately via 4-fold cross validation (see Performance and Validation section in the paper) and uses the Matlab codes "ADASYN.m" and "CBadasyn.m" to data augmentation and to balance the data between the control volunteers and the volunteers with (pre)diabetes. The file "HRADA.mat" is provided as input data for the HR variable for example. The same should be done with the RR and RRxHR data. Once the clusters have been obtained, validation is carried out using the data from the 50 volunteers. Note that only the 8 real volunteers with (pre)diabetes were used to build and train the algorithm.