Comparative performances of machine learning algorithms in … – Nature.com

Evaluation of performances of algorithms

We selected the following seven algorithms most often used in radiomics studies for feature selection, based on filtering approaches. These filters can be grouped into three categories : those from the statistical field including the Pearson correlation coefficient (abbreviated as Pearson in the manuscript) and Spearman correlation coefficient (Spearman ), those based on random forests including Random Forest Variable Importance (RfVarImp ) and Random Forest Permutation Importance (RfPerImp), and those based on the information theory including Joint Mutual Information (JMI), Joint Mutual Information Maximization (JMIM) and Minimum-Redundancy-Maximum-Relevance (MRMR).

These methods rank features, and then a given number of best features are kept for modeling. Three different numbers of selected features were investigated in this study: 10, 20 and 30.

Moreover, in order to estimate the impact of the feature selection step, two non-informative algorithms of feature selection were used as benchmarks: no selection which resulted in selecting all features (All) and a random selection of a given number of features (Random).

Fourteen machine-learning or statistical binary classifiers were tested, among those most often used in radiomics studies: K-Nearest Neighbors (KNN); five linear models including Linear Regression (Lr), three Penalized Linear Regression (Lasso Penalized Linear Regression (LrL1), Ridge Penalized Linear Regression (LrL2), Elastic-net Linear Regression (LrElasticNet)) and Linear Discriminant Analysis (LDA); Random Forest (RF); AdaBoost and XGBoost; three support vector classifiers including Linear Support Vector Classifier (Linear SVC), Polynomial Support Vector Classifier (PolySVC) and Radial Support Vector Classifier (RSVC); and two bayesian classifiers including Binomial Naive Bayes (BNB) and Gaussian Naive Bayes (GNB).

In order to estimate performances of each of the 126 combinations of the nine feature selection algorithms with the fourteen classification algorithms, each combination was trained using a grid-search and nested cross validation strategy15 as follows.

First, datasets were randomly split into three folds, stratified on the diagnostic value so that each fold had the same diagnostic distribution as the population of interest. Each fold was used in turn as the test set while the two remaining folds were used as training and cross-validation sets.

Ten-fold cross validation and grid-search were used on the training set to tune the hyperparameters maximizing the area under the receiver operating characteristic curve (AUC). Best hyperparameters were then used to train the model on the whole training set.

In order to take into account overfitting, the metric used was the AUC penalized by the absolute value of the difference between the AUCs of the test set and the train set:

$${text{AUC}}_{{{text{Cross}} - {text{Validation}}}} = {text{AUC}}_{{{text{Test}} - {text{Fold}}}} - left| {{text{AUC}}_{{{text{Test}} - {text{Fold}}}} - {text{AUC}}_{{{text{Train}} - {text{Fold}}}} } right|$$

This procedure was repeated for each of the ten datasets, for three different train-test splits and the three different numbers of selected features.

Each combination of algorithms yielded 90 (3310) AUCs, apart from combinations using the All feature selection which were associated with only 30 AUCs due to the absence of number of feature selection, the Random feature selection, repeated three times which yielded 270 AUCs. Hence, in total, 13,020 AUCs were calculated.

Multifactor ANalysis of VAriance (ANOVA) was used to quantify the variability of the AUC associated with the following factors: dataset, feature selection algorithm, classifier algorithm, number of features, train-test split, imaging modality, and interactions between classifier / dataset, classifier / feature selection, dataset / feature selection, and classifier / feature selection / dataset. Proportion of variance explained was used to quantify impacts of each factor/interaction. Results are given as frequency (proportion(%)) or range (minimum value; maximum value).

For each feature selection, classifier, dataset and train-test split, median AUC,1st quartile (Q1); and 3rd quartile (Q3) were computed. Box-plots were used to visualize results.

In addition, for feature selection algorithms and classifiers, a Friedman test16 followed by post-hoc pair-wise Nemenyi-Friedman tests were used to compare the median AUCs of the algorithms.

Heatmaps were generated to illustrate results for each Feature Selection and Classifier combination.

All the algorithms were implemented in Python (version 3.8.8). Pearson and Spearman correlations were computed using Pandas (1.2.4), the XGBoost algorithm using xgboost (1.5) and JMI, JMIM and MRMR algorithms using MIFS. All other algorithms were implemented using the scikit-learn library (version 0.24.1). Data were standardized by centering and scaling using scikit-learn StandardScaler.

See the original post here:
Comparative performances of machine learning algorithms in ... - Nature.com

Related Posts

Comments are closed.