Archive for the ‘Machine Learning’ Category

Machine learning to improve prognosis prediction of EHCC | JHC – Dove Medical Press

Introduction

Hepatocellular carcinoma (HCC), the fourth leading cause of cancer-related death worldwide, typically occurs in patients with chronic liver disease and is an aggressive disease with dismal prognosis.1 Over the past decades, improved surveillance programs and imaging techniques have led to early HCC (EHCC) diagnosis in 4050% of patients, at a stage amenable to potentially curative therapiesresection, transplantation or ablation.2,3 Generally, EHCC is expected to have an excellent outcome after radical therapies. Since total hepatectomy eliminates both the diseased liver and the tumor, liver transplantation (LT) offers the highest chance of cure, with a survival up to 70% at 10 years in selected cases, and remains the best treatment for EHCC.4 Unfortunately, the critical shortage of donor organs represents the main limitation of LT and results in long waiting times.

According to clinical practice guidelines, liver resection (LR) is the recommended first-line option for patients with EHCC and preserved liver function, although ablation is an alternative treatment modality.3,5,6 The prognosis following LR may vary even among patients with EHCC and two competing causes of death (tumor recurrence and liver dysfunction) both influence survival.7 Several HCC staging systems have been proposed to pair prognostic prediction with treatment allocation; however, these proposalssuch as Barcelona Clinic Liver Cancer (BCLC) staging, China Liver Cancer (CNLC) staging, Hong Kong Liver Cancer (HKLC) staging and Cancer of the Liver Italian Program (CLIP) scoreare not derived from surgically managed patients, except for the American Joint Committee on Cancer (AJCC) system and Japan Integrated Staging (JIS) score, and therefore exhibit modest prognostic accuracy for resected cases.69 A few prognostic models have been developed based on readily available patient and tumor characteristics; however, they are by nature outmoded and rigid tools because all determinants were examined by conventional statistical methods (ie, Cox proportional hazard regression) and assigned fixed weights.8,10 Hence, new strategies to improve outcome prediction and treatment selection are warranted for EHCC patients.

Machine learning (ML), a subfield of artificial intelligence, leverages algorithmic methods that enable computers to learn from on large-scale, heterogeneous datasets and execute a specific task without predefined rules.11 ML solutions such as gradient boosting machine (GBM) have outperformed regression modelling in a variety of clinical situations (eg, diagnosis and prognosis).1113 Nevertheless, the benefit of ML in predicting prognosis of patients with resected EHCC has yet to be fully explored. Accordingly, we assembled a large, international cohort of EHCC patients to design and evaluate a ML-based model for survival prediction, and compare its performance with existing prognostic systems.

Patients with EHCC, defined as tumor 5 cm and without evidence of extrahepatic disease or major vascular invasion,14 were retrospectively screened from two sources: (1) Medicare patients treated with surgical therapy (LR+LT) in the Surveillance, Epidemiology, and End Results (SEER) Program, a population-based database in the United States, between 2004 and 2015; (2) consecutive patients treated with LR at two high-volume hepatobiliary centers in China (First Affiliated Hospital of Nanjing Medical University and Wuxi Peoples Hospital) between 2006 and 2016. The inclusion criteria were (1) adult patients aged 20 years; (2) histology-confirmed HCC (International Classification of Diseases for Oncology, Third Edition, histology codes 8170 to 8175 for HCC and site code C22.0 for liver);15 (3) complete survival data and a survival of 1 month. The exclusion criteria were (a) missing information on the type of surgical procedure; (b) another malignant primary tumor prior to HCC diagnosis; (c) unknown cause of death. Patient selection process is summarized in the flow chart of Figure 1. This study protocol was approved by the Institution Review Board of First Affiliated Hospital of Nanjing Medical University and Wuxi Peoples Hospital. Written informed consent was waived because retrospective anonymous data were analyzed. Non-identified information was used in order to protect patient data confidentiality. This study was conducted in accordance with the Declaration of Helsinki.

Figure 1 Analytical framework for survival prediction. (A) Flow diagram of the study cohort details. (B) A machine learning pipeline to train, validate and test the model.

The endpoint selected to develop ML-based model was disease-specific survival (DSS), defined as the time from the date of surgery to the date of death from disease (tumor relapse or liver dysfunction). All deaths from any other cause were counted as non-disease-specific and censored at the date of the last follow-up. Follow-up protocol for Chinese cohort included physical examination, laboratory evaluation and dynamic CT or MRI of the chest and abdomen every 3 months during the first 2 years and every 6 months thereafter. The follow-up was terminated on August 15, 2020.

Electronic and paper medical records were reviewed in detail; all pertinent demographic and clinicopathologic data were abstracted on a standardized template. The following characteristics of interest were ascertained at the time of enrollment: age, gender, race, year of diagnosis, alpha-fetoprotein level, use of neoadjuvant therapy, tumor size, tumor number, vascular invasion, histological grade, liver fibrosis score, and type of surgery.

We deployed GBM, a decision tree-based ML algorithm that has gained popularity because of its performance and interpretability, to aggregate baseline risk factors and predict the likelihood of survival using the R package gbm. GBM algorithm16 assembles multiple base learners, in a step-wise fashion, with each successive learner fitting the residuals left over from previous learners to improve model performance: (1) , where is a base learner, typically a decision tree; (2) , where is optimized parameters in each base learner and is the weight of each base learner in the model. Each base learner may have different variables; variables with higher relative importance are utilized in more decision trees and earlier in the boosting algorithm. The model was trained using stratified 33-fold nested cross-validation (3 outer iterations and 3 inner iterations) on the training/validation cohort; a grid search of optimal hyper-parameter settings was run using the R package mlr. Figure 1 shows the ML workflow schematically.

Model discrimination was quantified using Harrells C-statistic and 95% confidence intervals [CIs] were assessed by bootstrapping. Calibration plots were used to assess the model fit. Decision curve analysis was used to determine the clinical net benefit associated with the adoption of the model.17

Differences between groups were tested using 2 test for categorical variables and MannWhitney U-test for continuous variables. Survival probabilities were assessed using the KaplanMeier method and compared by the Log rank test. The optimal cutoffs of GBM predictions were determined to stratify patients at low, intermediate, or high risk for disease-specific death by using X-tile software version 3.6.1 (Yale University School of Medicine, New Haven, CT).18 Propensity score matching (PSM) was used to balance the LR versus LT for EHCC in SEER cohort using 1:1 nearest neighbor matching with a fixed caliper width of 0.02. Cases (LR) and controls (LT) were matched on all baseline characteristics other than type of surgery using the R package MatchIt. All analyses were conducted using R software version 3.4.4 (www.r-project.org). Statistical significance was set at P<0.05; all tests were two-sided.

A total of 2778 EHCC patients (2082 males and 696 females; median age, 60 years; interquartile range [IQR], 5467 years) treated with LR were identified and divided into 1899 for the training/validation (SEER) cohort and 879 for the test (Chinese) cohort. Patient characteristics of the training/validation and test cohorts are summarized in Table 1. There were 625 disease-related deaths recorded (censored, 67.1%) during a median (IQR) follow-up time of 44.0 (26.074.0) months in the SEER cohort, and 258 deaths were recorded (censored, 70.6%) during a median (IQR) follow-up of 52.5 (35.876.0) months in the Chinese cohort. Baseline characteristics and post-resection survival differed between the cohorts.

Table 1 Baseline Characteristics in the Training/Validation and Test Cohorts

We investigated 12 potential model covariates using GBM algorithm. According to the results of nested cross-validation, we utilized 2000 decision trees sequentially, with at least 5 observations in the terminal nodes of the trees; the decision tree depth was optimized at 3, corresponding to 3-way interactions, and the learning rate was optimized at 0.01. Covariates with a relative influence greater than 5 (age, race, alpha-fetoprotein level, tumor size, multifocality, vascular invasion, histological grade and fibrosis score) were integrated into the final model developed to predict DSS (Figure 2A and B).

Figure 2 Overview of the machine-learning-based model. (A) Relative importance of the variables included in the model. (B) Illustrative example of the gradient boosting machine (GBM). GBM builds the model by combining predictions from stumps of massive decision-tree-base-learners in a step-wise fashion. GBM output is calculated by adding up the predictions attached to the terminal nodes of all 2000 decision trees where the patient traverses. (C) Performance of GBM model as compared with that of American Joint Committee on Cancer (AJCC) staging in the internal validation group. (D) Online model deployment based on GBM output.

The final GBM model demonstrated good discriminatory ability in predicting post-resection survival specific for EHCC, with a C-statistic of 0.738 (95% CI 0.7170.758), and outperformed the 7th and 8th edition of AJCC staging systems (P<0.001) in the training/validation cohort (Table 2). The internal validation group was the 33-fold nested cross-validation of the final model of the training cohort with 211 patients in each fold. For the composite outcome, the GBM model yielded a median C-statistic of 0.727 (95% CI 0.7060.761) and performed better than AJCC staging systems (P<0.05) in the internal validation group (Figure 2C). In the test cohort, the GBM model provided a C-statistic of 0.721 (95% CI, 0.6890.752) in predicting DSS after resection of EHCC and was clearly superior to AJCC, BCLC, CNLC, HKLC, CLIP and JIS systems (P<0.05). Note that prediction scores differed between training/validation and test sets (P<0.001) (Figure S1). The discriminatory performance of ML-based model exceeded those of AJCC staging systems even in sub-cohorts stratified by covariate integrity (complete/missing) (Table S1). Furthermore, the GBM model exhibited greater ability to discriminate survival probabilities than simple prognostic strategies, such as multifocal EHCC with vascular invasion indicating a dismal prognosis following LR, in sub-cohorts with complete strategy-related information (P<0.001) (Table S2).

Table 2 Performance of GBM Model and Staging Systems

Calibration plots presented excellent agreement between model predicted and actual observed survival in both the training/validation and test cohorts (Figure S2A and B). Decision curve analysis demonstrated that the GBM model provided better clinical utility for EHCC in designing clinical trials than the treat all or treat none strategy across the majority of the range of reasonable threshold probabilities (Figure S2C and D). The model is publicly accessible for use on Github (https://github.com/radgrady/EHCC_GBM), with an app (https://mlehcc.shinyapps.io/EHCC_App/) that allows survival estimates at individual scale (Figure 2D).

We utilized X-tile analysis to generate two optimal cut-off values (6.35 and 5.32 in GBM predictions, Figure S3) that separated EHCC patients into 3 strata with a highly different probability of post-resection survival in the training/validation cohort: low risk (760 [40.0%]; 10-year DSS, 75.6%), intermediate risk (948 [49.9%]; 10-year DSS, 41.8%), and high risk (191 [10.1%]; 10-year DSS, 5.7%) (P<0.001). In the test cohort, the aforementioned 3 prognostic strata by using the GBM model were confirmed: low risk (634 [72.1%]; 10-year DSS, 69.0%), intermediate risk (194 [22.1%]; 10-year DSS, 37.9%), and high risk (51 [5.8%]; 10-year DSS, 4.7%) (P<0.001) (Table 3). Visual inspection of the survival curves again revealed that, compared with the 8th edition AJCC criteria, the GBM model provided better prognostic stratification in both the training/validation and test cohorts (Figure 3). Differences in the baseline patient characteristics according to risk groups defined by the GBM model are summarized in Table S3.

Table 3 Disease-Specific Survival According to Risk Stratification

Figure 3 Kaplan-Meier survival plots demonstrating disparities between groups. Disease-specific survival stratified by the 8th edition of the American Joint Committee on Cancer T stage and the machine-learning model in the training/validation (A and C) and the test (B and D) cohort.

We also gathered data of 2124 EHCC patients (1671 males and 453 females; median age, 58 years; IQR, 5362 years) treated with LT from the SEER-Medicare database. SEER data demonstrated that considerable differences existed between LR (n=1899) and LT (n=2124) cohorts in terms of all listed clinical variables except for alpha-fetoprotein level (Table S4). Upon initial analysis, we found a remarkable survival benefit of LT over LR for patients with EHCC (hazard ratio [HR] 0.342, 95% CI 0.3000.389, P<0.001), which was further confirmed in a well-matched cohort of 1892 patients produced by PSM (HR 0.342, 95% CI 0.2850.410, P<0.001). Although a trend for higher survival probability was observed after 5 years in the LT cohort, no statistically significant difference in DSS was observed when compared with low-risk LR cohort (HR 0.850, 95% CI 0.6791.064, P=0.138). After PSM, 420 patients in the LT cohort were matched to 420 patients in the low-risk LR cohort; the trend for improved survival remained after 5 years in the matched LT cohort while the matched comparison also yielded no significant survival difference (HR 0.802, 95% CI 0.5611.145, P=0.226) (Figure 4). By contrast, when compared with intermediate-and high-risk patients treated with LR, remarkable survival benefits were observed in patients treated with LT both before and after PSM (P<0.001) (Table S5).

Figure 4 Comparison of survival after resection versus transplantation before and after propensity score matching in SEER-Medicare database. (A) KaplanMeier curves for different risk groups stratified by the model in the SEER resection cohort (n=1899) and patients in the SEER transplantation cohort (n=2124). (B) KaplanMeier curves for low-risk patients treated with resection and patients treated with transplantation in propensity score-matched cohort (n=840).

In this study involving over 2700 EHCC patients treated with resection, a gradient-boosting ML model was trained, validated and tested to predict post-resection survival. Our results demonstrated that this ML model utilized readily available clinical information, such as age, race, alpha-fetoprotein level, tumor size and number, vascular invasion, histological grade and fibrosis score, and provided real-time, accurate prognosis prediction (C-statistic >0.72) that outperform traditional staging systems. Among the model covariates, tumor-related characteristics, such as size, multifocality and vascular invasion, as well as liver cirrhosis are known risk factors for poor survival following resection of HCC.710 Besides, multiple population-based studies have shown the racial and age differences in survival of HCC.19,20 Therefore, our ML model is a valid and reliable tool to estimate prognosis of EHCC patients. This study represents, to our knowledge, the first application of a state-of-the-art ML survival prediction algorithm in EHCC based on large-scale, heterogeneous datasets.

In SEER cohort, the 10-year survival rate of EHCC after LR was around 50%, which seemed acceptable but was remarkably lower than that after LT (around 80%). No adjuvant therapies are able to prevent tumor relapse and cirrhosis progression; however, patients with dismal prognosis should be considered candidates for clinical trials of adjuvant therapy.7 Salvage LT has also been a highly applicable strategy to alleviate both graft shortage and waitlist dropout with excellent outcomes that are comparable to upfront LT.1,5 Priority policy, defined as enlistment of patients at high mortality risk before disease progression, was then implemented to improve the transplantability rate.21 Promisingly, our ML tool may help clinicians better identify EHCC patients who are at high risk of disease-related death, engage in clinical trials, and meet priority enlistment policy. Specifically, the GBM model identified 10% of EHCC patients who suffered from extremely dismal prognosis following LR in this study. Given its small proportion and survival benefit, we advocate the pre-emptive enlistment of high-risk subset for salvage LT after LR to avoid the later emergence of advanced disease (ie, tumor recurrence and liver decompensation) ultimately leading to death. Moreover, 40% of EHCC patients were at intermediate risk of disease-related death; adjuvant treatments that target HCC and cirrhosis are desirable. In turn, nearly half of EHCC patients were categorized as low risk by using the GBM model. The low-risk subset permits satisfactory long-term survival after LR and may receive no adjuvant therapy. We note that DSS curves are separated after 5 years for low-risk patients treated with LR as compared with patients treated with upfront LT, and thus long-lasting surveillance should be maintained.

Prior efforts to improve prognostic prediction of EHCC have mostly been reliant on tissue-based or imaging-assisted quantification of research biomarkers.9,22 However, a more accurate, yet more complex, prognosis estimate does not necessarily present a better clinical tool. Parametric regression models are ubiquitous in clinical research because of their simplicity and interpretability; however, regression analysis should be performed in complete cases only.23 Moreover, regression modeling strategies assume that relationships among input variables are linear and homogeneous but complicated interactions exist between predictors.24,25 Decision tree-based methods represent a large family of ML algorithms and can reveal complex non-linear relationships between covariates. GBM algorithm has been widely applied in big data analysis and consistently utilized by the top performers of ML predictive modelling competitions.14,26 GBM algorithm utilizes the boosting procedure to combine stumps of massive decision-tree-base-learners, which is similar to the clinical decision-making process for a patient by aggregating consultations from multiple specialists, each which would that look at the case in a slightly different way. Thus, our GBM model directly integrates interpretability in order to mitigate this issue. Compared with other tree-based ensemble methods such as random forest, GBM algorithm also has a built-in functionality to handle missing values that permits utilizing data from, and assigning classification to, all observations in the cohort without the need to impute data. We applied nested cross-validation scheme for hyperparameter tuning in GBM as it prevents information leaking between observations used for training and validating the model, and estimates the external test error of the given algorithm on unseen datasets more accurately by averaging its performance metrics across folds.27 Comparable discriminatory ability in the training/validation cohort, the test cohort as well as sub-cohorts from different clinical scenarios suggested good reproducibility and reliability of the proposed GBM model.

Our study has several limitations that warrant attention. First, all the presented analyses are retrospective; prospective validations of the ML model in different populations are warranted prior to routine use in clinical practice. Second, the study cohort included population-based cancer registries with limited information regarding patient and tumor characteristics; unavailable confounders, such as biochemical parameters, surgical margin status and recurrence treatment modality could not be adjusted for modeling. Third, SEER-Medicare database contains a considerable amount of missing data in several important clinical variables, such as fibrosis score. Indeed, missing data represent an unavoidable feature of all clinical and population-based databases; however, improper management of data resource, such as simply excluding cases with missing data, can introduce considerable bias, as previously noted across numerous cancer types.28 We therefore contend that integrating missingness into our GBM model indicates good transferability in future clinical practice.

In conclusion, ML approach is both feasible and accurate, and a novel way to consider analysis of survival outcomes in clinical scenarios. Our results suggest that a GBM model trained on readily-available clinical data provides good performance that is better than staging systems in predicting prognosis. Although several issues must be addressed, such as prospective validations and ethical challenges, prior to its widespread use, such an automated tool may complement existing prognostic sources and lead to better personalized treatments for patients with resected EHCC.

EHCC, early hepatocellular carcinoma; LT, liver transplantation; LR, liver resection; BCLC, Barcelona Clinic Liver Cancer; China Liver Cancer, CNLC; HKLC, Hong Kong Liver Cancer; CLIP, Cancer of the Liver Italian Program; AJCC, American Joint Committee on Cancer; ML, machine learning; GBM, gradient boosting machine; SEER, Surveillance, Epidemiology, and End Results; DSS, disease-specific survival; PSM, propensity score matching; IQR, interquartile range.

Data for model training and validation as well as R codes are available at Github (https://github.com/radgrady/EHCC_GBM). Test data are available from the corresponding author (Xue-Hao Wang) on reasonable request.

This study protocol was approved by the Institution Review Board of First Affiliated Hospital of Nanjing Medical University and Wuxi Peoples Hospital. Written informed consent was waived because retrospective anonymous data were analyzed. Non-identified information was used in order to protect patient data confidentiality.

This study was supported by the Key Program of the National Natural Science Foundation of China (31930020) and the National Natural Science Foundation of China (81530048, 81470901, 81670570).

The authors declare no potential conflicts of interest.

1. Yang JD, Hainaut P, Gores GJ, Amadou A, Plymoth A, Roberts LR. A global view of hepatocellular carcinoma: trends, risk, prevention and management. Nat Rev Gastroenterol Hepatol. 2019;16(10):589604. doi:10.1038/s41575-019-0186-y

2. Llovet JM, Montal R, Sia D, Finn RS. Molecular therapies and precision medicine for hepatocellular carcinoma. Nat Rev Clin Oncol. 2018;15(10):599616. doi:10.1038/s41571-018-0073-4

3. European Association for the Study of the Liver. EASL clinical practice guidelines: management of hepatocellular carcinoma. J Hepatol. 2018;69(1):182236. doi:10.1016/j.jhep.2018.03.019

4. Pinna AD, Yang T, Mazzaferro V, et al. Liver transplantation and hepatic resection can achieve cure for hepatocellular carcinoma. Ann Surg. 2018;268(5):868875. doi:10.1097/SLA.0000000000002889

5. Marrero JA, Kulik LM, Sirlin CB, et al. Diagnosis, staging, and management of hepatocellular carcinoma: 2018 Practice Guidance by the American Association for the Study of Liver Diseases. Hepatology. 2018;68(2):723750. doi:10.1002/hep.29913

6. Zhou J, Sun H, Wang Z, et al. Guidelines for the diagnosis and treatment of hepatocellular carcinoma (2019 Edition). Liver Cancer. 2020;9(6):682720. doi:10.1159/000509424

7. Villanueva A. Hepatocellular carcinoma. N Engl J Med. 2019;380(15):14501462. doi:10.1056/NEJMra1713263

8. Chan AWH, Zhong J, Berhane S, et al. Development of pre and post-operative models to predict early recurrence of hepatocellular carcinoma after surgical resection. J Hepatol. 2018;69(6):12841293. doi:10.1016/j.jhep.2018.08.027

9. Ji GW, Zhu FP, Xu Q, et al. Radiomic features at contrast-enhanced CT predict recurrence in early stage hepatocellular carcinoma: a Multi-Institutional Study. Radiology. 2020;294(3):568579. doi:10.1148/radiol.2020191470

10. Shim JH, Jun MJ, Han S, et al. Prognostic nomograms for prediction of recurrence and survival after curative liver resection for hepatocellular carcinoma. Ann Surg. 2015;261(5):939946. doi:10.1097/SLA.0000000000000747

11. Deo RC. Machine learning in medicine. Circulation. 2015;132(20):19201930. doi:10.1161/CIRCULATIONAHA.115.001593

12. Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N Engl J Med. 2019;380(14):13471358. doi:10.1056/NEJMra1814259

13. Eaton JE, Vesterhus M, McCauley BM, et al. Primary sclerosing cholangitis risk estimate tool (PREsTo) predicts outcomes of the disease: a derivation and validation study using machine learning. Hepatology. 2020;71(1):214224. doi:10.1002/hep.30085

14. Nathan H, Hyder O, Mayo SC, et al. Surgical therapy for early hepatocellular carcinoma in the modern era: a 10-year SEER-medicare analysis. Ann Surg. 2013;258(6):10221027. doi:10.1097/SLA.0b013e31827da749

15. Fritz AG. International Classification of Diseases for Oncology: ICD-O. 3. Geneva, Switzerland: World Health Organization; 2000.

16. Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29(5):11891232. doi:10.1214/aos/1013203451

17. Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006;26(6):565574. doi:10.1177/0272989X06295361

18. Camp RL, Dolled-Filhart M, Rimm DL. X-tile: a new bio-informatics tool for biomarker assessment and outcome-based cut-point optimization. Clin Cancer Res. 2004;10(21):72527259. doi:10.1158/1078-0432.CCR-04-0713

19. Altekruse SF, Henley SJ, Cucinelli JE, McGlynn KA. Changing hepatocellular carcinoma incidence and liver cancer mortality rates in the United States. Am J Gastroenterol. 2014;109(4):542553. doi:10.1038/ajg.2014.11

20. Dasari BV, Kamarajah SK, Hodson J, et al. Development and validation of a risk score to predict the overall survival following surgical resection of hepatocellular carcinoma in non-cirrhotic liver. HPB (Oxford). 2020;22(3):383390. doi:10.1016/j.hpb.2019.07.007

21. Ferrer-Fbrega J, Forner A, Liccioni A, et al. Prospective validation of ab initio liver transplantation in hepatocellular carcinoma upon detection of risk factors for recurrence after resection. Hepatology. 2016;63(3):839849. doi:10.1002/hep.28339

22. Qiu J, Peng B, Tang Y, et al. CpG methylation signature predicts recurrence in early-stage hepatocellular carcinoma: results from a Multicenter Study. J Clin Oncol. 2017;35(7):734742. doi:10.1200/JCO.2016.68.2153

23. Sterne JA, White IR, Carlin JB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338:b2393. doi:10.1136/bmj.b2393

24. Loftus TJ, Tighe PJ, Filiberto AC, et al. Artificial intelligence and surgical decision-making. JAMA Surg. 2020;155(2):148158. doi:10.1001/jamasurg.2019.4917

25. Shindoh J, Andreou A, Aloia TA, et al. Microvascular invasion does not predict long-term survival in hepatocellular carcinoma up to 2 cm: reappraisal of the staging system for solitary tumors. Ann Surg Oncol. 2013;20(4):12231229. doi:10.1245/s10434-012-2739-y

26. Bibault JE, Chang DT, Xing L. Development and validation of a model to predict survival in colorectal cancer using a gradient-boosted machine. Gut. 2021;70(5):884889. doi:10.1136/gutjnl-2020-321799

27. Maros ME, Capper D, Jones DTW, et al. Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data. Nat Protoc. 2020;15(2):479512. doi:10.1038/s41596-019-0251-6

28. Jeong CW, Washington SL 3rd, Herlemann A, Gomez SL, Carroll PR, Cooperberg MR. The new surveillance, epidemiology, and end results prostate with watchful waiting database: opportunities and limitations. Eur Urol. 2020;78(3):335344. doi:10.1016/j.eururo.2020.01.009

View post:
Machine learning to improve prognosis prediction of EHCC | JHC - Dove Medical Press

Which Industries are Hiring AI and Machine Learning Roles? – Dice Insights

Companies everywhere are pouring resources into artificial intelligence (A.I.) and machine learning (ML) initiatives. Many technologists believe that apps smartened with A.I. and ML tools will eventually offer better customer personalization; managers hope that A.I. will lead to better data analysis, which in turn will power better business strategies.

But which industries are actually hiring A.I. specialists? If you answer that question, it might give you a better idea of where those resources are being deployed. Fortunately,CompTIAs latest Tech Jobs Reportoffers a breakdown of A.I. hiring, using data from Burning Glass, which collects and analyzes millions of job postings from across the country. Check it out:

Perhaps its no surprise that manufacturing tops this list; after all, manufacturers have been steadily automating their production processes for years, and it stands to reason that they would turn to A.I. and ML to streamline things even more. In theory, A.I. will also help manufacturers do everythingfrom reducing downtime to improving supply chainsalthough it may take some time to get the models right.

The presence of healthcare, banking, and public administration likewise seem logical.These three industries have the money to invest in A.I. and ML right now and have the greatest opportunity to see the investment pay off, fast, Gus Walker, director of product at Veritone, an A.I. tech company based in Costa Mesa, California,told Dicelate last year.That being said, the pandemic has caused industries hit the hardest to take a step back and look at how they can leverage AI and ML to rebuild or adjust in the new normal.

Compared to overall tech hiring, the number of A.I.-related job postings is still relatively small. Right now, mastering and deploying A.I. and machine learning is something of a specialist industry; but as these technologies become more commodified, and companies develop tools that allow more employees to integrate A.I. and ML into their projects, the number of job postings for A.I. and ML positions could increase over the next several years. Indeed, one IDC report from 2020 found three-quarters of commercial enterprise applications could lean on A.I. in some way by2021.

Its also worth examining where all that A.I. hiring is taking place; its interesting that Washington DC tops this particular list, with New York City a close second; Silicon Valley and Seattle, the nations other big tech hubs, are somewhat further behind, at least for the moment. Washington DC is notable not only for federal government hiring, but the growing presence of companies such as Amazon that hunger for talent skilled in artificial intelligence:

Jobs that leverage artificial intelligence are potentially lucrative, with a current median salary (according to Burning Glass)of $105,000. Its also a skill-set thatmore technologists may need to become familiar with, especially managers and executives.A.I. is not going to replace managers but managers that use A.I. will replace those that do not, Rob Thomas, senior vice president of IBMscloudand data platform,recently told CNBC. If you mention A.I. or ML on your resume and applications, make sure you know your stuff before the job interview; chances are good youll be tested on it.

Want more great insights?Create a Dice profile today to receive the weekly Dice Advisor newsletter, packed with everything you need to boost your career in tech. Register now

Continue reading here:
Which Industries are Hiring AI and Machine Learning Roles? - Dice Insights

New Type of Machine Learning Aids Earthquake Risk Prediction – UT News – UT News | The University of Texas at Austin

AUSTIN, Texas Our homes and offices are only as solid as the ground beneath them. When that solid ground turns to liquid as sometimes happens during earthquakes it can topple buildings and bridges. This phenomenon is known as liquefaction, and it was a major feature of the 2011 earthquake in Christchurch, New Zealand, a magnitude 6.3 quake that killed 185 people and destroyed thousands of homes.

An upside of the Christchurch quake was that it was one of the most well-documented in history. Because New Zealand is seismically active, the city had numerous sensors for monitoring earthquakes. Post-event reconnaissance provided a wealth of additional data on how the soil responded across the city.

Two researchers from The University of Texas at Austin developed a machine learning model that predicted the amount of lateral movement that occurred when the Christchurch earthquake caused soil to lose its strength and shift relative to its surroundings.

The results were published online in Earthquake Spectra in April 2021.

Its one of the first machine learning studies in our area of geotechnical engineering, said postdoctoral researcher Maria Giovanna Durante, a Marie Sklodowska Curie fellow previously at UT Austin. Its an enormous amount of data for our field. If we have thousands of data points, maybe we can find a trend.

Durante and Ellen Rathje, the Janet S. Cockrell Centennial Chair in Engineering at UT Austin and the principal investigator for the National Science Foundation-funded DesignSafe cyberinfrastructure, first used a Random Forest approach with a binary classification to forecast whether lateral spreading movements occurred at a specific location. They then applied a multiclass classification approach to predict the amount of displacement, from none to more than 1 meter.

It was important to select specific input features that go with the phenomenon we study, Durante said. Were not using the model as a black box were trying to integrate our scientific knowledge as much as possible.

Durante and Rathje trained the model using data related to the peak ground shaking experienced (a trigger for liquefaction), the depth of the water table, the topographic slope, and other factors. In total, more than 7,000 data points from a small area of the city were used for training data a great improvement, as previous geotechnical machine learning studies had used only 200 data points.

They tested their model citywide on 2.5 million sites around the epicenter of the earthquake to determine the displacement. Their model predicted whether liquefaction occurred with 80% accuracy; it was 70% accurate at determining the amount of displacement.

The researchers used the Frontera supercomputer at the Texas Advanced Computing Center (TACC), one of the worlds fastest, to train and test the model. TACC is a key partner on the DesignSafe project, providing computing resources, software and storage to the natural hazards engineering community.

Access to Frontera provided Durante and Rathje machine learning capabilities on a scale previously unavailable to the field. Deriving the final machine learning model required testing 2,400 possible models.

It would have taken years to do this research anywhere else, Durante said. If you want to run a parametric study or do a comprehensive analysis, you need to have computational power.

See the article here:
New Type of Machine Learning Aids Earthquake Risk Prediction - UT News - UT News | The University of Texas at Austin

Code^Shift Lab Aims To Confront Bias In AI, Machine Learning – Texas A&M Today – Texas A&M University Today

As machines increasingly make high-risk decisions, a new lab at Texas A&M aims to reduce bias in artificial intelligence and machine learning.

Getty Images

The algorithms underpinning artificial intelligence and machine learning increasingly influence our daily lives. They can decide everything from which video were recommended to watch next on YouTube to who should be arrested based on facial recognition software.

But the data used to train these systems often replicate the harmful social biases of the engineers who build them. Eliminating this bias from technology is the focus of Code^Shift, a new data science lab at Texas A&M University that brings together faculty members and researchers from a variety of disciplines across campus.

Its an increasingly critical initiative, said Lab Director Srividya Ramasubramanian, as more of the world becomes automated. Machines, rather than humans, are making many of the decisions around us, including some that are high-risk.

Code^Shift tries to shift our thinking about the world of code or coding in terms of how we can be thinking of data more broadly in terms of equity, social healing, inclusive futures and transformation, said Ramasubramanian, professor of communication in the College of Liberal Arts. A lot of trauma and a lot of violence has been caused, including by media and technologies, and first we need to acknowledge that, and then work toward reparations and a space of healing individually and collectively.

Bias in artificial intelligence can have major impacts. In just one recent example, a man has sued the Detroit Police Department after he was arrested and jailed for shoplifting after being falsely identified by the departments facial recognition technology. The American Civil Liberties Union calls it the first case of its kind in the United States.

Code^Shift will attempt to confront this issue using a collaborative research model that includes Texas A&M experts in social science, data science, engineering and several other disciplines. Ramasubramanian said eight different colleges are represented, and more than 100 people attended the labs virtual launch last month.

Experts will work together on research, grant proposals and raising awareness in the broader public of the issue of bias in machine learning and artificial intelligence. Curriculum may also be developed to educate professionals in the tech industry, such as workshops and short courses on anti-racism literacy, gender studies and other topics that are sometimes not covered in STEM fields.

The labs name references coding, which is foundational to todays digital world. Its also a play on code-switching the way people change the languages they use or how they express themselves in conversation depending on the context.

As an immigrant, Ramasubramanian says shes familiar with living in two worlds. She offers several examples of computer-based biases shes encountered in everyday life, including an experience attempting to wash her hands in an airport bathroom.

Standing at the sink, Ramasubramanian recalls, she held her hands under the faucet. As she moved them back and forth and the taps stayed dry, she realized that the sensors used to turn the water on could not recognize her hands. It was the same case with the soap dispenser.

It was something I never thought much about, but later on I was reading an article about this topic that said many people with darker skin tones were not recognized by many systems, she said.

Similarly, when Ramasubramanian began to work remotely during the COVID-19 pandemic, she noticed that her skin and hair color made her disappear against the virtual Zoom backgrounds. Voice recognition software she attempted to use for dictation could not understand her accent.

The system is treating me as the other and different in many, many ways, she said. And in return, there are serious consequences of who feels excluded, and thats not being captured.

Co-director Lu Tang, an assistant professor in the College of Liberal Arts who examines health disparity in underserved populations, says her research shows that Black patients, for example, must have much more severe symptoms that non-Black patients in order to be assigned certain diagnoses in computer software used in hospitals.

She said this is just one instance of the disparities embedded in technology. Tangs research also focuses on how machine learning algorithms used on social media platforms are more likely to expose people to misinformation about health.

If I inhabit a social media space where a lot of my friends hold certain erroneous attitudes about things like vaccines or COVID-19, I will repeatedly be exposed to the same information without being exposed to different information, she said.

Tang also is interested in what she calls the filter bubble the phenomenon of where an algorithm leads a user on TikTok, YouTube or other platforms based on content theyve watched in the past or what other people with similar viewing behaviors are watching at that moment. Watching just one video containing vaccine misinformation could prompt the algorithm to continue recommending similar videos. Tang said the filter bubble is another added layer that influences the content that people are exposed to.

I think to really understand this society and how we are living today, we as social scientists and humanities scholars need to acknowledge and understand the way computers are influencing the way society is run today, Tang said. I feel like working with computer science engineers is a way for us to combine our strengths to understand a lot of the problems we have in this society.

Computer Science and Engineering Assistant Professor Theodora Chaspari, another co-director of Code^Shift, agrees that minds from different disciplines are needed to design better systems.

To build an inclusive system, she said, engineers need to include representative data from all populations and social groups. This could help facial recognition algorithms better recognize faces of all races, she said, because a system cannot really identify a face until it has seen many, many faces. But engineers may not understand more subtle sources of bias, she said, which is why social and life sciences experts are needed to help with the thoughtful design of more equitable algorithms.

The goal of Code^Shift is to help bridge the gap between systems and people, Chaspari said. The lab will do this by raising awareness through not only research, but education.

Were trying to teach our students about fairness and bias in engineering and artificial intelligence, Chaspari said. Theyre pretty new concepts, but are very important for the new, young engineers who will come in the next years.

So far, Code^Shift has held small group discussion on topics like climate justice, patient justice, gender equity and LGBTQ issues. A recent workshop focused on health equity and the ways in which big data and machine learning can be used to take into account social structures and inequalities.

Ramasubramanian said a full grant proposal to the Texas A&M Institute of Data Science Thematic Data Science Labs Program is also being developed. The labs directors hope to connect with more colleges and make information accessible to more people.

They say collaboration is critical to the initiative. The people who create algorithms often come from small groups, Ramasubramanian said, and are not necessarily collaborating with social scientists. Code^Shift asks for more accountability in how systems are created: who has access to the data, whos deciding how to use it, and how is it being shared?

Texas A&M is home to some of the worlds top data scientists, Ramasubramanian said, making it an important place to have conversations about difficult topics like data equity.

To me, we should also be leaders in thinking about the ethical, social, health and other impacts of data, she said.

To join the Code^Shift mailing list or learn more about collaborating with the lab, contact Ramasubramanian at srivi@tamu.edu.

Read this article:
Code^Shift Lab Aims To Confront Bias In AI, Machine Learning - Texas A&M Today - Texas A&M University Today

Is Machine Learning The Key To Unlocking Gen Z Engagement? A Discussion With Jonathan Jadali Of Ascend – Forbes

Jonathan Jadali, Founder and CEO of Ascend

The jury is still out on what makes Gen Zers tick, but while the research is still ongoing there is much evidence to suggest that a marketing strategy utilizing machine learning is exponentially more effective with the next generation.

One thing is abundantly clear to every marketer worth his salt; Gen Z customers are "ninja-level" efficient at swatting away regular ads and pop-ups. They are strongly immune to hard sales and obvious sales content.

Despite all the difficulties that marketers are facing in reaching a wide Gen Z audience, Jonathan Jadali, CEO and Founder at Ascend Agency has found great success in leading Gen Z-focused startups to victory in this marketing struggle.

So what makes the typical Gen Z customer tick and how can businesses and startups build a brand that is appealing to them, utilizing cutting edge technologies?

Jadali shares the ways in which he has used a data and machine-learning strategy in getting many of his clients from obscurity to domination of the Gen Z market.

Content, as they say, is king, but the wrong kind of content isnt even fit to be a pawn in this game. To get startups headed in the right direction, Jonathan often helps direct his clients at Ascend Agency on creating the right type of content for the right type of client.

While most brands are focused on putting out well-curated video and image content in a bid to drive engagement on their social media platforms, Jadali advises that this might not be the best way to go if Gen Zers are your target audience.

The ideal Gen Z customer thrives on spontaneous and messy content. As Jadali states, Gen Z customers are all about being realthey connect well with unfiltered and unedited content because it tends to feel less salesy than others.

For instance, a makeup brand is better off posting a video of a makeup session, in front of a cluttered vanity table, than a photoshoot with a perfectly made-up face.

This is important to keep in mind when implementing any machine learning into your marketing strategy. Whether you are creating a chat bot, or building a data-driven marketing campaign - its important that your system learns to be imperfect.

When AI or Machine Learning is used in marketing, sometimes it can come off as, well, robotic. Gen Z will be an important moment for machine learning marketing as it will help us get closer to contextual AI - machines that more accurately predict and reflect human behavior.

Gen Z wants to see the messiness of life and its process reflected in your content. Brands that do this, are the brands that they are drawn to and often build loyalty for.

How does it look? How effective is it? How satisfying is your service? All these are valid marketing questions and things that in the past had been asked by your millennial customer base.

According to Jadali, these questions do not matter nearly as much to a Gen Z audience.

Clearly, customers want products that work and businesses that deliver, but with a Gen Z audience, that doesnt seem to be the right way to lead in marketing to them.

Having worked with both Fortune 500 companies and smaller startups alike in the last 3 years since Ascend Agency launched, Jadali is fairly certain that Gen Z customers are way more attracted to how your business makes them feel.

This is where machine learning can really come in handy. Understanding your customers' moods and habits can help you tap into what makes them feel great about themselves and the products in their lives.

Gen Z customers are tired of hearing about how amazing your product is, businesses have been hyping up their products for as long as businesses have existed and Gen Zers arent having any more of it. In Jonathans words, Sell experiences, not products, and your products will head out of your door as well.

According to Mention, 25% of what you sell is your product. The additional 75% is the intangible feeling that comes with said product.

What dominant feeling do you want to evoke with your content? A question that is popularly asked at the Ascend Agency office, is one that has helped brands build consistency in their content style and delivery and that has brought the Gen Z customers in their droves.

This question can be answered through aggregated customer data that helps you better understand the emotions from brands that they also engage with.

Red Bull is a great example of a brand that utilizes data and machine learning in this manner. Their video content covers high-risk sports, like Skydiving, Bungee jumping, etc. From customer data processed by predictive analytics and machine learning systems, the dominant feeling Red Bull chose to evoke is one of courage and strength.

What is yours, Happiness, Reflection, or Prestige? The sooner you can answer that, the sooner you can get your gen Z audience to really pay attention. Machine learning can help you answer this question faster and more accurately.

Did you know that once an Influencers followership crosses the 100k mark, their engagement drops drastically? When did you last get an Instagram reply from Selena Gomez or Christiano Ronaldo? Never I presume. I will get back to this point in a bit.

While Guest Posting and proper ad placement might still work rather well for Millennials, Social media is clearly the major frontier for Gen Zers. This is why Influencer Marketing has risen to the fore in the last 6 years.

However, nothing is more important to this generation than being seen and heard. This is why Gen Z customers rate a brands authenticity by how well the brands engage with them online.

If a customer posts a tweet asking you for information or laying down a complaint, the first thing to do is to respond publicly before directing to their inbox as opposed to solely responding to them privately. If they send in a review, respond and thank them. Call them by name, engage with them personally in a way that doesnt feel rehearsed, says Jadali.

It goes without saying that brands should be more intentional with engaging their Gen Z audience personally. However, this is hard to scale.

Machine learning is helping brands go beyond the typical automated response we often see in DM and SMS replies. As this technology becomes more advanced, you will be able to engage with hundreds of thousands of customers at once at a deeply personal level.

Micro-influencers drive 60% higher engagement levels and 22.2% more weekly conversions coupled with the fact that they are considerably cheaper. However, their secret sauce is the fact that they are still able to engage with their followers directly far more than celebrities like Cristiano Ronaldo or Selena Gomez ever can.

Soon, machine learning will allow for this type of personal engagement at scale. It will also allow for small brands and businesses to authentically engage with customers without having to spend hours of their day on replies and comments.

As Jadali explains, The Gen Z audience is sensitive, intuitive and versatile, reaching them is not rocket science, it is not science at all, it is an art. It is something that anyone can master, wield and utilize.

Gen Z will help push Machine Learning to become more human, more perfectly imperfect in its responses, and move us closer to contextual AI in marketing and online content.

See the original post here:
Is Machine Learning The Key To Unlocking Gen Z Engagement? A Discussion With Jonathan Jadali Of Ascend - Forbes