EXPERIMENTAL AND CLINICAL TRANSPLANTATION

Key words : Heart transplantation, Lung transplantation, Machine learning, Patient Outcomes, Predictive Models

Introduction
Solid-organ transplant procedures have resulted in substantial changes for public health, especially for heart or lung disease patients who are in a critical condition or on transplant wait lists. Cases of heart or lung disease have been increasing globally because of unhealthy lifestyle habits,¹ such as smoking, alcohol consumption, diet poor in fibers,² and sedentary living. Presently, heart or lung transplantation is the last resort for patients with end-stage chronic diseases, but the quality of life and the survival rates have been low because of the many health complications, such as infections, comorbidities,^3,4 or even medical errors.^5,6

Over the past decades, various artificial intelligence (AI) models have been implemented in transplantations, but different concerns about the complexity and doubtful efficacy of AI models have been expressed, resulting in unsafe conclusions. A need has emerged for the evaluation of the performance of AI models, focusing on health outcomes of transplant recipients, as well as on the investigation of the systematic biases that affect the findings of research. In this systematic review, we analyzed data related to performance of AI models to predict posttransplant health outcomes and to present the risks and ethical concerns that come up from the use of AI in transplant procedures.

Materials and Methods
We conducted this systematic review following the PRISMA 2020 guidelines for systematic reviews⁷ to gather information and data from existing AI and transplant-related publications (Figure 1). We thoroughly researched databases and online libraries (PubMed, Scopus, and ScienceDirect), searching for studies using certain keywords (artificial intelligence OR machine learning OR deep learning OR neural networks AND transplantation OR transplant AND heart OR lung OR lungs) and certain eligibility criteria (PICOTSS framework), that were published after 2000 and written in English language. We limited the systematic review to cohort-retrospective studies with a population of adult subjects, over the age of 18 years, who have already had transplant procedures. Included studies mentioned the following posttransplant health outcomes: mortality (at 30 days, 1 month, and 1, 3, and 5 years), survival, graft dysfunction, or graft failure. Included studies presented performance discrimination, namely, area under the receiver operating characteristic curve (AUROC) and calibration metrics. Targeted data should have been collected in clinical settings (hospitals or primary care settings) or patient homes, before, during, or after transplantation, with no comparators due to the complex functions. In addition, we conducted risk of bias assessment to evaluate the domains of sampling, selection, reporting, or publication bias, by using the prediction model risk of bias assessment tool (PROBAST), which is used in prediction-related studies.⁸

Results
Selection and description of studies
Our initial number of included studies was 122. After several estimations and screenings and following the PRISMA 2020 guidelines for conducting systematic reviews, we reached a final number of 15 studies (Table 1)^9-23: 12 referred to heart transplantations (385060 total heart transplant patients) and 3 referred to lung transplantations (57984 total lung transplant patients).

All studies retrospectively collected data that derived from either large or small patient registries^11,15 (n = 2 studies) and transplant databases^9-14,16-23 (n = 13 studies), such as the United Network for Organ Sharing (UNOS). Patient sample size ranged from 381 to 103750, and most studies used a patient registry or database to train and validate the AI models. One study used external validation.²² Main data were from UNOS for 12 studies.^{9-14,16-18,20-23} A few studies did not provide any information about the target population. Mean age of recipients was 50.7 years, and mean age of donors was 32.8 years (excluding studies that reported age by range or median, with no information about the mean^{9,10,14,17,18,21}).

Most transplant recipients were White (64%-83.2%), followed by Black or African American (8.5%-23.4%), Hispanic (5.9%-8%), and Asian (1.5%-3.4%). Details on ethnicity of donors were only provided in 1 US study,²⁰ which showed that, in the period from 2010 to 2018, almost 64% of 18625 transplantations were White donors, followed by Black (16.3%), Hispanic (16%), and Asian donors (1.8%). Among recipients, 71.4% were males and 28.6% were females, whereas 79.8% of overall donors were males and 30% were females; no data was provided about other gender groups.

The targeted health outcomes were 1-year posttransplant survival^9,12-16,21 (n = 7), 1-year mortality^{10,11,20,22,23} (n = 5), 1-year graft survival¹⁷ (n = 1), 3-year posttransplant survival^12,21 (n = 2), 3-year mortality²⁰ (n = 1), 5-year posttransplant survival²¹ (n=1), 5-year mortality^19,20 (n=2), 5-year graft survival¹⁷ (n = 1) and failure¹⁹ (n = 1), 1-month or 30-day survival^14,15 (n = 2), 30-day graft failure¹⁸ (n = 1), 9-year graft survival¹⁷ (n = 1), 90-day survival^9,14 (n = 2), and multiple posttransplant survival periods^14,21 (n = 2).

The most used AI algorithms were logistic regression^{9,11,14,16-18,20,21,23} (n = 9), followed by random forest^{9,11,13,14,16,21} (n = 6), support vector machines^{11,14,17,20,21} (n = 5), extreme gradient boosting^9,10,11,21 (n = 4), decision trees^14,17,20,21 (n = 4), and gradient boosted trees¹⁴ (n = 1) (Table 2). Moreover, to evaluate the efficacy of AI models to predict posttransplant health outcomes, data on performance were collected (such as discrimination or calibration metrics), which indicated the model’s accuracy, sensitivity and specificity. The metrics for discrimination were the AUROC (n = 15), precision-recall curve^10,17,18 (n = 3), sensitivity^11-15,20,21 (n = 7), specificity^{11-15,17,20,21} (n = 8), positive predictive value (PPV)^12,13,20 (n = 3), negative predictive value (NPV)^12,13,20 (n = 3), Harrell C index or concordance index or C statistic²³ (n = 1), G mean²¹ (n = 1), integrated area under the curve (AUC)/time-dependent AUC¹⁵ (n = 1), net reclassification index¹⁶ (n = 1), and decision curve analysis¹⁶ (n = 1). Regarding calibration, the used metrics were calibration plots^{9,13,16,21,22} (n = 5), Hosmer-Lemeshow test^22,23 (n = 2), and integrated Brier score/predictive error¹⁵ (n = 1). Model accuracy was used for either the discrimination or calibration evaluation of the models^7,8,14,21 (n = 4).

The AI models showed very good discriminatory performance in predicting posttransplant health outcomes, and the predicted values were well aligned with the observed values for long-term outcomes.²¹ The AI models with the best performance, based on AUROC metrics, for the prediction of 1-year posttransplant mortality, were extreme gradient boosting (0.71)¹⁰ and random forest (0.801).¹¹ For 1-year posttransplant survival outcome, the models that showed the highest performance, based on AUROC metrics, were the shuffled 10-fold cross-validation random forest (0.893),⁹ random forest (0.790),¹⁴ random survival forest (RSF) (0.921),¹⁵ CatBoost (0.800),¹² and ensemble machine-learning method (0.764).¹⁶ For 3-year survival, the highest discriminatory performance was shown by CatBoost (0.690),¹² with an accuracy of 74.2%. For the prediction of graft failure and survival, the most effective model was logistic regression (0.840) for 9-year survival,¹⁷ a multilayer perceptron with only one hidden layer (0.690) for 30-day survival,¹⁸ and gradient boosting machine (0.716) for 5-year graft failure.¹⁹ For prediction of 1-year survival, higher sensitivity and specificity were demonstrated by RSF (88.7%¹⁵ and 79.6%¹⁵). In addition, RSF showed the best accuracy at 82.8%,¹⁵ and the CatBoost model showed the best PPV at 42%¹² and NPV at 93%.¹² For 1-year mortality, support vector machines and naive models showed the highest accuracy (84.9%),¹¹ and gradient boosted machine showed the highest specificity 91.6%,¹¹ whereas Catboost showed the highest sensitivity 75%.¹²

Regarding calibration of AI models, random forest, extreme gradient boosting, and Cox regression showed good calibration based on their Brier scores (0.060, 0.072, 0.096, respectively)⁹ for 1-year survival. Likewise, based on the calibration plots, for 1-year survival, Cox regression (intercept: 0.003, slope: 0.902)⁹ and RSF (intercept: 0.008, slope: 0.752)⁹ showed the best alignment of predicted probabilities to observed probabilities. Based on the Hosmer-Lemeshow test, the partial response network-LASSO showed the best calibration (15.01, P = .135),²² compared with the international heart transplant survival algorithm (40, P < .05) and the index for mortality prediction after cardiac transplantation (101, P < .05)²³ for prediction of 1-year mortality.

In the risk of bias assessment (Table 3), most studies were highly biased. Bias emerged because of either the exclusion of certain groups of patients during the selection of a representative sample for the research,^9-22 the inclusion of variables in the development dataset that may not be available during the application of the predictive machine,^{10,12,15,16,19,20,22} the use of small, unstandardized patient registries or cohorts,^11,15 the low number of patients with the relative outcome^{10,11,14,15,21} or the absence of calibration metric methods.^{10-12,14,17,18,19,20}

Discussion
Heart or lung transplantation is a complex procedure demanding high-quality health services, prognostic and diagnostic infallibility of technical equipment, and specially developed infrastructure that should meet the unique requirements of this sensitive and technical medical operation. Given the questioning and contestation over the overall efficacy of AI applications in heart and lung transplantations, we aimed, through a systematic review, to assess the performance of AI models and to address the biases and bioethical concerns that arise from their use.

Machine-learning algorithms depend on the size of the input dataset to develop accurate and efficient models and certainly benefit from large datasets.^25,26 In many reviews, large datasets had been used, providing more data points for investigation, capturing more patterns that could be interpreted, and leading to more reliable conclusions. In the systematic reviews by Gholamzadeh and colleagues²⁷ and Naruka and colleagues,²⁸ dataset sizes ranged from 30 to 310773 records. Smaller samples indicated a strong interest of the research community; in recent years, small-scale datasets have been used to develop and train AI models because of their capacity to generate high-quality output relying on more specific groups, to reduce overconsumption of resources and energy.²⁹ In addition, novel AI technologies can generate high-quality artificial samples or attain better performance using different techniques like hyperparameter tuning,^30,31 a technique that is applied to control the learning process of AI models, by optimizing certain hyperparameters (eg, learning rate, number of nodes per layer); novel AI technologies were used in many of the reviewed studies.^{10,14,15,18-20,22}

With regard to demographic results (Table 1), the main data resource was the UNOS database, a North American transplant database, and most transplant patients were male and White, outweighing non-White and female subjects. No mentions were made about socioeconomic backgrounds, a finding seemingly in line with other reviews like Naruka and colleagues,²⁸ Palmieri and colleagues,³² Gholamzadeh and colleagues,²⁷ and Rahman and colleagues.³³

In our systematic review, prediction of 1-year posttransplant health outcome was the most frequently studied outcome (Table 2), with mortality or patient survival being the main outcomes. Results were also reported on 30-day, 90-day, 1-month, 3-year, 5-year, 9-year, and long-term outcomes. In transplant research, long-term follow-up and clinical evaluation get more complicated because of patient withdrawal and, subsequently, the loss of essential input data. The 1-year timeframe was often regarded as a more stable and trustworthy timeframe when most complications are likely to manifest, patients are expected to be adapted to the new organ, and proper treatment is expected to have a greater possibility of success.^34,35,36 In reviews from Palmieri and colleagues³² and Naruka and colleagues,²⁸ results are presented for prediction of 3-month, 1-year, 3-year, 5-year health outcomes, as well as for other years, among transplant recipients.

The variety of AI model types that have been used in the reviewed studies was wide, encompassing different kinds of supervised or semi-supervised algorithmic classes like linear models (logistic regression),^{9,11,14,16-18,20,21,23} decision tree-based models,^{9,11,13,14,16,17,20,21} support vector machines,^14,17,20,21 and neural network models.^{11,14,16,17,21,23} In most of the studies, different techniques were applied for the assessment of “overfitting” (the remarkably good performance of a model when applied to training data), like cross-validation,^{9,11,14,16-19,21-23} bootstrapping,^10,11,15 or splitting dataset in training and test set^{10-13,15,16,19-22} instead of implementing external validation.²² In the study of Lisboa and colleagues, the external validation raised concerns about experimenter, confirmation, or publication biases introduced by the development scientific team, as well as questioning of the integrity of the evaluation process.³⁷

For the evaluation of the models’ performance, the process involved 2 important components: the calculation of discrimination and calibration. In their study, Agasthi and colleagues considered AUROC or AUC as the best discrimination metric among others, which can take values from 0 to 1, showing exceptional discriminative strength when it is closer to 1.³⁸ Our systematic review showed that the 10-fold cross-validation random forest,⁹ random forest,¹¹ and CatBoost¹² exhibited excellent discrimination and good calibration in predicting 1-year mortality. For the 1-year posttransplant survival, random forest,¹⁴ RSF,¹⁵ and ensemble machine-learning method¹⁶ seemed to have the highest AUC values. Likewise, sensitivity and specificity values pointed out the elevated accuracy of forest AI models in the prediction and classification of binary outcomes like mortality or survival. In the reviews from Clement and colleagues³⁹ and Senanayake and colleagues,⁴⁰ the most commonly used solid-organ transplantations AI models were random forest and artificial neural networks, where the models showed very good discriminative strength and outperformed traditional linear methods, a point that was similarly highlighted in our systematic review.

Ethical aspects and risk of bias
Different conclusions have been made on the biases and ethical concerns that arise from the use of AI in transplantation. Biases can be found in numerous phases of AI model development and training, as they are naturally inherent in social values and data structures like datasets.⁴¹ Small samples, exclusion of patients because of missing values, previous transplantation, or exclusion of data from younger donors introduce sampling and selection biases. Transparency concerns arose because of the internal complexity of machine learning models, rendering them difficult to interpret, affecting the opacity of procedures, and making health care providers reluctant to fully trust the machines.⁴² Clinicians may be unable to evaluate the AI recommendations, thus drawing false conclusions and making wrong decisions.⁴³ Concerning liability, different parties are included in the development of AI models, such as software developers, who have no duty of care toward patients and do not take responsibility when patients are harmed or negligence occurs.⁴⁴ Finally, AI applications in transplantation require vast amounts of sensitive health data that may affect the privacy and safety of patients. Fully comprehensible information on transplant patients by clinicians seems to be challenging because patients may not understand the internal functions of AI and because clinicians or medical staff may not be educated on AI procedures.⁴⁵ So far, humans have been actively performing efficiently, questioning the abilities or potential risks that might lurk behind the benefits of AI applications.⁴⁶ Therefore, legal, ethical, and regulatory issues must be resolved and legal frameworks must define the importance of informed consent in each phase of a transplant procedure.⁴⁷

Conclusions
This systematic review addressed gaps in the literature on AI applications in heart and lung transplant patients, focusing on the prediction of posttransplant health outcomes. In the analyzed studies, large datasets and patient registries were used and different validation techniques were implemented, like bootstrapping or cross-validation, instead of external validation. Ensemble methods like random forest and extreme gradient boosting emerged as the most effective AI models. The discriminatory power and calibration of AI models revealed very good accuracy in the prediction of 1-year health outcomes, like mortality and survival. Risk of bias assessment revealed biases and ethical concerns, such as liability, discrimination, transparency, and patient data privacy. Despite evidence of the potential of AI to predict post-transplant health outcomes, inconsistencies in performance highlighted the need for further research on AI applications in heart and lung transplants to draw safe conclusions about their efficacy.

Volume: 22 Issue: 11 November 2024

FULL TEXT