Objectives: Under the present liver transplant policy, patients with the highest risk of death receive preference for organ placement. The aim of this study was to evaluate the Model for End-stage Liver Disease (MELD) and seven prognostic derivatives of this test for outcome prediction in cirrhotic patients on liver transplant wait lists.
Material and Methods: The study included 416 patients (65.9% male; age 49 ± 13.9 years) who were entered to liver transplant wait lists from January 2013 to October 2016. Study endpoints were 3-month, 6-month, and 1-year mortality.
Results: All prognostic models had acceptable overall performances (0.12 < Brier score < 0.21). The MELD-to-serum sodium ratio test outperformed its counterparts at all 3 endpoints. Estimated C statistics ranged from 0.77 to 0.83. The largest value at 3 months was for the 5-variable MELD score (0.83), and the largest value at 6 months (0.82) and 1 year (0.83) was for the MELD-albumin score. The Hosmer-Lemeshow goodness-of-fit test and calibration plots revealed underestimation for the entire range of predicted risk (P < .001). With decision curve analysis, the MELD-to-serum sodium ratio and United Kingdom Model for End-Stage Liver Disease scoring tests covered the most extensive range of optimal threshold probabilities.
Conclusions: Although some derivations, including sodium and albumin, showed effective prioritization of liver transplant candidates, poor calibration statistics highlighted the need for a recalibration process as an inevitable prerequisite before daily clinical use of these tests at the individual level.
Key words : Iran, Prognostic models, Wait lists
The Model for End-stage Liver Disease (MELD) is a clinical triage system that quantifies the severity of liver cirrhosis, which is aimed at ensuring equitable allocation of donated livers among transplant candidates.1 This prioritization model helps surgeons to properly make an evidence-based decision in terms of medical urgency toward the promotion of effective management of organ placement and to avoid a futile transplant by means of 3 objective, verifiable, and readily available parameters, including serum creatinine, serum bilirubin, and international normalized ratio (INR) of prothrombin time.1,2 In February 2002, the MELD score replaced the Child-Turcotte-Pugh score for the prediction of short-term mortality in cirrhotic patients on liver transplant wait lists in the United States to eliminate its drawbacks (including subjective scoring of ascites and hepatic encephalopathy, the ceiling and floor effects of continuous laboratory parameters, and absence of renal function assessment).3,4 Since then, numerous other jurisdictions around the world have adopted the MELD allocation policy as the major reference system for optimizing organ pools.5
Over the past few years, potential limitations of the MELD score (including the development of a cohort of patients specifically not selected for transplant,2 its poor estimation for noncholestatic liver diseases,6 and the confounding effects of age, sex, ethnicity, and muscle mass on serum creatinine7) have become apparent, and attempts to refine it have been sought by investigators trying to amplify its opposing effects on outcome prediction.8 Previous studies have examined the predictive power of the MELD score after reweighting the predictors’ coefficients,9 transforming its laboratory components,10-12 and appending new parameters.13-15 Although the performance analyses reported by similar studies are limited to discrimination measures,6 here we aimed to compare the short-term and intermediate-term prognostic ability of the standard MELD with respect to 7 alternative scores in terms of overall performance, discrimination, calibration, and decision analyses by employing a combination of traditional and novel statistical performance measures using a province-wide Iranian liver transplant database.
Materials and Methods
Study population and data
This single-center retrospective study included 416 consecutive patients with cirrhosis who were entered to the Mashhad University of Medical Sciences liver transplant wait list between January 2013 and October 2016. The study was approved by the Ethics Committee of the institution, and the study protocol conformed to the ethical guidelines of the 1975 Helsinki Declaration. At the time of study, this institution was the sole liver transplant center for approximately 8 million residents in the northeast region of Iran. Exclusion criteria were age less than 18 years, diagnosis of fulminant hepatic failure or hepatocellular carcinoma, need for combined liver-kidney transplant, and retransplant. In addition, 14 patients who underwent liver transplant after 1 year of their first visit were included. All patients had a known survival status at 1 year of hemodynamic measurement. At the time of listing, demographic, clinical, laboratory, and imaging studies were recorded and considered for analyses.
The missing values (27 patients [6.5%] with missing serum sodium values and 42 patients [10.1%] with missing serum albumin values) were imputed for model calculation. We used serum bilirubin, serum creatinine, and INR, which were recorded for all patients, in a linear regression model to predict the missing values. All data were rendered anonymous before analysis.
Table 1 summarizes the formulas and important developmental characteristics of the investigated models: MELD, integrated MELD (iMELD), MELD-to-serum sodium ratio (MESO), MELD with the incorporation of serum sodium (MELD-Na), updated MELD (uMELD), United Kingdom Model for End-Stage Liver Disease (UKELD), MELD with the incorporation of albumin (MELD-albumin), and the 5-variable MELD (5vMELD). These scores were analyzed as prognostic models for the prediction of 3-month, 6-month, and 1-year mortality after listing.
The predictive performance of a prognostic model was assessed in terms of overall performance, discrimination, calibration, and decision analysis. The overall performance was estimated using the Brier score (BS), which is a quadratic scoring rule, where the mean squared differences between actual outcomes and predicted probabilities are calculated:
where N is the number of patients and Pi is the predicted probability for individual i to have the true outcome Oi (0 for nonsurvivors and 1 for survivors). The Brier score can vary from 0 (a perfect model with equal predicted probabilities and actual outcomes) to 0.25 (a noninformative model with 50% incidence of the outcome).16,17
Discrimination, which refers to a model’s ability to correctly distinguish between 2 classes of outcomes (ie, survival and death), was assessed by the area under the receiver operating characteristic curve (AUC or C statistic).18 An AUC of 1.0 indicates perfect identification of the relative survival of all possible pairs of patients, whereas an AUC of 0.5 indicates that the model does not predict better than chance.19 P values for the comparison of C statistics were calculated with the use of the DeLong method.20 The sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated using the best threshold (Youden index) to predict mortality.21 Discrimination slope was evaluated as the difference in mean of predictions between survivors and nonsurvivors.22
Calibration refers to the agreement between actual outcomes and predictions by decile of predicted probabilities.23 The Hosmer-Lemeshow goodness-of-fit test is a measure of discrepancy between the actual and predicted outcomes. A significant P value indicates poor calibration, whereas a nonsignificant
P value suggests good calibration.24 However, the sample size effect and poor interpretability are known as its potential drawbacks.25 A graphical assessment of calibration (calibration plot) was evaluated with the predicted mortality rates on the x-axis and the actual mortality rates on the y-axis. Once the plotted points lie on a 45-degree line, the model fits the data well.26
The decision curve analysis is a simple and novel method for comparing alternative prognostic strategies as clinical decision-making tools. This method is based on the principle that the relative harms of false positives and false negatives can be represented in terms of a probability threshold, using the clinical net benefit (NB) function
where N is the total number of patients and Pt is the threshold probability. A range of threshold probabilities can then be used to create the decision curve for a prognostic model. Interpretation of the decision curve is based on comparing the net benefit of the model with that of a strategy of “treat all” (the gray line from top-left corner as shown in Figure 3) and “treat none” (parallel to the x-axis at the net benefit of zero as shown in Figure 3). The prognostic model with the highest net benefit at a particular Pt is optimal, regardless of the size of the difference.27,28
The chi-square test or the Fisher exact test (two-tailed) was used for categorical variables, and independent-sample t test was used for continuous variables. For all tests, a P < .05 was considered to be statistically significant. All analyses were conducted in the R statistical software version 3.3.2 (PredictABEL, pROC, and stdca packages, R Core Team, Vienna, Austria).
Patients were predominantly older (median age 51 y; interquartile range, 39-59 y) and male (65.9%), and 29.8% had evidence of chronic hepatitis B cirrhosis. Of all patients, 42 (10.1%), 59 (14.2%), and 77 (18.5%) died 3 months, 6 months, and 1 year after listing, respectively. Regarding the different definitions in the formulas, UKELD tended to have the highest scores (39.2 ± 4.9), followed by iMELD (35 ± 8.4) and 5vMELD (19.8 ± 6.3) at the time of initial evaluation. The differences of serum bilirubin, serum creatinine, INR, serum albumin, and serum sodium, 5 parameters that are the foundation of most calculations, were statistically significant between the alive and deceased groups, as was the distribution of the listed portal hypertensive complications. Further basic and relevant clinical characteristics of the study population are summarized in Table 2.
All calculated Brier scores were smaller than 0.213 (Table 3), indicating an overall acceptable accuracy of all models at the 3 endpoints. The MESO (0.116), UKELD (0.117), and MELD-Na scores (0.122) had the smallest Brier score values for prediction of 3-month mortality. The MESO and UKELD, which had a Brier score of 0.161, and the MELD-albumin, which was similar to 5vMELD with Brier score of 0.166, were the scores with the most acceptable values at 6 months. In addition, MESO, MELD-albumin, and 5vMELD were associated with the smallest Brier scores (0.194, 0.195, and 0.196) for estimating the 1-year mortality.
The estimated AUCs for the 8 prognostic models in predicting 3-month, 6-month, and 1-year mortality are given in Table 3 and Figure 1. At 3 months, the 5vMELD outperformed its counterparts (0.829), followed by the MESO (0.828) and UKELD (0.828) scores. Use of MESO resulted in a significantly higher AUC than with the standard MELD (P = .042). At 6 months, the MELD-albumin test had the highest AUC (0.817), followed by 5vMELD (0.815) and MESO (0.803); comparisons between these showed no significant differences. At 1 year, the MELD-albumin had the highest AUC (0.826), followed by the 5vMELD (0.823) and MESO (0.804). Both MELD-albumin (P = .013) and 5vMELD had significantly higher AUCs than the standard MELD (P = .049). Complete pairwise comparisons of 8 models are listed in Table 4. The iMELD was associated with the largest discrimination slope for predicting 3-month (slope = 0.295), 6-month (slope = 0.255), and 1-year (slope = 0.250) mortality. The sensitivity (61.7% to 86.4%), specificity (66.1% to 88.3%), PPV (91.1% to 96.7%), NPV (24.8% to 42.1%), and accuracy (66.6% to 84.4%) of each prognostic model are reported in Table 5.
The Hosmer-Lemeshow goodness-of-fit test revealed a poor calibration for all scores at the 3 endpoints (P < .001) (Table 3). Calibration power was also explored by the calibration plot, which showed that the stratified predicted risk of all prognostic models did not completely overlap the 45-degree line. The scores with the most accurate calibration curve at 3 months (UKELD), 6 months (UKELD), and 1 year (MESO) are shown in Figure 2. All scores underes-timated the mortality rate for the entire range of predicted risk.
Decision curve analysis
The maximum range of threshold probabilities in which a model performs better than both “treat all” and “treat none” strategies are summarized in Table 3. The top 3 most optimal models compared with the MELD-based alternatives that support the most extensive range of threshold probabilities are shown in Figure 3. At 3 months, MESO was identified as the most useful clinical decision-making tool, showing from 0.45 to 0.75 threshold probabilities, followed by UKELD and uMELD. At 6 months, MESO, UKELD, and 5vMELD covered the most extensive area. At 1 year, iMELD, 5vMELD, and UKELD were the most optimal models in the maximum number of threshold probabilities (please contact the cor-responding author [AliakbarianM@mums.ac.ir] to receive all calculated net benefits and determine the most optimal prognostic model at each particular point of threshold probability).
The present study assessed 7 MELD-based prognostic models and compared their performance to that of the standard MELD. We found MESO to be the top prognostic model to predict short-term mortality in terms of the overall performance, discrimination, and decision analysis. For prediction of 6-month mortality, although MESO was associated with the best Brier score and covered the most extensive range of optimal threshold probabilities, the albumin-based models (MELD-albumin and 5vMELD) provided the most acceptable discrimination between deceased and living patients. Similar results were shown for the 1-year endpoint. However, the iMELD and UKELD were also identified as useful clinical decision-making tools over a considerable range of threshold probabilities. Better performance of MESO compared with other derivations may be due to the developing database structure in Taiwan, which confirmed results in our study population regarding the prevalence of viral infections in Asian countries.14
In terms of calibration, none of the investigated prognostic models fitted to our data, which may highlight the effects of database structure on model behavior. Although our population was younger than those in the developmental studies, regarding the standard MELD score, patients experienced a more critical level of liver cirrhosis at the time of listing. The continuous underestimation of all prognostic models may be due to the nature of the most prevalent cause in our region accompanied by the severity of liver disease.
Our discrimination analyses showed that only 3 of 7 derivations of MELD scores that incorporated serum albumin and serum sodium (MESO at 3 months and MELD-albumin and 5vMELD at 1 year) predicted mortality better than the standard MELD. Among the MELD-based scores, MELD-albumin and 5vMELD were comparable, having AUC values in excess of 0.815 at all endpoints, confirming that serum albumin concentration is a strong predictor of wait list mortality.15,29 In addition, high PPVs, despite low NPVs, indicated that MELD-based models predict survival better than nonsurvival in our population.
In our population, age seems to be a poor predictor of mortality (lowest AUC of iMELD in both short-term and intermediate-term analyses). However, the growing age of the Iranian population may highlight this variable in the next generation of MELD-based-modified versions. Moreover, premature mortality attributable to viral infections may be noted as one of the most important lifestyle issues in Iran. However, a national immunization program against hepatitis B infection has been in existence since 1993.30
Comparison to similar studies
The excellent discrimination power of MELD, MELD-Na, UKELD, uMELD, iMELD, and MESO confirmed previous evaluations.31-34 However, Kaltenborn and associates reported contradictory results (AUC < 0.7).35 The pattern of provided calibration plots for MELD-Na and iMELD did not confirm previous results.33,35 However, considerable underestimation of mortality rate for high-risk patients was also shown by Biselli and associates.33 It is noteworthy that acceptable Brier scores for MELD, MESO, MELD-Na, UKELD, iMELD, and uMELD in our study was not shown in the study from Kaltenborn and associates (Brier score > 0.25).35 Inconsistent performance measures from Western countries may be explained by differences in ethnicity, lifestyle, and major causes of disease, which may affect model behavior. Because of the lack of decision curve analyses in similar papers, decision analysis interpretations are not comparable to previous reports.
Limitations and strengths
A potential limitation of this study is that the most prevalent cause of disease was chronic hepatitis B cirrhosis, similar to most other Asian countries, which may affect the generalizability of results to Western countries. However, it has been proven that cause of cirrhosis is not a considerable predictor in MELD-based prognostic models.1,36,37 In addition, analyses were performed on patients referred to 1 of 3 liver transplant centers in Iran, which may prevent us from drawing broad generalized conclusions. In contrast, to the best of our knowledge, this is the first comprehensive evaluation study that investigated most of the recently proposed MELD-based prognostic models (from 2001 to 2014) using a combination of traditional and novel statistical measures. Detailed examination of model behavior over a maximal range of threshold probabilities may pave the way for the application of multimodel strategies with regard to the specific therapeutic considerations in different centers.
In conclusion, among the prognostic models pro-posed for prioritizing liver transplant candidates, some of those incorporating serum sodium and serum albumin, namely MESO, MELD-albumin, and 5vMELD, are the most effective scores in predicting short-term and medium-term mortality. However, poor calibration may affect their usefulness at the individual level.
Volume : 16
Issue : 6
Pages : 721 - 729
DOI : 10.6002/ect.2017.0149
From the 1Student Research Committee, Department of
Medical Informatics, Faculty of Medicine and the 2Pharmaceutical Research
Center, Mashhad University of Medical Sciences, Mashhad, Iran; the 3Department
of Medical Informatics, University of Amsterdam, Amsterdam, The Netherlands; and
the 4Surgical Oncology Research Center, Mashhad University of Medical Sciences,
Acknowledgements: This study received no funding support, and the authors have no conflicts of interest to declare. AA, SE, MA, and FT conceived the study idea and design. FT and MA collected data. FT and AA performed statistical analyses. FT and SE drafted the manuscript. All authors were involved in critically revising the manuscript and approved the final version. The authors appreciate the nursing personnel of liver transplant wait list coordination at the Montaserieh Transplant Center, Mashhad University of Medical Sciences, especially Monir Mirzadeh, for their kind assistance during patient follow-up.
Corresponding author: Mohsen Aliakbarian, Surgical Oncology Research Center, Imam Reza Hospital, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
Phone: +98 5138022664
Table 1. Developmental Characteristics of the Investigated Model for End-Stage Liver Disease-Based Prognostic Models As Published Previously
Table 2. Distribution of Baseline Demographic and Clinical Characteristics of the Study Population
Table 3. Performance Measures of the Model for End-Stage Liver Disease-Based Prognostic Models to Predict 3-Month, 6-Month, and 1-Year Mortality on Liver Transplant Wait List
Table 4. Pairwise Comparison of the Area Under the Curve to Predict 3-Month, 6-Month, and 1-Year Mortality Among the Model for End-Stage Liver Disease-Based Prognostic Models
Table 5. Sensitivity, Specificity, Positive Predictive Value, Negative Predictive Value, and Accuracy for the Model for End-Stage Liver Disease-Based Prognostic Models Used to Predict Mortality With the Best Predictive Threshold at 3 Months, 6 Months, and 1 Year
Figure 1. Receiver Operating Characteristic Curves of the Top 3 Models
Figure 2. Calibration Plots of the Prognostic Model
Figure 3. Decision Curves of the Top 3 Models