Objectives: Living donor kidney transplant often involves decisions that require complex recipient evaluation under substantial cognitive load. Large language models may augment clinician reasoning; however, their role in transplant assessment remains unexplored.
Materials and Methods: We conducted a prospective, vignette-based study evaluating ChatGPT (GPT-5, OpenAI) as a decision-support tool to assess living donor kidney transplant recipients. Fourteen nephro-logy fellows completed 22 expert-validated vignettes unaided and, after a 72-hour washout, completed vignettes with the assistance of ChatGPT. Primary outcomes were accuracy (agreement with expert reference) and completeness (coverage of critical decision elements). Secondary outcomes included unsafe potential (risk of harm from omission/error) and cognitive workload (NASA Task Load Index).
Results: We analyzed 308 paired responses. Accuracy improved from 68.4% (SD 7.5) to 86.2% (SD 5.6; mean change of 17.8%, d=1.25). Completeness rose from 63.5% (SD 8.1) to 82.1% (SD 6.9; mean change of 18.6%, d=1.31). Gains were most significant in complex vignettes (change of 24.5%; P = .002). Unsafe potential was present in 18 vignettes (82%) unaided but was reduced to 5 (23%) with ChatGPT5, resulting in an absolute reduction of 59% (P < .001). Omissions fell from 6 (27%) to 2 (9%) (reduction of 18%; P = .01). NASA Task Load Index scores declined substantially, with large effect sizes in mental demand (d=3.55), effort (d=3.65), and frustration (d=2.97).
Conclusions: In our study, large language model support significantly enhanced accuracy, completeness, and safety while diminishing cognitive workload among nephrology fellows. This finding highlighted that large language models can operate as both cognitive aids and safety nets in transplant evaluation. Future real-world validation and continuous audits are necessary before integration into clinical workflows.
Key words : Artificial intelligence, ChatGTP5, Nephrology, Renal transplant
Introduction
Across the world, living donor kidney transplant (LDKT) remains the definitive treatment for end-stage kidney disease when deceased donations are unavailable. Living donor kidney transplant demands complex clinical reasoning skills and substantial cognitive load, especially during recipient evaluation. Clinicians often deal with various factors like comorbidities, immunologic risks, timing of transplant, and psychosocial issues during LDKT. In these scenarios, safe, guideline-concordant decisions and a personalized approach are desired.1
In the current era, clinical decision support systems are stringent, lack contextual flexibility, and require structured data inputs. The aforementioned limitations question the usability of these systems in dynamic, case-based evaluations, such as evaluations of LDKT recipients.2 In contrast, large language models (LLMs) like ChatGPT5 can generate nuanced, context-sensitive outputs from natural language prompts, potentially easing cognitive burden for clinicians navigating multifactorial scenarios.3,4
Emerging evidence has shown that generative artificial intelligence (AI) holds promise in health care. Artificial intelligence improves the accessibility of patient-facing information (for example, simplifying living kidney donation details to an 8th-grade reading level),5 enhances decision efficiency,6 and may reduce clinician cognitive load.7 However, most applications have been in patient communication rather than provider-facing decision support. In addition, no AI systems have been rigorously evaluated in the context of LDKT recipient assessment.
To address this gap, we conducted a vignette-based study to evaluate the impact of ChatGPT5 as a decision-support tool in LDKT recipient evaluation. We assessed changes in accuracy, completeness, safety, and cognitive workload when fellows received ChatGPT5 assistance compared with unaided performance.
Materials and Methods
Study design and oversight
We conducted a prospective, vignette-based eva-luation of ChatGPT (GPT-5/OpenAI) as a decision-support tool in the assessment of LDKT recipients.
The study followed the principles outlined in the Declaration of Helsinki. The institutional ethics committee (ILBS-vasant kunj, New Delhi) approved the protocol. Formal informed written consent was obtained from all participants before start of the study.
Participants
We enrolled 14 nephrology fellows (6 first-year fellows and 8 senior fellows) from a tertiary academic transplant center in India between June and August 2025. We included the following baseline charac-teristics: age, sex, training stage, prior exposure to kidney transplant evaluations, familiarity with Kidney Disease: Improving Global Outcomes (KDIGO) guidelines, self-rated digital literacy, and prior use of ChatGPT5.
Vignette development
We designed 22 clinical vignettes that reflected real-world LDKT recipient scenarios (Table 1). Each vignette was reviewed by a multidisciplinary panel of 3 transplant nephrologists for validity, realism, and safety implications. Vignettes were mapped to Bloom’s taxonomy (apply, analyze, evaluate) and stratified by complexity (easy, typical, complex/edge-case). Clinical domains included infection and timing, malignancy, glomerular disease recurrence, cardiovascular risk, neurologic comorbidities, anticoagulation, and frailty. Experts independently flagged unsafe deci-sion potential.
Intervention and study arms
The study comprised 2 arms: (1) unaided condition (pre-ChatGPT5), in which fellows independently completed all 22 vignettes. The second arm was the ChatGPT5-aided condition (post-ChatGPT5). For this arm, after a 72-hour washout to minimize recall bias, fellows completed the same vignettes with ChatGPT5 outputs provided alongside case details. The ChatGPT5 prompts used included role/system prompt (expert nephrologist and transplant physician; timestamp: 2025-09-13, 13:30 IST). Generation para-meters included temperature 0.2 (low randomness, stable outputs); Top-p: 1.0 (standard nucleus sampling); Max tokens: 2000; Frequency penalty: 0; Presence penalty: 0. All ChatGPT5 responses were archived in a model audit file (API-generated, timestamped outputs) to ensure reproducibility and external verification. The data are available as open access source (https://doi.org/10.5281/zenodo. 17110754).
Measured outcomes
The primary outcomes were accuracy (agreement with expert reference answers) and completeness (coverage of all critical decision elements). Secondary outcomes included unsafe response potential (risk of harm from omission/error) and cognitive workload, assessed using the NASA Task Load Index across 6 subscales.
Statistical analyses
We calculated the sample size for the study. For paired comparisons before and after ChatGPT5, a minimum of 12 participants completing 20 vignettes each (yielding more than 240 paired responses) provided greater than 80% power to detect a medium effect size (d = 0.5) at α = 0.05. With 14 fellows and 22 vignettes, our study generated 308 paired evaluations, exceeding the required number. Paired t tests or Wilcoxon signed-rank tests were used for continuous outcomes, and McNamar test was used for categorical comparisons. Effect sizes were expressed as Cohen d. Data visualization was done in the form of a raindrop plot, a heat map, and a bar chart. All analyses were performed in R version 4.3.2. P < .05 was taken as a measure of statistical significance in our LLM study.
Results
Tables 1, 2, and 3 describe the vignette bank, comprising 22 unique clinical scenarios representing a broad spectrum of evaluations of patients before kidney transplant. Cognitive complexity was skewed toward higher-order reasoning: 7 vignettes (32%) primarily tested straightforward application of knowledge, 6 (27%) required analytic reasoning, 6 (27%) required evaluative judgment, and 3 (14%) demanded combined application-evaluation skills. In terms of difficulty, 11 (50%) were classified as complex or edge-case scenarios (8 [36%] as typical and 3 [14%] as easy). Thematically, infection and timing (5 [23%]), glomerular disease recurrence risk (5 [23%]), and malignancy, including posttransplant lymphoproliferative disorder and onco-urological issues (4 [18%]), were the most represented domains, followed by cardiovascular (2 [9%]), neurology (2 [9%]), anticoagulation (2 [9%]), frailty (1 [5%]), and bone-mineral disorders (1 [5%]). Importantly, 18 of 22 vignettes (82%) were deliberately constructed with the potential for unsafe recommendations if misinterpreted, thereby enabling robust evaluation of ChatGPT5’s safety performance.
Among the14 fellows who participated, mean age was 31.5 ± 2.4 years. Twelve were men ( 86%), and 2 were women (14%). Six (43%) were in the first year of nephrology training, and 8 (57%) were in the second or third year. Regarding prior exposure to kidney transplant evaluations, 2 (14%) had attended none, 7 (50%) had attended 1 to 20 cases, and 5 (36%) had attended more than 20 cases. Familiarity with KDIGO guidelines was reported as none in 3 (21%), some in 7 (50%), and good in 4 (29%). Five participants (36%) reported prior use of ChatGPT5 for educational or clinical tasks, and 9 participants (64%) had no prior use. Rated comfort with digital tools (on a 1 to o5 scale) had a median score of 3 (interquartile range, 2-4).
Overall accuracy improved from 68.4% (SD 7.5) before aid with ChatGPT5 to 86.2% (SD 5.6) after aid with ChatGPT5 (mean change of 17.8%, SD 6.3). Completeness rose from 63.5% (SD 8.1) to 82.1% (SD 6.9), corresponding to a mean change of 18.6% (SD 7.4). Effect sizes were large for both accuracy (Cohen d = 1.25) and completeness (d = 1.31). Impro-vements were most pronounced in complex vignettes (change of 24.5%) compared with typical cases (change of 11.2%; P = .002) (Figure 1).
Across the 6 NASA Task Load Index subscales, scores decreased with ChatGPT5 aid (Figure 2). Mental demand decreased from 69.8 ± 7.2 before aid with ChatGPT5 to 50.7 ± 9.1 after aid (improvement of 19.0 ± 5.4; P < .001; d = 3.55). Physical demand decreased from 69.3 ± 7.6 to 68.4 ± 7.8 (change of 0.8 ± 1.1; P = .011; d = 0.80). Temporal demand decreased from 73.6 ± 9.0 to 61.2 ± 9.8 (change of 12.4 ± 5.0; P < .001; d = 2.49). Performance scores changed from 71.6 ± 8.1 to 70.6 ± 7.8 (change of 1.0 ± 1.3; P = .014; d = 0.76). Effort decreased from 69.5 ± 8.4 to 54.9 ± 8.2 (change of 14.7 ± 4.0; P < .001; d=3.65). Frustration decreased from 74.5 ± 9.6 to 65.0 ± 10.3 (change of 9.5 ± 3.2; P < .001; d = 2.97).
Across the 22 vignettes, unsafe potential (Figure 3), defined as a response that could plausibly lead to patient harm if acted upon, was present in 18 cases (82%) when fellows responded unaided. With ChatGPT5 intervention, this proportion decreased substantially to 5 cases (23%) (absolute reduction, 59%; P < .001). Omissions of key clinical considerations were observed in 6 vignettes (27%) at baseline, most notably in scenarios involving frailty (vignette 1), retransplant with nonadherence (vignette 2), high-risk glomerulopathies (vignettes 4 and 15), bone-mineral disease (vignette 10), and pulmonary hypertension (vignette 21). After the ChatGPT5 intervention, omissions decreased to 2 vignettes (9%), representing an absolute reduction of 18% (P = .01). The most significant benefit was observed in complex glome-rular recurrence and cardiovas-cular cases, where omission rates were nearly eliminated.
Discussion
In this report of vintage-based assessment of LLMs in evaluation of LDKT recipients, we noted a significant gain in accuracy (17.8%) and completeness (18.6%). These gains were consistent with recent reports that LLMs improve structured reasoning in domains that require the synthesis of diverse clinical inputs. Importantly, these improvements were most enunciated in complex and edge-case scenarios, suggesting that AI augmentation may be most valuable where the stakes are highest and human error rates are at their highest. This observation aligns with prior evidence that clinicians’ cognitive load peaks during uncertain or high-stakes transplant evaluations and that support systems are most beneficial in such contexts. This is particularly relevant for the transplant nephrology field, where burnout and cognitive overload have been well-documented and where AI can enable improvements in both patient safety and trainee well-being.8 Although no burnout magnitude has been syste-matically studied in emerging nations like India, burnout is expected to be high considering the lack of infrastructure and workload.
Our participant cohort was relatively young and predominantly male fellows, which reflects the current trainee demographics in tertiary care nephrology centers in India. We observed some variation in prior exposure to kidney transplant evaluations and KDIGO guidelines. This also highlights heterogeneity in baseline preparedness, which may influence both learning curves and reliance on decision-support tools. The baseline data also showed limited prior use of ChatGPT5 and only moderate comfort with digital technologies. This underscores the fact that the observed workload and accuracy benefits by LLMs were achieved despite minimal preexisting familiarity, suggesting a low barrier to adoption in real-world training settings like India.9 Hence, this study has excellent training implications. This positions LLMs not only as adjunct decision tools but also as educational scaffolds,10 helping trainees internalize guideline-concordant reasoning patterns.
In this study, aid with ChatGPT5 was associated with substantial reductions in cognitive workload among nephrology fellows, particularly in domains of mental demand, effort, temporal demand, and frustration. The large effect sizes in these subscales suggest that AI assistance may meaningfully ease the cognitive and emotional burden of complex case analysis. In contrast, changes in physical demand and perceived performance were short. This finding suggests that ChatGPT5 predominantly supports cognitive rather than physical or outcome-related aspects of task execution.11 These findings align with prior literature on clinical decision support systems,12 which has demonstrated improvements in efficiency and mental workload without directly altering user performance metrics.
Across the 22 vignettes, an unsafe potential, defined as a response that could plausibly lead to patient harm if acted upon, was identified in 18 cases (82%) (Figure 3). These cases were predominantly concentrated in scenarios involving glomerular disease recurrence, malignancy, and complex cardio-vascular or neurological comorbidities. Without aid of ChatGPT5, fellows’ responses showed frequent omissions, most notably in vignette 1 (frailty), vignette 2 (retransplant nonadherence), vignette 4 (C3 glo-merulopathy), vignette 10 (bone-mineral disease), vignette 15 (anti- glomerular basement membrane nephritis), and vignette 21 (pulmonary hypertension). Assistance with ChatGPT5 markedly reduced both unsafe potential and omission rates, ensuring that key risk factors such as recurrence, perioperative bleeding, and infection control were consistently addressed. In brief, with ChatGPT5 integration, the proportion of unsafe responses fell from more than four-fifths to less than one-quarter, whereas omissions decreased by two-thirds. This dual improvement suggests that LLMs can act as a safeguard against overlooked clinical considerations, particularly in complex or high-risk cases, by standardizing attention to detail and ensuring guideline-concordant reasoning.13
The strength of our study is underscored by our rigorous methodology and reporting. Our evaluation utilized a model audit file to ensure reproducibility and accountability, a crucial safeguard given the nondeterministic nature of generative AI outputs. However, despite encouraging results from this study, LLM integration should be approached cautiously.14 Furthermore, the study design cannot establish generalizability beyond the vignette setting; real-world prospective validation will be necessary. Importantly, ChatGPT5 outputs occasionally require human correction, affirming that AI should augment, not replace, expert judgment.
Conclusions
Our report underscores the dual role of LLMs as a cognitive aid and a safety enhancer in LDKT recipient assessment. Integration of LLMs into routine training and multidisciplinary evaluation workflows has huge potential and positive implications.15 This effect extends across standard-izing quality, reducing unsafe omissions, and accelerating decision-making. Future work should explore hybrid human-AI models, prospective clinical trials, and continuous audit pipelines to ensure both fidelity and safety.
References:

Volume : 24
Issue : 1
Pages : 43 - 50
DOI : 10.6002/ect.2025.0233
From the Department of Nephrology, ILBS Vasant Kunj, New Delhi, India
Acknowledgements: The authors have not received any funding or grants in support of the presented research or for the preparation of this work and have no declarations of potential conflicts of interest.
Corresponding author: Hari Shankar Meshram, Department of Nephrology, ILBS Vasant Kunj, New Delhi, India
Phone: +91 7999514306 E-mail: hsnephrology@gmail.com
Table 1. Detailed Question Quality
Table 1 (Continious). Detailed Question Quality
Table 2. Vignette Description
Table 3. Vignette Categorization
Figure 1. Accuracy and Completeness Improvement Before and After ChatGTP5 Intervention
Figure 2. NASA Task Load Index Scores Before (Pre) and After (Post) ChatGTP5 Intervention
Figure 3. Error Improvement