·

AI-säkerhet inom hälsovård

Hälsovård

Kliniker

ChatGPT in clinical decision support: capabilities and limitations

Explore ChatGPT's role in clinical decision support: where it helps, where it fails, and how to evaluate AI tools safely in healthcare

Clinical decision support has existed in healthcare for decades, from simple drug interaction alerts embedded in prescribing systems to sophisticated rule-based diagnostic tools. The arrival of large language models (LLMs), such as ChatGPT, has changed the picture. These tools can engage with complex clinical questions in natural language, summarise lengthy patient histories, and generate plausible-sounding clinical reasoning in seconds. For clinicians, this creates a genuine dilemma: these tools are already being used informally in clinical workflows, yet their validation for high-stakes decision-making remains limited and uneven. Understanding where LLMs genuinely help, where they fail, and what governance frameworks should surround their use is now a practical necessity for anyone working in healthcare.

What 'clinical decision support' actually means in 2026

Clinical decision support (CDS) is not a single technology. It is a spectrum. At one end sit passive, rule-based alerts: a prescribing system flagging a contraindicated drug combination, or a reminder that a patient is overdue for a cervical smear. At the other end are active, AI-driven tools that interpret unstructured clinical data, generate differential diagnoses, or recommend treatment pathways in real time.

This distinction matters for risk stratification. A rule-based alert that fires incorrectly is annoying and can cause alert fatigue. An LLM that confidently generates a plausible but incorrect differential diagnosis in a complex case can contribute to patient harm. Not all CDS tools carry the same risk profile, and evaluating general-purpose LLMs like ChatGPT against this spectrum requires clarity about which part of the spectrum is being discussed.

In 2026, most clinical AI sits somewhere in the middle: tools that blend pattern recognition on large datasets with structured clinical logic. ChatGPT and its equivalents occupy a distinctive position. They are extraordinarily capable at language tasks, but they were not designed, trained, or validated as medical devices.

How large language models work: a clinician's primer

LLMs like ChatGPT generate text by predicting the most statistically probable next token (a word fragment) based on patterns learned from vast quantities of training data. They do not reason in the way a clinician reasons. They do not access a structured knowledge base, consult a pharmacopoeia, or apply logical rules. They generate responses that are linguistically coherent and often factually accurate, because accurate information was well-represented in their training data.

This mechanism is fundamentally different from the rule-based CDS systems most clinicians already encounter in their medical record systems. A rule-based system will always flag a specific drug interaction if the logic is correctly coded. An LLM may or may not, depending on how the question is phrased, what was in its training data, and the inherent probabilistic variability of its outputs.

Understanding this distinction is clinically important. LLMs are not lookup systems with a reliable index. They are sophisticated pattern-matching engines that produce language, and their outputs should be evaluated accordingly.

Where ChatGPT shows genuine promise in clinical workflows

Despite these architectural limitations, the evidence base for LLM utility in specific clinical tasks is growing. A scoping review of 28 studies published in Health Science Reports identified clinical decision support and medical education as the most commonly cited advantages of ChatGPT in healthcare settings.

The areas where evidence of utility is strongest include:

  • Clinical documentation: A synthesis published in Frontiers in Artificial Intelligence found 40–70 per cent time savings in clinical documentation tasks when LLMs assisted with note generation and summarisation.

  • Supporting non-specialists: A June 2026 study in NPJ Digital Medicine evaluating ChatGPT-4o on 100 real-world polyneuropathy cases found it achieved 65.5 per cent leading diagnosis accuracy, comparable to non-specialist neurologists (63.0 per cent), though lower than specialists (74.0 per cent). Non-specialists revised their assessments in 21.8 per cent of cases after reviewing ChatGPT-4o's suggestions, improving their accuracy.

  • Differential diagnosis breadth: In the same polyneuropathy study, ChatGPT-4o outperformed non-specialists on differential diagnoses (82.0 per cent vs 77.5 per cent) and recommended more appropriate confirmatory tests (68.0 per cent vs 53.0 per cent).

  • Patient communication: LLMs demonstrate measurable utility in drafting patient letters, explaining diagnoses in accessible language, and supporting health literacy tasks.

  • Literature synthesis: For clinicians conducting rapid evidence reviews, LLMs can accelerate the identification and summarisation of relevant studies, though outputs require verification.

A systematic review of LLMs across the colorectal cancer care continuum, published in the Journal of Medical Internet Research in May 2026, found utility in automating data extraction from clinical texts, supporting patient education, and assisting clinical decision-making. Domain-specific and multimodal models showed advantages over general-purpose models in certain tasks.

The accuracy problem: hallucinations, outdated knowledge, and confidence without certainty

The most clinically significant limitation of LLMs is hallucination, the generation of plausible-sounding but factually incorrect information. In everyday language tasks, a hallucinated detail is a minor inconvenience. In clinical contexts, it can be dangerous.

A November 2025 analysis of hallucination risks in LLMs deployed in healthcare found that hallucinated outputs appearing credible can guide clinicians toward harmful interventions, influencing both diagnostic pathways and therapeutic choices. The analysis noted that even minor inaccuracies can escalate clinical risk when they align with existing clinician cognitive bias. The model confirms what the clinician already suspects, and the error goes unchallenged.

The hallucination problem is compounded by a confidence calibration failure. A comprehensive evaluation of four LLMs on complex rheumatology cases, published in Health Informatics Journal in April 2026, found that while ChatGPT-4 achieved 100 per cent primary diagnosis accuracy on the test cases (on a small test set), Spearman correlation analysis revealed uniformly weak and non-significant associations between reasoning quality and expressed certainty across all models. The models did not reliably know when they were wrong, and expressed similar confidence regardless of whether their reasoning was sound.

ChatGPT demonstrated the lowest hallucination rate of the four models tested (7.4 per cent), compared to Gemini's 18.5 per cent. Even a 7.4 per cent hallucination rate in a clinical context is not a trivial figure.

Additional accuracy limitations include:

Regulatory status: is ChatGPT a medical device?

The regulatory position of general-purpose LLMs under the EU Medical Device Regulation (MDR) is not straightforward, and clarity in this area is still evolving. The MDR defines a medical device partly by its intended purpose. Software intended for diagnosis, prevention, monitoring, or treatment of disease falls within scope. General-purpose LLMs like ChatGPT are not marketed with a medical intended purpose, which means they do not currently carry CE marking as medical devices.

This creates a significant accountability gap. When a clinician uses a regulated clinical decision support tool, there is a validated evidence base, a defined intended use, a manufacturer's liability framework, and a regulatory audit trail. When a clinician uses consumer ChatGPT to support a clinical decision, none of those safeguards apply. Clinical accountability rests entirely with the clinician.

This distinction is not merely bureaucratic. It affects whether a tool has been tested against real patient populations, whether its outputs have been validated against clinical outcomes, and whether any mechanism exists for post-market surveillance if the tool causes harm.

Patient safety risks clinicians should not overlook

Beyond hallucination, several distinct safety failure modes are relevant to clinical practice.

Automation bias is the tendency for clinicians to defer to an AI-generated suggestion even when their own clinical judgement would have led to a different, and potentially more accurate, conclusion. This risk is well-documented in the human factors literature and is directly applicable to LLM use in clinical settings.

Performance in sequential decision-making is substantially worse than in static question-answering. The AgentClinic benchmark, published in NPJ Digital Medicine in April 2026, found that LLM diagnostic accuracy in sequential, interactive clinical scenarios can drop to below one-tenth of accuracy on equivalent static questions. Real clinical encounters are sequential and dynamic. Most benchmark evaluations are not.

Rare diagnoses present a particular vulnerability. LLMs are trained on population-level data, and rare conditions are by definition underrepresented. A model that performs well on common presentations may perform poorly on the atypical or rare case that most requires diagnostic support.

Unsupervised patient access introduces a separate risk category. Patients are already using consumer LLMs to interpret symptoms, assess medications, and make decisions about whether to seek care. The quality controls that apply to clinician-facing tools do not apply in this context, and the potential for harm from unsupervised AI-generated clinical information is a recognised concern in the patient safety literature.

A study evaluating five LLMs on 50 multiple myeloma clinical scenarios, assessed by three independent haematologist-oncologists, concluded that LLM treatment recommendations require careful clinical supervision to ensure patient safety, even in a structured, oncology-specific evaluation context.

Data security, GDPR, and the problem of inputting patient data

A compliance risk that clinicians frequently underestimate is the entry of identifiable patient data into consumer-facing LLM tools, raising significant data security & privacy concerns. Under the UK General Data Protection Regulation (GDPR) and the EU's GDPR, patient data is special category data requiring an explicit legal basis for processing. Entering a patient's name, date of birth, clinical history, or test results into a consumer LLM tool almost certainly constitutes processing of personal data by a third-party controller, without a data processing agreement, without a lawful basis, and potentially with data residency in jurisdictions outside the UK or EU.

The distinction between consumer ChatGPT and enterprise or healthcare-specific deployments is material here. Enterprise agreements with OpenAI or equivalent providers can include data processing agreements, commitments not to use inputs for model training, and defined data residency. Consumer-facing tools carry none of these protections by default.

Clinicians using any AI tool in a clinical context should verify:

  • Whether a data processing agreement is in place between their organisation and the AI provider

  • Where patient data is processed and stored

  • Whether the tool has been approved by their organisation's information governance team

  • Whether patient data can be de-identified before input without losing clinical utility

How purpose-built clinical AI differs from general-purpose LLMs

The contrast between general-purpose LLMs and purpose-built clinical AI tools is not simply a matter of the underlying model. It is a matter of the entire deployment architecture.

A benchmark study published in NPJ Digital Medicine in December 2025 developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB) and tested six LLMs across 30 metrics covering clinical decision support domains. Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with higher top scores in both safety (0.912 vs lower general-purpose scores) and effectiveness (0.861). Overall performance across all models averaged 57.2 per cent, with a 13.3 per cent performance drop in high-risk clinical scenarios.

Purpose-built clinical AI tools designed for regulated healthcare environments typically offer:

  • Validated outputs tested against clinical outcomes in defined patient populations

  • Audit trails that support clinical governance and accountability

  • Medical record system integration that allows the tool to access structured patient data without manual re-entry

  • Medical device certification under relevant regulatory frameworks (MDR in the EU, UKCA or MHRA registration in the UK)

  • Defined intended uses that constrain the tool to tasks where its performance has been evaluated

The underlying language model may be similar or identical to a general-purpose LLM. What differs is the governance layer, the validation evidence, and the regulatory accountability.

What the evidence actually says: a review of clinical studies

The published evidence on LLM performance in clinical tasks is growing rapidly but remains methodologically uneven. Several important caveats apply when interpreting the literature.

Benchmark performance does not equal real-world clinical utility. LLMs have achieved scores in the 80–90% range on medical licensing examinations such as the USMLE, a result that generated significant media attention. Licensing examinations test recall and structured reasoning on well-defined problems. They do not replicate the ambiguity, time pressure, incomplete information, and sequential decision-making of real clinical encounters.

Study quality is variable. The systematic review of LLMs in colorectal cancer care found that only 27 per cent of included studies had low risk of bias, with problematic domains including outcome measurement, patient selection, and lack of blinded assessment. Results from higher-risk-of-bias studies should be interpreted with caution.

Evaluation frameworks are not yet standardised. A 2025 expert consensus published in ScienceDirect identified lack of consistent evaluation methodologies as a significant barrier to safe deployment, covering disease screening, diagnostic assistance, and health management. Without standardised evaluation, comparing performance across studies is difficult.

Specialty-specific performance varies substantially. The evidence base is strongest in areas with large, well-structured training datasets, including common presentations in primary care, oncology staging, and radiology interpretation. Performance in rare diseases, complex multi-morbidity, and time-sensitive acute presentations is less well characterised.

A JAMIA commentary from January 2025 exploring ChatGPT's potential to augment CDS logic acknowledged both the promise and the significant gaps in evidence, noting that key limitations and areas for future research remain unresolved.

A framework for clinicians evaluating AI tools for decision support

Before adopting any AI tool in a clinical workflow, clinicians and healthcare decision-makers should apply a structured evaluation lens. The following questions are grounded in current regulatory and clinical governance frameworks, and are particularly relevant when evaluating clinical decision support tools.

On accuracy and validation:

  • Has the tool been validated on patient populations comparable to those in your clinical setting?

  • What is the published evidence on hallucination rate, calibration, and performance in high-risk scenarios?

  • Does performance degrade in patients with comorbidities or atypical presentations?

On regulatory status:

  • Is the tool classified as a medical device under the relevant regulatory framework?

  • If not, what is the manufacturer's stated intended use, and does your clinical use case fall within it?

On data governance:

  • Is there a data processing agreement between your organisation and the AI provider?

  • Where is patient data processed and stored?

  • Has the tool been approved by your organisation's information governance and clinical safety teams?

On clinical oversight:

  • Does the tool's deployment model require clinician review of all outputs before they influence clinical decisions?

  • Is there an audit trail that supports accountability if an AI-assisted decision is later questioned?

  • Are clinicians using the tool trained to recognise its specific failure modes, including hallucination and automation bias?

The ChatGPT-CARE preprint identified specific limitations of base ChatGPT in clinical settings, including ignoring clinical practice guidelines, lack of reasoning transparency, and overly general responses. It proposed in-context learning and chain-of-thought prompting as mitigation strategies. These represent a direction of travel for purpose-built tools rather than a solution available in consumer deployments.

The bottom line: appropriate use versus inappropriate reliance

The evidence supports a nuanced position. ChatGPT and comparable LLMs can reduce documentation burden, support non-specialist clinicians in generating differential diagnoses, and improve the accessibility of clinical communication, particularly in resource-limited or non-specialist settings. These are real and measurable benefits.

What the evidence does not support is autonomous clinical decision-making by LLMs, or the use of general-purpose consumer tools as a substitute for validated clinical decision support systems in high-stakes contexts. The Annals of Biomedical Engineering study on ChatGPT in CDS concluded that human expert oversight remains essential across all applications, a conclusion echoed consistently across the literature.

The boundary between appropriate use and inappropriate reliance is not always obvious in practice, but several markers are useful:

  • Low-risk, high-volume tasks such as documentation assistance, patient letter drafting, and literature summarisation are where LLMs offer the most benefit with the least risk, provided patient data governance requirements are met.

  • High-stakes diagnostic or prescribing decisions, particularly in complex, multi-morbid, or rare-disease presentations, require validated, regulated tools with defined intended uses and clinical oversight built into the workflow.

  • Consumer tools without enterprise data agreements should not be used with identifiable patient data under any circumstances, regardless of the clinical task.

Clinician oversight is not a temporary safeguard pending better AI. It is a structural requirement given the current state of LLM accuracy, calibration, and regulatory validation. The tools are improving, the evidence base is growing, and the regulatory frameworks are evolving. Until validated clinical AI tools demonstrate consistent, safe performance across the full range of clinical complexity, the clinician remains the essential checkpoint in any AI-assisted decision.

Frequently asked questions

▶ Is ChatGPT a regulated medical device for clinical decision support?

No. General-purpose large language models like ChatGPT are not marketed with a medical intended purpose, so they don't carry CE marking as medical devices under the EU Medical Device Regulation. When a clinician uses consumer ChatGPT to support a clinical decision, there's no validated evidence base, no manufacturer liability framework, and no regulatory audit trail. Clinical accountability rests entirely with the clinician.

▶ What clinical tasks show the strongest evidence for LLM utility?

The evidence is strongest in clinical documentation, where a synthesis published in Frontiers in Artificial Intelligence found 40–70 per cent time savings when large language models assisted with note generation and summarisation. LLMs also show measurable utility in drafting patient letters, supporting non-specialist clinicians with differential diagnoses, and accelerating literature summarisation. High-stakes diagnostic or prescribing decisions in complex cases are where the evidence is weakest and the risks are greatest.

▶ What is hallucination, and why does it matter in clinical settings?

Hallucination is when a large language model generates plausible-sounding but factually incorrect information. In clinical contexts, this can be dangerous. A November 2025 analysis found that hallucinated outputs appearing credible can guide clinicians toward harmful interventions, particularly when the incorrect output aligns with an existing clinical suspicion. ChatGPT demonstrated the lowest hallucination rate among four models tested, at 7.4 per cent, but even that figure isn't trivial in a clinical context.

▶ Can ChatGPT reliably detect drug–drug interactions?

No. A European study published in Clinical Pharmacology and Therapeutics in February 2025 found that ChatGPT produced different answers in 90 per cent of cases when the same drug interaction query was repeated, and missed clinically relevant interactions. The study concluded it can't currently be recommended for this use case. Unlike rule-based prescribing systems, which apply consistent coded logic, large language models produce probabilistic outputs that vary with phrasing and context.

▶ What are the GDPR risks of entering patient data into consumer LLM tools?

Entering identifiable patient data into consumer-facing large language model tools almost certainly constitutes processing of special category data by a third-party controller, without a data processing agreement, without a lawful basis, and potentially with data stored outside the UK or EU. Consumer tools carry none of the protections that enterprise or healthcare-specific deployments can include. Clinicians should verify whether a data processing agreement is in place, where data is processed and stored, and whether their organisation's information governance team has approved the tool before use.

▶ How does a purpose-built clinical AI tool differ from general-purpose ChatGPT?

The difference isn't just the underlying model. It's the entire deployment architecture. Purpose-built clinical AI tools offer validated outputs tested against clinical outcomes, audit trails that support governance and accountability, medical record system integration, and medical device certification under relevant regulatory frameworks. A benchmark study published in NPJ Digital Medicine in December 2025 found that domain-specific medical large language models showed consistent performance advantages over general-purpose models, with higher scores in both safety and effectiveness metrics.

▶ Does passing medical licensing exams mean ChatGPT is safe for clinical decision support?

Not directly. Large language models have achieved scores in the 80–90 per cent range on medical licensing examinations such as the USMLE, but licensing exams test recall and structured reasoning on well-defined problems. They don't replicate the ambiguity, incomplete information, and sequential decision-making of real clinical encounters. A simulation-based study published in Healthcare (MDPI) in July 2025 found that ChatGPT's reliability declined significantly in cardiovascular cases with comorbidities, precisely the patients most commonly seen in secondary care.

▶ What is automation bias, and how does it affect clinicians using AI tools?

Automation bias is the tendency for clinicians to defer to an AI-generated suggestion even when their own clinical judgement would have led to a different, and potentially more accurate, conclusion. This risk is well-documented in the human factors literature and applies directly to large language model use in clinical settings. It's compounded by a confidence calibration failure in current models: a comprehensive evaluation published in Health Informatics Journal in April 2026 found that models expressed similar confidence regardless of whether their reasoning was sound.

▶ Where should clinicians draw the line between appropriate use and inappropriate reliance on LLMs?

Low-risk, high-volume tasks such as documentation assistance, patient letter drafting, and literature summarisation offer the most benefit with the least risk, provided patient data governance requirements are met. High-stakes diagnostic or prescribing decisions, particularly in complex, multi-morbid, or rare-disease presentations, require validated, regulated tools with defined intended uses and clinical oversight built into the workflow. Consumer tools without enterprise data agreements should not be used with identifiable patient data under any circumstances, regardless of the clinical task.

Kom igång med Tandem idag

Gör som tusentals andra som njuter av stressfri dokumentation.

Kom igång med Tandem idag

Gör som tusentals andra som njuter av stressfri dokumentation.

Kom igång med Tandem idag

Gör som tusentals andra som njuter av stressfri dokumentation.