·

Primärversorgung

Kliniker

Rare disease diagnosis: limits of clinical decision support

How frequency-weighted AI tools miss rare diseases in primary care, and what phenotype-based approaches can do better

Rare diseases are individually uncommon, but collectively they represent one of the most significant diagnostic challenges in European general practice. With over 6,000 recognised rare diseases affecting approximately 30 million people across Europe, most GPs will encounter the majority of these conditions either infrequently or never. Yet primary care remains the first point of contact for patients whose symptoms have not yet been named, and the burden of initial suspicion falls squarely on the GP. Clinical decision support tools have become a routine feature of that environment, but how well they serve patients with rare disease presentations deserves careful, evidence-grounded scrutiny.

Why rare diseases are a diagnostic blind spot in general practice

The structural problem isn't one of clinical competence. As the British Journal of General Practice has noted, GPs have considerable expertise managing multisystem disease, but few have the resources to thoroughly research rare conditions, and many report becoming overwhelmed when patients attend with detailed information about conditions the GP has never encountered. The average time from symptom onset to confirmed rare disease diagnosis is consistently cited at five to six years in the UK and Europe, with half of patients receiving at least one misdiagnosis along the way.

This delay isn't primarily a hospital problem. It's a primary care problem. Rare disease patients typically pass through multiple GP consultations before reaching a specialist who recognises the condition. The diagnostic odyssey, a term now embedded in rare disease literature, begins and often stalls in general practice. Not because GPs are inattentive, but because the cognitive architecture of clinical pattern recognition is calibrated to frequency. Conditions a clinician has never seen are, almost by definition, harder to consider.

How clinical decision support tools are trained, and where that creates gaps

Clinical decision support (CDS) tools generate differential diagnoses by drawing on training data and diagnostic logic weighted toward high-frequency presentations. This is a rational design choice: tools built to assist in the broadest range of consultations should perform well on the conditions GPs encounter most often. The consequence is that rare disease pathways are systematically under-represented, not because of deliberate exclusion, but because of a data frequency problem. Conditions that appear rarely in training datasets generate weak or absent signal in frequency-weighted differential logic.

A 2026 scoping review in the International Journal of Medical Informatics mapped the technological approaches underpinning CDS systems for rare disease diagnosis and identified four main categories: information-retrieval systems, phenotype-driven reasoning, ontology-based methods, and AI-based approaches. The review found that translation into routine clinical practice remains limited across all four categories, and that the gap between research-grade tools and those available at the point of care is substantial.

The SATURN project in Germany, which developed a CDS prototype specifically for primary care targeting unclear and rare disease presentations, found in qualitative evaluation that even purpose-built tools face significant usability barriers. These included the inability to enter unlisted symptoms and the absence of direct data import from practice management systems. These aren't minor refinements. They're obstacles that determine whether a tool gets used at all.

What the literature says about rare disease misdiagnosis in primary care

The evidence on diagnostic delay in rare diseases is consistent across European settings. Diagnostic odysseys averaging four to eight years are reported across the literature, though estimates vary by condition and country, with some studies citing five to six years in the UK and Europe specifically. These delays are accompanied by frequent misdiagnoses and unnecessary investigations. A 2025 Delphi consensus study published in Scientific Reports, involving 55 multidisciplinary experts, identified the key reasons for delay: low prevalence, limited awareness among primary healthcare professionals, heterogeneous clinical presentation, and unusual inheritance patterns.

The same consensus identified the presentation features most commonly associated with missed rare disease diagnoses:

  • Family history of unexplained or severe illness

  • Clusters of birth defects or congenital anomalies

  • Unusual presentations of otherwise common diseases

  • Neurodevelopmental delays or unexplained cognitive decline

  • Severe pathology disproportionate to apparent cause

These aren't obscure signals. Many are visible in GP records across multiple consultations. The problem is that, in isolation, each can appear to fit a more common explanation, and frequency-weighted tools will consistently surface that more common explanation first.

The difference between frequency-weighted and phenotype-based decision support

The distinction between frequency-weighted and phenotype-based differential logic is central to understanding how CDS tools perform on rare disease presentations.

Frequency-weighted tools rank diagnostic suggestions by population prevalence. In a consultation involving fatigue, joint pain, and a rash in a 30-year-old, such a tool will reliably surface anaemia, viral illness, or reactive arthritis before it considers systemic lupus erythematosus, because the former are more common. This is appropriate for most consultations. It becomes a structural limitation when the patient's symptom cluster is genuinely more consistent with a rare condition.

Phenotype-based differential logic takes a different approach. Rather than ranking by prevalence, they map the specific combination of symptoms, the phenotype, against disease profiles regardless of how frequently those diseases occur in the general population. This approach is more likely to surface rare disease candidates when the clinical picture is atypical. The Human Phenotype Ontology (HPO) is the most widely used structured vocabulary for this purpose, and it supports systematic phenotype-to-disease mapping across thousands of conditions.

A 2026 study in EBioMedicine evaluated large language model performance for rare disease diagnosis across ten languages, including English, French, German, Dutch, Spanish, and Italian, using 4,917 clinical vignettes derived from Human Phenotype Ontology-structured data. GPT-4o placed the correct rare disease diagnosis within the top three ranked differentials in 27 per cent of cases in English, with broadly consistent performance across European languages. A 27 per cent top-three accuracy rate isn't sufficient for standalone clinical reliance, but it represents a meaningful signal that phenotype-structured prompting can surface rare diagnoses that frequency-weighted logic would not.

How leading European clinical decision support tools approach rare disease coverage

The landscape of CDS tools available in European general practice is heterogeneous, and rare disease coverage varies considerably.

Orphanet, the European reference database for rare diseases, provides the most comprehensive structured resource for rare disease nomenclature, prevalence data, and clinical descriptions. Tools that integrate Orphanet data, or cross-reference OMIM (Online Mendelian Inheritance in Man), have a structural advantage in surfacing rare disease candidates. Integration of these databases into tools embedded within primary care medical record systems remains inconsistent.

DxGPT, a GPT-4-based tool developed with rare disease diagnosis as an explicit design objective, generates a ranked, reasoned top-five differential diagnosis specifically intended to counter cognitive biases in complex cases. It has been evaluated in UK and Spanish clinical contexts. Tools such as Ada Health, currently being evaluated in a quality improvement study across the CUF Hospital Network in Portugal, take a symptom-assessment approach that the study protocol notes has potential to assist users with rare disease cases where timely diagnosis remains a significant challenge.

DeepRare, described in a recent Nature paper, represents the current research frontier. It's a multi-agent system integrating more than 40 specialised tools and knowledge sources, processing free-text descriptions, structured Human Phenotype Ontology terms, and genetic results to generate ranked diagnostic hypotheses with transparent reasoning. In HPO-based tasks, it achieved strong performance metrics, outperforming comparable methods. Expert review agreed with its reasoning chains in a substantial majority of cases. DeepRare isn't yet embedded in routine GP workflows, but it illustrates the performance ceiling that phenotype-driven, knowledge-integrated approaches can reach.

The gap between research-grade tools like DeepRare and the tools available in a typical European GP surgery remains significant. Integrating primary care data in Germany, as one example, remains challenging due to country-specific vocabularies and heterogeneous data structures, which limits the ability of even well-designed CDS prototypes to function in real-world primary care settings.

Red flag symptom clusters that decision support tools commonly miss

The Argo Delphi consensus established a set of clinical red flags that should trigger rare disease suspicion in primary care. These are the patterns that frequency-weighted tools are most likely to rank low or omit entirely:

  • Multi-system involvement in a young patient, particularly when symptoms span cardiovascular, neurological, and musculoskeletal domains without a unifying common diagnosis

  • Unexplained fatigue with atypical co-morbidities, especially when standard investigations are unremarkable and the clinical picture does not evolve toward a recognised common diagnosis

  • Recurrent presentations without convergence, meaning multiple consultations for related or overlapping symptoms that have not resolved into a clear diagnostic category

  • Disproportionate severity, where the clinical course is more severe than would be expected for the apparent diagnosis

  • Positive family history of unexplained serious illness, particularly in conditions with autosomal recessive or X-linked inheritance patterns that may not be immediately apparent

These clusters share a common feature: each is individually explicable by common conditions, but their combination, particularly over time, should prompt consideration of a rare disease differential. A tool that ranks by frequency will consistently offer the common explanation first. It won't flag the combination as unusual unless it has been explicitly designed to do so.

How to recognise when a tool's differential is likely to be incomplete

Recognising the limits of a CDS tool's differential in real time is itself a clinical skill. Several indicators should prompt a GP to treat a tool's output with additional scepticism:

  • The patient's age and symptom chronicity don't fit the common differentials offered. If a tool suggests a diagnosis that is statistically implausible given the patient's age, duration of symptoms, or prior investigation results, the differential may be anchored on frequency rather than fit.

  • Previous investigations have been unremarkable. When standard workup for the suggested diagnoses has returned normal results, the differential hasn't been confirmed. It simply hasn't been excluded. That's not the same as a diagnosis.

  • The presentation spans multiple organ systems. Single-system CDS tools, or tools trained predominantly on single-specialty data, are structurally less equipped to surface diagnoses that require multi-system pattern recognition.

  • The tool returns a high-confidence suggestion for a common condition despite a poor clinical fit. High confidence in a frequency-weighted tool reflects prevalence, not match quality. Tool confidence isn't diagnostic confirmation.

  • The patient has been seen multiple times for the same or related symptoms. Chronicity and recurrence should lower the threshold for considering rare disease, even when each individual consultation appears to fit a common explanation.

Tool silence on a diagnosis is not the same as ruling it out. A CDS system that doesn't list a condition in its differential hasn't excluded it. It simply hasn't generated sufficient signal to surface it, which, for rare diseases, is precisely the problem these tools are least equipped to solve.

The role of specialist networks and European rare disease registries

Where CDS tools reach their limits, European infrastructure provides structured pathways for escalation. The European Reference Networks (ERNs), 24 thematic networks connecting specialist centres across EU member states, exist to provide expert input on rare and complex conditions. European Reference Networks cover areas including neurological diseases, connective tissue disorders, immunodeficiencies, and metabolic conditions, among others. GPs can, in appropriate cases, initiate contact through national rare disease centres or via Advice and Guidance pathways to access specialist opinion without requiring a formal referral.

National rare disease registries, where they exist, provide epidemiological data that can contextualise a clinical presentation, particularly for conditions with known geographic or ethnic clustering. ERDERA, the European partnership on rare disease research launched in 2024 under Horizon Europe with a budget of approximately €380 million to 2031, is intended in part to strengthen this data infrastructure.

Orphanet remains the most accessible reference point for GPs seeking information on a specific suspected rare condition. It provides disease summaries, prevalence estimates, diagnostic criteria, and links to specialist centres, none of which require a subscription or specialist access.

What good rare disease coverage looks like in a clinical decision support tool

For GPs and procurement leads evaluating CDS tools, rare disease capability isn't a binary feature. It exists on a spectrum, and the following criteria provide a practical framework for assessment:

  • Integration with validated rare disease databases. Does the tool draw on Orphanet, Online Mendelian Inheritance in Man, or equivalent structured rare disease knowledge sources? If not, its rare disease differential logic is likely limited to conditions that appear in general clinical training data.

  • Phenotype-based differential logic. Does the tool map symptom combinations to disease profiles, or does it rank purely by population prevalence? The former is a prerequisite for reliable rare disease performance.

  • Transparency about training data scope. Can the tool or its documentation specify which disease categories are covered, and which are not? A tool that can't answer this question can't be evaluated for fitness of purpose.

  • Clear escalation prompts. Does the tool flag when a symptom cluster exceeds its confident range, or when a rare disease referral pathway should be considered? High-yield, low-volume alerts seamlessly integrated into daily workflow are identified in the literature as the design standard for effective rare disease clinical support.

  • Multilingual consistency. For tools used across European healthcare systems, performance should be evaluated across the relevant clinical language, not assumed to be equivalent to English-language performance.

No currently available tool meets all of these criteria fully. The non-algorithmic barriers to deployment, including absent implementation frameworks and the failure of biological models to capture real-world clinical complexity, remain significant. The gap between research performance and routine clinical utility is an acknowledged limitation across the field.

The clinician's role when decision support reaches its limits

Clinical decision support tools are decision aids. In rare disease presentations, this distinction matters more than in almost any other clinical context. A tool that performs well on the most common presentations may perform poorly on the rarest, and rare diseases collectively affect millions of patients across Europe.

The British Journal of General Practice has described GPs as having considerable expertise in managing multisystem disease, an expertise that no current CDS tool replicates. Longitudinal knowledge of a patient, familiarity with the trajectory of their symptoms over time, and the clinical instinct that something doesn't fit a common diagnosis aren't features that can be encoded in a differential ranking algorithm. They're the product of sustained clinical relationship and experienced pattern recognition.

Education, increased awareness, and the use of technology are identified in the Delphi consensus as complementary gateways to earlier rare disease diagnosis, not alternatives to clinical judgement, but supports for it. Understanding what a clinical decision support tools can and can't do isn't a technical question. It's a clinical competency, and in rare disease presentations, it may be the most important one a GP brings to the consultation.

Frequently asked questions

▶ Why does rare disease diagnosis take so long in primary care?

The average time from symptom onset to confirmed rare disease diagnosis is five to six years in the UK and Europe, with half of patients receiving at least one misdiagnosis along the way. The delay isn't primarily a hospital problem. Rare disease patients typically pass through multiple GP consultations before reaching a specialist who recognises the condition. The core difficulty is that clinical pattern recognition is calibrated to frequency — conditions a clinician has never seen are harder to consider, even when the clinical signals are present across multiple visits.

▶ Why do clinical decision support tools miss rare disease diagnoses?

Most clinical decision support tools rank diagnostic suggestions by population prevalence. Rare conditions are systematically under-represented in training data, not because of deliberate exclusion, but because of a data frequency problem. Conditions that appear rarely in training datasets generate weak or absent signal in frequency-weighted differential logic. A 2026 scoping review in the International Journal of Medical Informatics confirmed that translation of rare disease decision support into routine clinical practice remains limited across all four main technological approaches.

▶ What is the difference between frequency-weighted and phenotype-based clinical decision support?

Frequency-weighted tools rank diagnostic suggestions by how commonly a condition occurs in the general population. Phenotype-based tools take a different approach: they map the specific combination of symptoms against disease profiles regardless of how frequently those diseases occur. This makes phenotype-based tools more likely to surface rare disease candidates when the clinical picture is atypical. The Human Phenotype Ontology is the most widely used structured vocabulary for this purpose, supporting systematic phenotype-to-disease mapping across thousands of conditions.

▶ How accurately do large language models identify rare diseases?

A 2026 study in EBioMedicine evaluated large language model performance for rare disease diagnosis across ten languages using 4,917 clinical vignettes structured with Human Phenotype Ontology data. GPT-4o placed the correct rare disease diagnosis within the top three ranked differentials in 27 per cent of cases in English, with broadly consistent performance across European languages. That figure isn't sufficient for standalone clinical reliance, but it does show that phenotype-structured prompting can surface rare diagnoses that frequency-weighted logic would not.

▶ What red flag symptom clusters should prompt rare disease suspicion in general practice?

A 2025 Delphi consensus study identified several clinical patterns most commonly associated with missed rare disease diagnoses: multi-system involvement in a young patient, unexplained fatigue with atypical co-morbidities where standard investigations are unremarkable, recurrent presentations that haven't resolved into a clear diagnostic category, disproportionate severity relative to the apparent diagnosis, and a positive family history of unexplained serious illness. Each cluster is individually explicable by common conditions, but their combination over time should prompt consideration of a rare disease differential.

▶ How can a GP tell when a clinical decision support tool's differential is likely to be incomplete?

Several indicators should prompt additional scepticism about a tool's output. These include a suggested diagnosis that doesn't fit the patient's age or symptom chronicity, previous investigations that have returned unremarkable results, a presentation spanning multiple organ systems, high tool confidence for a common condition despite a poor clinical fit, and a patient who has been seen multiple times for the same or related symptoms. Tool silence on a diagnosis isn't the same as ruling it out — a clinical decision support system that doesn't list a condition hasn't excluded it.

▶ What European resources exist to support GPs when clinical decision support reaches its limits?

The European Reference Networks, 24 thematic networks connecting specialist centres across EU member states, provide expert input on rare and complex conditions covering areas including neurological diseases, connective tissue disorders, immunodeficiencies, and metabolic conditions. GPs can initiate contact through national rare disease centres or via Advice and Guidance pathways without requiring a formal referral. Orphanet, the European reference database for rare diseases, provides disease summaries, prevalence estimates, diagnostic criteria, and links to specialist centres, and requires no subscription or specialist access.

▶ What should GPs and procurement leads look for when evaluating a clinical decision support tool's rare disease capability?

Rare disease capability in a clinical decision support tool exists on a spectrum rather than as a binary feature. Key criteria include integration with validated rare disease databases such as Orphanet or Online Mendelian Inheritance in Man, phenotype-based differential logic rather than purely prevalence-based ranking, transparency about training data scope, clear escalation prompts when a symptom cluster exceeds the tool's confident range, and consistent performance across the relevant clinical language for tools used across European healthcare systems. No currently available tool meets all of these criteria fully.

▶ What is DeepRare and how does it differ from tools available in routine GP practice?

DeepRare, described in a recent Nature paper, is a multi-agent system integrating more than 40 specialised tools and knowledge sources. It processes free-text descriptions, structured Human Phenotype Ontology terms, and genetic results to generate ranked diagnostic hypotheses with transparent reasoning. In Human Phenotype Ontology-based tasks it achieved strong performance metrics, and expert review agreed with its reasoning chains in a substantial majority of cases. DeepRare isn't yet embedded in routine GP workflows, but it illustrates the performance ceiling that phenotype-driven, knowledge-integrated approaches can reach.

▶ What role does clinical judgement play when decision support tools fall short in rare disease presentations?

Clinical decision support tools are decision aids, not replacements for clinical judgement. In rare disease presentations, longitudinal knowledge of a patient, familiarity with the trajectory of their symptoms over time, and the clinical recognition that something doesn't fit a common diagnosis aren't features that a differential ranking algorithm can replicate. A 2025 Delphi consensus identified education, increased awareness, and the use of technology as complementary supports for earlier rare disease diagnosis, not alternatives to clinical judgement. Understanding what a clinical decision support tool can and can't do is itself a clinical competency.

Get started with Tandem today

Join thousands of clinicians enjoying stress-free documentation.

Get started with Tandem today

Join thousands of clinicians enjoying stress-free documentation.

Get started with Tandem today

Join thousands of clinicians enjoying stress-free documentation.