
Overview of the study and datasets
We conducted this study at three centers, collecting 15640 data entries from 9825 subjects (4554 male, 5271 female) to develop and evaluate the IOMIDS system (Fig. 1a). Among these, 6551 entries belong to the model development dataset, 912 entries belong to the silent evaluation dataset, and 8177 entries belong to the clinical trial dataset (Supplementary Fig. 1). In detail, we first collected a doctor-patient communication dialog dataset of 450 entries to train the text model through prompt engineering. Next, to assess the diagnostic and triage efficiency of the text model, we collected Dataset A (Table 1), consisting of simulated patient data derived from outpatient records. We then gathered two image datasets (Table 1, Dataset B and Dataset C) for training and validating image diagnostic models, which contain only images and the corresponding image-based diagnostic data. Dataset D, Dataset E and Dataset F (Table 1) were then collected to evaluate image diagnostic model performance and develop a text-image multimodal model. These datasets include both patient medical histories and anterior segment images. Following in silico development of the IOMIDS program, we collected a silent evaluation dataset to compare the diagnostic and triage efficacy among different models (Table 1, Dataset G). The early clinical evaluation consists of internal evaluation (Shanghai center) and external evaluation (Nanjing and Suqian), with 3519 entries from 2292 patients in Shanghai, 2791 entries from 1748 patients in Nanjing, and 1867 entries from 1192 patients in Suqian. Comparison among these centers reveals significant differences in subspecialties, disease classifications, gender, age, and laterality (Supplementary Table 1), suggesting that these factors may influence model performance and should be considered in further analyses.
a Intelligent Ophthalmic Multimodal Interactive Diagnostic System (IOMIDS) is an embodied conversational agent integrated with ChatGPT designed for multimodal diagnosis using eye images and medical history. It comprises a text model and an image model. The text model employs classifiers for chief complaints, along with question and analysis prompts developed from real doctor-patient dialogs. The image model utilizes eye photos taken with a slit-lamp and/or smartphone for image-based diagnosis. These modules combine through diagnostic prompts to create a multimodal model. Patients with eye discomfort can interact with IOMIDS using natural language. This interaction enables IOMIDS to gather patient medical history, guide them in capturing eye lesion photos with a smartphone or uploading slit-lamp images, and ultimately provide disease diagnosis and ophthalmic subspecialty triage information. b Both the text model and the multimodal models follow a similar workflow for text-based modules. After a patient inputs their chief complaint, it is classified by the chief complaint classifier using keywords, triggering relevant question and analysis prompts. The question prompt guides ChatGPT to ask specific questions to gather the patient’s medical history. The analysis prompt considers the patient’s gender, age, chief complaint, and medical history to generate a preliminary diagnosis. If no image information is provided, IOMIDS provides the preliminary diagnosis along with subspecialty triage and prevention, treatment, and care guidance as the final response. If image information is available, the diagnosis prompt integrates image analysis with the preliminary diagnosis to provide a final diagnosis and corresponding guidance. c The text + image multimodal model is divided into text + slit-lamp, text + smartphone, and text + slit-lamp + smartphone models based on image acquisition methods. For smartphone-captured images, YOLOv7 segments the image to isolate the affected eye, removing other facial information, followed by analysis using a ResNet50-trained diagnostic model. Slit-lamp captured images skip segmentation and are directly analyzed by another ResNet50-trained model. Both diagnostic outputs undergo threshold processing to exclude non-relevant diagnoses. The image information is then integrated with the preliminary diagnosis derived from textual information via the diagnosis prompt to form the multimodal model.
Development of the IOMIDS system
To develop the text model, we categorized doctor-patient dialogs according to chief complaint themes (Supplementary Table 2). Three researchers independently reviewed the dataset and each selected a set of 90 dialogs for training. Based on these dialogs, we used prompt engineering (Fig. 1b) to develop an embodied conversational agent with ChatGPT. After comparison, the most effective set of 90 dialogs (Supplementary Data 1) was identified, finalizing the text model for further research. These included 11 dialogs on “dry eye”, 10 on “itchy eye”, 10 on “red eye”, 7 on “eye swelling”, 10 on “eye pain”, 8 on “eye discharge”, 5 on “eye masses”, 13 on “blurry vision”, 6 on “double vision”, 6 on “eye injuries or foreign bodies”, and 4 on “proptosis”. This text model can reliably generate questions related to the chief complaint and provide a final response based on the patient’s answers, which includes diagnostic, triage, and other preventive, therapeutic, and care guidance.
After developing the text model, we evaluated its performance using Dataset A (Table 1). The results demonstrated varying diagnostic accuracy across diseases (Fig. 2a). Specifically, the model performed least effectively for primary anterior segment diseases (cataract, keratitis, and pterygium), achieving only 48.7% accuracy (Supplementary Fig. 2a). To identify conditions that did not meet development goals, we analyzed the top 1–3 diseases in each subspecialty. The results showed that the following did not achieve the targets of sensitivity ≥ 90% and specificity ≥ 95% (Fig. 2a): keratitis, pterygium, cataract, glaucoma, and thyroid eye disease. Clinical experience suggests slit-lamp and smartphone captured images are valuable for diagnosing cataracts, keratitis, and pterygium. Therefore, development efforts of image-based diagnostic model would focus on these three conditions.
a Heatmaps of diagnostic (top) and triage (bottom) performance metrics after in silico evaluation of the text model (Dataset A). Metrics are column-normalized from -2 (blue) to 2 (red). Disease types are categorized into six major classifications. The leftmost lollipop chart displays the prevalence of each diagnosis and triage. b Radar charts of disease-specific diagnosis (red) and triage (green) accuracy in Dataset A. Rainbow ring represents six disease classifications. Asterisks indicate significant differences between diagnosis and triage accuracy based on Fisher’s exact test. c Bar charts of overall accuracy and disease-specific accuracy for diagnosis (red) and triage (green) after silent evaluation across different models (Dataset G). The line graph below denotes the model used: text model, text + slit-lamp model, text + smartphone model, and text + slit-lamp + smartphone model. d Sankey diagram of Dataset G illustrating the flow of diagnoses across different models for each case. Each line represents a case. PPV, positive predictive value; NPV, negative predictive value; * P < 0.05, ** P < 0.01, *** P < 0.001, **** P < 0.0001.
Beyond diagnosis, the chatbot effectively provided triage information. Statistical analysis revealed high overall triage accuracy (88.3%), significantly outperforming diagnostic accuracy (84.0%; Fig. 2b; Fisher’s exact test, P = 0.0337). All subspecialties achieved a negative predictive value ≥ 95%, and all, except optometry (79.7%) and retina (77.6%), achieved a positive predictive value ≥ 85% (Dataset A in Supplementary Data 2). Thus, eight out of ten subspecialties met the predefined developmental targets. Future multimodal model development will focus on enhancing diagnostic capabilities while utilizing the text model’s triage prompts without additional refinement.
To develop a multimodal model combining text and images, we first created two image-based diagnostic models based on Dataset B and Dataset C (Table 1), with 80% of the images used for training and 20% for validation. The slit-lamp model achieved disease-specific accuracies of 79.2% for cataract, 87.6% for keratitis, and 98.4% for pterygium (Supplementary Fig. 2b). The smartphone model achieved disease-specific accuracies of 96.2% for cataract, 98.4% for keratitis, and 91.9% for pterygium (Supplementary Fig. 2c). After developing the image diagnostic models, we collected Dataset D, Dataset E and Dataset F (Table 1), which included both imaging results and patient history. Clinical diagnosis requires integrating medical history and eye imaging features, so clinical and image diagnoses may not always align (Supplementary Fig. 3a). To address this, we used image information only to rule out diagnoses. Using image diagnosis as the gold standard, we plotted the receiver operating characteristic (ROC) curves for cataract, keratitis, and pterygium in Dataset D (Supplementary Fig. 3b) and Dataset E (Supplementary Fig. 3c). The threshold >0.363 provided high specificity for all three conditions (cataract 83.5%, keratitis 99.2%, pterygium 96.6%) in Dataset D and was used to develop the text + slit-lamp multimodal model. Similarly, in Dataset E, the threshold >0.315 provided high specificity for all three conditions (cataract 96.8%, keratitis 98.5%, pterygium 95.0%) and was used to develop the text + smartphone multimodal model. In the text + slit-lamp + smartphone multimodal model, we tested two methods to combine the results from slit-lamp and smartphone images. The first method used the union of the diagnoses excluded by each model, while the second used the intersection. Testing on Dataset F showed that the first method achieved significantly higher accuracy (52.2%, Supplementary Fig. 3d) than the second method (31.9%, Supplementary Fig. 3e; Fisher’s exact test, P < 0.0001). Therefore, we applied the first method in all subsequent evaluations for the text + slit-lamp + smartphone model.
Using clinical diagnosis as the gold standard, the diagnostic accuracy of all multimodal models significantly improved compared to the text model; specifically, the text + slit-lamp model increased from 32.0% to 65.5% (Fisher’s exact text, P < 0.0001), the text + smartphone model increased from 41.6% to 64.2% (Fisher’s exact test, P < 0.0001), and the text + slit-lamp + smartphone model increased from 37.4% to 52.2% (Fisher’s exact test, P = 0.012). Therefore, we successfully developed four models for the IOMIDS system: the unimodal text model, the text + slit-lamp multimodal model, the text + smartphone multimodal model, and the text + slit-lamp + smartphone multimodal model.
Silent evaluation of diagnostic and triage performance
During the silent evaluation phase, Dataset G was collected to validate the diagnostic and triage performance of the IOMIDS system. Although the diagnostic performance for cataract, keratitis, and pterygium (Dataset G in Supplementary Data 3) did not meet the established clinical goal, significant improvements in diagnostic accuracy were observed for all multimodal models compared to the text model (Fig. 2c). The Sankey diagram revealed that in the text model, 70.8% of cataract cases and 78.3% of pterygium cases were misclassified as “others” (Fig. 2d). In the “others” category, the text + slit-lamp multimodal model correctly identified 88.2% of cataract cases and 63.0% of pterygium cases. The text + smartphone multimodal model performed even better, correctly diagnosing 93.3% of cataract cases and 80.0% of pterygium cases. Meanwhile, the text + slit-lamp + smartphone multimodal model accurately identified 90.5% of cataract cases and 68.2% of pterygium cases in the same category.
Regarding triage accuracy, the overall performance improved with the multimodal models. However, the accuracy for cataract triage notably decreased, dropping from 91.7% to 62.5% in the text + slit-lamp model (Fig. 2c, Fisher’s exact test, P = 0.0012), to 58.3% in the text + smartphone model (Fig. 2c, Fisher’s exact test, P = 0.0003), and further to 53.4% in the text + slit-lamp + smartphone model (Fig. 2c, Fisher’s exact test, P = 0.0001). Moreover, neither the text model nor the three multimodal models met the established clinical goal in any subspecialty (Supplementary Data 2).
We also investigated whether medical histories in the outpatient electronic system alone were sufficient for the text model to achieve accurate diagnostic and triage results. We randomly sampled 104 patients from Dataset G and re-entered their medical dialogs into the text model (Supplementary Fig. 1). For information not recorded in the outpatient history, responses were given as “no information available”. The results showed a significant decrease in diagnostic accuracy, dropping from 63.5% – 20.2% (Fisher’s exact test, P < 0.0001), while triage accuracy remained relatively unchanged, only slightly decreasing from 72.1% – 70.2% (Fisher’s exact test, P = 0.8785). This study suggests that while triage accuracy of the text model is not dependent on dialog completeness, diagnostic accuracy is affected by the completeness of the answers provided. Therefore, thorough responses to AI chatbot queries are crucial in clinical applications.
Evaluation in real clinical settings with trained researchers
The clinical trial involved two parts: researcher-collected data and patient-entered data (Table 2). There was a significant difference in the words input and the duration of input between researchers and patients. For researcher-collected data, the average was 38.5 ± 8.2 words and 58.2 ± 13.5 s, while for patient-entered data, the average was 55.5 ± 10.3 words (t-test, P = 0.002) and 128.8 ± 27.1 s (t-test, P < 0.0001). We first assessed the diagnostic performance during the researcher-collected data phase. For the text model across six datasets (Dataset 1–3, 6–8 in Supplementary Data 3), the following number of diseases met the clinical goal for diagnosing ophthalmic diseases: 16 out of 46 diseases (46.4% of all cases) in Dataset 1, 16 out of 32 diseases (16.4% of all cases) in Dataset 2, 18 out of 28 diseases (61.4% of all cases) in Dataset 3, 19 out of 48 diseases (43.9% of all cases) in Dataset 6, 14 out of 28 diseases (35.3% of all cases) in Dataset 7, and 11 out of 42 diseases (33.3% of all cases) in Dataset 8. Thus, less than half of the cases during the researcher-collected data phase met the clinical goal for diagnosis.
Next, we investigated the subspecialty triage accuracy of the text model across various datasets (Dataset 1–3, 6–8 in Supplementary Data 2). Our findings revealed that during internal validation, the cornea subspecialty achieved the clinical goal for triaging ophthalmic diseases. In external validation, the general outpatient clinic, cornea subspecialty, optometry subspecialty, and glaucoma subspecialty also met these clinical criteria. We further compared the diagnostic and triage outcomes of the text model across six datasets. Data analysis demonstrated that triage accuracy exceeded diagnostic accuracy in most datasets (Supplementary Fig. 4a–c, e–g). Specifically, triage accuracy was 88.7% compared to diagnostic accuracy of 69.3% in Dataset 1 (Fig. 3a; Fisher’s exact test, P < 0.0001), 84.1% compared to 62.4% in Dataset 2 (Fisher’s exact test, P < 0.0001), 82.5% compared to 75.4% in Dataset 3 (Fisher’s exact test, P = 0.3508), 85.7% compared to 68.6% in Dataset 6 (Fig. 3a; Fisher’s exact test, P < 0.0001), 80.5% compared to 66.5% in Dataset 7 (Fisher’s exact test, P < 0.0001), and 84.5% compared to 65.1% in Dataset 8 (Fisher’s exact test, P < 0.0001). This suggests that while the text model may not meet clinical diagnostic needs, it could potentially fulfill clinical triage requirements.
a Radar charts of disease-specific diagnosis (red) and triage (green) accuracy after clinical evaluation of the text model in internal (left, Dataset 1) and external (right, Dataset 6) centers. Asterisks indicate significant differences between diagnosis and triage accuracy based on Fisher’s exact test. b Circular stacked bar charts of disease-specific diagnostic accuracy across different models from internal (left, Dataset 2–4) and external (right, Dataset 7–9) evaluations. Solid bars represent the text model, while hollow bars represent multimodal models. Asterisks indicate significant differences in diagnostic accuracy between two models based on Fisher’s exact test. c Bar charts of overall accuracy (upper) and accuracy of primary anterior segment diseases (lower) for diagnosis (red) and triage (green) across different models in Dataset 2–5 and Dataset 7–10. The line graphs below denote study centers (internal, external), models used (text, text + slit-lamp, text + smartphone, text + slit-lamp + smartphone), and data provider (researchers, patients). * P < 0.05, ** P < 0.01, *** P < 0.001, **** P < 0.0001.
We then investigated the diagnostic performance of multimodal models in Dataset 2, 3, 7, and 8 (Supplementary Data 3). Both the text + slit-lamp model and text + smartphone model demonstrated higher overall diagnostic accuracy compared to the text model in internal and external validations, with statistically significant improvements noted for the text + smartphone model in Dataset 8 (Fig. 3c). The clinical goal for diagnosing ophthalmic diseases was achieved by 11 out of 32 diseases (13.8% of all cases) in Dataset 2, 21 out of 28 diseases (70.2% of all cases) in Dataset 3, 11 out of 28 diseases (28.5% of all cases) in Dataset 7, and 15 out of 42 diseases (50.6% of all cases) in Dataset 8. The text + smartphone model outperformed the text model by meeting the clinical goal for diagnosis in more cases and disease types. For some other diseases that did not meet the clinical goal for diagnosis, significant improvements in diagnostic accuracy were also found within the multimodal models (Fig. 3b). Therefore, the multimodal model exhibited better diagnostic performance compared to the text model.
Regarding to triage, some datasets of the multimodal models showed a minor decrease in accuracy compared to the text model, however, these differences were not statistically significant (Fig. 3c). Unlike the silent evaluation phase, in clinical applications, neither of the two multimodal models demonstrated a notable decline in triage accuracy across different diseases, including cataract (Supplementary Fig. 5). In summary, data collected by researchers indicated that multimodal models outperformed the text model in diagnostic accuracy but were slightly less efficient in triage.
Evaluation in real clinical settings with untrained patients
During the patient-entered data phase, considering the convenience of smartphones, we focused on the text model, the text + smartphone model, and the text + slit-lamp + smartphone model. First, we compared the triage accuracy. Consistent with the researcher-collected data phase, the overall triage accuracy of the multimodal models were slightly lower than that of the text model, but this difference was not statistically significant (Fig. 3c). For subspecialties, in both internal and external validation, the text and multimodal models for the general outpatient clinic and glaucoma met the clinical goals for triaging ophthalmic diseases. Additionally, internal validation showed that the multimodal models met these standards for cornea, optometry, and retina subspecialties. In external validation, the text model met the standards for cornea and retina, while the multimodal models met the standards for cataract and retina. These results suggest that both the text model and multimodal models meet the triage requirements when patients input their own data.
Next, we compared the diagnostic accuracy of the text model and the multimodal models. Results revealed that in both internal and external validations, all diseases met the specificity criterion of ≥ 95%. In Dataset 4, the text model met the clinical criterion of sensitivity ≥ 75% in 15 out of 42 diseases (40.5% of cases), while the text + smartphone multimodal model met this criterion in 24 out of 42 diseases (78.6% of cases). In Dataset 5, the text model achieved the sensitivity threshold of ≥ 75% in 14 out of 40 diseases (48.3% of cases), while the text + slit-lamp + smartphone multimodal model met this criterion in only 10 out of 40 diseases (35.0% of cases). In Dataset 9, the text model achieved the clinical criterion in 24 out of 43 diseases (57.1%), while the text + smartphone model met the criterion in 28 out of 43 diseases (81.9%). In Dataset 10, the text model achieved the criterion in 25 out of 42 diseases (62.5%), whereas the text + slit-lamp + smartphone model met the criterion in 22 out of 42 diseases (50.8%). This suggests that the text + smartphone model outperforms the text model, while the text + slit-lamp + smartphone model does not. Further statistical analysis confirmed the superiority of text + smartphone model when comparing its diagnostic accuracy with the text model in both Dataset 4 and Dataset 9 (Fig. 3c). We also conducted an analysis of diagnostic accuracy for individual diseases, identifying significant improvements for certain diseases (Fig. 3b). These findings collectively show that during the patient-entered data phase, the text + smartphone model not only meets triage requirements but also delivers better diagnostic performance than both the text model and the text + slit-lamp + smartphone model.
We further compared the diagnostic and triage accuracy of the text model in Dataset 4 and Dataset 9. Consistent with previous findings, both internal validation (triage: 80.4%, diagnosis: 69.6%; Fisher’s exact test, P < 0.0001) and external validation (triage: 84.7%, diagnosis: 72.5%; Fisher’s exact test, P < 0.0001) demonstrated significantly higher triage accuracy compared to diagnostic accuracy for the text model (Supplementary Fig. 4d, h). Examining individual diseases, cataract exhibited notably higher triage accuracy than diagnostic accuracy in internal validation (Dataset 4: triage 76.8%, diagnosis 51.2%; Fisher’s exact test, P = 0.0011) and external validation (Dataset 9: triage 87.3%, diagnosis 58.2%; Fisher’s exact test, P = 0.0011). Interestingly, in Dataset 4, the diagnostic accuracy for myopia (94.0%) was significantly higher (Fisher’s exact test, P = 0.0354) than the triage accuracy (80.6%), indicating that the triage accuracy of the text model may not be influenced by diagnostic accuracy. Subsequent regression analysis is necessary to investigate the factors determining triage accuracy.
Due to varying proportions of the disease classifications across the three centers (Supplementary Table 1), we further explored changes in diagnostic and triage accuracy within each classification. Results revealed that, regardless of whether data was researcher-collected or patient-reported, diagnostic accuracy for primary anterior segment diseases (cataract, keratitis, pterygium) was significantly higher in the multimodal model compared to the text model in both internal and external validation (Fig. 3c). Further analysis of cataract, keratitis, and pterygium across Datasets 2, 3, 4, 7, 8, and 9 (Fig. 3b) also showed that, similar to the silent evaluation phase, multimodal model diagnostic accuracy for cataract significantly improved compared to the text model in most datasets. Pterygium and keratitis exhibited some improvement but showed no significant change across most datasets due to sample size limitations. For the other five major disease categories, multimodal model diagnostic accuracy did not consistently improve and even significantly declined in some categories (Supplementary Fig. 6). These findings indicate that the six major disease categories may play crucial roles in influencing the diagnostic performance of the models, underscoring the need for further detailed investigation.
Comparison of diagnostic performance in different models
To further compare the diagnostic accuracy of different models across various datasets, we conducted comparisons within six major disease categories. The results revealed significant differences in diagnostic accuracy among the models across these categories (Fig. 4a). For example, when comparing the text + smartphone model (Datasets 4, 9) to the text model (Datasets 1, 6), both internal and external validations showed higher diagnostic accuracy for the former in primary anterior segment diseases, other anterior segment diseases, and intraorbital diseases and emergency categories compared to the latter (Fig. 4a, b). Interestingly, contrary to previous findings within datasets, comparisons across datasets demonstrated a notable decrease in diagnostic accuracy for the text + slit-lamp model (Dataset 1 vs 2, Dataset 6 vs 7) and the text + slit-lamp + smartphone model (Dataset 4 vs 5, Dataset 9 vs 10) in the categories of other anterior segment diseases and vision disorders in both internal and external validations (Fig. 4a). This suggests that, in addition to the model used and the disease categories, other potential factors may influence the model’s diagnostic accuracy.
a Bar charts of diagnostic accuracy calculated for each disease classification across different models from internal (upper, Dataset 1–5) and external (lower, Dataset 6–10) evaluations. The bar colors represent disease classifications. The line graphs below denote study centers, models used, and data providers. b Heatmaps of diagnostic performance metrics after internal (left) and external (right) evaluations of different models. For each heatmap, metrics in the text model and text + smartphone model are normalized together by column, ranging from -2 (blue) to 2 (red). Disease types are classified into six categories and displayed by different colors. c Multivariate logistic regression analysis of diagnostic accuracy for all cases (left) and subgroup analysis for follow-up cases (right) during clinical evaluation. The first category in each factor is used as a reference, and OR values and 95% CIs for other categories are calculated against these references. OR, odds ratio; CI, confidence interval; *P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001.
We then conducted univariate and multivariate regression analyses to explore factors influencing diagnostic accuracy. Univariate analysis revealed that seven factors (age, laterality, number of visits, disease classification, model, data provider, and words input) significantly influence diagnostic accuracy (Supplementary Table 3). In multivariate analysis, six factors (age, laterality, number of visits, disease classification, model, and words input) remained significant, while the data provider was no longer a critical factor (Fig. 4c). Subgroup analysis of follow-up cases showed that only the model type significantly influenced diagnostic accuracy (Fig. 4c). For first-visit patients, three factors (age, disease classification, and model) were still influential. Further analysis across different age groups within each disease classification revealed that the multimodal models generally outperformed or performed comparably to the text model in most disease categories (Table 3). However, all multimodal models, including the text + slit-lamp model (OR: 0.21 [0.04–0.97]), the text + smartphone model (OR: 0.17 [0.09–0.32]), and the text + slit-lamp + smartphone model (OR: 0.16 [0.03–0.38]), showed limitations in diagnosing visual disorders in patients over 45 years old compared to the text model (Table 3). Additionally, both the text + slit-lamp model (OR: 0.34 [0.20–0.59]) and the text + slit-lamp + smartphone model (OR: 0.67 [0.43–0.89]) were also less effective for diagnosing other anterior segment diseases in this age group. In conclusion, for follow-up cases, both text + slit-lamp and text + smartphone models are suitable, with a preference for the text + smartphone model. For first-visit patients, the text + smartphone model is recommended, but its diagnostic efficacy for visual disorders in patients over 45 years old (such as presbyopia) may be inferior to that of the text model.
We also performed a regression analysis on triage accuracy. In the univariate logistic regression, the center and data provider significantly influenced triage accuracy. Multivariate regression analysis showed that only the data provider significantly impacted triage accuracy, with patient-entered data significantly improving accuracy (OR: 1.40 [1.25–1.56]). Interestingly, neither model type nor diagnostic accuracy affects triage outcomes. Considering the previous data analysis results from the patient-entered data phase, both the text model and the text + smartphone model are recommended as self-service triage tools for patients in clinical applications. Collectively, among the four models developed in our IOMIDS system, the text + smartphone model is more suitable for patient self-diagnosis and self-triage compared to the other models.
Model interpretability
In subgroup analysis, we identified limitations in the diagnostic accuracy for all multimodal models for patients over 45 years old. The misdiagnosed cases in this age group were further analyzed to interpret the limitations. Both the text + slit-lamp model (Datasets 2, 7) and the text + slit-lamp + smartphone model (Datasets 5, 10) frequently misdiagnosed other anterior segment and visual disorders as cataracts or keratitis. For instance, with the text + slit-lamp + smartphone model, glaucoma (18 cases, 69.2%) and conjunctivitis (22 cases, 38.6%) were often misdiagnosed as keratitis, while presbyopia (6 cases, 54.5%) and visual fatigue (11 cases, 28.9%) were commonly misdiagnosed as cataracts. In contrast, both the text model (Datasets 1–10) and the text + smartphone model (Datasets 3, 4, 8, 9) had relatively low misdiagnosis rates for cataracts (text: 23 cases, 3.5%; text + smartphone: 91 cases, 33.7%) and keratitis (text: 16 cases, 2.4%; text + smartphone: 25 cases, 9.3%). These results suggest that in our IOMIDS system, the inclusion of slit-lamp images, whether in the text + slit-lamp model or the text + slit-lamp + smartphone model, may actually hinder diagnostic accuracy due to the high false positive rate for cataracts and keratitis.
We then examined whether these misdiagnoses could be justified through image analysis. First, we reviewed the misdiagnosed cataract cases. In the text + slit-lamp model, 30 images (91.0%) were consistent with a cataract diagnosis. However, clinically, they were mainly diagnosed with glaucoma (6 cases, 20.0%) and dry eye syndrome (5 cases, 16.7%). Similarly, in the text + smartphone model, photographs of 80 cases (88.0%) were consistent with a cataract diagnosis. Clinically, these cases were primarily diagnosed with refractive errors (20 cases), retinal diseases (15 cases), and dry eye syndrome (8 cases). We then analyzed the class activation maps of the two multimodal models. Both models showed regions of interest for cataracts near the lens (Supplementary Fig. 7), in accordance with clinical diagnostic principles. Thus, these multimodal models can provide some value for cataract diagnosis based on images but may lead to discrepancies with the final clinical diagnosis.
Next, we analyzed cases misdiagnosed as keratitis by the text + slit-lamp model. The results showed that only one out of 25 cases had an anterior segment photograph consistent with keratitis, indicating a high false-positive rate for keratitis with the text + slit-lamp model. We then conducted a detailed analysis of the class activation maps generated by this model during clinical application. The areas of interest for keratitis were centered around the conjunctiva rather than the corneal lesions (Supplementary Fig. 7a). Thus, the model appears to interpret conjunctival congestion as indicative of keratitis, contributing to the occurrence of false-positive results. In contrast, the text + smartphone model displayed areas of interest for keratitis near the corneal lesions (Supplementary Fig. 7b), which aligns with clinical diagnostic principles. Taken together, future research should focus on refining the text + slit-lamp model for keratitis diagnosis and prioritize optimizing the balance between text-based and image-based information to enhance diagnostic accuracy across both multimodal models.
Inter-model variability and inter-expert variability
We further evaluated the diagnostic accuracy of GPT4.0 and the domestic large language model (LLM) Qwen using Datasets 4, 5, 9, and 10. Additionally, we invited three trainees and three junior doctors to independently diagnose these diseases. Since the text + smartphone model performed the best in the IOMIDS system, we compared its diagnostic accuracy with that of the other two LLMs and ophthalmologists with varying levels of experience (Fig. 5a-b). The text + smartphone model (80.0%) outperformed GPT4.0 (71.7%, χ² test, P = 0.033) and showed similar accuracy to the mean performance of trainees (80.6%). Among the three LLMs, Qwen performed the poorest, comparable to the level of a junior doctor. However, all three LLMs fell short of expert-level performance, suggesting there is still potential for improvement.
a Comparison of diagnostic accuracy of IOMIDS (text + smartphone model), GPT4.0, Qwen, expert ophthalmologists, ophthalmology trainees, and unspecialized junior doctors. The dotted lines represent the mean performance of ophthalmologists at different experience levels. b Heatmap of Kappa statistics quantifying agreement between diagnoses provided by AI models and ophthalmologists. c Kernel density plots of user satisfaction rated by researchers (red) and patients (blue) during clinical evaluation. d Example of an interactive chat with IOMIDS (left) and quality evaluation of the chatbot response (right). On the left, the central box displays the patient interaction process with IOMIDS: entering chief complaint, answering system questions step-by-step, uploading a standard smartphone-captured eye photo, and receiving diagnosis and triage information. The chatbot response includes explanations of the condition and guidance for further medical consultation. The surrounding boxes show a researcher’s evaluation of six aspects of the chatbot response. The radar charts on the right illustrate the quality evaluation across six aspects for chatbot responses generated by the text model (red) and the text + image model (blue). The axes for each aspect correspond to different coordinate ranges due to varying rating scales. Asterisks indicate significant differences between two models based on two-sided t-test. ** P < 0.01, *** P < 0.001, **** P < 0.0001.
We then analyzed the agreement between the answers provided by the LLMs and ophthalmologists (Fig. 5b). Agreement among expert ophthalmologists, who served as the gold standard in our study, was generally strong (κ: 0.85–0.95). Agreement among trainee doctors was moderate (κ: 0.69–0.83), as was the agreement among junior doctors (κ: 0.69–0.73). However, the agreement among the three LLMs was weaker (κ: 0.48–0.63). Notably, the text + smartphone model in IOMIDS showed better agreement with experts (κ: 0.72–0.80) compared to the other two LLMs (GPT4.0: 0.55–0.78; Qwen: 0.52–0.75). These results suggest that the text + smartphone model in IOMIDS demonstrates the best alignment with experts among the three LLMs.
Evaluation of user satisfaction and response quality
The IOMIDS responses not only contained diagnostic and triage results but also provided guidance on prevention, treatment, care, and follow-up (Fig. 5c). We first analyzed both researcher and patient satisfaction with these responses. Satisfaction was evaluated by researchers during the model development phase and the clinical trial phase; satisfaction was evaluated by patients during the clinical trial phase, regardless of the data collection method. Researchers rated satisfaction score significantly higher (4.63 ± 0.92) than patients (3.99 ± 1.46; t-test, P < 0.0001; Fig. 5c). Patient ratings did not differ between researcher-collected (3.98 ± 1.45) and self-entered data (4.02 ± 1.49; t-test, P = 0.3996). Researchers frequently rated chatbot responses as very satisfied (82.5%), whereas patient ratings varied, with 20.2% finding responses not satisfied (11.7%) or slightly satisfied (8.5%), and 61.9% rating them very satisfied. Further demographic analysis between these patient groups revealed that the former (45.7 ± 23.8 years) were significantly older than the latter (37.8 ± 24.4 years; t-test, P < 0.0001), indicating greater acceptance and positive evaluation of AI chatbots among younger individuals.
Next, we evaluated the response quality between multimodal models and the text model (Fig. 5d). The multimodal models exhibited significantly higher overall information quality (4.06 ± 0.12 vs. 3.82 ± 0.14; t-test, P = 0.0031) and better understandability (78.2% ± 1.3% vs. 71.1% ± 0.7%; t-test, P < 0.0001) than the text model. Additionally, the multimodal models showed significantly lower misinformation scores (1.02 ± 0.05 vs. 1.23 ± 0.11; t-test, P = 0.0003) compared to the text model. Notably, the empathy score statistically decreased in multimodal models compared to the text model (3.51 ± 0.63 vs. 4.01 ± 0.56; t-text, P < 0.0001), indicating lower empathy in chatbot responses from multimodal models. There were no significant differences in terms of grade level (readability), with both the text model and multimodal models being suitable for users at a grade 3 literacy level. These findings suggest that multimodal models generate high-quality chatbot responses with good readability. Future studies may focus on enhancing the empathy of these multimodal models to better suit clinical applications.