International multicenter validation of AI-driven ultrasound detection of ovarian cancer

Data acquisition

In this international multicenter retrospective study, we included transvaginal and transabdominal ultrasound images from patients with an ovarian lesion, examined between 2006 and 2021 at 20 secondary or tertiary referral centers for gynecological ultrasound in eight countries. The images were acquired by examiners with varying levels of training and experience, using 21 different commercial ultrasound systems from nine manufacturers, primarily GE (91.8%), followed by Samsung (4.8%), Philips (1.4%) and Mindray (1.2%) (Supplementary Table 13). Participating centers were requested to provide images of at least 50 consecutive malignant cases and at least 50 benign cases, examined just before or after each malignant case, to ensure a similar temporal distribution between classes and avoid bias from potential variations in diagnostic practices or equipment over time. This enrichment strategy was designed to ensure an adequate representation of malignant cases, thereby more effectively capturing rare pathologies while minimizing potential biases17. The inclusion of images for a given patient was limited to the side of the lesion, and in cases of bilateral lesions, the side of the dominant lesion (that is, that with the most complex ultrasound morphology) was included. Anonymized images were submitted in JPEG format. Data transfer agreements were signed between the host institution, Karolinska Institute, and each of the participating centers. The study was preregistered at https://doi.org/10.1186/ISRCTN51927471, approved by the Swedish Ethics Review Authority (Dnr 2020-06919) and conducted in accordance with the Declaration of Helsinki. Informed consent had been obtained from all patients for the use of their data for research purposes.

After excluding 4.8% (n = 183/3,840) of the cases (91 benign and 92 malignant) due to inadequate image quality (for example, lesions that could not be identified, lesions with blurred margins and lesions that were only partially visible), 17,119 ultrasound images (10,626 grayscale and 6,493 Doppler) representing 3,652 cases remained for analysis (Extended Data Fig. 1). Out of these cases, 3,419 were patients who had undergone surgery, including histological assessment, within 120 days of their ultrasound examination. The remaining 233 patients had been managed conservatively with ultrasound follow-up until the resolution of the lesion, or for at least three years without a malignant diagnosis, and were thus regarded as benign. The median number of images per case was 4 (interquartile range (IQR): 3–6). A breakdown of the diagnoses is shown in Table 1 and by center in Supplementary Fig. 3. Specific histological diagnoses are provided in Supplementary Table 14, a detailed summary of the data by centers can be found in Extended Data Table 3, and by centers separately for benign and malignant cases in Supplementary Table 15.

Human examiner review

To ensure a thorough evaluation, we collected the assessments made by 66 human examiners, comprising 33 ultrasound experts and 33 non-experts, recruited at the participating centers. To establish a competitive baseline and ensure the validity of our results, expert examiners were recruited based on their extensive expertise in gynecological ultrasound imaging for the assessment of ovarian lesions. For our study, an ‘expert’ examiner was defined as a physician who performs second or third opinion gynecological ultrasound imaging, and who has at least 5 years’ experience or annually assesses at least 200 patients with a persistent ovarian lesion. Among the experts, the median experience in gynecological ultrasound imaging was 17 years (IQR: 10–27 years), with a median of 10 years as second or third opinion (IQR: 5–17 years). Most experts (91%, n = 30/33) were affiliated with a gynecologic oncology referral center, 61% (n = 20/33) performed over 1,500 gynecological ultrasound scans annually, and 64% (n = 21/33) reported seeing more than 200 patients with a persistent ovarian lesion each year. To strive for a fair evaluation, we did not train the ‘non-expert’ examiners beyond providing them with instructions for the task. The specific prior training and certification varied among examiners, as they were included from centers in eight different countries. However, all non-expert examiners were certified physicians, actively practicing gynecological ultrasound imaging. They had a median experience of 5 years (IQR: 3–6 years) and 52% (n = 17/33) were affiliated with a gynecologic oncology referral center. Furthermore, 24% (n = 8/33) of non-experts served as second or third opinion referrals, however, not meeting the criteria for an ‘expert’ examiner determined in this study. When presented with a case, the examiner was asked to classify the lesion as benign or malignant using pattern recognition (that is, subjective ultrasound assessment)40, and rate their confidence in the assessment as certain, probable, or uncertain. To prevent bias from previously seen cases, none of the examiners were asked to review cases originating from their own centers.

A total of 2,660 cases (1,575 benign and 1,085 malignant) were assessed by at least 7 expert (median: 10, IQR: 9–11) and 6 non-expert (median: 9, IQR: 8–10) examiners, with a total of 51,179 assessments. The median number of cases assessed by each expert and non-expert examiner was 696 (IQR: 628–886) and 610 (IQR: 583–655), respectively. One center (Olbia) was excluded from the review due to its limited sample size (n = 57) and its small number of malignant cases (n = 8). Additionally, 58 cases from three centers (Cagliari, Trieste and Pamplona) were excluded from our main analysis as these had not been included in compliance with our criterion on the temporal distribution of examination dates. After excluding 233 patients managed conservatively with ultrasound follow-up, we selected 300 cases (150 benign and 150 malignant) from the Stockholm center with known histological diagnoses for inclusion in the human review. We selected the most recent 150 consecutive malignant cases, followed by one benign case examined just before or after each malignant case. The remaining 644 cases from the Stockholm center were excluded to have a test set of comparable size to those of the other centers and to utilize our reviewer resources efficiently. The excluded cases (n = 57) from the Olbia center were used as supplementary training data for all models. The 877 cases excluded from the Stockholm center (233 conservatively managed and 644 with post-surgical histological diagnosis) were also used as supplementary training data; however, only when the Stockholm center was not the held-out test set.

Model training

The OMLC-RS dataset was used to train a series of 19 transformer-based neural network models, each using DeiT architecture initialized with ImageNet pretraining20,41. We applied a leave-one-center-out cross-validation scheme, where iteratively each center in turn was isolated as the test set and the model was given the cases from the remaining centers for training. More specifically, in each iteration, the cases from the remaining centers were randomly split into a training (90%) and a validation (10%) set, with the validation set used for selection of the learning rate. A caveat to the procedure is that the random split was constrained such that the validation set had an equal number of malignant and benign cases. When we say that a case was used for training, we mean that it was included in either the training set or validation set.

Although our goal was to differentiate between benign and malignant lesions, the models were trained to discern ten different histological categories within the benign and malignant classes (Supplementary Table 14), which was done to leverage the richer information contained in the specific histological diagnoses. We trained the models using the multiclass focal loss42, which encourages the model to assign greater importance to often misclassified examples compared to the standard cross-entropy loss30.

Image pre-processing

Before training, images were cropped to the regions of interest, unless otherwise stated. The cropped images were zero-padded to square shape and resized to 256 × 256 x 3 pixels. The mean and standard deviation of the pixels for the images in the dataset were then computed for each color channel for later use.

For each training epoch, images were loaded from disk and randomly cropped to 224 × 224 × 3 pixels. The RandAugment method was used for data augmentation43, with default hyperparameters, five sequential random transformations and color-related transformations removed. Thereafter, the image pixels were normalized to zero mean and unit variance, using the precomputed pixel statistics.

Additional training details

Transformer-based models originate from the field of natural language processing44, an area that has seen immense progress in recent years with the advent of large language models45. Transformer-based models have been adapted and increasingly utilized also for imaging tasks. Within the ultrasound domain, these models were first used by Gheflati et al. in 2022 for the classification of breast lesions46. In our study, we used the DeiT-S (DeiT small) architecture20, with transfer learning from model weights initialized with ImageNet pretraining41. Transfer learning from ImageNet has become a standard approach and has been shown to improve performance in medical imaging tasks21. In our preliminary investigation, we also tried the larger model version, DeiT-B (DeiT base); however, as there were no noticeable improvements, we used the smaller DeiT-S architecture for computational efficiency. The linear projection layer on top of the final hidden state of the class token was replaced by a new linear projection layer with ten nodes, that is, with the same dimensionality as the number of classes. The AdamW optimizer was used47, with default hyperparameters, except for the learning rate. For each experiment, four different learning rates (10−3, 10−4, 5 × 10−5 and 10−5) were tried, each with a linear warm-up for 500 training steps and a batch size of 128 images. When the performance on the validation set reached a plateau, the learning rate was reduced. This reduction was made twice, each time by a factor of 0.1.

At the end of training, the model with the best performance on the validation set was selected, based on the case-wise binary classification performance in terms of the area under the ROC curve (AUC). An exponential moving average of the model weights from each training epoch was computed using a decay factor of 0.99. These model weights were later used for model evaluation.

Model inference

After training, the multi-class neural network models provided probability estimates for each of the ten histological categories within the benign and malignant classes (Supplementary Table 14). Because our goal was to differentiate between benign and malignant lesions, we computed the risk of malignancy for an image by summing up the probabilities for the five malignant classes, in a manner similar to Esteva et al.48. The malignancy score for a case was then computed as the average of the malignancy scores of its images. A case was considered malignant if its malignancy score exceeded a given cut-off point. Unless otherwise stated, we used the default cut-off point of 0.5.

Evaluation procedure

To avoid overly optimistic results commonly seen in medical machine learning18, we conducted a rigorous assessment of the diagnostic performance of our models via separate test sets, each containing only data from the center withheld during training. We compared the predictions of the models and the expert and non-expert examiners with histological diagnosis from surgery. We used the F1 score as the primary metric as it provides a balance between precision and recall, and which unlike the AUC can be computed in a straightforward and unbiased way also for human examiners. The F1 score is the harmonic mean of the precision (that is, positive predictive value) and the recall (that is, sensitivity):

$${\mathrm{F}}1=2\frac{{\mathrm{PPV}}\times {\mathrm{sensitivity}}}{{\mathrm{PPV}}+{\mathrm{sensitivity}}}$$

Metrics were calculated at the case level, as opposed to image-wise. In addition to the F1 score, we also report accuracy, sensitivity, specificity, Cohen’s kappa coefficient, MCC, DOR and Youden’s J statistic, as well as the AUC and Brier score for the models. The primary evaluation in our study compared the performance of the AI models with each individual examiner’s assessments on matched case sets. When calculating the diagnostic performance of the models, we identified the originating center for each case and used the model that had not been exposed to cases from that center during training.

Statistical analysis

To compare the diagnostic performance of the AI models with that of expert and non-expert examiners, we applied two-sided non-parametric Wilcoxon signed-rank tests (Supplementary Table 1)49, performed in JASP (version 0.18.3).

We evaluated the robustness of the AI models by examining performance variations across different centers, ultrasound systems, histological diagnoses, examiner confidence levels, patient age groups and years of examination. Rather than statistical tests, box plots and nonparametric confidence intervals were provided. Confidence intervals were estimated from bootstrapping using the percentile method50, as direct parametric calculation of the confidence intervals was not possible for the human examiners.

To ensure unbiased examiner representation, we used a sampling strategy where each examiner was selected with a probability inversely proportional to their number of cases assessed. This strategy was consistently applied also in our triage simulation.

Additionally, we assessed the sensitivity-specificity trade-off by presenting an ROC curve for the AI models, accompanied by 95% confidence bands. The confidence bands were constructed from the 2.5th and 97.5th percentiles of sensitivity values, at each level of specificity, from bootstrapped ROC curves. We also depicted 95% confidence regions for the mean diagnostic performance of expert and non-expert examiners. To account for the negative correlation between sensitivity and specificity, we applied a bivariate random-effects model39, implemented in SAS (version 9.04). The calibration plots were constructed using R (version 4.3.3).

All other analyses, including bootstrapping and triage simulations, were conducted using Python (version 3.8.13) with the pandas library (version 2.0.1). A significance level of 0.05 was used for all statistical tests.

Our initial power analysis, which was based on our plan to compare the AI models with the initial assessments of the ultrasound examiners who generated the images, resulted in a required sample size of 1,600 cases. To account for potential dropout, we initially requested a minimum of 100 cases from each of the 20 participating centers. The inclusion process exceeded our expectations, resulting in a total of 3,652 cases. However, as the examiners’ initial assessments had not been systematically documented for most centers, we adjusted our evaluation strategy as detailed in the ‘Human examiner review’ section.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Leave a Comment