An open competition involving thousands of competitors failed to construct useful abstract classifiers for new diagnostic test accuracy systematic reviews.

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • Additional Information
    • Source:
      Publisher: Wiley Blackwell Country of Publication: England NLM ID: 101543738 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1759-2887 (Electronic) Linking ISSN: 17592879 NLM ISO Abbreviation: Res Synth Methods Subsets: MEDLINE
    • Publication Information:
      Publication: : Chichester : Wiley Blackwell
      Original Publication: Malden, MA : John Wiley & Sons, 2010-
    • Subject Terms:
    • Abstract:
      There are currently no abstract classifiers, which can be used for new diagnostic test accuracy (DTA) systematic reviews to select primary DTA study abstracts from database searches. Our goal was to develop machine-learning-based abstract classifiers for new DTA systematic reviews through an open competition. We prepared a dataset of abstracts obtained through database searches from 11 reviews in different clinical areas. As the reference standard, we used the abstract lists that required manual full-text review. We randomly splitted the datasets into a train set, a public test set, and a private test set. Competition participants used the training set to develop classifiers and validated their classifiers using the public test set. The classifiers were refined based on the performance of the public test set. They could submit as many times as they wanted during the competition. Finally, we used the private test set to rank the submitted classifiers. To reduce false exclusions, we used the Fbeta measure with a beta set to seven for evaluating classifiers. After the competition, we conducted the external validation using a dataset from a cardiology DTA review. We received 13,774 submissions from 1429 teams or persons over 4 months. The top-honored classifier achieved a Fbeta score of 0.4036 and a recall of 0.2352 in the external validation. In conclusion, we were unable to develop an abstract classifier with sufficient recall for immediate application to new DTA systematic reviews. Further studies are needed to update and validate classifiers with datasets from other clinical areas.
      (© 2023 John Wiley & Sons, Ltd.)
    • References:
      Regnard N-E, Lanseur B, Ventre J, et al. Assessment of performances of a deep learning algorithm for the detection of limbs and pelvic fractures, dislocations, focal bone lesions, and elbow effusions on trauma X-rays. Eur J Radiol. 2022;154(110447):110447.
      Lotter W, Diab AR, Haslam B, et al. Robust breast cancer detection in mammography and digital breast tomosynthesis using an annotation-efficient deep learning approach. Nat Med. 2021;27(2):244-249.
      Kataoka Y, Baba T, Ikenoue T, et al. Development and external validation of a deep learning-based computed tomography classification system for COVID-19. Ann Clin Epidemiol. 2022;4:110-119. advpub(22014):22014.
      Henry KE, Adams R, Parent C, et al. Factors driving provider adoption of the TREWS machine learning-based early warning system and its effects on sepsis treatment timing. Nat Med. 2022;28(7):1447-1454.
      Kataoka Y, Takemura T, Sasajima M, Katoh N. Development and early feasibility of chatbots for educating patients with lung cancer and their caregivers in Japan: mixed methods study. JMIR Cancer. 2021;7(1):e26911.
      Tercero-Hidalgo JR, Khan KS, Bueno-Cavanillas A, et al. Artificial intelligence in COVID-19 evidence syntheses was underutilized, but impactful: a methodological study. J Clin Epidemiol. 2022;148:124-134.
      Qin X, Liu J, Wang Y, et al. Natural language processing was effective in assisting rapid title and abstract screening when updating systematic reviews. J Clin Epidemiol. 2021;133:121-129.
      Stansfield C, Stokes G, Thomas J. Applying machine classifiers to update searches: analysis from two case studies. Res Synth Methods. 2022;13(1):121-133.
      Lange T, Schwarzer G, Datzmann T, Binder H. Machine learning for identifying relevant publications in updates of systematic reviews of diagnostic test studies. Res Synth Methods. 2021;12(4):506-515.
      Beynon R, Leeflang MMG, McDonald S, et al. Search strategies to identify diagnostic accuracy studies in MEDLINE and EMBASE. Cochrane Database Syst Rev. 2013;9:MR000022.
      Cohen JF, Korevaar DA, Altman DG, et al. STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration. BMJ Open. 2016;6(11):e012799.
      Thompson G, Zhelev Z, Hunt H, Hyde C. It was not easy to identify the study design from the title and abstract of articles indexed as diagnostic (test) accuracy studies in EMBASE in 2012 and 2019. J Clin Epidemiol. 2022;144:102-110.
      Data science bowl 2017 [Internet]. [cited 2022 Sep 1]. Available from:
      Ehteshami Bejnordi B, Veta M, Johannes van Diest P, et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama. 2017;318(22):2199.
      Knoll F, Murrell T, Sriram A, et al. Advancing machine learning for MR image reconstruction with an open competition: overview of the 2019 fastMRI challenge. Magn Reson Med. 2020;84(6):3054-3070.
      Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Ann Intern Med. 2015;162(1):55-63.
      [cited 2022 Sep 9]. Available from:
      Shiroshita A, Nozaki S, Tanaka Y, Luo Y, Kataoka Y. Thoracic ultrasound for malignant pleural effusion: a systematic review and meta-analysis. ERJ Open Res. 2020;6(4): 00464-2020.
      Sagami S, Kobayashi T, Miyatani Y, et al. Accuracy of ultrasound for evaluation of colorectal segments in patients with inflammatory bowel diseases: a systematic review and meta-analysis. Clin Gastroenterol Hepatol. 2021;19(5):908-921.e6.
      Shiroshita A, Jin Z, Tanaka Y, Kataoka Y. Diagnostic accuracy and safety of inhalation challenge tests for bird fancier's lung-systematic review and meta-analysis. Clin Exp Allergy. 2020;50(9):1007-1016.
      Tsujimoto H, Tsujimoto Y, Nakata Y, Akazawa M, Kataoka Y. Ultrasonography for confirmation of gastric tube placement. Cochrane Database Syst Rev. 2017;4(4):CD012083.
      Tsutsumi Y, Tsujimoto Y, Takahashi S, et al. Accuracy of aortic dissection detection risk score alone or with D-dimer: a systematic review and meta-analysis. Eur Heart J Acute Cardiovasc Care. 2020;9(3_suppl):S32-S39.
      Nihashi T, Dahabreh IJ, Terasawa T. Diagnostic accuracy of PET for recurrent glioma diagnosis: a meta-analysis. AJNR Am J Neuroradiol. 2013;34(5):944-950.
      Nihashi T, Ito K, Terasawa T. Diagnostic accuracy of DAT-SPECT and MIBG scintigraphy for dementia with Lewy bodies: an updated systematic review and Bayesian latent class model meta-analysis. Eur J Nucl Med Mol Imaging. 2020;47(8):1984-1997.
      Mishima A, Nihashi T, Ando Y, et al. Biomarkers differentiating dementia with Lewy bodies from other dementias: a meta-analysis. J Alzheimers Dis. 2016;50(1):161-174.
      Takeuchi M, Dahabreh IJ, Nihashi T, Iwata M, Varghese GM, Terasawa T. Nuclear imaging for classic fever of unknown origin: meta-analysis. J Nucl Med. 2016;57(12):1913-1919.
      Iguchi M, Noguchi Y, Yamamoto S, Tanaka Y, Tsujimoto H. Diagnostic test accuracy of jolt accentuation for headache in acute meningitis in the emergency setting. Cochrane Database Syst Rev. 2020;6(6):CD012824.
      Powers DMW. What the F-measure doesn't measure: features, flaws, fallacies and fixes [Internet]. arXiv [cs.IR]. 2015 Available from:
      Glanville J, Kotas E, Featherstone R, Dooley G. Which are the most sensitive search filters to identify randomized controlled trials in MEDLINE? J Med Libr Assoc. 2020;108(4):556-563.
      Tsujimoto Y, Kumasawa J, Shimizu S, et al. Doppler trans-thoracic echocardiography for detection of pulmonary hypertension in adults. Cochrane Database Syst Rev. 2022;5(5):CD012809.
      Huang J, Ling CX. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng. 2005;17(3):299-310.
      Rufibach K. Use of brier score to assess binary predictions. J Clin Epidemiol Elsevier BV. 2010;63(8):938-939. author reply 939.
      Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the performance of prediction models. Epidemiology. 2010;21(1):128-138.
      Bisong E. Google Colaboratory. Building Machine Learning and Deep Learning Models on Google Cloud Platform. Apress; 2019:59-64.
      Posit Team. Posit Cloud [Internet]. Posit Cloud: Integrated Development for R. RStudio, PBC, Boston, MA. 2022 Available from:
      Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional Transformers for language understanding. 2018 Available from: 10.48550/ARXIV.1810.04805.
      Le T, Winsnes CF, Axelsson U, et al. Analysis of the human protein atlas weakly supervised single-cell classification competition. Nat Methods. 2022;19(10):1221-1229.
      Kaggle competitions [Internet]. [cited 2022 Oct 24]. Available from:
      Wynants L, Van Calster B, Collins GS, et al. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ. 2020;369:m1328.
      Dinnes J, Sharma P, Berhane S, et al. Rapid, point-of-care antigen tests for diagnosis of SARS-CoV-2 infection. Cochrane Database Syst Rev. 2022;7:CD013705.
      Struyf T, Deeks JJ, Dinnes J, et al. Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19. Cochrane Database Syst Rev. 2022;5(5):CD013665.
      Volovici V, Syn NL, Ercole A, Zhao JJ, Liu N. Steps to avoid overuse and misuse of machine learning in clinical research. Nat Med. 2022;28(10):1996-1999.
      Li Z, Kamnitsas K, Glocker B. Overfitting of neural nets under class imbalance: analysis and improvements for segmentation. Lecture Notes in Computer Science. Springer International Publishing; 2019:402-410 (Lecture notes in computer science).
      Zednik C. Solving the black box problem: a normative framework for explainable artificial intelligence. Philos Technol. 2021;34(2):265-288.
      Moons KGM, Kengne AP, Grobbee DE, et al. Risk prediction models: II. External validation, model updating, and impact assessment. Heart. 2012;98(9):691-698.
      Luo R, Sun L, Xia Y, et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining [Internet]. arXiv [cs.CL]. 2022 Available from:
    • Grant Information:
      Fujifilm Corporation
    • Contributed Indexing:
      Keywords: diagnostic test accuracy; machine learning; open competition; search filter; systematic review
    • Publication Date:
      Date Created: 20230620 Date Completed: 20230915 Latest Revision: 20230915
    • Publication Date:
    • Accession Number:
    • Accession Number: