IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

electronics-logo

Article Menu

literature review of nlp

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

A systematic literature review on the applications of robots and natural language processing in education.

literature review of nlp

1. Introduction

  • Can the NAO robots be introduced in education? What is the comparison between the level of students before and after the introduction of robots in education?
  • What is the status of the NAO robots in education articles published in the selected journals from 2014 to 2023? Is the number of articles for each year divided into two seasons concerning this topic, increasing, or decreasing?
  • What research-sample groups are related to the selected articles from 2014 to 2023? What are the applications of Natural language programming? What are the advantages and disadvantages of NAO robots and NLP?

1.1. Natural Language Processing

1.2. natural language processing applications, 2. materials and methods, 2.1. benchmark dataset, 2.2. study criteria applied, 3.1. motivations.

No.ReferencesAim of StudyModel(s)Dataset(s)Study Field/AreaLimitation(s)
1Dehghanzadeh et al. [ ]The goal to increase engagement, learning, and behavioral change; gamification aims to capitalize on people’s inherent desire to play and competeMeta-Analysis(9) dataset(PRISMA) guideline
2Yang et al. [ ]Childhood education: Effects on computational
thinking, sequencing ability, and self-regulation
Matatalab
coding
101 kindergartenershypotheses
3Zhang et al.
[ ]
Explore Chinese EFL learners’ acceptance of mobile dictionaries (MDs) and identify factors influencing their perceptionsTechnology acceptance mode, mobile technology evaluation framework125 participantsNLP
4Veivo et al. [ ]Children’s gaze behavior during dialogue breakdowns in robot-assisted language learning (RALL) is analyzed. Gaze patterns are identified through multimodal analysis of video recordingsIRE model18 videos, 36 primary school Robot (RALL)
5Engwall et al. [ ]Analyze robot behavior in RALL with adult learners and interactionNA33 adultsRobot (RALL)
6Hwang et al. [ ]Smart UEnglish app improves English as a foreign language
(EFL) conversation with authentic context effectiveness
Smart UEnglishEnglish textbookRobot + NLP
7Cao et al. [ ]Compare ASD and TD children’s joint attention responses with an adult and social robot (NAO)Comparative study design27 ASD + 40 TD childrenRobot + Education
8Ko et al. [ ]Create human–human interaction dataset for teaching social behaviors to robotNAAIR-Act2ActRobot
9Belpaeme et al. [ ]Explore social robots as tutors for second language learning.Simulation40 adults Robot + NLP
10Engwall et al. [ ]Study Furhat robot’s interaction styles for language practice, assess learner satisfactionFour interaction styles32 participantsRobot + NLP
11Chew et al. [ ]Identify educational barriers to child rights in Malaysia, propose robot activists as an innovative solution.Model designStudentRobot + Education
12Le et al. [ ]Explore telepresence robot acceptance in education, analyze factors influencing use intention, and provide design recommendations for improved usability.Platform Qualtrics60 participantsRobot + Education
13Engwal et al. [ ]Assess the feasibility of autonomous robot-led conversations for second language practice and evaluate speech recognition and utterance selection methodsLanguage model33 studentsRobot
14Esfandbd et al. [ ]Examine the effects of using RASA robot in speech therapy for children with language disorders.CNN architecturesCK+ datasetRobot + Education
15Zhou et al. [ ]Assess online course quality and identify factors influencing implementation effectiveness.NA100 coursesEducation
16Peng et al. [ ]Enhance student engagement in online collaborative writing by integrating intergroup and intragroup awareness informationTechnology acceptance model (TAM)161 studentsEducation
17Flanigan et al. [ ]Exploring online instructors’ rapport-building strategies and factors for initiating and maintaining rapport with studentsCommunity of inquiry (CoI)Nineteen college instructors Education
18Selwyn et al. [ ]Critique discriminatory learning analytics and explore alternatives aligned with diverse learners and learning experiencesLearning analyticsstudentsNLP
19Belpaeme et al. [ ]Exploring social robots’ impact on education outcomes and challenges ReviewSeveral studiesRobot + Education
20Ramirez et al. [ ]Compare active and passive SDOH screening methods in clinical spaces.Retrospective cohort analysis1735 casesNLP
21Chang, et al. [ ]To enhance professional trainers’ effectiveness through a robot-based digital storytelling (DST) approachBSFE model40 trainersRobot
22Velentza et al. [ ]To examine the performance of social robots as university professors in engineering education, measuring enjoyment, and knowledge acquisition, and to explore the correlation between enjoyment and knowledge acquisition through a series of experimentsQuestionnaire138 people, 7 Males + 131 FemalesRobot
23Smakman et al. [ ]The aim of this study is to identify and compare the moral considerations associated with the introduction of social robots in primary education, to develop guidelines for their responsible implementationQuestionnaire118Robot + Education
24Konijn et al. [ ]Investigate effects of robot behaviors on students’ learning outcomes in multiplicationNA86 students Robot
25Atapattu et al. [ ]Analyze and remove noise from lecture slides for structured data analysis.Rating7 University lecturersNLP
26Liu et al. [ ]Exploring artificial intelligence (AI) chatbot as a book talk companion to enhance reading experience and maintain students’ interest and social connectionArtificial-intelligence techniques68 students NLP
27Rodrigues et al. [ ]Formative assessment system for students and teachers, automating exam creation, monitoring progress, and providing feedback on free-text answers.AssessmentHistory teachersNLP
28Westera et al. [ ]Automated essay scoring methodology using NLP to reduce teacher workload. High precision achieves substantial workload reduction.ReaderBench173 reportsNLP
29Kyu et al. [ ]Exploring automatic methods for constructing an expert model from textual explanations, focusing on key concepts, and evaluating different metrics.Design/technology integration in learning7 professors teaching and 6 major universities in the USNLP
30Rico-juan et al. [ ]Automated detection of inconsistencies in peer assessment using machine learning, aiming to reduce teachers’ workload and ensure a fair evaluation process.In this paper, we consider ML algorithms for NLP354 students + 2 activitiesNLP
31Gerard et al. [ ]Automated, adaptive guidance: moving students forward with personalized assistance.Knowledge integration (KI) + c-raterML798 6th and 7th-grade studentsEducation
32Lee et al. [ ]Aim: Develop an AI-based chatbot to enhance preservice teachers’ responsive questioning skills in mathematics education.ChatbotPrivate datasetEducation
33Lu et al. [ ]Analyze social media impact on mental health and well-being.analysis of randomly selecting4 CourseEducation
34Wambsganss et al. [ ]Explore the impact of automated feedback and social comparison on students’ logical argumentation writing abilities.Feedback mechanisms and novel NLP approaches71 studentsNLP
35Hsu et al. [ ]Investigate differences in learning achievement, AI anxiety, computational thinking (CT), and learning behaviors in CT and AI concept learning.Voice assistant application (VA app)56 university first-year studentsEducation
36Han et al. [ ]Investigate demographic factors influencing the unique experience of chatbot implementation for inclusive learningFAQ chatbot46 studentsEducation
37Sikström et al. [ ]Develop pedagogical agents with adaptive, adequate, relational, and logical communication for effective and usable learning supportSystematic reviewPapers published (2010 and 2020), Education
38Zhu et al. [ ]Investigate student reactions to automated feedback and the relationship between revisions and improvement in scientific-argument writingNA374 studentsNLP
39Bywater et al. [ ]Investigate the impact of the teacher responding tool (TRT) on high school teachers’ practice in effectively responding to students’ mathematical ideasNA4 high school/teachersEducation
40Greenhalgh et al. [ ]Analyze teacher-focused Twitter hashtags as distinct affinity spaces for learningDescriptive and hierarchical analysis.#michED1/9/2015 to 31/08/2016Education
41Wang et al. [ ]Analyze students’ interactions with AI for English foreign language (EFL) learning and identify factors for success.Cluster and epistemic network analysis16 students Education
42Yang et al. [ ]Investigate massive open online courses (MOOC) learners’ forum participation patterns and their impact on performanceLatent semantic analysis (LSA) model + decision tree model69,867 learnersEducation

3.2. Contributions

  • Provide a guide for researchers, trainers, teachers, orientalists, and students;
  • Promote policies and initiatives by universities and institutes to improve the research capacity of academic staff and students to join the NAO robot in the classroom and the laboratory;
  • Cultivate a culture of learning, training, and teaching with NAO robots and make it essential in the field of education;
  • Develop the skills of students and enhance their scientific level, cooperating to achieve the correct answer for any question.

4. Discussion

  • Personalized instruction: robots provide tailored feedback, adaptive content delivery, and personalized tutoring;
  • Collaborative learning: robots facilitate group discussions, fostering communication and critical thinking skills;
  • Automated grading and feedback: NLP algorithms enable automated grading and timely feedback;
  • Language learning support: NLP-driven robots aid language learners through conversations and explanations;
  • Ethical considerations: privacy, fairness, and ethical implications must be addressed in implementation;
  • Teacher–technology collaboration: educators play a vital role in guiding and contextualizing technology use;
  • Preparation for future skills: robots and NLP prepare students for future workforce demands;
  • Inclusive education: robots promote inclusivity by supporting diverse learners;
  • Student engagement: robots increase student engagement and active participation in learning;
  • Advancements and challenges: ongoing advancements and challenges shape the future of these applications.

4.1. Distribution of Publications over the Years of Publication

4.2. distribution by the author’s nationality, 4.3. advantages, 4.4. disadvantages, 5. conclusions.

  • Transformational impact: robots and NLP have the potential to transform education;
  • Personalized learning: individualized instruction enhances engagement and knowledge retention;
  • Collaborative environment: robots foster communication, critical thinking, and teamwork skills;
  • Efficiency and feedback: automated grading saves time and provides immediate feedback;
  • Language learning enhancement: NLP-driven robots support language acquisition and accessibility;
  • Ethical considerations: privacy, fairness, and transparency must be prioritized;
  • Teacher empowerment: collaboration between educators and technology developers is crucial;
  • Future readiness: integrating robots and NLP equips students with essential skills;
  • Inclusivity and engagement: robots promote inclusive education and active student participation;
  • Constant evolution: advancements and challenges continue to shape these applications.

Author Contributions

Institutional review board statement, informed consent statement, data availability statement, acknowledgments, conflicts of interest, abbreviations.

NLPNatural Language Processing
SJRScientific Journal Rankings
AIArtificial Intelligence
Q1First Quarter
ACMAssociation For Computing Machinery
IEEEInstitute of Electrical and Electronics Engineers
LALearning Analytics
EREducational Robotics
PPPedagogical Practices
UTAUTA Unified Theory of Acceptance and The Use of Technology
HMAHot Melt Adhesives
SPARCSupervised Increasingly Autonomous Robot Competencies
CRRCare-Receiving Robot
HRIHuman–Robot Interaction
ANOVAAnalysis of Variance
MALLMobile Assisted Language Learning
BPEByte-Pair Encoding
KIKnowledge Integration
CTComputational Thinking
STEMTechnology, Engineering, And Mathematics
OJADOnline Japanese Accent Dictionary
PRISMAPreferred Reporting Items for Systematic Reviews and Meta-Analyses
MDsMobile Dictionaries
RALLRobot-Assisted Language Learning
EFLEnglish Foreign Language
DSTDigital Storytelling
RAReference Answer
KIKnowledge Integration
ML Machine Learning
TRTTeacher Responding Tool
LSALatent Semantic Analysis
MOOCMassive Open Online Courses
No.Name of JournalWeb JournalH-Index
Accessed on
18 June 2020
Accessed on
18 June 2023
1British Journal of Educational Technology 87110
2Computer-Assisted Language Learning 4563
3Computers & Education 164215
4Foundations and Trends in Machine Learning 3036
5IEEE Robotics and Automation Letters 3482
6IEEE Transactions on Robotics 146177
7International Journal of Robotics Research 155180
8International Journal of Social Robotics 4468
9Internet and Higher Education 81109
10Journal of the ACM 123133
11Science Robotics 3079
12Soft Robotics 3263

Click here to enlarge figure

  • Bartneck, C. The end of the beginning: A reflection on the first five years of the HRI conference. Scientometr. J. 2011 , 86 , 487–504. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ Green Version ]
  • Métais, E. Enhancing information systems management with natural language processing techniques. Data Knowl. Eng. J. 2002 , 41 , 247–272. [ Google Scholar ] [ CrossRef ]
  • Yang, Y.; Long, Y.; Sun, D.; Van Aalst, J.; Cheng, S. Fostering students’ creativity via educational robotics: An investigation of teachers’ pedagogical practices based on teacher interviews. Br. J. Educ. Technol. 2020 , 51 , 1826–1842. [ Google Scholar ] [ CrossRef ]
  • Guggemos, J.; Seufert, S.; Sonderegger, S. Humanoid robots in higher education: Evaluating the acceptance of pepper in the context of an academic writing course using the UTAUT. Br. J. Educ. Technol. 2020 , 51 , 1864–1883. [ Google Scholar ] [ CrossRef ]
  • Hobbs, R.; Tuzel, S. Teacher motivations for digital and media literacy: An examination of turkish educators. Br. J. Educ. Technol. 2017 , 48 , 7–22. [ Google Scholar ] [ CrossRef ]
  • Yu, X.; Nurzaman, S.G.; Culha, U.; Iida, F. Soft Robotics Education. Soft Robot. 2014 , 1 , 202–212. [ Google Scholar ] [ CrossRef ]
  • Senft, E.; Lemaignan, S.; Baxter, P.E.; Bartlett, M.; Belpaeme, T. Teaching robots social autonomy from in situ human guidance. Soft Robot. 2019 , 4 , eaat1186. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Younis, H.A.; Ruhaiyem, N.I.R.; Badr, A.A.; Abdul-Hassan, A.K.; Alfadli, I.M.; Binjumah, W.M.; Altuwaijri, E.A.; Nasser, M. Multimodal Age and Gender Estimation for Adaptive Human-Robot Interaction: A Systematic Literature Review. Processes 2023 , 11 , 1488. [ Google Scholar ] [ CrossRef ]
  • Vollmer, A.L.; Read, R.; Trippas, D.; Belpaeme, T. Children conform, adults resist: A robot group induced peer pressure on normative social conformity. Sci. Robot. 2018 , 3 , eaat7111. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Michaelis, J.E.; Mutlu, B. Reading socially: Transforming the in-home reading experience with a learning-companion robot. Sci. Robot. 2018 , 3 , eaat5999. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Scassellati, B.; Boccanfuso, L.; Huang, C.; Mademtzi, M.; Qin, M.; Salomons, N.; Ventola, P.; Shic, F. Improving social skills in children with ASD using a long-term, in-home social robot. Sci. Robot. 2018 , 3 , eaat7544. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Yang, G.Z.; Bellingham, J.; Choset, H.; Dario, P.; Fischer, P.; Fukuda, T.; Jacobstein, N.; Nelson, B.; Veloso, M.; Berg, J. Science for robotics and robotics for science. Sci. Robot. 2016 , 1 , eaal2099. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Tanaka, F.; Matsuzoe, S. Learning verbs by teaching a care-receiving robot by children: An experimental report. In Proceedings of the Seventh Annual ACM/IEEE International Conference on Human-Robot Interaction, Boston, MA, USA, 5–8 March 2012; pp. 253–254. [ Google Scholar ]
  • Tanaka, F.; Matsuzoe, S. Children teach a care-receiving robot to promote their learning: Field experiments in a classroom for vocabulary learning. J. Hum.-Robot. Interact. 2012 , 1 , 78–95. [ Google Scholar ] [ CrossRef ]
  • Majgaard, G.; Bertel, L.B. Initial phases of design-based research into the educational potentials of NAO-robots. In Proceedings of the HRI’14: ACM/IEEE International Conference on Human-Robot Interaction, Bielefeld, Germany, 3–6 March 2014; pp. 238–239. [ Google Scholar ]
  • Diyas, Y.; Brakk, D.; Aimambetov, Y.; Sandygulova, A. Evaluating peer versus teacher robot within educational scenario of programming learning. In Proceedings of the IEEE 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Christchurch, New Zealand, 7–10 March 2016; pp. 425–426. [ Google Scholar ]
  • Sánchez, F.Á.B.; Correal, A.M.G.; Guerrero, E.G. Interactive drama with robots for teaching non-technical subjects. J. Hum.-Robot. Interact. 2017 , 6 , 48. [ Google Scholar ] [ CrossRef ]
  • Reich-Stiebert, N.; Eyssel, F. (Ir)relevance of gender? on the influence of gender stereotypes on learning with a robot. In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction, Vienna, Austria, 6–9 March 2017; pp. 166–176. [ Google Scholar ]
  • Cheng, X.; Sun, J.; Zarifis, A. Artificial intelligence and deep learning in educational technology research and practice. Br. J. Educ. Technol. 2020 , 51 , 1653–1656. [ Google Scholar ] [ CrossRef ]
  • Qin, F.; Li, K.; Yan, J. Understanding user trust in artificial intelligence-based educational systems: Evidence from China. Br. J. Educ. Technol. 2020 , 51 , 1693–1710. [ Google Scholar ] [ CrossRef ]
  • Shum, S.J.B.; Luckin, R. Learning analytics and AI: Politics, pedagogy and practices. Br. J. Educ. Technol. 2019 , 50 , 2785–2793. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Ahmad, M.I.; Mubin, O.; Orlando, J. Understanding behaviours and roles for social and adaptive robots in education: Teacher’s perspective. In Proceedings of the Fourth International Conference on Human Agent Interaction, Biopolis, Singapore, 4–7 October 2016; pp. 297–304. [ Google Scholar ]
  • Karayaneva, Y.; Hintea, D. Object recognition algorithms implemented on NAO robot for children’s visual learning enhancement. In Proceedings of the 2nd International Conference on Mechatronics Systems and Control, Amsterdam, The Netherlands, 21 February 2018; pp. 86–92. [ Google Scholar ]
  • Gressmann, A.; Weilemann, E.; Meyer, D.; Bergande, B. Nao robot vs. lego mindstorms: The influence on the intrinsic motivation of computer science non-majors. In Proceedings of the 19th Koli Calling International Conference on Computing Education Research, Koli, Finland, 21–24 November 2019. [ Google Scholar ]
  • Lee, A. Determining the effects of computer science education at the secondary level on STEM major choices in postsecondary institutions in the United States. Comput. Educ. 2015 , 88 , 241–255. [ Google Scholar ] [ CrossRef ]
  • Roberts, T.H.; Brown, D.; Boulton, H.; Burton, A.; Shopland, N.; Martinovs, D. Examining the potential impact of digital game making in curricula based teaching: Initial observations. Comput. Educ. 2020 , 158 , 103988. [ Google Scholar ] [ CrossRef ]
  • Mubin, O.; Alhashmi, M.; Baroud, R.; Alnajjar, F.S. Humanoid robots as teaching assistants in an arab school. In Proceedings of the 31st Australian Conference on Human-Computer-Interaction, Fremantle, WA, Australia, 2–5 December 2019; pp. 462–466. [ Google Scholar ]
  • Ahmad, M.I.; Khordi-Moodi, M.; Lohan, K.S. Social robot for STEM education. In Proceedings of the ACM/IEEE International Conference on Human-Computer-Interaction, Cambridge, UK, 23–26 March 2020; pp. 90–92. [ Google Scholar ]
  • Van Ewijk, G.; Smakma, M.; Konijn, E.A. Teachers’ perspectives on social robots in education: An exploratory case study. In Proceedings of the Interaction Design and Children Conference, London, UK, 21–24 June 2020; pp. 273–280. [ Google Scholar ]
  • Hamid, S.; Waycott, J.; Kurnia, S.; Chang, S. Understanding students’ perceptions of the benefits of online social networking use for teaching and learning. Internet High. Educ. 2015 , 26 , 1–9. [ Google Scholar ] [ CrossRef ]
  • So, S. Mobile instant messaging support for teaching and learning in higher education. Internet High. Educ. 2016 , 31 , 32–42. [ Google Scholar ] [ CrossRef ]
  • Song, Y.; Siu, C.K. Affordances and constraints of BYOD (Bring Your Own Device) for learning and teaching in higher education: Teachers’ perspectives. Internet High. Educ. 2017 , 32 , 39–46. [ Google Scholar ] [ CrossRef ]
  • Van den Heuvel, R.J.F.; Lexis, M.A.S.; de Witte, L.P. ZORA robot based interventions to achieve therapeutic and educational goals in children with severe physical disabilities. Int. J. Soc. Robot. 2020 , 12 , 493–504. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Zaraki, A.; Khamassi, M.; Wood, L.J.; Lakatos, G.; Tzafestas, C.; Amirabdollahian, F.; Robins, B.; Dautenhahn, K. A novel reinforcement-based paradigm for children to teach the humanoid kaspar robot. Int. J. Soc. Robot. 2020 , 12 , 709–720. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Yoshino, K.; Zhang, S. Construction of teaching assistant robot in programming class. In Proceedings of the 7th International Congress on Advanced Applied Informatics (IIAI-AAI), Yonago, Japan, 8–13 July 2018; pp. 215–220. [ Google Scholar ]
  • Vrochidou, E.; Najoua, A.; Lytridis, C.; Salonidis, M.; Ferelis, V.; Papakostas, G.A. Social robot NAO as a self-regulating didactic mediator: A case study of teaching/learning numeracy. In Proceedings of the 2018 26th International Conference on Software, Telecommunications and Computer Networks, SoftCOM, Split, Croatia, 13–15 September 2018; pp. 93–98. [ Google Scholar ]
  • Shirouzu, H.; Miyake, N. Effects of robots’ revoicing on preparation for future learning. In Proceedings of the 10th International Conference on Computer-Supported Collaborative Learning, Madison, WI, USA, 15–19 June 2013; Volume 1, pp. 438–445. [ Google Scholar ]
  • Mubin, O.; Stevens, C.J.; Shahid, S.; Al Mahmud, A.; Dong, J. A review of the applicability of robots in education. Technol. Educ. Learn. 2013 , 1 , 13. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Koizumi, S.; Kanda, T.; Miyashita, T. Collaborative learning experiment with social robot. J. Robot. Soc. Jpn. 2011 , 29 , 902–906. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Sun, Z.; Li, Z.; Nishimorii, T. Development and assessment of robot teaching assistant in facilitating learning. In Proceedings of the International Conference of Educational Innovation through Technology (EITT), Osaka, Japan, 7–9 December 2018; pp. 165–169. [ Google Scholar ]
  • Tanaka, F.; Takahashi, T.; Matsuzoe, S.; Tazawa, N.; Morita, M. Telepresence robot helps children in communicating with teachers who speak a different language. In Proceedings of the 9th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Bielefeld, Germany, 3–6 March 2014; pp. 399–406. [ Google Scholar ]
  • Magyar, G.; Cádrik, T.; Virčiková, M.; Sinčák, P. Towards adaptive cloud-based platform for robotic assistants in education. In Proceedings of the 2014 IEEE 12th International Symposium on Applied Machine Intelligence and Informatics (SAMI), Herl’any, Slovakia, 23–25 January 2014; pp. 285–289. [ Google Scholar ]
  • Eteokleous, N.; Ktoridou, D. Educational robotics as learning tools within the teaching and learning practice. In Proceedings of the 2014 IEEE Global Engineering Education Conference (EDUCON), Istanbul, Turkey, 3–5 April 2014; pp. 1055–1058. [ Google Scholar ]
  • Montalvo, M.; Calle-Ortiz, E. Programming by demonstration for the teaching and preservation of intangible cultural heritage. In Proceedings of the 2017 IEEE XXIV International Conference on Electronics, Electrical Engineering and Computing (INTERCON), Cusco, Peru, 15–18 August 2017; Volume 2. [ Google Scholar ]
  • Johal, W.; Jacq, A.; Paiva, A.; Dillenbourg, P. Child-robot spatial arrangement in a learning by teaching activity. In Proceedings of the 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), New York, NY, USA, 26–31 August 2016; pp. 533–538. [ Google Scholar ]
  • Tosello, E.; Michieletto, S.; Pagello, E. Training master students to program both virtual and real autonomous robots in a teaching laborator students to program both virtual and real autonomous robots in a teaching laboratory. In Proceedings of the IEEE Global Engineering Education Conference (EDUCON), Abu Dhabi, United Arab Emirates, 10–13 April 2016; pp. 621–630. [ Google Scholar ]
  • Konijn, E.A.; Hoorn, J.F. Robot tutor and pupils’ educational ability: Teaching the times tables. Comput. Educ. 2020 , 157 , 103970. [ Google Scholar ] [ CrossRef ]
  • Xia, L.; Zhong, B. A systematic review on teaching and learning robotics content knowledge in K-12. Comput. Educ. 2018 , 127 , 267–282. [ Google Scholar ] [ CrossRef ]
  • Younis, H.A.; Jamaludin, R.; Wahab, M.N.A.; Mohamed, A.S.A. The review of NAO robotics in educational 2014-2020 in COVID-19 Virus (pandemic ara): Technologies, type of application, advantage, disadvantage and motivation. IOP Conf. Ser. Mater. Sci. Eng. 2020 , 928 , 032014. [ Google Scholar ] [ CrossRef ]
  • Younis, H.A.; Mohamed, A.S.A.; Wahab, M.N.A.; Jamaludin, R.; Salisu, S. A New speech recognition model in a human-robot interaction scenario using NAO robot. In Proceedings of the International Conference on Communication & Information Technology (ICICT), Basrah, Iraq, 5–6 June 2021; pp. 215–220. [ Google Scholar ]
  • Li, D.; Chen, X. Study on the application and challenges of educational robots in future education. In Proceedings of the 2020 International Conference on Artificial Intelligence and Education (ICAIE), Tianjin, China, 26–28 June 2020; pp. 198–201. [ Google Scholar ]
  • Tan, J.T.C.; Iocchi, L.; Eguchi, A.; Okada, H. Bridging robotics education between high school and university: RoboCup@Home education. In Proceedings of the 2019 IEEE AFRICON, Accra, Ghana, 25–27 September 2019; pp. 1–4. [ Google Scholar ]
  • Haerazi, H.; Utama, I.M.P.; Hidayatullah, H. Mobile applications to improve english writing skills viewed from critical thinking ability for pre-service teachers. Int. J. Interact. Mob. Technol. 2020 , 14 , 58–72. [ Google Scholar ] [ CrossRef ]
  • Mukhitdinova, K.; Asilova, G.; Salisheva, Z.; Rakhmatullaeva, M. Current issues of creating educational material for intensive teaching to the Uzbek language of foreigners. Int. J. Eng. Adv. Technol. 2019 , 9 , 5224–5226. [ Google Scholar ] [ CrossRef ]
  • Marello, C.; Marchisio, M.; Pulvirenti, M.; Fissore, C. Automatic assessment to enhance online dictionaries consultation skills. In Proceedings of the 16th International Conference on Cognition and Exploratory Learning in the Digital Age (Celda 2019), Cagliari, Italy, 7–9 November 2019; pp. 331–338. [ Google Scholar ]
  • Minematsu, N.; Nakamura, I.; Suzuki, M.; Hirano, H.; Nakagawa, C.; Nakamura, N.; Tagawa, Y.; Hirose, K.; Hashimoto, H. Development and evaluation of online infrastructure to aid teaching and learning of Japanese prosody. IEICE Trans. Inf. Syst. 2017 , 100 , 662–669. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Dmitrichenkova, S.V.; Chauzova, V.A.; Malykh, E.A. Foreign language training of IT-students with the programme—translator in the directions and specialties of engineering faculty. Procedia Comput. Sci. 2017 , 103 , 577–580. [ Google Scholar ] [ CrossRef ]
  • Balalaieva, O. Online resources and software for teaching and learning latin recursos online e software para ensinar e aprender latim. Texto Livre 2019 , 12 , 93–188. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Ngo, X.V.; Le Ha, T.; Nguyen, P.T.; Nguyen, L.M. Combining advanced methods in japanese-vietnamese neural machine translation. In Proceedings of the 10th International Conference on Knowledge and Systems Engineering (KSE), Ho Chi Minh, Vietnam, 1–3 November 2018; pp. 318–322. [ Google Scholar ]
  • Younis, H.A.; Mohamed, A.S.A.; Jamaludin, R.; Wahab, M.N.A. Survey of robotics in education, taxonomy, applications, and platforms during COVID-19. Comput. Mater. Contin. J. 2021 , 67 , 687–707. [ Google Scholar ]
  • Dehghanzadeh, H.; Farrokhnia, M. Using Gamification to Support Learning in K-12 Education: A Systematic Literature Review. Br. J. Educ. Technol. 2023 , 1–37. [ Google Scholar ] [ CrossRef ]
  • Yang, W.; Tsz, D.; Ng, K.; Gao, H. Robot Programming versus Block Play in Early Childhood Education: Effects on Computational Thinking, Sequencing Ability, and Self-Regulation. Br. J. Educ. Technol. 2022 , 53 , 1817–1841. [ Google Scholar ] [ CrossRef ]
  • Zhang, D.; Hennessy, S.; Pérez-paredes, P. An Investigation of Chinese EFL Learners’ Acceptance of Mobile Dictionaries in English Language Learning An Investigation of Chinese EFL Learners’ Acceptance of Mobile Dictionaries in English. Comput. Assist. Lang. Learn. 2023 , 1–25. [ Google Scholar ] [ CrossRef ]
  • Veivo, O.; Mutta, M. Dialogue Breakdowns in Robot-Assisted L2 Learning. Comput. Assist. Lang. Learn. 2022 , 1–22. [ Google Scholar ] [ CrossRef ]
  • Engwall, O.; Lopes, J. Interaction and Collaboration in Robot-Assisted Language Learning for Adults. Comput. Assist. Lang. Learn. 2022 , 35 , 1273–1309. [ Google Scholar ] [ CrossRef ]
  • Hwang, W.; Guo, B.; Hoang, A.; Chang, C. Facilitating Authentic Contextual EFL Speaking and Conversation with Smart Mechanisms and Investigating Its Influence on Learning Achievements. Comput. Assist. Lang. Learn. 2022 , 1–27. [ Google Scholar ] [ CrossRef ]
  • Cao, H.; Simut, R.E.; Desmet, N.; de Beir, A.; Van De Perre, G.; Vanderborght, B.; Member, S. Robot-Assisted Joint Attention: A Comparative Study between Children with Autism Spectrum Disorder and Typically Developing Children in Interaction with NAO. IEEE Access 2020 , 8 , 223325–223334. [ Google Scholar ] [ CrossRef ]
  • Ko, W.; Jang, M.; Lee, J.; Kim, J. AIR-Act2Act: Human—Human Interaction Dataset for Teaching Non-Verbal Social Behaviors to Robots. Int. J. Robot. Res. 2021 , 40 , 691–697. [ Google Scholar ] [ CrossRef ]
  • Belpaeme, T.; Vogt, P.; Van Den Berghe, R.; Bergmann, K.; Göksun, T.; De Haas, M.; Kanero, J.; Kennedy, J.; Küntay, A.C.; Papadopoulos, F.; et al. Guidelines for Designing Social Robots as Second Language Tutors. Int. J. Soc. Robot. 2018 , 10 , 325–341. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Engwall, O.; Lopes, J.; Åhlund, A. Robot Interaction Styles for Conversation Practice in Second Language Learning. Int. J. Soc. Robot. 2021 , 13 , 251–276. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Chew, E.; Sikander, U.; Pei, K.; Lee, H. Designing a Novel Robot Activist Model for Interactive Child Rights Education. Int. J. Soc. Robot. 2021 , 13 , 1641–1655. [ Google Scholar ] [ CrossRef ]
  • Lei, M.; Clemente, I.M.; Liu, H.; Bell, J. The Acceptance of Telepresence Robots in Higher Education. Int. J. Soc. Robot. 2022 , 14 , 1025–1042. [ Google Scholar ] [ CrossRef ]
  • Engwall, O.; Lopes, J.; Cumbal, R. Is a Wizard-of-Oz Required for Robot-Led Conversation Practice in a Second Language? Int. J. Soc. Robot. 2022 , 14 , 1067–1085. [ Google Scholar ] [ CrossRef ]
  • Esfandbod, A.; Rokhi, Z.; Meghdari, A.F. Utilizing an Emotional Robot Capable of Lip-Syncing in Robot-Assisted Speech Therapy Sessions for Children with Language Disorders. Int. J. Soc. Robot. 2023 , 15 , 165–183. [ Google Scholar ] [ CrossRef ]
  • Zhou, X.; Li, Q.; Xu, D.; Li, X.; Fischer, C. College Online Courses Have Strong Design in Scaffolding but Vary Widely in Supporting Student Agency and Interactivity. Internet High. Educ. 2023 , 58 , 100912. [ Google Scholar ] [ CrossRef ]
  • Peng, Y.; Li, Y.; Su, Y.; Chen, K.; Jiang, S. Effects of Group Awareness Tools on Students’ Engagement, Performance, and Perceptions in Online Collaborative Writing: Intergroup Information Matters. Internet High. Educ. 2022 , 53 , 100845. [ Google Scholar ] [ CrossRef ]
  • Flanigan, A.E.; Akcaoglu, M.; Ray, E. Initiating and Maintaining Student-Instructor Rapport in Online Classes. Internet High. Educ. 2022 , 53 , 100844. [ Google Scholar ] [ CrossRef ]
  • Selwyn, N. Re-Imagining ‘Learning Analytics’ … a Case for Starting Again? Internet High. Educ. 2020 , 46 , 100745. [ Google Scholar ] [ CrossRef ]
  • Belpaeme, T.; Kennedy, J.; Ramachandran, A.; Scassellati, B.; Tanaka, F. Social Robots for Education: A Review. Sci. Robot. 2018 , 3 , eaat5954. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • De Ramirez, S.S.; Shallat, J.; Mcclure, K.; Foulger, R.; Barenblat, L. Screening for Social Determinants of Health: Active and Passive Information Retrieval Methods. Popul. Health Manag. 2022 , 25 , 781–788. [ Google Scholar ] [ CrossRef ]
  • Chang, C.; Hwang, G.; Chen, K. Fostering Professional Trainers with Robot-Based Digital Storytelling: A Brainstorming, Selection, Forming and Evaluation Model for Training Guidance. Comput. Educ. 2023 , 202 , 104834. [ Google Scholar ] [ CrossRef ]
  • Velentza, A.; Fachantidis, N.; Lefkos, I. Learn with Surprize from a Robot Professor. Comput. Educ. 2021 , 173 , 104272. [ Google Scholar ] [ CrossRef ]
  • Smakman, M.; Vogt, P.; Konijn, E.A. Moral Considerations on Social Robots in Education: A Multi-Stakeholder Perspective. Comput. Educ. 2021 , 174 , 104317. [ Google Scholar ] [ CrossRef ]
  • Atapattu, T.; Falkner, K.; Falkner, N. A Comprehensive Text Analysis of Lecture Slides to Generate Concept Maps. Comput. Educ. 2017 , 115 , 96–113. [ Google Scholar ] [ CrossRef ]
  • Liu, C.; Liao, M.; Chang, C.; Lin, H. An Analysis of Children’ Interaction with an AI Chatbot and Its Impact on Their Interest in Reading. Comput. Educ. 2022 , 189 , 104576. [ Google Scholar ] [ CrossRef ]
  • Rodrigues, F.; Oliveira, P. A System for Formative Assessment and Monitoring of Students’ Progress. Comput. Educ. 2014 , 76 , 30–41. [ Google Scholar ] [ CrossRef ]
  • Westera, W.; Dascalu, M.; Kurvers, H.; Ruseti, S. Automated Essay Scoring in Applied Games: Reducing the Teacher Bandwidth Problem in Online Training. Comput. Educ. 2018 , 123 , 212–224. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Kyu, M.; Zouaq, A.; Mi, S. Automatic Detection of Expert Models: The Exploration of Expert Modeling Methods Applicable to Technology-Based Assessment and Instruction. Comput. Educ. 2016 , 101 , 55–69. [ Google Scholar ] [ CrossRef ]
  • Rico-Juan, J.R.; Gallego, A.; Calvo-Zaragoza, J. Automatic Detection of Inconsistencies between Numerical Scores and Textual Feedback in Peer-Assessment Processes with Machine Learning. Comput. Educ. 2019 , 140 , 103609. [ Google Scholar ] [ CrossRef ]
  • Gerard, L.; Linn, M.C.; Berkeley, U.C. Computer-Based Guidance to Support Students’ Revision of Their Science Explanations. Comput. Educ. 2022 , 176 , 104351. [ Google Scholar ] [ CrossRef ]
  • Lee, D.; Yeo, S. Developing an AI-Based Chatbot for Practicing Responsive Teaching in Mathematics. Comput. Educ. 2022 , 191 , 104646. [ Google Scholar ] [ CrossRef ]
  • Lu, X.; Liu, X.W.; Zhang, W. Diversities of Learners’ Interactions in Different MOOC Courses: How These Diversities Affects Communication in Learning. Comput. Educ. 2020 , 151 , 103873. [ Google Scholar ] [ CrossRef ]
  • Wambsganss, T.; Janson, A.; Leimeister, J.M. Enhancing Argumentative Writing with Automated Feedback and Social Comparison Nudging. Comput. Educ. 2022 , 191 , 104644. [ Google Scholar ] [ CrossRef ]
  • Hsu, T.; Chang, C.; Lin, Y. Effects of Voice Assistant Creation Using Different Learning Approaches on Performance of Computational Thinking. Comput. Educ. 2023 , 192 , 104657. [ Google Scholar ] [ CrossRef ]
  • Han, S.; Lee, M.K. FAQ Chatbot and Inclusive Learning in Massive Open Online Courses. Comput. Educ. 2022 , 179 , 104395. [ Google Scholar ] [ CrossRef ]
  • Sikström, P.; Valentini, C.; Sivunen, A.; Kärkkäinen, T. How Pedagogical Agents Communicate with Students: A Two-Phase Systematic Review. Comput. Educ. 2022 , 188 , 104564. [ Google Scholar ] [ CrossRef ]
  • Zhu, M.; Liu, O.L.; Lee, H.S. The Effect of Automated Feedback on Revision Behavior and Learning Gains in Formative Assessment of Scientific Argument Writing. Comput. Educ. 2020 , 143 , 103668. [ Google Scholar ] [ CrossRef ]
  • Bywater, J.P.; Chiu, J.L.; Hong, J.; Sankaranarayanan, V. The Teacher Responding Tool: Scaffolding the Teacher Practice of Responding to Student Ideas in Mathematics Classrooms. Comput. Educ. 2019 , 139 , 16–30. [ Google Scholar ] [ CrossRef ]
  • Greenhalgh, S.P.; Rosenberg, J.M.; Staudt Willet, K.B.; Koehler, M.J.; Akcaoglu, M. Identifying Multiple Learning Spaces within a Single Teacher-Focused Twitter Hashtag. Comput. Educ. 2020 , 148 , 103809. [ Google Scholar ] [ CrossRef ]
  • Wang, X.; Liu, Q.; Pang, H.; Tan, S.C.; Lei, J.; Wallace, M.P.; Li, L. What Matters in AI-Supported Learning: A Study of Human-AI Interactions in Language Learning Using Cluster Analysis and Epistemic Network Analysis. Comput. Educ. 2023 , 194 , 104703. [ Google Scholar ] [ CrossRef ]
  • Yang, B.; Tang, H.; Hao, L.; Rose, J.R. Untangling Chaos in Discussion Forums: A Temporal Analysis of Topic-Relevant Forum Posts in MOOCs. Comput. Educ. 2022 , 178 , 104402. [ Google Scholar ] [ CrossRef ]
No. Ref.TechnologiesStudy SampleApplicationTarget PartyMotivation
[ ] Artificial neural networks + inverse kinematics
algorithms
Human instructor Teaching Human + Students Intangible cultural heritage
[ ] Child–robot interaction Children Educational Students + Children Concentration + attention
+ Visualize performance
[ ] Telepresence Children Speak + Teaching Children Foreign language teaching
[ ] Project-Based learning
constructivist approach.
Master students Training Master students Program both virtual + real autonomous robots
[ ] ___ Pupils Teaching Students Timetables
[ ] Review Children Teaching + learning Children Improvement in student learning
[ ]Review HybridSpeech Alllearning
[ ]Proposal + preliminaryHybridSpeech recognitionAllLearning + education
[ ]Robots in educationHybrid--Allstudy
[ ]Bridging
robotics education
High school + universitySpeak + TeachingStudentsEducation
challenge
No.Ref.Type of
Application
Name
Technologies
Variables
(Independent/
Dependent)
MethodOutcomeActivity/
Purpose of the Study
1[ ]Mobile applicationsMobile-assisted language learning Mobile-Assisted Language Learning (MALL)58 teachers + WhatsApp + U-Dictionary + emailEvaluated by using a writing test and a critical thinking testCapability
on critical thinking for teachers
Improve English writing skills
2[ ]-Educational dictionary platformMultimedia textbook + a multilingual dictionary + audio of conversational phrasebooksMultilingual electronic dictionaryElectronic educational complexTeaching students of the Uzbek language outside Uzbekistan
3[ ]TestsThe project makes the most of a learning management system600 students + 5 different tests + different languages,Design and creation of tests to be carried out through the consultation of online dictionariesStudents who use dictionaries perform betterTeaching + translation
4[ ]ExperimentsOJAD (Online Japanese Accent Dictionary)Visual + auditory + systematic + comprehensiveDictionaryGenerate its adequate word accent + phrase intonationAid teaching + learning
5[ ]Training IT studentsQ&AForeign languages + IT studentsGeneral language training of students.Programs for machine translationTechnical translation+ search + read
6[ ]--ReviewDigital libraries + databases+ online courses+
electronic textbooks+ dictionary + translators in Latin
review electronic resources in LatinOpened access to vast resources of libraries+ using the scientific
and educational potential and experience in teaching Latin
Development of e-learning tools + websites for the study of Latin in Ukraine
7[ ]Combining advanced methodsProposed
a variant of Byte-Pair Encoding (BPE) algorithm
Methods in
Japanese Vietnamese
Created the first NMT systems for
Japanese to Vietnamese
Neural Machine translationTranslation
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Younis, H.A.; Ruhaiyem, N.I.R.; Ghaban, W.; Gazem, N.A.; Nasser, M. A Systematic Literature Review on the Applications of Robots and Natural Language Processing in Education. Electronics 2023 , 12 , 2864. https://doi.org/10.3390/electronics12132864

Younis HA, Ruhaiyem NIR, Ghaban W, Gazem NA, Nasser M. A Systematic Literature Review on the Applications of Robots and Natural Language Processing in Education. Electronics . 2023; 12(13):2864. https://doi.org/10.3390/electronics12132864

Younis, Hussain A., Nur Intan Raihana Ruhaiyem, Wad Ghaban, Nadhmi A. Gazem, and Maged Nasser. 2023. "A Systematic Literature Review on the Applications of Robots and Natural Language Processing in Education" Electronics 12, no. 13: 2864. https://doi.org/10.3390/electronics12132864

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

A systematic review of natural language processing for classification tasks in the field of incident reporting and adverse event analysis

Affiliations.

  • 1 Department of Anaesthesia, Critical Care and Pain Medicine, Edinburgh Royal Infirmary, 51 Little France Crescent, Edinburgh, Scotland, EH16 4SA, United Kingdom. Electronic address: [email protected].
  • 2 Usher Institute of Population Health Sciences & Informatics, The University of Edinburgh, 9 Little France Rd, Edinburgh, Scotland EH16 4UX, United Kingdom. Electronic address: [email protected].
  • 3 Usher Institute of Population Health Sciences and Informatics, The University of Edinburgh, Teviot Place, Edinburgh, EH8 9AG, United Kingdom. Electronic address: [email protected].
  • PMID: 31630063
  • DOI: 10.1016/j.ijmedinf.2019.103971

Context: Adverse events in healthcare are often collated in incident reports which contain unstructured free text. Learning from these events may improve patient safety. Natural language processing (NLP) uses computational techniques to interrogate free text, reducing the human workload associated with its analysis. There is growing interest in applying NLP to patient safety, but the evidence in the field has not been summarised and evaluated to date.

Objective: To perform a systematic literature review and narrative synthesis to describe and evaluate NLP methods for classification of incident reports and adverse events in healthcare.

Methods: Data sources included Medline, Embase, The Cochrane Library, CINAHL, MIDIRS, ISI Web of Science, SciELO, Google Scholar, PROSPERO, hand searching of key articles, and OpenGrey. Data items were manually abstracted to a standardised extraction form.

Results: From 428 articles screened for eligibility, 35 met the inclusion criteria of using NLP to perform a classification task on incident reports, or with the aim of detecting adverse events. The majority of studies used free text from incident reporting systems or electronic health records. Models were typically designed to classify by type of incident, type of medication error, or harm severity. A broad range of NLP techniques are demonstrated to perform these classification tasks with favourable performance outcomes. There are methodological challenges in how these results can be interpreted in a broader context.

Conclusion: NLP can generate meaningful information from unstructured data in the specific domain of the classification of incident reports and adverse events. Understanding what or why incidents are occurring is important in adverse event analysis. If NLP enables these insights to be drawn from larger datasets it may improve the learning from adverse events in healthcare.

Keywords: Adverse event analysis; Incident reporting; Machine learning; Natural language processing; Patient safety; Text classification.

Copyright © 2019 Elsevier B.V. All rights reserved.

PubMed Disclaimer

Similar articles

  • Natural Language Processing and Its Implications for the Future of Medication Safety: A Narrative Review of Recent Advances and Challenges. Wong A, Plasek JM, Montecalvo SP, Zhou L. Wong A, et al. Pharmacotherapy. 2018 Aug;38(8):822-841. doi: 10.1002/phar.2151. Epub 2018 Jul 22. Pharmacotherapy. 2018. PMID: 29884988 Review.
  • Detecting Adverse Drug Events with Rapidly Trained Classification Models. Chapman AB, Peterson KS, Alba PR, DuVall SL, Patterson OV. Chapman AB, et al. Drug Saf. 2019 Jan;42(1):147-156. doi: 10.1007/s40264-018-0763-y. Drug Saf. 2019. PMID: 30649737 Free PMC article.
  • Natural language processing with machine learning methods to analyze unstructured patient-reported outcomes derived from electronic health records: A systematic review. Sim JA, Huang X, Horan MR, Stewart CM, Robison LL, Hudson MM, Baker JN, Huang IC. Sim JA, et al. Artif Intell Med. 2023 Dec;146:102701. doi: 10.1016/j.artmed.2023.102701. Epub 2023 Nov 1. Artif Intell Med. 2023. PMID: 38042599
  • Integrating natural language processing expertise with patient safety event review committees to improve the analysis of medication events. Fong A, Harriott N, Walters DM, Foley H, Morrissey R, Ratwani RR. Fong A, et al. Int J Med Inform. 2017 Aug;104:120-125. doi: 10.1016/j.ijmedinf.2017.05.005. Epub 2017 May 11. Int J Med Inform. 2017. PMID: 28529113
  • Evaluation of Natural Language Processing (NLP) systems to annotate drug product labeling with MedDRA terminology. Ly T, Pamer C, Dang O, Brajovic S, Haider S, Botsis T, Milward D, Winter A, Lu S, Ball R. Ly T, et al. J Biomed Inform. 2018 Jul;83:73-86. doi: 10.1016/j.jbi.2018.05.019. Epub 2018 Jun 1. J Biomed Inform. 2018. PMID: 29860093
  • ChatGPT for Automated Qualitative Research: Content Analysis. Bijker R, Merkouris SS, Dowling NA, Rodda SN. Bijker R, et al. J Med Internet Res. 2024 Jul 25;26:e59050. doi: 10.2196/59050. J Med Internet Res. 2024. PMID: 39052327 Free PMC article.
  • Construction of a Multi-Label Classifier for Extracting Multiple Incident Factors From Medication Incident Reports in Residential Care Facilities: Natural Language Processing Approach. Kizaki H, Satoh H, Ebara S, Watabe S, Sawada Y, Imai S, Hori S. Kizaki H, et al. JMIR Med Inform. 2024 Jul 23;12:e58141. doi: 10.2196/58141. JMIR Med Inform. 2024. PMID: 39042454 Free PMC article.
  • Validation and clinical discovery demonstration of breast cancer data from a real-world data extraction platform. Nottke A, Alan S, Brimble E, Cardillo AB, Henderson L, Littleford HE, Rojahn S, Sage H, Taylor J, West-Odell L, Berk A. Nottke A, et al. JAMIA Open. 2024 May 17;7(2):ooae041. doi: 10.1093/jamiaopen/ooae041. eCollection 2024 Jul. JAMIA Open. 2024. PMID: 38766645 Free PMC article.
  • Development of a scoring system to quantify errors from semantic characteristics in incident reports. Uematsu H, Uemura M, Kurihara M, Yamamoto H, Umemura T, Kitano F, Hiramatsu M, Nagao Y. Uematsu H, et al. BMJ Health Care Inform. 2024 Apr 19;31(1):e100935. doi: 10.1136/bmjhci-2023-100935. BMJ Health Care Inform. 2024. PMID: 38642920 Free PMC article.
  • Adverse Event Signal Detection Using Patients' Concerns in Pharmaceutical Care Records: Evaluation of Deep Learning Models. Nishioka S, Watabe S, Yanagisawa Y, Sayama K, Kizaki H, Imai S, Someya M, Taniguchi R, Yada S, Aramaki E, Hori S. Nishioka S, et al. J Med Internet Res. 2024 Apr 16;26:e55794. doi: 10.2196/55794. J Med Internet Res. 2024. PMID: 38625718 Free PMC article.

Publication types

  • Search in MeSH

Related information

Linkout - more resources, full text sources.

  • Elsevier Science
  • MedlinePlus Health Information

Miscellaneous

  • NCI CPTAC Assay Portal
  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

Advertisement

Advertisement

A systematic review of natural language processing applications for hydrometeorological hazards assessment

  • Review Article
  • Published: 08 February 2023
  • Volume 116 , pages 2819–2870, ( 2023 )

Cite this article

literature review of nlp

  • Achraf Tounsi   ORCID: orcid.org/0000-0001-8166-616X 1 &
  • Marouane Temimi   ORCID: orcid.org/0000-0003-0006-2685 1  

3557 Accesses

11 Citations

1 Altmetric

Explore all metrics

Natural language processing (NLP) is a promising tool for collecting data that are usually hard to obtain during extreme weather, like community response and infrastructure performance. Patterns and trends in abundant data sources such as weather reports, news articles, and social media may provide insights into potential impacts and early warnings of impending disasters. This paper reviews the peer-reviewed studies (journals and conference proceedings) that used NLP to assess extreme weather events, focusing on heavy rainfall events. The methodology searches four databases (ScienceDirect, Web of Science, Scopus, and IEEE Xplore) for articles published in English before June 2022. The preferred reporting items for systematic reviews and meta-analysis reviews and meta-analysis guidelines were followed to select and refine the search. The method led to the identification of thirty-five studies. In this study, hurricanes, typhoons, and flooding were considered. NLP models were implemented in information extraction, topic modeling, clustering, and classification. The findings show that NLP remains underutilized in studying extreme weather events. The review demonstrated that NLP could potentially improve the usefulness of social media platforms, newspapers, and other data sources that could improve weather event assessment. In addition, NLP could generate new information that should complement data from ground-based sensors, reducing monitoring costs. Key outcomes of NLP use include improved accuracy, increased public safety, improved data collection, and enhanced decision-making are identified in the study. On the other hand, researchers must overcome data inadequacy, inaccessibility, nonrepresentative and immature NLP approaches, and computing skill requirements to use NLP properly.

Similar content being viewed by others

literature review of nlp

A computational approach to analyzing climate strategies of cities pledging net zero

literature review of nlp

A quantitative analysis of research trends in flood hazard assessment

literature review of nlp

Natural hazards, disaster management and simulation: a bibliometric analysis of keyword searches

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction

Meteorologists have always faced the challenge of predicting extreme precipitation events (Lazo et al. 2009 ). According to recent surveys of the public’s use of weather forecasts, the most largely utilized information of standard forecasts is precipitation prediction (e.g., where, when, and how much rain will fall) (Lazo et al. 2009 ). End users in different communities and fields, like transportation, water resources, flood control, emergency management, and many others, require reliable precipitation forecasts. In the USA Weather Research Program’s (USWRP) community planning report, experts in hydrology, transportation, and emergency management discussed how quantitative precipitation forecasts (QPF) relate to their specific communities (Ralph 2005 ). On average, users can receive weather information four to five times daily through mobile phone apps, TV, newspapers, tweets, etc. (Purwandari et al. 2021 ). When assessing climate data, it is possible to encounter a discrete–continuous option problem because of the dynamic nature of precipitation and the variety of physical forms involved, making rainfall forecasts challenging (Purwandari et al. 2021 ). The determination of QPF in short-, mid-, and long-range is done with different levels of difficulties, and QPF is usually delivered with varying uncertainty (Purwandari et al. 2021 ). Meteorologists have always tried to assimilate as many possible observations as possible to enhance their weather forecast skills. Observations from satellite, airborne, or ground-based sources have been used. The more reliable observations, event qualitative, are examined, processed, and eventually assimilated into models, the better the forecast. The increasing use of the Internet, precisely social media steams, has created a wealth of information that holds the potential to improve weather models. However, a large amount of unstructured data is available on the Internet, which lacks the identifiable tabular organization necessary for traditional data analysis methods, which diminishes its potential (Gandomi and Haider 2015 ). Although unstructured data, such as Web pages, emails, and mobile phone records, may contain numerical and quantitative information (e.g., dates), they are usually text-heavy. Contrary to numbers, textual data are inherently inaccurate and vague. According to (Britton 1978 ), at least 32% of the words used in English text are lexically ambiguous. Textual data are often unstructured, making it difficult for researchers to use them to enhance meteorologic models. Nevertheless, the large amount of textual data provides new opportunities for urban researchers to investigate people’s perceptions, attitudes, and behaviors, which will help them better understand the impact of natural hazards. Jang and Kim ( 2019 ) have demonstrated that crowd-sourced text data can effectively represent the collective identity of urban spaces by analyzing crowd-sourced text data gathered from social media. Conventional methods of collecting data, such as surveys, focus groups, and interviews, are often time-consuming and expensive. Raw text data without predetermined purposes can be compelling if used wisely and can complement purposefully designed data collection strategies.

Machines can analyze and comprehend human language thanks to a process known as natural language processing (NLP). It is at the heart of all the technologies we use daily, including search engines, chatbots, spam filters, grammar checkers, voice assistants, and social media monitoring tools (Chowdhary 2020 ). By applying NLP, it is possible to grasp better human language’s syntax, semantics, pragmatics, and morphology. Then, computer science uses this language understanding to create rule-based, machine learning algorithms that can solve certain issues and carry out specific tasks (Chowdhary 2020 ). NLP has demonstrated tremendous capabilities in harvesting the abundance of textual data available. Hirschberg and Manning ( 2015 ) define it as a form of artificial intelligence similar to deep learning and machine learning that uses computational algorithms to learn, understand, and produce human language content. Basic NLP procedures require processing text data, converting text into features, and identifying semantic relationships (Ghosh and Gunning 2019 ). In addition to structuring large volumes of unstructured data, NLP can also improve the accuracy of text processing and analysis because it follows the rules and criteria consistently. A wide range of fields has proven to benefit from NLP. Guetterman et al. ( 2018 ) conducted an experiment in which they compared the results of traditional text analysis with those of a natural language processing analysis. The authors claim that NLP could identify major themes manually summarized by conventional text analysis. Syntactic and semantic analysis is frequently employed in NLP to break human discourse into machine-readable segments (Chowdhary 2020 ). Syntactic analysis, commonly called parsing or syntax analysis, detects a text’s syntactic structure and the dependencies between words, as shown on a parse tree diagram (Chowdhary 2020 ). The semantic analysis aims to find out what the language means or, to put it another way, extract the exact meaning or dictionary meaning from the text. However, semantics is regarded as one of the most challenging domains in NLP due to language’s polysemy and ambiguity (Chowdhary 2020 ). Semantic tasks examine sentence structure, word interactions, and related ideas to grasp the topic of a text and the meaning of words. One of the main reasons for NLP’s complexity is the ambiguity of human language. For instance, sarcasm is problematic for NLP (Suhaimin et al. 2017 ). It would be challenging to educate a machine to understand the irony in the statement, “I was excited about the weekend, but then my neighbor rained on my parade," or “it is raining cats and dogs,” yet humans would be able to do so quickly. Researchers worked on instructing NLP technologies to explore beyond word meanings and word order to thoroughly understand context, word ambiguities, and other intricate ideas related to communications. However, they must consider other factors like culture, background, and gender while adjusting natural language processing models. For instance, idioms, like those related to weather, can significantly vary from one nation to the next.

NLP can support extreme weather events data challenges by analyzing large amounts of weather data for patterns and trends (Kahle et al. 2022 ). These data can provide more accurate forecasts and early warning systems for extreme weather events (Kitazawa and Hale 2021 ; Rossi et al. 2018 ; Vayansky et al. 2019 ; Zhou et al. 2022 ). NLP can also monitor social media for information on extreme weather events, allowing for the detection of local events that may not be reported in official channels (Kitazawa and Hale 2021 ; Zhou et al. 2022 ). Additionally, NLP can be used to create automated chat bots to provide information to those affected by extreme weather events, such as directions to shelters, medical assistance, and other resources.

This article comprehensively reviews how researchers have used NLP in extreme weather event assessment. The present study is, to our knowledge, a first attempt to synthesize opportunities and challenges for extreme events assessment research by adopting Natural Language Processing. In the methodology section of the manuscript, the approach, the selection method, and the search terms used for the article selection are detailed. Then, the search results are summarized and categorized. Next, the role of NLP and its challenges in supporting extreme events are assessed. Finally, the limitations of this literature study are listed.

2 Methodology

First, the protocol registration and information sources were determined. In this study, the Preferred Reporting Items for Systematic Reviews and Meta-Analysis Reviews and Meta-Analysis guidelines (PRISMA) were followed to conduct the systematic review (Page et al. 2021 ). The protocol (Tounsi 2022 ) was registered with the Open Science Framework on June 6th, 2022. We searched for peer-reviewed publications in databases to identify articles within this systematic literature review’s scope and eligibility criteria.

Then, the search strategy was defined. Our search terms were developed systematically to ensure that all related and eligible papers in the databases were captured. A preliminary literature review determined the keywords used in the search and then modified them based on feedback from content experts and the librarian. Our review then incorporated a collaborative search strategy to ensure that all papers related to the use of NLP and their use for the assessment of extreme weather events were included in our review and determined the keywords. We searched four databases: IEEE Xplore, Web of Science, ScienceDirect, and Scopus.

We grouped the query keywords to identify relevant studies that meet our scope and inclusion criteria. We combined them using an AND/OR operator. We used keyword terms such as “NLP OR Natural Language Processing” with narrower terms such as “Precipitation OR Rainfall” in our keyword search. Figure  1 shows all the combinations of search terms used in the keyword search.

figure 1

Conceptual framework of the search terms used in the literature review to query the studies for the literature review

This study focused on peer-reviewed publications satisfying the following two primary conditions: (a) applying one of the techniques of natural language processing to solve the problem stated in the study, the study has to show the results of the model used, not just suggest it and (b) reporting results related to precipitation-related extreme weather events assessment. Note that, papers that did not meet these conditions were excluded from consideration. For example, studies that only focused on developing numerically based deep learning models were excluded from the study. Additionally, secondary research, such as reviews, commentaries, and conceptual articles, was excluded from this review. The search was limited to English language papers published until June 2022.

Two authors screened the publications simultaneously to decide whether or not to include each of them using consensus methods. First, we screened the publications by looking at the titles and abstracts, and then, we removed duplicates. Having read the full text of the remaining papers, we finalized the selection by reading the full text. To minimize selection bias, all discrepancies were resolved by discussion requiring consensus from both reviewers. For each paper, standardized information was recorded using a data abstraction form.

3.1 Search results

The flowchart of the procedure used to choose the articles to be included in this systematic literature review is shown in Fig.  2 . A total of 1225 documents were found in the initial search using a set of queries. To control the filtering and duplicate removal procedure, we employed EndNote. We eliminated duplicates and all review, opinion, and perspective papers. Following that, two writers conducted second filtering by reading titles and abstracts ( n  = 846). Three hundred sixteen (316) documents were left after a screening procedure using inclusion criteria for a full-text examination. An additional 281 articles that were not within the scope of the study were removed after reading the full study. For example, some studies suggested NLP as a solution but did not implement it. We cannot include such studies because NLP was not shown as an efficient solution to the problem. Other examples include studies that explained the consequences of different extreme weather events (including rain-related issues) on other fields, such as construction or maritime transportation.

figure 2

PRISMA flow diagram for Searching and Selection process

Consequently, the final number of studies considered for the systematic review is 35, with consensus from both authors. To extract the information listed in Table 1 , we used a systematic approach from each eligible article. Overall, 29 were journal articles, and nine were conference proceedings (Table 2 ).

3.2 NLP areas of application

The authors of the literature selection have explored numerous NLP topics, as shown in Table 1 . To summarize, researchers have used NLP in four areas: (1) Social influence and trend analysis, (2) event impact assessment and mapping, (3) event detection, and (4) disaster resilience. Social influence and trend analysis is the most dominant topic (37% of all literature), which includes discussions on crowdsourcing, sentiment analysis, topic modeling, and citizen engagement. Researchers also have used NLP to study extreme events’ impact assessment and mappings (25% of all literature), such as disaster response, event impact on infrastructure, and flood mapping. Event detection is another popular area of research (21% of all literature), in which authors used NLP to detect and monitor extreme weather events and design early warning systems. Lastly, researchers adopted NLP models for disaster resilience (17% of all literature).

Multiple sources of data have been used for the selected studies. Figure  3 shows the distribution of data sources used in the literature. Data sources vary between social media (Twitter, Weibo, and Facebook) , publications (Newspapers, Disaster Risk Reduction Knowledge Service, Numerical climate & gauge data , Weather bulletins, and Generic ontology) , public encyclopedia, and services (Chinese wiki and Google Map) and posted photographs. Social media plays an essential role in data sourcing for the studies. For example, Twitter is used by 57.1% of the total studies, which emphasizes both the role and the potential of this platform as a real-time and historical public data provider with additional features and functionality that support collecting more precise, complete, and unbiased datasets. Typically, researchers take the content from social media posts and evaluate it along with the geolocation data. Authors also used NLP to process other data sources such as weather bulletins, gauge data, online photographs, and newspapers. Data size could vary from small datasets (dozens of photographs, hundreds of weather bulletins) to large ones (millions of tweets). It is important to note that when predictive modeling was the goal of a study, it was usual practice to compare the accuracy of NLP findings to data from reliable sources.

figure 3

Pie chart of the data sources used within the literature

3.4 NLP tasks and models in the review literature

All of the studies mentioned in this review applied at least one NLP task within their research that involves either syntactic or sematic analysis. As studies deal with extensive unstructured social media data, clustering was used to categorize tweets to understand what was being discussed on social platforms. Clustering proved a highly effective machine learning approach for identifying structures in labeled and unlabeled datasets. Studies used K-means and graph-based clustering.

In addition, pre-trained models (PTMs) for NLP are deep learning models trained on a large dataset to perform specific NLP tasks. PTMs may learn universal language representations when they are trained on a significant corpus. This can aid in solving actual NLP tasks (downstream tasks) and save the need to train new models from the start. Several studies within the literature have used PTMs such as Bidirectional Encoder Representations from Transformers (BERT), SpaCy NER, Stanford NLP, NeuroNER, FlauBERT, and CamemBERT. These models have either been used for named entity recognition or classification.

Moreover, the automated extraction of subjects being discussed from vast amounts of text is another application addressed in the literature. Topic modeling is a statistical technique for discovering and separating these subjects in an unsupervised manner from the massive volume of documents. The authors have used models such as Latent Dirichlet Allocation (LDA) and Correlation Explanation (CorEx) (Barker and Macleod 2019 ; Chen and Ji 2021 ; Karimiziarani et al. 2022 ; Xin et al. 2019 ; Zhou et al. 2021 ). Furthermore, other NLP sub-tasks such as tokenization, part-of-speech tagging, dependency parsing, and lemmatization & stemming have also been applied in several studies to deal with data-related problems.

3.5 Study area

Studies utilizing NLP have been conducted in different countries. Most research concentrated on metropolitan cities like Beijing, China, and New York City, USA. This may be because these cities are heavily populated, which increases the probability of data availability (more social media users, for example). For example, the USA is the leading country based on the number of Twitter users as of January 2022, with more than 76.9 million users, while 252 million Chinese users actively use Weibo daily. Adding to that, these cities are both in the path of frequent tropical cyclones. They might be affected by several types of flooding, such as coastal flooding due to storm surges, pluvial flooding due to heavy precipitation in a short time, or fluvial flooding due to their proximity to a river. For example, on July 21, 2012, a flash flood struck the city of Beijing for over twenty hours. As a result, 56,933 people were evacuated within a day of the flooding, which caused 79 fatalities, at least 10 billion Yuan in damages, and destroyed at least 8,200 dwellings. Figure  4 shows the distribution of the studies by country. Some researchers compared data from various cities. The analysis may be done at multiple scales, from one town to a whole continent.

figure 4

Distribution of studies by country

3.6 Types of extreme weather events

Three types of extreme events were addressed in the surveyed literature: hurricanes and storms, typhoons, and flooding. Formed in the North Atlantic, the northeastern Pacific, the Caribbean Sea, or the Gulf of Mexico, hurricanes are considered to cause significant damage over a wide area with high population density, which explains the focus on this type of extreme event and its presence in social media. More than 48% of the studies concentrated in one way or another on hurricanes. Hurricanes have hazardous impacts. Storm surges and large waves produced by hurricanes pose the greatest threat to life and property along the coast (Rappaport et al. 2009 ). In emergencies, governments have to invest in resources (financial and human) to support the affected areas and populations and to help spread updates and warnings (Vanderford et al. 2007 ).

Flood is a challenging and complex phenomenon as it may occur at different spatial and temporal scales (Istomina et al. 2005 ). Floods can occur due to intense rainfall, surges from the ocean, rapid snowmelt, or the failure of dams or levees (Istomina et al. 2005 ). The most dangerous floods are flash floods, which combine extreme speed with a flood’s devastating strength (Sene 2016 ). As it is considered the deadliest type of severe weather, decision-makers must use every possible data source to confront flooding. Flooding was the second hazard addressed in the literature, varying magnitudes from a few inches of water to several feet. They may also come on quickly or build gradually. Nearly 42% of the study literature covered flooding.

Three studies, representing 10% of the literature, covered typhoons, which develop in the northwestern Pacific and usually affect Asia. Typhoons can cause enormous casualties and economic losses. Additionally, governments and decision-makers have difficulty gathering data on typhoon crisis scenarios. At the same time, modern social media platforms like Twitter and Weibo offer access to almost real-time disaster-related information (Jiang et al. 2019 ).

4 Discussion

Hurricanes, storms, and floods are the most extreme events addressed in the literature. The duration of weather hazards varies greatly, from a few hours for some powerful storms to years or decades for protracted droughts. The occurrences of weather hazards usually raise more awareness of weather hazards and trigger, therefore, higher sensitivity to extreme events and the tendency of the public and Internet users to report them. Even short-lived catastrophes may leave a long-lasting mark in the public’s mind and remain referred to online as a reference event. Using ground-based sensors to understand the dynamics of weather hazards is often confronted with limited resources, leading to sparse networks and data scares. Thus, we can only ensure a better understanding and monitoring of these extreme events by expanding the data set to other structured and unstructured data sources. In this regard, only through NLP this kind of data could be valorized and made available for weather modelers. This systematic review addressed a critical gap in the literature by exploring the applications of NLP in assessing extreme events.

4.1 Role of NLP in supporting extreme events assessment

4.1.1 hurricanes.

NLP can help learn from structured and unstructured textual data from reliable sources to take preventive and corrective measures in emergencies to help support decision-making. We found in this study that many different NLP models are used together with multiple data sources. For example, topic modeling was used by 12 studies to support the hurricanes and storms decision-making using LDA and CorEx models applied to social media data (Facebook and Twitter). Vayansky et al. ( 2019 ) used sentiment analysis to measure changes in Twitter users’ emotions during natural disasters. Their work can help the authorities limit the damages from natural disasters, specifically hurricanes and storms. In addition to the corrective measures of recovering from the disaster, it can also help them adjust future response efforts accordingly (Vayansky et al. 2019 ). Yuan et al. ( 2021 ) investigated the differences in sentiment polarities between various racial/ethnic and gender groups to look into the themes of concern in their expressions, including the popularity of these themes and their sentiment toward them, and to understand better the social aspects of disaster resilience using the results of disparities in disaster response. Findings can assist crisis response managers in identifying the most sensitive/vulnerable groups and targeting the appropriate demographic groups with catastrophe evolution reports and relief resources. On another note, Yuan et al. ( 2020 ) looked at how often people posted on social media and used the yearly average sentiment as a baseline for the sentiment. The LDA was used to creatively determine the sentiment and weights for various subjects in public discourse. Using their work, a better understanding of the public’s unique worries and panics as catastrophes progress may assist crisis response managers in creating and carrying out successful response methods. In addition to protecting people’s lives, NLP can be used to monitor post-disaster infrastructure conditions. Chen and Ji ( 2021 ) used the CorEx topic model to capture infrastructure condition-related topics by incorporating domain knowledge into the correlation explanation and to look at spatiotemporal patterns of topic engagement levels for systematically sensing infrastructure functioning, damage, and restoration conditions. To help practitioners maintain essential infrastructure after catastrophes, Chen and Ji ( 2021 ) offered a systematic situational assessment of the infrastructure. Additionally, the suggested method looked at how people and infrastructure systems interact, advancing human-centered infrastructure management (Chen and Ji 2021 ).

Other studies used topic modeling twinned with other methods, such as clustering and named entity recognition (Barker and Macleod 2019 ; Fan et al. 2018 ; Shannag and Hammo 2019 ; Sit et al. 2019 ). This enabled the authors to develop more advanced analytical frameworks than a single-model-based pipeline. Sit et al. ( 2019 ) used an analytical framework for Twitter analysis that could recognize and classify tweets about disasters, identify impact locations and time frames, and determine the relative importance of each category of disaster-related information over time and geography. Throughout the disaster’s temporal course, their analysis revealed possible places with significant densities of impacted people and infrastructure damage (Sit et al. 2019 ). The approach described in this article has enormous potential for usage during a disaster for real-time damage and emergency information detection and for making wise judgments by analyzing the circumstance in affected areas (Sit et al. 2019 ). Barker and Macleod ( 2019 ) created a prototype national-scale Twitter data mining pipeline for better stakeholder situational awareness during flooding occurrences across Great Britain. By automatically detecting tweets using Paragraph Vectors and a classifier based on regression models, the study can be implemented as a national-scale, real-time product that could respond to requests for better crisis situational awareness. In another study, Fan et al. ( 2018 ) detected infrastructure-related topics in the tweets posted during disasters and their evolutions during hurricanes by integrating several NLP and statistical models such as LDA and K-means clustering(Fan et al. 2018 ). The study made it possible to trace the progression of conditions during various crisis phases and to summarize key points (Fan et al. 2018 ). The proposed framework’s analytics components can help decision-makers recognize infrastructure performance through text-based representation and give evidence for measures that can be taken immediately (Fan et al. 2018 ).

Not only is topic modeling beneficial for decision-making support, but other NLP techniques, such as information extraction (IE). Many hurricane-related social media reactions are contained in natural language text. However, using it effectively in this format is extremely difficult. IE allows extracting information from these unstructured textual sources, finding those entities (words and tokens related to the topic of interest), and classifying and storing them in a database (Grishman 2015 ). In this review, four studies used Information Extractio-related models to extract valuable information from unstructured text (Devaraj et al. 2020 ; Chao Fan et al. 2020a , b ; Zhou et al. 2022 ). Zhou et al. ( 2022 ) created VictimFinder models based on cutting-edge NLP algorithms, including BERT, to identify tweets that ask for help rescuing people. The study presents a handy application promoting social media use for rescue operations in future disaster events. Web apps can be created to offer near-real-time rescue request locations that emergency responders and volunteers may use as a guide for dispatching assistance. The ideal model can also be included in GIS tools (Zhou et al. 2022 ). On another subject and to analyze location-specific catastrophe circumstances, Chao Fan et al. ( 2020a , b ) suggested an integrated framework to parse social media data and evaluate location-event-actor meta-networks. The study’s outcomes highlighted the potential of the proposed framework to enhance social sensing of crisis conditions, prioritize relief efforts, and prioritize relief and rescue operations based on the severity of the events and local requirements (Chao Fan et al. 2020a , b ).

Devaraj et al. ( 2020 ) considered whether it could successfully extract valuable information for first responders from public tweets during a hurricane. As the use of social media constantly increases, people now turn to social media platforms like Twitter to make urgent requests for help. The study shows that using machine learning models, urgent requests posted on social media sites like Twitter may be identified (Devaraj et al. 2020 ). As hurricanes develop, emergency services or other relevant relief parties might utilize these broad models to automatically identify pleas for assistance on social media in real-time (Devaraj et al. 2020 ).

4.1.2 Typhoons

When a disaster strikes, as has been demonstrated in numerous cases, citizens can quickly organize themselves and start sharing disaster information (Jiang et al. 2019 ). Three studies from the review scope have dealt with typhoons and suggested both topic modeling and information extraction as approaches. Kitazawa and Hale ( 2021 ) studied how the general population reacts online to warnings of typhoons and heavy rains. The study suggests that insights can assist authorities in creating more focused social media strategies that can reach the public more rapidly and extensively than traditional communication channels, improve the circulation of information to the public at large, and gather more detailed disaster data (Kitazawa and Hale 2021 ). Lam et al. ( 2017 ) suggested a strategy for poorly annotating tweets on typhoons when just a tiny, annotated subset is available. Government and other concerned bodies can utilize tweet classifiers to find additional information about a tragedy on social media (Lam et al. 2017 ). To evaluate the degree of harm suggested by social media texts, Yuan et al. ( 2021 ) presented a model that focuses on the in-depth interpretation of social media texts while requiring less manual labor in a semi-supervised setting. The damage extent map developed by the authors mostly matches the actual damage recorded by authorities, demonstrating that the suggested approach can correctly estimate typhoon damage with minimum manual labor (Yuan et al. 2021 ).

4.1.3 Flooding

Through 14 studies from this review’s literature, authors developed several topic modeling-based frameworks and products that might help assess flood-related situations (Barker and Macleod 2019 ; GrÜNder-Fahrer et al. 2018 ; Rahmadan et al. 2020 ). Barker and Macleod ( 2019 ) presented a prototype social geodata machine learning pipeline that combined current developments in word embedding NLP with real-time environmental data at the national level to identify tweets that were related to flooding throughout Great Britain. By automatically detecting tweets using Paragraph Vectors and a logistic regression-based classifier, the study supports requests for better crisis situational awareness (Barker and Macleod 2019 ). The approach can be considered an important finding as it holds considerable potential to apply to other countries and other emergencies (Barker and Macleod 2019 ). In their investigation of the thematic and temporal structure of German social media communication, GrÜNder-Fahrer et al. ( 2018 ) looked at the types of content shared on social media during the event, the evolution of topics over time, and the use of temporal clustering techniques to identify various defining phases of communication automatically. According to the study, social media material has significant potential in disasters’ factual, organizational, and psychological aspects and throughout all phases of the disaster management life cycle (GrÜNder-Fahrer et al. 2018 ). In the framework of the methodological inquiry, the authors assert that topic model analysis showed great relevance for thematic and temporal social media analysis in crisis management when paired with proper optimization approaches (GrÜNder-Fahrer et al. 2018 ). Social media is heavily utilized for warnings and the dissemination of current information about many factual components of the event (such as weather, water levels, and traffic hindrance). Social media may help with situational awareness and the promptness of early warnings in the planning and reaction states.

NLP completely changes how we view social media in a way that it becomes the ideal way to interact with the volunteer movement immediately and shift it from its current stage of autonomous individual involvement to organized participation from the perspective of crisis management (GrÜNder-Fahrer et al. 2018 ). From another angle, Rahmadan et al. ( 2020 ) identified the subjects mentioned during the flood crisis by applying an LDA topic modeling methodology and a lexicon-based approach to examine the sentiment displayed by the public when floods strike. Related parties can utilize their work and the information they provide to design disaster management plans, map at-risk floodplains, assess the causes and monitor the effects after a flood catastrophe (Rahmadan et al. 2020 ).

Despite the proven adverse financial, economic, and humanitarian effects of floods, databases including measures to reduce flood risk are either sparse in detail, have a limited scope, or are owned privately by commercial entities. However, given that the amount of Internet data is constantly increasing, several studies from this review highlighted the emergence of information extraction methods on unstructured text (Kahle et al. 2022 ; Lai et al. 2022 ). For example, to extract information from newspapers, Lai et al. ( 2022 ) used NLP to build a hybrid Named Entity Recognition (NER) model that uses a domain-specific machine learning model, linguistic characteristics, and rule-based matching. The study’s methodology builds upon earlier comparable efforts by extending the geographical scope and using methods to extract information from massive documents with little accuracy loss (Lai et al. 2022 ). The work offers new information that climate researchers may use to recognize and map flood patterns and assess the efficacy of current flood models (Lai et al. 2022 ). Zhang et al. ( 2021 ) used the BERT-Bilstm-CRF model to offer a social sensing strategy for identifying flooding sites by separating waterlogged places from common locations. The authors created a “City Portrait” using semantic data to depict the city’s functional regions (Zhang et al. 2021 ). This would be very valuable for crucial decision-makers using the same approach to validate regions with high flooding rates. In another study, Wang et al. ( 2020 ) used Computer Vision (CV) and NeuroNER methods to extract information from social media’s visual and linguistic content to create a deep learning-based framework for flood monitoring. The work can provide thorough situational awareness and create a passive hotline to coordinate rescue and search efforts (Wang et al. 2020 ).

4.2 Challenges of NLP in supporting extreme events assessment

NLP provides a fascinating new path with the potential of great support to understand natural hazards better. Nevertheless, the technique brings uncertainties. Models, data, and hydrology applications are three possible sources of uncertainty related to using NLP in extreme events assessment research. Several data-related challenges were reported in the literature review. Several studies share the fact that any social media data are either noisy (made up of fake news, disinformation, automated messaging, adverts, and unrelated cross-event subjects) or hard to obtain due to rate limitations from the source (Twitter Rest API, Twitter Streaming API) (C. Fan et al. 2020a , b ; Xin et al. 2019 ; Zhou et al. 2021 ). As they frequently contain higher degrees of slang, URLs, linguistic variety, and other elements compared with generic text or other social media posts, tweets are challenging to examine in a comparative context with other media (Alam et al. 2020 ; Xin et al. 2019 ).

Moreover, several studies that tried to work with geotagged tweets reported the scarcity of the number of them, and even if they are available, GPS position precision of crowd-sourced data can reach meters. Still, Twitter-based data are only accurate to the level of street names (Wang et al. 2018 ). Despite the exponential growth in social media-based research in different fields, the number of studies included is relatively low. Many factors, such as data issues, can explain this. In fact, as an example, Twitter recently added more restrictions on access to data which may have been behind the slow growth in the research using social media to assess the consequences of extreme weather events. Finally, data might lack critical elements, including disaster-related (such as severity, length, magnitude, and kind of catastrophe) and non-disaster-related (such as regional socioeconomic variations, day/night) aspects.

Despite the advances made in NLP, most common applications are still minimal. While the end goal of NLP is for algorithms to arrange meaning through computer logic and establish the links between words and grammar in human language, the existing methodologies do not have the same capacity for resolving natural language as humans do. Several named entity recognition applications require human interaction as a post-processing operation to correct model output (Alam et al. 2020 ). In addition, several models did not perform well due to data scarcity and quality, leading to failure to detect fake and spam social media messages (Vayansky et al. 2019 ). Moreover, several studies reported the low model performance of NLP models in languages other than English (GrÜNder-Fahrer et al. 2018 ). More research is needed to be done on this matter. On another note, several studies lacked time to optimize better and tune their models (de Bruijn et al. 2020 ). Finally, topic modeling algorithms such as Latent Dirichlet Allocation cannot produce reliable results when used with microblogging platforms like Twitter because of their unique features, particularly their brief tweets (Shannag and Hammo 2019 ). LDA was shown to perform well on long texts and less on shorter ones.

Although these models can solve plenty of hydrology challenges, some issues are still not solved yet and need much investigation. Most developed methods are not ready for operational use, despite their capability to communicate preventive and corrective states for extreme weather events assessment (Alam et al. 2020 ). In addition, public response, especially from those directly impacted by a natural hazard, might be relatively limited during disasters where data are most needed as Internet users will prioritize their response to the hazard and its aftermath over-reporting online (Vayansky et al. 2019 ). Adding to that, it is still challenging to obtain complete situational awareness to help disaster management due to the unpredictable nature of natural catastrophe behavior (Maulana and Maharani 2021 ). On another note, there is a lack of studies in several critical research areas, such as the geographical disparity in the spatial distribution of flood research at the global and intercontinental scales (Zhang and Wang 2022 ) and how the public’s attitudes change over time when disasters occur (Reynard and Shirgaokar 2019 ).

4.3 Study limitations

Although the search strategy was meant to be a systematic and comprehensive approach, it may hold some limitations. First, a language bias was evident in the search because only English language studies were included. Any study in non-English language with English abstracts was also excluded. Secondly, the retrieval method may have missed studies labeled with other terms based on NLP techniques. For instance, literature searches using keywords such as latent Dirichlet allocation (LDA), a statistical model in NLP, did not return results, but this was not the criteria used for the literature search. Adding to that, several studies may look interesting from the title and abstract and could be included in the scope of the paper. However, in the full paper reading, these studies would mention the MLP as a solution to their problem without implementing any model.

Moreover, several studies would tackle problems related to other than natural hazards, such as infrastructure or transportation logistics, and still apply NLP models, which makes them not related to the scope of this study. This review excluded dissertations, theses, books, reports, and working papers because only peer-reviewed journal articles and conference papers were included. The quality and quantity of literature were traded off here. Finally, we considered recent studies that were published after 2018.

5 Conclusion

NLP techniques hold a huge potential to process and analyze natural language content. By leveraging NLP, it is possible to convert unstructured textual data into structured data that can be used for further analysis. In this study, Natural Language Processing algorithms have proven their ability to leverage hurricanes and flood events. The benefits of NLP in evaluating extreme weather events include increasing social media platforms, newspapers, and other hydrological datasets as data sources, broadening study locations and scales and lowering research expenses. There are many different applications for NLP use. First, Natural language processing can be used in the data collection phase. News articles and social media data can be collected using NLP techniques to monitor, study extreme weather events, validate numerical models’ predictions, identify potential trends, and explore future risk management strategies inspired by the hazards assessed. With its proper use, meteorologists can provide more accurate and timely warnings to the public, helping to reduce the risk of injury or death due to extreme weather events. Adding to that, NLP enables faster decision-making by automating the process of analyzing data and generating reports which can help decision-makers quickly access the information they need to make informed decisions and take appropriate action in a timely manner.

According to this systematic evaluation of the literature, there is a need for further research to advance the use of NLP to analyze extreme weather occurrences. Information extraction, topic modeling, categorization, or clustering are all examples of NLP modeling techniques tested and assessed. Although using this new potential is promising, hydrologists and meteorologists should have reasonable expectations of what NLP can achieve and recognize its limitations. In future studies, researchers should focus on methods to overcome data inadequacy, accessibility, non-representativity, immature NLP approaches, and computing ability.

Alam F, Ofli F, Imran M (2020) Descriptive and visual summaries of disaster events using artificial intelligence techniques: case studies of Hurricanes Harvey, Irma, and Maria. Behav Inform Technol 39(3):288–318. https://doi.org/10.1080/0144929X.2019.1610908

Article   Google Scholar  

Barker JLP, Macleod CJA (2019) Development of a national-scale real-time Twitter data mining pipeline for social geodata on the potential impacts of flooding on communities. Environ Modell Softw 115:213–227. https://doi.org/10.1016/j.envsoft.2018.11.013

Britton BK (1978) Lexical ambiguity of words used in English text. Behav Res Methods Instrum 10(1):1–7. https://doi.org/10.3758/BF03205079

Cerna S, Guyeux C, Laiymani D (2022) The usefulness of NLP techniques for predicting peaks in firefighter interventions due to rare events. Neural Comput Appl 34(12):10117–10132. https://doi.org/10.1007/s00521-022-06996-x

Chen Y, Ji W (2021) Enhancing situational assessment of critical infrastructure following disasters using social media. J Manag Eng 37(6):04021058. https://doi.org/10.1061/(ASCE)ME.1943-5479.0000955

Chen Z, Lim S (2021) Social media data-based typhoon disaster assessment. Int J Disast Risk Reduct 64:102482. https://doi.org/10.1016/j.ijdrr.2021.102482

Chowdhary KR (2020) Natural language processing. In: Chowdhary KR (ed) Fundamentals of artificial intelligence. Springer, India, pp 603–649. https://doi.org/10.1007/978-81-322-3972-7_19

Chapter   Google Scholar  

de Bruijn JA, de Moel H, Weerts AH, de Ruiter MC, Basar E, Eilander D, Aerts JCJH (2020) Improving the classification of flood tweets with contextual hydrological information in a multimodal neural network. Comput Geosci 140:104485. https://doi.org/10.1016/j.cageo.2020.104485

Devaraj A, Murthy D, Dontula A (2020) Machine-learning methods for identifying social media-based requests for urgent help during hurricanes. Int J Disast Risk Reduct 51:101757. https://doi.org/10.1016/j.ijdrr.2020.101757

Fan C, Wu F, Mostafavi A (2020b) A hybrid machine learning pipeline for automated mapping of events and locations from social media in disasters. IEEE Access 8:10478–10490. https://doi.org/10.1109/ACCESS.2020.2965550

Fan C, Mostafavi A, Gupta A, Zhang C (2018) A system analytics framework for detecting infrastructure-related topics in disasters using social sensing, pp 74–91. https://doi.org/10.1007/978-3-319-91638-5_4

Fan C, Jiang Y, Mostafavi A (2020a) Integrated natural language processing and meta-network analysis for social sensing of location-event-actor nexus in disasters. In: Construction research congress 2020a, pp 622–631. https://doi.org/10.1061/9780784482865.066

Farnaghi M, Ghaemi Z, Mansourian A (2020) Dynamic spatio-temporal tweet mining for event detection: a case study of hurricane florence. Int J Disast Risk Sci 11(3):378–393. https://doi.org/10.1007/s13753-020-00280-z

Gandomi A, Haider M (2015) Beyond the hype: big data concepts, methods, and analytics. Int J Inform Manag 35(2):137–144. https://doi.org/10.1016/j.ijinfomgt.2014.10.007

Ghosh S, Gunning D (2019) Natural language processing fundamentals: build intelligent applications that can interpret the human language to deliver impactful results. Packt Publishing Ltd.

Google Scholar  

Grishman R (2015) Information extraction. IEEE Intell Syst 30(5):8–15. https://doi.org/10.1109/MIS.2015.68

GrÜNder-Fahrer S, Schlaf A, Wiedemann G, Heyer G (2018) Topics and topical phases in German social media communication during a disaster. Nat Lang Eng 24(2):221–264. https://doi.org/10.1017/S1351324918000025

Guetterman TC, Chang T, DeJonckheere M, Basu T, Scruggs E, Vydiswaran VV (2018) Augmenting qualitative text analysis with natural language processing: methodological study. J Med Internet Res 20(6):e9702

Hirschberg J, Manning CD (2015) Advances in natural language processing. Science 349(6245):261–266. https://doi.org/10.1126/science.aaa8685

Istomina MN, Kocharyan AG, Lebedeva IP (2005) Floods: genesis, socioeconomic and environmental impacts. Water Resour 32(4):349–358. https://doi.org/10.1007/s11268-005-0045-9

Jang KM, Kim Y (2019) Crowd-sourced cognitive mapping: a new way of displaying people’s cognitive perception of urban space. PLoS ONE 14(6):e0218590. https://doi.org/10.1371/journal.pone.0218590

Jiang Y, Zhao Q, Chin CS (2019) Extracting typhoon disaster information from VGI based on machine learning. J Mar Sci Eng 7:318. https://doi.org/10.3390/jmse7090318

Kahle M, Kempf M, Martin B, Glaser R (2022) Classifying the 2021 ‘Ahrtal’ flood event using hermeneutic interpretation, natural language processing, and instrumental data analyses. Environ Res Commun 4(5):051002. https://doi.org/10.1088/2515-7620/ac6657

Karimiziarani M, Jafarzadegan K, Abbaszadeh P, Shao W, Moradkhani H (2022) Hazard risk awareness and disaster management: extracting the information content of twitter data. Sustain Cities Soc 77:103577. https://doi.org/10.1016/j.scs.2021.103577

Kitazawa K, Hale SA (2021) Social media and early warning systems for natural disasters: a case study of typhoon Etau in Japan. Int J Disast Risk Reduct 52:101926. https://doi.org/10.1016/j.ijdrr.2020.101926

Lai K, Porter JR, Amodeo M, Miller D, Marston M, Armal S (2022) A natural language processing approach to understanding context in the extraction and geocoding of historical floods, storms, and adaptation measures. Inform Process Manag 59(1):102735. https://doi.org/10.1016/j.ipm.2021.102735

Lam AJ, Oco N, Roxas RE (2017) Towards the development of typhoon-related tweet classifiers despite the sparseness of strongly-annotated data. TENCON 2017—2017 IEEE Region 10 conference

Lazo JK, Morss RE, Demuth JL (2009) 300 billion served: sources, perceptions, uses, and values of weather forecasts, vol 90, pp 785–798

Maulana I, Maharani W (2021) Disaster tweet classification based on geospatial data using the BERT-MLP method. In: 2021 9th international conference on information and communication technology (ICoICT)

Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE (2021) The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Int J Surg 88:105906

Purwandari K, Sigalingging JWC, Cenggoro TW, Pardamean B (2021) Multi-class weather forecasting from Twitter using machine learning aprroaches. Proc Comput Sci 179:47–54. https://doi.org/10.1016/j.procs.2020.12.006

Rahmadan MC, Hidayanto AN, Ekasari DS, Purwandari B, Theresiawati (2020) Sentiment analysis and topic modelling using the LDA method related to the flood disaster in Jakarta on Twitter. In: 2020 International conference on informatics, multimedia, cyber and information system (ICIMCIS)

Ralph FM (2005) Improving short-term (0–48 h) cool-season quantitative precipitation forecasting: recommendations from a USWRP workshop, vol 86, pp 1619–1632

Rappaport EN, Franklin JL, Avila LA, Baig SR, Beven JL, Blake ES, Burr CA, Jiing J-G, Juckins CA, Knabb RD (2009) Advances and challenges at the National Hurricane Center. Weather Forecast 24(2):395–419

Reynard D, Shirgaokar M (2019) Harnessing the power of machine learning: can Twitter data be useful in guiding resource allocation decisions during a natural disaster? Transport Res Part D Transport Environ 77:449–463. https://doi.org/10.1016/j.trd.2019.03.002

Rossi C, Acerbo FS, Ylinen K, Juga I, Nurmi P, Bosca A, Tarasconi F, Cristoforetti M, Alikadic A (2018) Early detection and information extraction for weather-induced floods using social media streams. Int J Disast Risk Reduct 30:145–157. https://doi.org/10.1016/j.ijdrr.2018.03.002

Sattaru JS, Bhatt CM, Saran S (2021) Utilizing geo-social media as a proxy data for enhanced flood monitoring. J Indian Soc Remote Sens 49(9):2173–2186. https://doi.org/10.1007/s12524-021-01376-9

Sene K (2016) Flash floods. In: Sene K (ed) Hydrometeorology: forecasting and applications. Springer, pp 273–312. https://doi.org/10.1007/978-3-319-23546-2_9

Sermet Y, Demir I (2018) An intelligent system on knowledge generation and communication about flooding. Environ Modell Softw 108:51–60. https://doi.org/10.1016/j.envsoft.2018.06.003

Shannag FB, Hammo BH (2019) Lessons learned from event detection from Arabic tweets: the case of Jordan flash floods near dead sea. In: 2019 IEEE Jordan international joint conference on electrical engineering and information technology (JEEIT)

Sit MA, Koylu C, Demir I (2019) Identifying disaster-related tweets and their semantic, spatial and temporal context using deep learning, natural language processing and spatial analysis: a case study of hurricane Irma. Int J Digit Earth 12(11):1205–1229. https://doi.org/10.1080/17538947.2018.1563219

Suhaimin MSM, Hijazi MHA, Alfred R, Coenen F (2017) Natural language processing based features for sarcasm detection: an investigation using bilingual social media texts. In: 2017 8th international conference on information technology (ICIT)

Tounsi A (2022) Natural language processing for extreme event assessment: a systematic review

Vanderford ML, Nastoff T, Telfer JL, Bonzo SE (2007) Emergency communication challenges in response to hurricane Katrina: lessons from the centers for disease control and prevention. J Appl Commun Res 35(1):9–25. https://doi.org/10.1080/00909880601065649

Vayansky I, Kumar SAP, Li Z (2019) An evaluation of geotagged Twitter data during hurricane Irma using sentiment analysis and topic modeling for disaster resilience. In: 2019 IEEE international symposium on technology and society (ISTAS)

Wang R-Q, Mao H, Wang Y, Rae C, Shaw W (2018) Hyper-resolution monitoring of urban flooding with social media and crowdsourcing data. Comput Geosci 111:139–147. https://doi.org/10.1016/j.cageo.2017.11.008

Wang R, Hu Y, Zhou Z, Yang K (2020) Tracking flooding phase transitions and establishing a passive hotline with AI-enabled social media data. IEEE Access 8:103395–103404. https://doi.org/10.1109/ACCESS.2020.2994187

Xiao Y, Li B, Gong Z (2018) Real-time identification of urban rainstorm waterlogging disasters based on Weibo big data. Nat Hazards 94(2):833–842. https://doi.org/10.1007/s11069-018-3427-4

Xin EZ, Murthy D, Lakuduva NS, Stephens KK (2019) Assessing the stability of tweet corpora for hurricane events over time: a mixed methods approach. In: Proceedings of the 10th international conference on social media and society, Toronto, ON, Canada. https://doi.org/10.1145/3328529.3328545

Yuan F, Li M, Liu R (2020) Understanding the evolutions of public responses using social media: hurricane Matthew case study. Int J Disast Risk Reduct 51:101798. https://doi.org/10.1016/j.ijdrr.2020.101798

Yuan F, Li M, Liu R, Zhai W, Qi B (2021) Social media for enhanced understanding of disaster resilience during hurricane Florence. Int J Manag 57:102289. https://doi.org/10.1016/j.ijinfomgt.2020.102289

Zhang M, Wang J (2022) Global flood disaster research graph analysis based on literature mining. Appl Sci 12(6):3066

Zhang Y, Chen Z, Zheng X, Chen N, Wang Y (2021) Extracting the location of flooding events in urban systems and analyzing the semantic risk using social sensing data. J Hydrol 603:127053. https://doi.org/10.1016/j.jhydrol.2021.127053

Zhou S, Kan P, Huang Q, Silbernagel J (2021) A guided latent Dirichlet allocation approach to investigate real-time latent topics of twitter data during Hurricane Laura. J Inform Sci. https://doi.org/10.1177/01655515211007724

Zhou B, Zou L, Mostafavi A, Lin B, Yang M, Gharaibeh N, Cai H, Abedin J, Mandal D (2022) VictimFinder: harvesting rescue requests in disaster response from social media with BERT. Comput Environ Urban Syst 95:101824. https://doi.org/10.1016/j.compenvurbsys.2022.101824

Download references

The authors acknowledge the financial support from the Port Authority of New York and New Jersey.

Author information

Authors and affiliations.

Department of Civil, Environmental, and Ocean Engineering, Stevens Institute of Technology, 1 Castle Point Terrace, Hoboken, NJ, 07030, USA

Achraf Tounsi & Marouane Temimi

You can also search for this author in PubMed   Google Scholar

Contributions

All authors contributed to the study conception and design, material preparation, data collection, and analysis. Achraf Tounsi wrote the first draft of the manuscript and all authors commented on previous versions. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Achraf Tounsi .

Ethics declarations

Conflict of interest.

The authors declare that they have no known competing interests for this publication.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Tounsi, A., Temimi, M. A systematic review of natural language processing applications for hydrometeorological hazards assessment. Nat Hazards 116 , 2819–2870 (2023). https://doi.org/10.1007/s11069-023-05842-0

Download citation

Received : 20 October 2022

Accepted : 28 January 2023

Published : 08 February 2023

Issue Date : April 2023

DOI : https://doi.org/10.1007/s11069-023-05842-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Natural language processing
  • Extreme weather events
  • Text mining
  • Disaster management
  • Find a journal
  • Publish with us
  • Track your research
  • Open access
  • Published: 03 June 2021

A systematic review of natural language processing applied to radiology reports

  • Arlene Casey 1 ,
  • Emma Davidson 2 ,
  • Michael Poon 2 ,
  • Hang Dong 3 , 4 ,
  • Daniel Duma 1 ,
  • Andreas Grivas 5 ,
  • Claire Grover 5 ,
  • Víctor Suárez-Paniagua 3 , 4 ,
  • Richard Tobin 5 ,
  • William Whiteley 2 , 6 ,
  • Honghan Wu 4 , 7 &
  • Beatrice Alex 1 , 8  

BMC Medical Informatics and Decision Making volume  21 , Article number:  179 ( 2021 ) Cite this article

16k Accesses

87 Citations

16 Altmetric

Metrics details

Natural language processing (NLP) has a significant role in advancing healthcare and has been found to be key in extracting structured information from radiology reports. Understanding recent developments in NLP application to radiology is of significance but recent reviews on this are limited. This study systematically assesses and quantifies recent literature in NLP applied to radiology reports.

We conduct an automated literature search yielding 4836 results using automated filtering, metadata enriching steps and citation search combined with manual review. Our analysis is based on 21 variables including radiology characteristics, NLP methodology, performance, study, and clinical application characteristics.

We present a comprehensive analysis of the 164 publications retrieved with publications in 2019 almost triple those in 2015. Each publication is categorised into one of 6 clinical application categories. Deep learning use increases in the period but conventional machine learning approaches are still prevalent. Deep learning remains challenged when data is scarce and there is little evidence of adoption into clinical practice. Despite 17% of studies reporting greater than 0.85 F1 scores, it is hard to comparatively evaluate these approaches given that most of them use different datasets. Only 14 studies made their data and 15 their code available with 10 externally validating results.

Conclusions

Automated understanding of clinical narratives of the radiology reports has the potential to enhance the healthcare process and we show that research in this field continues to grow. Reproducibility and explainability of models are important if the domain is to move applications into clinical use. More could be done to share code enabling validation of methods on different institutional data and to reduce heterogeneity in reporting of study properties allowing inter-study comparisons. Our results have significance for researchers in the field providing a systematic synthesis of existing work to build on, identify gaps, opportunities for collaboration and avoid duplication.

Peer Review reports

Medical imaging examinations interpreted by radiologists in the form of narrative reports are used to support and confirm diagnosis in clinical practice. Being able to accurately and quickly identify the information stored in radiologists’ narratives has the potential to reduce workloads, support clinicians in their decision processes, triage patients to get urgent care or identify patients for research purposes. However, whilst these reports are generally considered more restricted in vocabulary than other electronic health records (EHR), e.g. clinical notes, it is still difficult to access this efficiently at scale [ 1 ]. This is due to the unstructured nature of these reports and Natural Language Processing (NLP) is key to obtaining structured information from radiology reports [ 2 ].

NLP applied to radiology reports is shown to be a growing field in earlier reviews [ 2 , 3 ]. In recent years there has been an even more extensive growth in NLP research in general and in particular deep learning methods which is not seen in the earlier reviews. A more recent review of NLP applied to radiology-related research can be found but this focuses on one NLP technique only, deep learning models [ 4 ]. Our paper provides a more comprehensive review comparing and contrasting all NLP methodologies as they are applied to radiology.

It is of significance to understand and synthesise recent developments specific to NLP in the radiology research field as this will assist researchers to gain a broader understanding of the field, provide insight into methods and techniques supporting and promoting new developments in the field. Therefore, we carry out a systematic review of research output on NLP applications in radiology from 2015 onward, thus, allowing for a more up to date analysis of the area. An additional listing of our synthesis of publications detailing their clinical and technical categories can be found in Additional file 1 and per publication properties can be found in Additional file 2 . Also different to the existing work, we look at both the clinical application areas NLP is being applied in and consider the trends in NLP methods. We describe and discuss study properties, e.g. data size, performance, annotation details, quantifying these in relation to both the clinical application areas and NLP methods. Having a more detailed understanding of these properties allows us to make recommendations for future NLP research applied to radiology datasets, supporting improvements and progress in this domain.

Related work

Amongst pre-existing reviews in this area, [ 2 ] was the first that was both specific to NLP on radiology reports and systematic in methodology. Their literature search identified 67 studies published in the period up to October 2014. They examined the NLP methods used, summarised their performance and extracted the studies’ clinical applications, which they assigned to five broad categories delineating their purpose. Since Pons et al.’s paper, several reviews have emerged with the broader remit of NLP applied to electronic health data, which includes radiology reports. [ 5 ] conducted a systematic review of NLP systems with a specific focus on coding free text into clinical terminologies and structured data capture. The systematic review by [ 6 ] specifically examined machine learning approaches to NLP (2015–2019) in more general clinical text data, and a further methodical review was carried out by [ 7 ] to synthesise literature on deep learning in clinical NLP (up to April 2019) although the did not follow the PRISMA guideline completely. With radiology reports as their particular focus, [ 3 ] published, the same year as Pons et al.’s review, an instructive narrative review outlining the fundamentals of NLP techniques applied in radiology. More recently, [ 4 ] published a systematic review focused on deep learning radiology-related research. They identified 10 relevant papers in their search (up to September 2019) and examined their deep learning models, comparing these with traditional NLP models and also considered their clinical applications but did not employ a specific categorisation. We build on this corpus of related work, and most specifically Pons et al.’s work. In our initial synthesis of clinical applications we adopt their application categories and further expand upon these to reflect the nature of subsequent literature captured in our work. Additionally, we quantify and compare properties of the studies reviewed and provide a series of recommendations for future NLP research applied to radiology datasets in order to promote improvements and progress in this domain.

Our methodology followed the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) [ 8 ], and the protocol is registered on protocols.io.

Eligibility for literature inclusion and search strategy

We included studies using NLP on radiology reports of any imaging modality and anatomical region for NLP technical development, clinical support, or epidemiological research. Exclusion criteria included: (1) language not English; (2) wrong publication type (e.g., case reports, reviews, conference abstracts, comments, patents, or editorials) (2) published before 2015; (3) uses radiology images only (no NLP); (4) not radiology reports; (5) no NLP results; (6) year out of range; (7) duplicate, already in the list of publications retrieved; (8) not available in full text.

We used Publish or Perish [ 9 ], a citation retrieval and analysis software program, to search Google Scholar. Google Scholar has a similar coverage to other databases [ 10 ] and is easier to integrate into search pipelines. We conducted an initial pilot search following the process described here, but the search terms were too specific and restricted the number of publications. For example, we experimented with using specific terms used within medical imaging such at CT, MRI. Thirty-seven papers were found during the pilot search but the same papers also appeared in our final search. We use the following search query restricted to research articles published in English between January 2015 and October 2019. (“radiology” OR “radiologist”) AND (“natural language” OR “text mining” OR “information extraction” OR “document classification” OR “word2vec”) NOT patent. We automated the addition of publication metadata and applied filtering to remove irrelevant publications. These automated steps are described in Tables 1 and 2 .

In addition to query search, another method to find papers is to conduct a citation search [ 15 ]. The citation search compiled a list of publications that cite the Pons et al. review and the articles cited in the Pons’ review. To do this, we use a snowballing method [ 16 ] to follow the forward citation branch for each publication in this list, i.e. finding every article that cites the publications in our list. The branching factor here is large, so we filter at every stage and automatically add metadata. One hundred and seventy-one papers were identified as part of the snowball citation search and of these 84 were in the final 164 papers.

Manual review of literature

Four reviewers (three NLP researchers [AG,DD and HD] and one epidemiologist [MTCP]) independently screened all titles and abstracts with the Rayyan online platform and discussed disagreements. Fleiss’ kappa [ 17 ] agreement between reviewers was 0.70, indicating substantial agreement [ 18 ]. After this screening process, each full-text article was reviewed by a team of eight (six NLP researchers and two epidemiologists) and double reviewed by a NLP researcher. We resolved any discrepancies by discussion in regular meetings.

Data extraction for analysis

We extracted data on: primary clinical application and technical objective, data source(s), study period, radiology report language, anatomical region, imaging modality, disease area, dataset size, annotated set size, training/validation/test set size, external validation performed, domain expert used, number of annotators, inter-annotator agreement, NLP technique(s) used, best-reported results (recall, precision and F1 score), availability of dataset, and availability of code.

The literature search yielded 4836 possibly relevant publications from which our automated exclusion process removed 4,402, and during both our screening processes, 270 were removed, leaving 164 publications. See Fig. 1 for details of exclusions at each step.

figure 1

PRISMA diagram for search publication retrieval

General characteristics

2015 and 2016 saw similar numbers of publications retrieved (22 and 21 respectively) with the volume increasing almost three-fold in 2019 (55), noting 2019 only covers 10 months (Fig. 2 ). Imaging modality (Table 3 ) varied considerably and 46 studies used reports from multiple modalities. Of studies focusing on a single modality, the most featured were CT scans (38) followed by MRI (16), X-Ray (8), Mammogram (5) and Ultrasound (4). Forty-seven studies did not specifying scan modality. For the study samples (Table 4 ), 33 papers specified that they used consecutive patient images, 38 used non-consecutive image sampling and 93 did not clearly specify their sampling strategy. The anatomical regions for scans varied (Table 5 ) with mixed being the highest followed by Thorax and Head/neck. Disease categories are presented in Table 6 with the largest disease category being Oncology. The majority of reports were in English (141) and a small number in other languages e.g., Chinese (5), Spanish (4), German (3) (Table 7 ). Additional file 2 , CSV format, provides a breakdown of the information in Tables 3 , 4 , 5 , 6 and 7 per publication.

Clinical application categories

In synthesis of the literature each publication was classified by the primary clinical purpose. Pons’ work in 2016 categorised publications into 5 broad categories: Diagnostic Surveillance, Cohort Building for Epidemiological Studies, Query-based Case Retrieval, Quality Assessment of Radiological Practice and Clinical Support Services. We found some changes in this categorisation schema and our categorisation consisted of six categories: Diagnostic Surveillance, Disease information and classification, Quality Compliance, Cohort/Epidemiology, Language Discovery and Knowledge Structure, Technical NLP . The main difference is we found no evidence for a category of Clinical Support Services which described applications that had been integrated into the workflow to assist. Despite the increase in the number of publications, very few were in clinical use with more focus on the category of Disease Information and Classification . We describe each clinical application area in more detail below and where applicable how our categories differ from the earlier findings. A listing of all publications and their corresponding clinical application and technical category can be found in Additional file 1 , MS Word format, and in Additional file 2 in CSV format. Table 8 shows the clinical application category by the technical classification and Fig. 2 shows the breakdown of clinical application category by publication year. There were more publications in 2019 compared with 2015 for all categories except Language Discovery & Knowledge Structure, which fell by \(\approx\) 25% (Fig. 2 ).

figure 2

Clinical application of publication by year

Diagnostic surveillance

A large proportion of studies in this category focused on extracting disease information for patient or disease surveillance e.g. investigating tumour characteristics [ 19 , 20 ]; changes over time [ 21 ] and worsening/progression or improvement/response to treatment [ 22 , 23 ]; identifying correct anatomical labels [ 24 ]; organ measurements and temporality [ 25 ]. Studies also investigated pairing measurements between reports [ 26 ] and linking reports to monitoring changes through providing an integrated view of consecutive examinations [ 27 ]. Studies focused specifically on breast imaging findings investigating aspects, such as BI-RADS MRI descriptors (shape, size, margin) and final assessment categories (benign, malignant etc.) e.g., [ 28 , 29 , 30 , 31 , 32 , 33 ]. Studies focused on tumour information e.g., for liver [ 34 ] and hepatocellular carcinoma (HPC) [ 35 , 36 ] and one study on extracting information relevant for structuring subdural haematoma characteristics in reports [ 37 ].

Studies in this category also investigated incidental findings including on lung imaging [ 38 , 39 , 40 ], with [ 38 ] additionally extracting the nodule size; for trauma patients [ 41 ]; and looking for silent brain infarction and white matter disease [ 42 ]. Other studies focused on prioritising/triaging reports, detecting follow-up recommendations, and linking a follow-up exam to the initial recommendation report, or bio-surveillance of infectious conditions, such as invasive mould disease.

Disease information and classification

Disease Information and Classification publications use reports to identify information that may be aggregated according to classification systems. These publications focused solely on classifying a disease occurrence or extracting information about a disease with no focus on the overall clinical application. This category was not found in Pons’ work. Methods considered a range of conditions including intracranial haemorrhage [ 43 , 44 ], aneurysms [ 45 ], brain metastases [ 46 ], ischaemic stroke [ 47 , 48 ], and several classified on types and severity of conditions e.g., [ 46 , 49 , 50 , 51 , 52 ]. Studies focused on breast imaging considered aspects such as predicting lesion malignancy from BI-RADS descriptors [ 53 ], breast cancer subtypes [ 54 ], and extracting or inferring BI-RADS categories, such as [ 55 , 56 ]. Two studies focused on abdominal images and hepatocellular carcinoma (HCC) staging and CLIP scoring. Chest imaging reports were used to detect pulmonary embolism e.g., [ 57 , 58 , 59 ], bacterial pneumonia [ 60 ], and Lungs-RADS categories [ 61 ]. Functional imaging was also included, such as echocardiograms, extracting measurements to evaluate heart failure, including left ventricular ejection fractions (LVEF) [ 62 ]. Other studies investigated classification of fractures [ 63 , 64 ] and abnormalities [ 65 ] and the prediction of ICD codes from imaging reports [ 66 ].

Language discovery and knowledge structure

Language Discovery and Knowledge Structure publications investigate the structure of language in reports and how this might be optimised to facilitate decision support and communication. Pons et al. reported on applications of Query-based retrieval which has similarities to Language Discovery and Knowledge Structure but it is not the same. Their category contains studies that retrieve cases and conditions that are not predefined and in some instances could be used for research purposes or are motivated for educational purposes. Our category is broader and encompasses papers that investigated different aspects of language including variability, complexity simplification and normalising to support extraction and classification tasks.

Studies focus on exploring lexicon coverage and methods to support language simplification for patients looking at sources, such as the consumer health vocabulary [ 67 ] and the French lexical network (JDM) [ 68 ]. Other works studied the variability and complexity of report language comparing free-text and structured reports and radiologists. Also investigated was how ontologies and lexicons could be combined with other NLP methods to represent knowledge that can support clinicians. This work included improving report reading efficiency [ 69 ]; finding similar reports [ 70 ]; normalising phrases to support classification and extraction tasks, such as entity recognition in Spanish reports [ 71 ]; imputing semantic classes for labelling [ 72 ], supporting search [ 73 ] or to discover semantic relations [ 74 ].

Quality and compliance

Quality and Compliance publications use reports to assess the quality and safety of practice and reports similar to Pons’ category. Works considered how patient indications for scans adhered to guidance e.g., [ 75 , 76 , 77 , 78 , 79 , 80 ] or protocol selection [ 81 , 82 , 83 , 84 , 85 ] or the impact of guideline changes on practice, such as [ 86 ]. Also investigated was diagnostic utilisation and yield, based on clinicians or on patients, which can be useful for hospital planning and for clinicians to study their work patterns e.g. [ 87 ]. Other studies in this category looked at specific aspects of quality, such as, classification for long bone fractures to support quality improvement in paediatric medicine [ 88 ], automatic identification of reports that have critical findings for auditing purposes [ 89 ], deriving a query-based quality measure to compare structured and free-text report variability [ 90 ], and [ 91 ] who describe a method to fix errors in gender or laterality in a report.

Cohort and epidemiology

This category is similar to Pons’ earlier review but we treated the studies in this category differently attempting to differentiate which papers described methods for creating cohorts for research purposes, and those which also reported the outcomes of an epidemiological analysis. Ten studies use NLP to create specific cohorts for research purposes and six reported the performance of their tools. Out of these papers, the majority (n = 8) created cohorts for specific medical conditions including fatty liver disease [ 92 , 93 ] hepatocellular cancer [ 94 ], ureteric stones [ 95 ], vertebral fracture [ 96 ], traumatic brain injury [ 97 , 98 ], and leptomeningeal disease secondary to metastatic breast cancer [ 99 ]. Five papers identified cohorts focused on particular radiology findings including ground glass opacities (GGO) [ 100 ], cerebral microbleeds (CMB) [ 101 ], pulmonary nodules [ 102 , 103 ], changes in the spine correlated to back pain [ 1 ] and identifying radiological evidence of people having suffered a fall. One paper focused on identifying abnormalities of specific anatomical regions of the ear within an audiology imaging database [ 104 ] and another paper aimed to create a cohort of people with any rare disease (within existing ontologies - Orphanet Rare Disease Ontology ORDO and Radiology Gamuts Ontology RGO). Lastly, one paper took a different approach of screening reports to create a cohort of people with contraindications for MRI, seeking to prevent iatrogenic events [ 105 ]. Amongst the epidemiology studies there were various analytical aims, but they primarily focused on estimating the prevalence or incidence of conditions or imaging findings and looking for associations of these conditions/findings with specific population demographics, associated factors or comorbidities. The focus of one study differed in that it applied NLP to healthcare evaluation, investigating the association of palliative care consultations and measures of high-quality end-of-life (EOL) care [ 99 ].

Technical NLP

This category is for publications that have a primary technical aim that is not focused on radiology report outcome, e.g. detecting negation in reports, spelling correction [ 106 ], fact checking [ 107 , 108 ] methods for sample selection, crowd source annotation [ 109 ]. This category did not occur in Pons’ earlier review.

NLP methods in use

NLP methods capture the different techniques an author applied broken down into rules, machine learning methods, deep learning, ontologies, lexicons and word embeddings. We discriminate machine learning from deep learning, using the former to represent traditional machine learning methods.

Over half of the studies only applied one type of NLP method and just over a quarter of the studies compared or combined methods in hybrid approaches. The remaining studies either used a bespoke proprietary system or focus on building ontologies or similarity measures (Fig. 3 ). Rule-based method use remains almost constant across the period, whereas use of machine learning decreases and deep learning methods rises, from five publications in 2017 to twenty-four publications in 2019 (Fig. 4 ).

figure 3

NLP method breakdown

figure 4

NLP method by year

A variety of machine classifier algorithms were used, with SVM and Logistic Regression being the most common (Table 9 ). Recurrent Neural Networks (RNN) variants were the most common type of deep learning architectures. RNN methods were split between long short-term memory (LSTM) and bidirectional-LSTM (Bi-LSTM), bi-directional gated recurrent unit (Bi-GRU), and standard RNN approaches. Four of these studies additionally added a Conditional Random Field (CRF) for the final label generation step. Convolutional Neural Networks (CNN) were the second most common architecture explored. Eight studies additionally used an attention mechanism as part of their deep learning architecture. Other neural approaches included feed-forward neural networks, fully connected neural networks and a proprietary neural system IBM Watson [ 82 ] and Snorkel [ 110 ]. Several studies proposed combined architectures, such as [ 31 , 111 ].

NLP method features

Most rule-based and machine classifying approaches used features based on bag-of-words, part-of-speech, term frequency, and phrases with only two studies alternatively using word embeddings. Three studies use feature engineering with deep learning rather than word embeddings. Thirty-three studies use domain-knowledge to support building features for their methods, such as developing lexicons or selecting terms and phrases. Comparison of embedding methods is difficult as many studies did not describe their embedding method. Of those that did, Word2Vec [ 112 ] was the most popular (n = 19), followed by GLOVE embeddings [ 113 ] (n = 6), FastText [ 114 ] (n = 3), ELMo [ 115 ] (n = 1) and BERT [ 116 ] (n = 1). Ontologies or lexicon look-ups are used in 100 studies; however, even though publications increase over the period in real terms, 20% fewer studies employ the use of ontologies or lexicons in 2019 compared to 2015. The most widely used resources were UMLS [ 117 ] (n = 15), Radlex [ 118 ] (n = 20), SNOMED-CT [ 119 ] (n = 14). Most studies used these as features for normalising words and phrases for classification, but this was mainly those using rule-based or machine learning classifiers with only six studies using ontologies as input to their deep learning architecture. Three of those investigated how existing ontologies can be combined with word embeddings to create domain-specific mappings, with authors pointing to this avoiding the need for large amounts of annotated data. Other approaches looked to extend existing medical resources using a frequent phrases approach, e.g. [ 120 ]. Works also used the derived concepts and relations visualising these to support activities, such as report reading and report querying (e.g. [ 121 , 122 ])

Annotation and inter-annotator agreement

Eighty-nine studies used at least two annotators, 75 did not specify any annotation details, and only one study used a single annotator. Whilst 69 studies use a domain expert for annotation (a clinician or radiologist) only 56 studies report the inter-annotator agreement. Some studies mention annotation but do not report on agreement or annotators. Inter-annotator agreement values for Kappa range from 0.43 to perfect agreement at 1. Whilst most studies reported agreement by Cohen’s Kappa [ 123 ] some reported precision, and percent agreement. Studies reported annotation data sizes differently, e.g., on the sentence or patient level. Studies also considered ground truth labels from coding schemes such as ICD or BI-RADS categories as annotated data. Of studies which detailed human annotation at the radiology report level, only 45 specified inter-annotator agreement and/or the number of annotators. Annotated report numbers for these studies varies with 15 papers having annotated less than 500, 12 having annotated between 500 and less than 1000, 15 between 1000 and less than 3000, and 3 between 4000 and 8,288 reports. Additional file 2 gives all annotation size information on a per publication basis in CSV format.

Data sources and availability

Only 14 studies reported that their data is available, and 15 studies reported that their code is available. Most studies sourced their data from medical institutions, a number of studies did not specify where their data was from, and some studies used publicly available datasets: MIMIC-III (n = 5), MIMIC-II (n = 1), MIMIC-CXR (n = 1); Radcore (n = 5) or STRIDE (n = 2). Four studies used combined electronic health records such as clinical notes or pathology reports.

Reporting on total data size differed across studies with some not giving exact data sizes but percentages and others reporting numbers of sentences, reports, patients, or a mixture of these. Where an author was not clear on the type of data they were reporting on, or on the size, we marked this as unspecified. Thirteen studies did not report on total data size. Data size summaries for those reporting at the radiology report level is n = 135 or 82.32% of the studies (Table 10 ). The biggest variation of data size by NLP Method is in studies that apply other methods or are rule-based. Machine learning also varies in size; however, the median value is lower compared to rule-based methods. The median value for deep learning is considerably higher at 5000 reports compared to machine learning or those that compare or create hybrid methods. Of the studies reporting on radiology reports numbers, 39.3% used over 10,000 reports and this increases to over 48% using more than 5000 reports. However, a small number of studies, 14%, are using comparatively low numbers of radiology reports, less than 500 (Table 11 ).

NLP performance and evaluation measures

Performance metrics applied for evaluation of methods vary widely with authors using precision (positive predictive value (PPV)), recall (sensitivity), specificity, the area under the curve (AUC) or accuracy. We observed a wide variety in evaluation methodology employed concerning test or validation datasets. Different approaches were taken in generating splits for testing and validation, including k-fold cross-validation. Table 12 gives a summary of the number of studies reporting about total data size and splits across train, validation, test, and annotation. This table is for all data types, i.e., reports, sentences, patients or mixed. Eighty-two studies reported on both training and test data splits, of which only 38 studies included a validation set. Only 10 studies validated their algorithm using an external dataset from another institution, another modality, or a different patient population. Additional file 2 gives all data size information on a per publication basis in CSV format. The most widely used metrics for reporting performance were precision (PPV) and recall (sensitivity) reported in 47% of studies. However, even though many studies compared methods and reported on the top-performing method, very few studies carried out significance testing on these comparisons. Issues of heterogeneity make it difficult and unrealistic to compare performance between methods applied, hence, we use summary measures as a broad overview (Fig. 5 ). Performance reported varies, but both the mean and median values for the F1 score appear higher for methods using rule-based only or deep learning only methods. Whilst differences are less discernible between F1 scores for application areas, Diagnostic Surveillance looks on average lower than other categories.

figure 5

Application Category and NLP Method, Mean and Median Summaries. Mean value is indicated by a vertical bar, the box shows error bars and the asterisk is the median value

Discussion and future directions

Our work shows there has been a considerable increase in the number of publications using NLP on radiology reports over the recent time period. Compared to 67 publications retrieved in the earlier review of [ 2 ], we retrieved 164 publications. In this section we discuss and offer some insight into the observations and trends of how NLP is being applied to radiology and make some recommendations that may benefit the field going forward.

Clinical applications and NLP methods in radiology

The clinical applications of the publications is similar to the earlier review of Pons et al. but whilst we observe an increase in research output we also highlight that there appears to be even less focus on clinical application compared to their review. Like many other fields applying NLP the use of deep learning has increased, with RNN architectures being the most popular. This is also observed in a review of NLP in clinical text [ 7 ]. However, although deep learning use increases, rules and traditional machine classifiers are still prevalent and often used as baselines to compare deep learning architectures against. One reason for traditional methods remaining popular is their interpretability compared to deep learning models. Understanding the features that drive a model prediction can support decision-making in the clinical domain but the complex layers of non-linear data transformations deep learning is composed of does not easily support transparency [ 124 ]. This may also help explain why in synthesis of the literature we observed less focus on discussing clinical application and more emphasis on disease classification or information task only. Advances in interpretability of deep learning models are critical to its adoption in clinical practice.

Other challenges exist for deep learning such as only having access to small or imbalanced datasets. Chen et al. [ 125 ] review deep learning methods within healthcare and point to these challenges resulting in poor performance but that these same datasets can perform well with traditional machine learning methods. We found several studies highlight this and when data is scarce or datasets imbalanced, they introduced hybrid approaches of rules and deep learning to improve performance, particularly in the Diagnostic Surveillance category. Yang et al. [ 126 ] observed rules performing better for some entity types, such as time and size, which are proportionally lower than some of the other entities in their train and test sets; hence they combine a bidirectional-LSTM and CRF with rules for entity recognition. Peng et al. [ 19 ] comment that combining rules and the neural architecture complement each other, with deep learning being more balanced between precision and recall, but the rule-based method having higher precision and lower recall. The authors reason that this provides better performance as rules can capture rare disease cases, particularly when multi-class labelling is needed, whilst deep learning architectures perform worse in instances with fewer data points.

In addition to its need for large-scale data, deep learning can be computationally costly. The use of pre-trained models and embeddings may alleviate some of this burden. Pre-trained models often only require fine-tuning, which can reduce computation cost. Language comprehension pre-learned from other tasks can then be inherited from the parent models, meaning fewer domain-specific labelled examples may be needed [ 127 ]. This use of pre-trained information also supports generalisability, e.g., [ 58 ] show that their model trained on one dataset can generalise to other institutional datasets.

Embedding use has increased which is expected with the application of deep learning approaches but many rule-based and machine classifiers continue to use traditional count-based features, e.g., bag-of-words and n-grams. Recent evidence [ 128 ] suggests that the trend to continue to use feature engineering with traditional machine learning methods does produce better performance in radiology reports than using domain-specific word embeddings.

Banerjee et al. [ 44 ] found that there was not much difference between a uni-gram approach and a Word2vec embedding, hypothesising this was due to their narrow domain, intracranial haemorrhage. However, the NLP research field has seen a move towards bi-directional encoder representations from transformers (BERT) based embedding models not reflected in our analysis, with only one study using BERT generated embeddings [ 46 ]. Embeddings from BERT are thought to be superior as they can deliver better contextual representations and result in improved task performance. Whilst more publications since our review period have used BERT based embeddings with radiology reports e.g. [ 127 , 129 ] not all outperform traditional methods [ 130 ]. Recent evidence shows that embeddings generated by BERT fail to show a generalisable understanding of negation [ 131 ], an essential factor in interpreting radiology reports effectively. Specialised BERT models have been introduced such as ClinicalBERT [ 132 ] or BlueBERT [ 129 ]. BlueBERT has been shown to outperform ClinicalBERT when considering chest radiology [ 133 ] but more exploration of the performance gains versus the benefits of generalisability are needed for radiology text.

All NLP models have in common that they need large amounts of labelled data for model training [ 134 ]. Several studies [ 135 , 136 , 137 ] explored combining word embeddings and ontologies to create domain-specific mappings, and they suggest this can avoid a need for large amounts of annotated data. Additionally, [ 135 , 136 ] highlight that such combinations could boost coverage and performance compared to more conventional techniques for concept normalisation.

The number of publications using medical lexical knowledge resources is still relatively low, even though a recent trend in the general NLP field is to enhance deep learning with external knowledge [ 138 ]. This was also observed by [ 7 ], where only 18% of the deep learning studies in their review utilised knowledge resources. Although pre-training supports learning previously known facts it could introduce unwanted bias, hindering performance. The inclusion of domain expertise through resources such as medical lexical knowledge may help reduce this unwanted bias [ 7 ]. Exploration of how this domain expertise can be incorporated with deep learning architectures in future could improve the performance when having access to less labelled data.

Task knowledge

Knowledge about the disease area of interest and how aspects of this disease are linguistically expressed is useful and could promote better performing solutions. Whilst [ 139 ] find high variability between radiologists, with metric values (e.g. number of syntactic, clinical terms based on ontology mapping) being significantly greater on free-text than structured reports, [ 140 ] who look specifically at anatomical areas find less evidence for variability. Zech et al. [ 141 ] suggest that the highly specialised nature of each imaging modality creates different sub-languages and the ability to discover these labels (i.e. disease mentions) reflects the consistency with which labels are referred to. For example, edema is referred to very consistently whereas other labels are not, such as infarction/ischaemic. Understanding the language and the context of entity mentions could help promote novel ideas on how to solve problems more effectively. For example, [ 35 ] discuss how the accuracy of predicting malignancy is affected by cues being outside their window of consideration and [ 142 ] observe problems of co-reference resolution within a report due to long-range dependencies. Both these studies use traditional NLP approaches, but we observed novel neural architectures being proposed to improve performance in similar tasks specifically capturing long-range context and dependency learning, e.g., [ 31 , 111 ]. This understanding requires close cooperation of healthcare professionals and data scientists, which is different to some other fields where more disconnection is present [ 125 ].

Study heterogeneity, a need for reporting standards

Most studies reviewed could be described as a proof-of-concept and not trialled in a clinical setting. Pons et al. [ 2 ] hypothesised that a lack of clinical application may stem from uncertainty around minimal performance requirements hampering implementations, evidence-based practice requiring justification and transparency of decisions, and the inability to be able to compare to human performance as the human agreement is often an unknown. These hypotheses are still valid, and we see little evidence that these problems are solved.

Human annotation is generally considered the gold standard at measuring human performance, and whilst many studies reported that they used annotated data, overall, reporting was inconsistent. Steps were undertaken to measure inter-annotator agreement (IAA), but in many studies, this was not directly comparable to the evaluation undertaken of the NLP methods. The size of the data being used to draw experimental conclusions from is important and accurate reporting of these measures is essential to ensure reproducibility and comparison in further studies. Reporting on the training, test and validation splits was varied with some studies not giving details and not using held-out validation sets.

Most studies use retrospective data from single institutions but this can lead to a model over-fitting and, thus, not generalising well when applied in a new setting. Overcoming the problem of data availability is challenging due to privacy and ethics concerns, but essential to ensure that performance of models can be investigated across institutions, modalities, and methods. Availability of data would allow for agreed benchmarks to be developed within the field that algorithm improvements can be measured upon. External validation of applied methods was extremely low, although, this is likely due to the availability of external datasets. Making code available would enable researchers to report how external systems perform on their data. However, only 15 studies reported that their code is available. To be able to compare systems there is a need for common datasets to be available to benchmark and compare systems against.

Whilst reported figures in precision and recall generally look high more evidence is needed for accurate comparison to human performance. A wide variety of performance measures were used, with some studies only reporting one measure, e.g., accuracy or F1 scores, with these likely representing the best performance obtained. Individual studies are often not directly comparable for such measures, but none-the-less clarity and consistency in reporting is desirable. Many studies making model comparisons did not carry out any significance testing for these comparisons.

Progressing NLP in radiology

The value of NLP applied to radiology is clear in that it can support areas such as clinicians in their decision making and reducing workload, add value in terms of automated coding of data, finding missed diagnosis for triage or monitoring quality. However, in recent years labelling disease phenotypes or extracting disease information in reports has been a focus rather than real-world clinical application of NLP within radiology. We believe this is mainly due to the difficulties in accessing data for research purposes. More support is needed to bring clinicians and NLP experts together to promote innovative thinking about how such work can benefit and be trialled in the clinical environment. The challenges in doing so are significant because of the need to work within safe environments to protect patient privacy. In terms of NLP methods, we observe that the general trends of NLP are applied within this research area, but we would emphasise as NLP moves more to deep learning it is particularly important in healthcare to think about how these methods can satisfy explainability. Explainability in artificial intelligence and NLP has become a hot topic in general but it is now also being addressed in the healthcare sector [ 143 , 144 ]. Methodology used is also impacted by data availability with uncommon diseases often being hard to predict with deep learning as data is scarce. If the practical and methodological challenges on data access, privacy and less data demanding approaches can be met there is much potential to increase the value of NLP within radiology. The sharing of tools, practice, and expertise could also ease the real-world application of NLP within radiology.

To help move the field forward, enable more inter-study comparisons, and increase study reproducibility we make the following recommendations for research studies:

Clarity in reporting study properties is required: (a) Data characteristics including size and the type of dataset should be detailed, e.g., the number of reports, sentences, patients, and if patients how many reports per patient. The training, test and validation data split should be evident, as should the source of the data. (b) Annotation characteristics including the methodology to develop the annotation should be reported, e.g., annotation set size, annotator details, how many, expertise. (c) Performance metrics should include a range of metrics: precision, recall, F1, accuracy and not just one overall value.

Significance testing should be carried out when a comparison between methods is made.

Data and code availability are encouraged. While making data available will often be challenging due to privacy concerns, researchers should make code available to enable inter-study comparisons and external validation of methods.

Common datasets should be used to benchmark and compare systems.

Limitations of study

Publication search is subject to bias in search methods and it is likely that our search strategy did inevitably miss some publications. Whilst trying to be precise and objective during our review process some of the data collected and categorising publications into categories was difficult to agree on and was subjective. For example, many of the publications could have belonged to more than one category. One of the reasons for this was how diverse in structure the content was which was in some ways reflected by the different domains papers were published in. It is also possible that certain keywords were missed in recording data elements due to the reviewers own biases and research experience.

This paper presents an systematic review of publications using NLP on radiology reports during the period 2015 to October 2019. We show there has been substantial growth in the field particularly in researchers using deep learning methods. Whilst deep learning use has increased, as seen in NLP research in general, it faces challenges of lower performance when data is scarce or when labelled data is unavailable, and is not widely used in clinical practice perhaps due to the difficulties in interpretability of such models. Traditional machine learning and rule-based methods are, therefore, still widely in use. Exploration of domain expertise such as medial lexical knowledge must be explored further to enhance performance when data is scarce. The clinical domain faces challenges due to privacy and ethics in sharing data but overcoming this would enable development of benchmarks to measure algorithm performance and test model robustness across institutions. Common agreed datasets to compare performance of tools against would help support the community in inter-study comparisons and validation of systems. The work we present here has the potential to inform researchers about applications of NLP to radiology and to lead to more reliable and responsible research in the domain.

Availability of data and materials

All data generated or analysed during this study are included in this published article [and its supplementary information files].

Abbreviations

natural language processing

international classification of diseases

Breast Imaging-Reporting and Data System

inter-annotator agreement

unified medical language system

embeddings from Language Models

bidirectional encoder representations from transformers

support vector machine

convolutional neural network

long short-term memory

bi-directional long short-term memory

bi-directional gated recurrent unit

conditional random field

Global Vectors for Word Representation

Bates J, Fodeh SJ, Brandt CA, Womack JA. Classification of radiology reports for falls in an HIV study cohort. J Am Med Inform Assoc. 2016;23(e1):113–7. https://doi.org/10.1093/jamia/ocv155 .

Article   Google Scholar  

Pons E, Braun LMM, Hunink MGM, Kors JA. Natural language processing in radiology: a systematic review. Radiology. 2016;279(2):329–43. https://doi.org/10.1148/radiol.16142770 .

Article   PubMed   Google Scholar  

Cai T, Giannopoulos AA, Yu S, Kelil T, Ripley B, Kumamaru KK, Rybicki FJ, Mitsouras D. Natural language processing technologies in radiology research and clinical applications. RadioGraphics. 2016;36(1):176–91. https://doi.org/10.1148/rg.2016150080 .

Article   PubMed   PubMed Central   Google Scholar  

Sorin V, Barash Y, Konen E, Klang E. Deep learning for natural language processing in radiology-fundamentals and a systematic review. J Am Coll Radiol. 2020;17(5):639–48. https://doi.org/10.1016/j.jacr.2019.12.026 .

Kreimeyer K, Foster M, Pandey A, Arya N, Halford G, Jones SF, Forshee R, Walderhaug M, Botsis T. Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review. J Biomed Inform. 2017;73:14–29. https://doi.org/10.1016/j.jbi.2017.07.012 .

Spasic I, Nenadic G. Clinical text data in machine learning: systematic review. JMIR Med Inform. 2020;8(3):17984. https://doi.org/10.2196/17984 .

Wu S, Roberts K, Datta S, Du J, Ji Z, Si Y, Soni S, Wang Q, Wei Q, Xiang Y, Zhao B, Xu H. Deep learning in clinical natural language processing: a methodical review. J Am Med Inform Assoc. 2020;27(3):457–70. https://doi.org/10.1093/jamia/ocz200 .

Moher D, Shamseer L, Clarke M, Ghersi D, Liberati A, Petticrew M, Shekelle P, Stewart LA. Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) 2015 statement. Syst Rev. 2015;4(1):1. https://doi.org/10.1186/2046-4053-4-1 .

Harzing AW. Publish or Perish (2007). Available from https://harzing.com/resources/publish-or-perish . Accessed 1 Nov 2019.

Gehanno J-F, Rollin L, Darmoni S. Is the coverage of google scholar enough to be used alone for systematic reviews. BMC Med Inform Decis Mak. 2013;13:7. https://doi.org/10.1186/1472-6947-13-7 .

Wilkinson LJ, REST API. Publication title: crossref type: website. https://www.crossref.org/education/retrieve-metadata/rest-api/ . Accessed 26 Jan 2020.

For AI AI. Semantic scholar |AI-powered research tool. https://api.semanticscholar.org/ . Accessed 26 Jan 2021.

University C. arXiv.org e-Print archive. https://arxiv.org/ . Accessed 26 Jan 2021.

Bearden E, LibGuides: unpaywall: home. https://library.lasalle.edu/c.php?g=982604&p=7105436 . Accessed 26 Jan 2021.

Briscoe S, Bethel A, Rogers M. Conduct and reporting of citation searching in Cochrane systematic reviews: a cross-sectional study. Res Synth Methods. 2020;11(2):169–80. https://doi.org/10.1002/jrsm.1355 .

Wohlin C, Guidelines for snowballing in systematic literature studies and a replication in software engineering. In: Proceedings of the 18th international conference on evaluation and assessment in software engineering. EASE ’14. Association for Computing Machinery, New York, NY, USA (2014). https://doi.org/10.1145/2601248.2601268 . event-place: London, England, UK. https://doi.org/10.1145/2601248.2601268 .

Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76(5):378–82. https://doi.org/10.1037/h0031619 .

Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–74. https://doi.org/10.2307/2529310 .

Article   CAS   Google Scholar  

Peng Y, Yan K, Sandfort V, Summers R.M, Lu Z. A self-attention based deep learning method for lesion attribute detection from CT reports. In: 2019 IEEE international conference on healthcare informatics (ICHI), pp. 1–5. IEEE Computer Society, Xi’an, China (2019). https://doi.org/10.1109/ICHI.2019.8904668 .

Bozkurt S, Alkim E, Banerjee I, Rubin DL. Automated detection of measurements and their descriptors in radiology reports using a hybrid natural language processing algorithm. J Digit Imaging. 2019;32(4):544–53. https://doi.org/10.1007/s10278-019-00237-9 .

Hassanpour S, Bay G, Langlotz CP. Characterization of change and significance for clinical findings in radiology reports through natural language processing. J Digit Imaging. 2017;30(3):314–22. https://doi.org/10.1007/s10278-016-9931-8 .

Kehl KL, Elmarakeby H, Nishino M, Van Allen EM, Lepisto EM, Hassett MJ, Johnson BE, Schrag D. Assessment of deep natural language processing in ascertaining oncologic outcomes from radiology reports. JAMA Oncol. 2019;5(10):1421–9. https://doi.org/10.1001/jamaoncol.2019.1800 .

Chen P-H, Zafar H, Galperin-Aizenberg M, Cook T. Integrating natural language processing and machine learning algorithms to categorize oncologic response in radiology reports. J Digit Imaging. 2018;31(2):178–84. https://doi.org/10.1007/s10278-017-0027-x .

Cotik V, Rodríguez H, Vivaldi J. Spanish named entity recognition in the biomedical domain. In: Lossio-Ventura JA, Muñante D, Alatrista-Salas H, editors. Information management and big data. Communications in computer and information science, vol. 898. Lima: Springer; 2018. p. 233–48. https://doi.org/10.1007/978-3-030-11680-4-23 .

Chapter   Google Scholar  

Sevenster M, Buurman J, Liu P, Peters JF, Chang PJ. Natural language processing techniques for extracting and categorizing finding measurements in narrative radiology reports. Appl Clin Inform. 2015;06(3):600–10. https://doi.org/10.4338/ACI-2014-11-RA-0110 .

Sevenster M, Bozeman J, Cowhy A, Trost W. A natural language processing pipeline for pairing measurements uniquely across free-text CT reports. J Biomed Inform. 2015;53:36–48. https://doi.org/10.1016/j.jbi.2014.08.015 .

Oberkampf H, Zillner S, Overton JA, Bauer B, Cavallaro A, Uder M, Hammon M. Semantic representation of reported measurements in radiology. BMC Med Inform Decis Mak. 2016;16(1):5. https://doi.org/10.1186/s12911-016-0248-9 .

Liu Y, Zhu L-N, Liu Q, Han C, Zhang X-D, Wang X-Y. Automatic extraction of imaging observation and assessment categories from breast magnetic resonance imaging reports with natural language processing. Chin Med J. 2019;132(14):1673–80. https://doi.org/10.1097/CM9.0000000000000301 .

Gupta A, Banerjee I, Rubin DL. Automatic information extraction from unstructured mammography reports using distributed semantics. J Biomed Inform. 2018;78:78–86. https://doi.org/10.1016/j.jbi.2017.12.016 .

Castro SM, Tseytlin E, Medvedeva O, Mitchell K, Visweswaran S, Bekhuis T, Jacobson RS. Automated annotation and classification of BI-RADS assessment from radiology reports. J Biomed Inform. 2017;69:177–87. https://doi.org/10.1016/j.jbi.2017.04.011 .

Short RG, Bralich J, Bogaty D, Befera NT. Comprehensive word-level classification of screening mammography reports using a neural network sequence labeling approach. J Digit Imaging. 2019;32(5):685–92. https://doi.org/10.1007/s10278-018-0141-4 .

Lacson R, Goodrich ME, Harris K, Brawarsky P, Haas JS. Assessing inaccuracies in automated information extraction of breast imaging findings. J Digit Imaging. 2017;30(2):228–33. https://doi.org/10.1007/s10278-016-9927-4 .

Lacson R, Harris K, Brawarsky P, Tosteson TD, Onega T, Tosteson ANA, Kaye A, Gonzalez I, Birdwell R, Haas JS. Evaluation of an automated information extraction tool for imaging data elements to populate a breast cancer screening registry. J Digit Imaging. 2015;28(5):567–75. https://doi.org/10.1007/s10278-014-9762-4 .

Yim W-W, Kwan SW, Yetisgen M. Tumor reference resolution and characteristic extraction in radiology reports for liver cancer stage prediction. J Biomed Inform. 2016;64:179–91. https://doi.org/10.1016/j.jbi.2016.10.005 .

Yim W-W, Kwan SW, Yetisgen M. Classifying tumor event attributes in radiology reports. J Assoc Inform Sci Technol. 2017;68(11):2662–74. https://doi.org/10.1002/asi.23937 .

Yim W, Denman T, Kwan SW, Yetisgen M. Tumor information extraction in radiology reports for hepatocellular carcinoma patients. AMIA Summits Transl Sci Proc. 2016;2016:455–64.

PubMed   Google Scholar  

Pruitt P, Naidech A, Van Ornam J, Borczuk P, Thompson W. A natural language processing algorithm to extract characteristics of subdural hematoma from head CT reports. Emerg Radiol. 2019;26(3):301–6. https://doi.org/10.1007/s10140-019-01673-4 .

Farjah F, Halgrim S, Buist DSM, Gould MK, Zeliadt SB, Loggers ET, Carrell DS. An automated method for identifying individuals with a lung nodule can be feasibly implemented across health systems. eGEMs. 2016;4(1):1254. https://doi.org/10.13063/2327-9214.1254 .

Karunakaran B, Misra D, Marshall K, Mathrawala D, Kethireddy S. Closing the loop-finding lung cancer patients using NLP. In: 2017 IEEE international conference on big data (big data), pp. 2452–61. IEEE, Boston, MA (2017). https://doi.org/10.1109/BigData.2017.8258203 .

Tan WK, Hassanpour S, Heagerty PJ, Rundell SD, Suri P, Huhdanpaa HT, James K, Carrell DS, Langlotz CP, Organ NL, Meier EN, Sherman KJ, Kallmes DF, Luetmer PH, Griffith B, Nerenz DR, Jarvik JG. Comparison of natural language processing rules-based and machine-learning systems to identify lumbar spine imaging findings related to low back pain. Acad Radiol. 2018;25(11):1422–32. https://doi.org/10.1016/j.acra.2018.03.008 .

Trivedi G, Hong C, Dadashzadeh ER, Handzel RM, Hochheiser H, Visweswaran S. Identifying incidental findings from radiology reports of trauma patients: an evaluation of automated feature representation methods. Int J Med Inform. 2019;129:81–7. https://doi.org/10.1016/j.ijmedinf.2019.05.021 .

Fu S, Leung LY, Wang Y, Raulli A-O, Kallmes DF, Kinsman KA, Nelson KB, Clark MS, Luetmer PH, Kingsbury PR, Kent DM, Liu H. Natural language processing for the identification of silent brain infarcts from neuroimaging reports. JMIR Med Inform. 2019;7(2):12109. https://doi.org/10.2196/12109 .

Jnawali K, Arbabshirani MR, Ulloa AE, Rao N, Patel AA. Automatic classification of radiological report for intracranial hemorrhage. In: 2019 IEEE 13th international conference on semantic computing (ICSC), pp. 187–90. IEEE, Newport Beach, CA, USA (2019). https://doi.org/10.1109/ICOSC.2019.8665578 .

Banerjee I, Madhavan S, Goldman RE, Rubin DL. Intelligent Word embeddings of free-text radiology reports. In: AMIA annual symposium proceedings, pp. 411–20 (2017). Accessed 30 Oct 2020.

Kłos M, Żyłkowski J, Spinczyk D, Automatic classification of text documents presenting radiology examinations. In: Pietka E, Badura P, Kawa J, Wieclawek W, editors. Proceedings 6th international conference information technology in biomedicine (ITIB’2018). Advances in intelligent systems and computing, pp. 495–505. Springer (2018). https://doi.org/10.1007/978-3-319-91211-0-43 .

Deshmukh N, Gumustop S, Gauriau R, Buch V, Wright B, Bridge C, Naidu R, Andriole K, Bizzo B. Semi-supervised natural language approach for fine-grained classification of medical reports. arXiv:1910.13573 [cs.LG] (2019). Accessed 30 Oct 2020.

Kim C, Zhu V, Obeid J, Lenert L. Natural language processing and machine learning algorithm to identify brain MRI reports with acute ischemic stroke. PLoS ONE. 2019;14(2):0212778. https://doi.org/10.1371/journal.pone.0212778 .

Garg R, Oh E, Naidech A, Kording K, Prabhakaran S. Automating ischemic stroke subtype classification using machine learning and natural language processing. J Stroke Cerebrovasc Dis. 2019;28(7):2045–51. https://doi.org/10.1016/j.jstrokecerebrovasdis.2019.02.004 .

Shin B, Chokshi FH, Lee T, Choi JD. Classification of radiology reports using neural attention models. In: 2017 international joint conference on neural networks (IJCNN), pp. 4363–70. IEEE, Anchorage, AK (2017). https://doi.org/10.1109/IJCNN.2017.7966408 .

Wheater E, Mair G, Sudlow C, Alex B, Grover C, Whiteley W. A validated natural language processing algorithm for brain imaging phenotypes from radiology reports in UK electronic health records. BMC Med Inform Decis Mak. 2019;19(1):184. https://doi.org/10.1186/s12911-019-0908-7 .

Gorinski P.J, Wu H, Grover C, Tobin R, Talbot C, Whalley H, Sudlow C, Whiteley W, Alex B. Named entity recognition for electronic health records: a comparison of rule-based and machine learning approaches. arXiv:1903.03985 [cs.CL] (2019). Accessed 30 Oct 2020.

Alex B, Grover C, Tobin R, Sudlow C, Mair G, Whiteley W. Text mining brain imaging reports. J Biomed Semant. 2019;10(1):23. https://doi.org/10.1186/s13326-019-0211-7 .

Bozkurt S, Gimenez F, Burnside ES, Gulkesen KH, Rubin DL. Using automatically extracted information from mammography reports for decision-support. J Biomed Inform. 2016;62:224–31. https://doi.org/10.1016/j.jbi.2016.07.001 .

Patel TA, Puppala M, Ogunti RO, Ensor JE, He T, Shewale JB, Ankerst DP, Kaklamani VG, Rodriguez AA, Wong STC, Chang JC. Correlating mammographic and pathologic findings in clinical decision support using natural language processing and data mining methods. Cancer. 2017;123(1):114–21. https://doi.org/10.1002/cncr.30245 .

Banerjee I, Bozkurt S, Alkim E, Sagreiya H, Kurian AW, Rubin DL. Automatic inference of BI-RADS final assessment categories from narrative mammography report findings. J Biomed Inform. 2019. https://doi.org/10.1016/j.jbi.2019.103137 .

Miao S, Xu T, Wu Y, Xie H, Wang J, Jing S, Zhang Y, Zhang X, Yang Y, Zhang X, Shan T, Wang L, Xu H, Wang S, Liu Y. Extraction of BI-RADS findings from breast ultrasound reports in Chinese using deep learning approaches. Int J Med Inform. 2018;119:17–21. https://doi.org/10.1016/j.ijmedinf.2018.08.009 .

Dunne RM, Ip IK, Abbett S, Gershanik EF, Raja AS, Hunsaker A, Khorasani R. Effect of evidence-based clinical decision support on the use and yield of CT pulmonary angiographic imaging in hospitalized patients. Radiology. 2015;276(1):167–74. https://doi.org/10.1148/radiol.15141208 .

Banerjee I, Ling Y, Chen MC, Hasan SA, Langlotz CP, Moradzadeh N, Chapman B, Amrhein T, Mong D, Rubin DL, Farri O, Lungren MP. Comparative effectiveness of convolutional neural network (CNN) and recurrent neural network (RNN) architectures for radiology text report classification. Artif Intell Med. 2019;97:79–88. https://doi.org/10.1016/j.artmed.2018.11.004 .

Chen MC, Ball RL, Yang L, Moradzadeh N, Chapman BE, Larson DB, Langlotz CP, Amrhein TJ, Lungren MP. Deep learning to classify radiology free-text reports. Radiology. 2017;286(3):845–52. https://doi.org/10.1148/radiol.2017171115 .

Meystre S, Gouripeddi R, Tieder J, Simmons J, Srivastava R, Shah S. Enhancing comparative effectiveness research with automated pediatric pneumonia detection in a multi-institutional clinical repository: a PHIS+ pilot study. J Med Internet Res. 2017;19(5):162. https://doi.org/10.2196/jmir.6887 .

Beyer SE, McKee BJ, Regis SM, McKee AB, Flacke S, El Saadawi G, Wald C. Automatic lung-RADSTM classification with a natural language processing system. J Thorac Dis. 2017;9(9):3114–22. https://doi.org/10.21037/jtd.2017.08.13 .

Patterson OV, Freiberg MS, Skanderson M, Fodeh SJ, Brandt CA, DuVall SL. Unlocking echocardiogram measurements for heart disease research through natural language processing. BMC Cardiovasc Disord. 2017;17(1):151. https://doi.org/10.1186/s12872-017-0580-8 .

Lee C, Kim Y, Kim YS, Jang J. Automatic disease annotation from radiology reports using artificial intelligence implemented by a recurrent neural network. Am J Roentgenol. 2019;212(4):734–40. https://doi.org/10.2214/AJR.18.19869 .

Fiebeck J, Laser H, Winther HB, Gerbel S. Leaving no stone unturned: using machine learning based approaches for information extraction from full texts of a research data warehouse. In: Auer S, Vidal M-E, editors. 13th international conference data integration in the life sciences (DILS 2018). Lecture Notes in Computer Science, pp. 50–8. Springer, Hannover, Germany (2018). https://doi.org/10.1007/978-3-030-06016-9_5 .

Hassanzadeh H, Kholghi M, Nguyen A, Chu K. Clinical document classification using labeled and unlabeled data across hospitals. In: AMIA annual symposium proceedings 2018, pp. 545–54 (2018). Accessed 30 Oct 2020.

Krishnan GS, Kamath SS. Ontology-driven text feature modeling for disease prediction using unstructured radiological notes. Comput Sist. 2019. https://doi.org/10.13053/cys-23-3-3238 .

Qenam B, Kim TY, Carroll MJ, Hogarth M. Text simplification using consumer health vocabulary to generate patient-centered radiology reporting: translation and evaluation. J Med Internet Res. 2017;19(12):417. https://doi.org/10.2196/jmir.8536 .

Lafourcade M, Ramadier L. Radiological text simplification using a general knowledge base. In: 18th international conference on computational linguistics and intelligent text processing (CICLing 2017). CICLing 2017. Budapest, Hungary (2017). https://doi.org/10.1007/978-3-319-77116-8_46 .

Hong Y, Zhang J. Investigation of terminology coverage in radiology reporting templates and free-text reports. Int J Knowl Content Dev Technol. 2015;5:5–14. https://doi.org/10.5865/IJKCT.2015.5.1.005 .

Comelli A, Agnello L, Vitabile S. An ontology-based retrieval system for mammographic reports. In: 2015 IEEE symposium on computers and communication (ISCC), pp. 1001–6. IEEE, Larnaca (2015). https://doi.org/10.1109/ISCC.2015.7405644

Cotik V, Filippo D, Castano J. An approach for automatic classification of radiology reports in Spanish. Stud Health Technol Inform. 2015;216:634–8.

Johnson E, Baughman WC, Ozsoyoglu G. A method for imputation of semantic class in diagnostic radiology text. In: 2015 IEEE international conference on bioinformatics and biomedicine (BIBM), pp. 750–5. IEEE, Washington, DC (2015). https://doi.org/10.1109/BIBM.2015.7359780 .

Mujjiga S, Krishna V, Chakravarthi KJV. Identifying semantics in clinical reports using neural machine translation. In: Proceedings of the AAAI conference on artificial intelligence, vol. 33(01), pp. 9552–7 (2019). https://doi.org/10.1609/aaai.v33i01.33019552 . Accessed 30 Oct 2020.

Lafourcade M, Ramadier L. Semantic relation extraction with semantic patterns: experiment on radiology report. In: Proceedings of the tenth international conference on language resources and evaluation (LREC 2016). LREC 2016 proceedings. european language resources association (ELRA), Portorož, Slovenia (2016). https://hal.archives-ouvertes.fr/hal-01382320 .

Shelmerdine SC, Singh M, Norman W, Jones R, Sebire NJ, Arthurs OJ. Automated data extraction and report analysis in computer-aided radiology audit: practice implications from post-mortem paediatric imaging. Clin Radiol. 2019;74(9):733–1173318. https://doi.org/10.1016/j.crad.2019.04.021 .

Mabotuwana T, Hombal V, Dalal S, Hall CS, Gunn M. Determining adherence to follow-up imaging recommendations. J Am Coll Radiol. 2018;15(3, Part A):422–8. https://doi.org/10.1016/j.jacr.2017.11.022 .

Dalal S, Hombal V, Weng W-H, Mankovich G, Mabotuwana T, Hall CS, Fuller J, Lehnert BE, Gunn ML. Determining follow-up imaging study using radiology reports. J Digit Imaging. 2020;33(1):121–30. https://doi.org/10.1007/s10278-019-00260-w .

Bobbin MD, Ip IK, Sahni VA, Shinagare AB, Khorasani R. Focal cystic pancreatic lesion follow-up recommendations after publication of ACR white paper on managing incidental findings. J Am Coll Radiol. 2017;14(6):757–64. https://doi.org/10.1016/j.jacr.2017.01.044 .

Kwan JL, Yermak D, Markell L, Paul NS, Shojania KJ, Cram P. Follow up of incidental high-risk pulmonary nodules on computed tomography pulmonary angiography at care transitions. J Hosp Med. 2019;14(6):349–52. https://doi.org/10.12788/jhm.3128 .

Mabotuwana T, Hall CS, Tieder J, Gunn ML. Improving quality of follow-up imaging recommendations in radiology. In: AMIA annual symposium proceedings, vol. 2017, pp. 1196–204 (2018). Accessed 30 Oct 2020.

Brown AD, Marotta TR. A natural language processing-based model to automate MRI brain protocol selection and prioritization. Acad Radiol. 2017;24(2):160–6. https://doi.org/10.1016/j.acra.2016.09.013 .

Trivedi H, Mesterhazy J, Laguna B, Vu T, Sohn JH. Automatic determination of the need for intravenous contrast in musculoskeletal MRI examinations using IBM Watson’s natural language processing algorithm. J Digit Imaging. 2018;31(2):245–51. https://doi.org/10.1007/s10278-017-0021-3 .

Zhang AY, Lam SSW, Liu N, Pang Y, Chan LL, Tang PH. Development of a radiology decision support system for the classification of MRI brain scans. In: 2018 IEEE/ACM 5th international conference on big data computing applications and technologies (BDCAT), pp. 107–15 (2018). https://doi.org/10.1109/BDCAT.2018.00021 .

Brown AD, Marotta TR. Using machine learning for sequence-level automated MRI protocol selection in neuroradiology. J Am Med Inform Assoc. 2018;25(5):568–71. https://doi.org/10.1093/jamia/ocx125 .

Yan Z, Ip IK, Raja AS, Gupta A, Kosowsky JM, Khorasani R. Yield of CT pulmonary angiography in the emergency department when providers override evidence-based clinical decision support. Radiology. 2016;282(3):717–25. https://doi.org/10.1148/radiol.2016151985 .

Kang SK, Garry K, Chung R, Moore WH, Iturrate E, Swartz JL, Kim DC, Horwitz LI, Blecker S. Natural language processing for identification of incidental pulmonary nodules in radiology reports. J Am Coll Radiol. 2019;16(11):1587–94. https://doi.org/10.1016/j.jacr.2019.04.026 .

Brown AD, Kachura JR. Natural language processing of radiology reports in patients with hepatocellular carcinoma to predict radiology resource utilization. J Am Coll Radiol. 2019;16(6):840–4. https://doi.org/10.1016/j.jacr.2018.12.004 .

Article   CAS   PubMed   Google Scholar  

Grundmeier RW, Masino AJ, Casper TC, Dean JM, Bell J, Enriquez R, Deakyne S, Chamberlain JM, Alpern ER. Identification of long bone fractures in radiology reports using natural language processing to support healthcare quality improvement. Appl Clin Inform. 2016;7(4):1051–68. https://doi.org/10.4338/ACI-2016-08-RA-0129 .

Heilbrun ME, Chapman BE, Narasimhan E, Patel N, Mowery D. Feasibility of natural language processing-assisted auditing of critical findings in chest radiology. J Am Coll Radiol. 2019;16(9, Part B):1299–304. https://doi.org/10.1016/j.jacr.2019.05.038 .

Maros ME, Wenz R, Förster A, Froelich MF, Groden C, Sommer WH, Schönberg SO, Henzler T, Wenz H. Objective comparison using guideline-based query of conventional radiological reports and structured reports. In Vivo. 2018;32(4):843–9. https://doi.org/10.21873/invivo.11318 .

Minn MJ, Zandieh AR, Filice RW. Improving radiology report quality by rapidly notifying radiologist of report errors. J Digit Imaging. 2015;28(4):492–8. https://doi.org/10.1007/s10278-015-9781-9 .

Goldshtein I, Chodick G, Kochba I, Gal N, Webb M, Shibolet O. Identification and characterization of nonalcoholic fatty liver disease. Clin Gastroenterol Hepatol. 2020;18(8):1887–9. https://doi.org/10.1016/j.cgh.2019.08.007 .

Redman JS, Natarajan Y, Hou JK, Wang J, Hanif M, Feng H, Kramer JR, Desiderio R, Xu H, El-Serag HB, Kanwal F. Accurate identification of fatty liver disease in data warehouse utilizing natural language processing. Dig Dis Sci. 2017;62(10):2713–8. https://doi.org/10.1007/s10620-017-4721-9 .

Sada Y, Hou J, Richardson P, El-Serag H, Davila J. Validation of case finding algorithms for hepatocellular cancer from administrative data and electronic health records using natural language processing. Med Care. 2016;54(2):9–14. https://doi.org/10.1097/MLR.0b013e3182a30373 .

Li AY, Elliot N. Natural language processing to identify ureteric stones in radiology reports. J Med Imaging Radiat Oncol. 2019;63(3):307–10. https://doi.org/10.1111/1754-9485.12861 .

Tan WK, Heagerty PJ. Surrogate-guided sampling designs for classification of rare outcomes from electronic medical records data. arXiv:1904.00412 [stat.ME] (2019). Accessed 30 Oct 2020.

Yadav K, Sarioglu E, Choi H-A, Cartwright WB, Hinds PS, Chamberlain JM. Automated outcome classification of computed tomography imaging reports for pediatric traumatic brain injury. Acad Emerg Med. 2016;23(2):171–8. https://doi.org/10.1111/acem.12859 .

Mahan M, Rafter D, Casey H, Engelking M, Abdallah T, Truwit C, Oswood M, Samadani U. tbiExtractor: a framework for extracting traumatic brain injury common data elements from radiology reports. bioRxiv 585331 (2019). https://doi.org/10.1101/585331 . Accessed 05 Dec 2020.

Brizzi K, Zupanc SN, Udelsman BV, Tulsky JA, Wright AA, Poort H, Lindvall C. Natural language processing to assess palliative care and end-of-life process measures in patients with breast cancer with leptomeningeal disease. Am J Hosp Palliat Med. 2019;37(5):371–6. https://doi.org/10.1177/1049909119885585 .

Van Haren RM, Correa AM, Sepesi B, Rice DC, Hofstetter WL, Mehran RJ, Vaporciyan AA, Walsh GL, Roth JA, Swisher SG, Antonoff MB. Ground glass lesions on chest imaging: evaluation of reported incidence in cancer patients using natural language processing. Ann Thorac Surg. 2019;107(3):936–40. https://doi.org/10.1016/j.athoracsur.2018.09.016 .

Noorbakhsh-Sabet N, Tsivgoulis G, Shahjouei S, Hu Y, Goyal N, Alexandrov AV, Zand R. Racial difference in cerebral microbleed burden among a patient population in the mid-south United States. J Stroke Cerebrovasc Dis. 2018;27(10):2657–61. https://doi.org/10.1016/j.jstrokecerebrovasdis.2018.05.031 .

Gould MK, Tang T, Liu I-LA, Lee J, Zheng C, Danforth KN, Kosco AE, Di Fiore JL, Suh DE. Recent trends in the identification of incidental pulmonary nodules. Am J Respir Crit Care Med. 2015;192(10):1208–14. https://doi.org/10.1164/rccm.201505-0990OC .

Huhdanpaa HT, Tan WK, Rundell SD, Suri P, Chokshi FH, Comstock BA, Heagerty PJ, James KT, Avins AL, Nedeljkovic SS, Nerenz DR, Kallmes DF, Luetmer PH, Sherman KJ, Organ NL, Griffith B, Langlotz CP, Carrell D, Hassanpour S, Jarvik JG. Using natural language processing of free-text radiology reports to identify type 1 modic endplate changes. J Digit Imaging. 2018;31(1):84–90. https://doi.org/10.1007/s10278-017-0013-3 .

Masino AJ, Grundmeier RW, Pennington JW, Germiller JA, Crenshaw EB. Temporal bone radiology report classification using open source machine learning and natural langue processing libraries. BMC Med Inform Decis Mak. 2016;16(1):65. https://doi.org/10.1186/s12911-016-0306-3 .

Valtchinov VI, Lacson R, Wang A, Khorasani R. Comparing artificial intelligence approaches to retrieve clinical reports documenting implantable devices posing MRI safety risks. J Am Coll Radiol. 2020;17(2):272–9. https://doi.org/10.1016/j.jacr.2019.07.018 .

Zech J, Forde J, Titano JJ, Kaji D, Costa A, Oermann EK. Detecting insertion, substitution, and deletion errors in radiology reports using neural sequence-to-sequence models. Ann Transl Med. 2019. https://doi.org/10.21037/atm.2018.08.11 .

Zhang Y, Merck D, Tsai EB, Manning CD, Langlotz CP. Optimizing the factual correctness of a summary: a study of summarizing radiology reports. arXiv:1911.02541 [cs.CL] (2019). Accessed 30 Oct 2020.

Steinkamp JM, Chambers C, Lalevic D, Zafar HM, Cook TS. Toward complete structured information extraction from radiology reports using machine learning. J Digit Imaging. 2019;32(4):554–64. https://doi.org/10.1007/s10278-019-00234-y .

Cocos A, Qian T, Callison-Burch C, Masino AJ. Crowd control: effectively utilizing unscreened crowd workers for biomedical data annotation. J Biomed Inform. 2017;69:86–92. https://doi.org/10.1016/j.jbi.2017.04.003 .

Ratner A, Hancock B, Dunnmon J, Goldman R, Ré C. Snorkel MeTaL: weak supervision for multi-task learning. In: Proceedings of the second workshop on data management for end-to-end machine learning. DEEM’18, vol. 3, pp. 1–4. ACM, Houston, TX, USA (2018). https://doi.org/10.1145/3209889.3209898 . https://doi.org/10.1145/3209889.3209898 . Accessed 30 Oct 2020.

Zhu H, Paschalidis IC, Hall C, Tahmasebi A. Context-driven concept annotation in radiology reports: anatomical phrase labeling. In: AMIA summits on translational science proceedings, vol. 2019, pp. 232–41 (2019). Accessed 30 Oct 2020.

Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space (2013). http://arxiv.org/abs/1301.3781 . Accessed 7 Feb 2021.

Pennington J, Socher R, Manning CD. Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–43 (2014).

Mikolov T, Grave E, Bojanowski P, Puhrsch C, Joulin A. Advances in pre-training distributed word representations. In: Proceedings of the international conference on language resources and evaluation (LREC 2018) (2018).

Peters M.E, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L. Deep contextualized word representations. CoRR abs/1802.05365 (2018). \_eprint: 1802.05365.

Devlin J, Chang M-W, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

National Library of Medicine: Unified medical language system (2021). https://www.nlm.nih.gov/research/umls/index.html . Accessed 7 Feb 2021.

RSNA: RadLex (2021). http://radlex.org/ . Accessed 7 Feb 2021.

National Library of Medicine: SNOMED CT, (2021). https://www.nlm.nih.gov/healthit/snomedct/index.html . Accessed 07 Feb 2021.

Bulu H, Sippo DA, Lee JM, Burnside ES, Rubin DL. Proposing new RadLex terms by analyzing free-text mammography reports. J Digit Imaging. 2018;31(5):596–603. https://doi.org/10.1007/s10278-018-0064-0 .

Hassanpour S, Langlotz CP. Unsupervised topic modeling in a large free text radiology report repository. J Digit Imaging. 2016;29(1):59–62. https://doi.org/10.1007/s10278-015-9823-3 .

Zhao Y, Fesharaki NJ, Liu H, Luo J. Using data-driven sublanguage pattern mining to induce knowledge models: application in medical image reports knowledge representation. BMC Med Inform Decis Mak. 2018;18(1):61. https://doi.org/10.1186/s12911-018-0645-3 .

Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20(1):37–46. https://doi.org/10.1177/001316446002000104 .

Shickel B, Tighe PJ, Bihorac A, Rashidi P. Deep EHR: a survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. IEEE J Biomed Health Inform. 2018;22(5):1589–604. https://doi.org/10.1109/JBHI.2017.2767063 .

Chen D, Liu S, Kingsbury P, Sohn S, Storlie CB, Habermann EB, Naessens JM, Larson DW, Liu H. Deep learning and alternative learning strategies for retrospective real-world clinical data. npj Digit Med. 2019;2(1):1–5. https://doi.org/10.1038/s41746-019-0122-0 .

Yang H, Li L, Yang R, Zhou Y. Towards automated knowledge discovery of hepatocellular carcinoma: extract patient information from Chinese clinical reports. In: Proceedings of the 2nd international conference on medical and health informatics. ICMHI ’18, pp. 111–6. ACM, New York, NY, USA (2018). https://doi.org/10.1145/3239438.3239445 . Accessed 30 Oct 2020.

Wood D.A, Lynch J, Kafiabadi S, Guilhem E, Busaidi A.A, Montvila A, Varsavsky T, Siddiqui J, Gadapa N, Townend M, Kiik M, Patel K, Barker G, Ourselin S, Cole JH, Booth TC. Automated labelling using an attention model for radiology reports of MRI scans (ALARM). arXiv:2002.06588 [cs.CV] (2020). Accessed 03 Dec 2020.

Ong CJ, Orfanoudaki A, Zhang R, Caprasse FPM, Hutch M, Ma L, Fard D, Balogun O, Miller MI, Minnig M, Saglam H, Prescott B, Greer DM, Smirnakis S, Bertsimas D. Machine learning and natural language processing methods to identify ischemic stroke, acuity and location from radiology reports. PLoS ONE. 2020;15(6):0234908. https://doi.org/10.1371/journal.pone.0234908 .

Smit A, Jain S, Rajpurkar P, Pareek A, Ng A, Lungren M. Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp. 1500–19. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.117 . https://www.aclweb.org/anthology/2020.emnlp-main.117 . Accessed 03 Dec 2020.

Grivas A, Alex B, Grover C, Tobin R, Whiteley W. Not a cute stroke: analysis of rule- and neural network-based information extraction systems for brain radiology reports. In: Proceedings of the 11th international workshop on health text mining and information analysis (2020).

Ettinger A. What BERT is not: lessons from a new suite of psycholinguistic diagnostics for language models. Trans Assoc Comput Linguist. 2020;8:34–48. https://doi.org/10.1162/tacl_a_00298 .

Alsentzer E, Murphy J, Boag W, Weng W-H, Jindi D, Naumann T, McDermott M. Publicly available clinical BERT embeddings. In: Proceedings of the 2nd clinical natural language processing workshop, pp. 72–8. Association for Computational Linguistics, Minneapolis, Minnesota, USA (2019). https://doi.org/10.18653/v1/W19-1909 . https://www.aclweb.org/anthology/W19-1909 .

Smit A, Jain S, Rajpurkar P, Pareek A, Ng AY, Lungren MP. CheXbert: combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. CoRR abs/2004.09167 (2020). \_eprint: 2004.09167.

Yasaka K, Abe O. Deep learning and artificial intelligence in radiology: current applications and future directions. PLOS Med. 2018;15(11):1002707. https://doi.org/10.1371/journal.pmed.1002707 .

Percha B, Zhang Y, Bozkurt S, Rubin D, Altman RB, Langlotz CP. Expanding a radiology lexicon using contextual patterns in radiology reports. J Am Med Inform Assoc. 2018;25(6):679–85. https://doi.org/10.1093/jamia/ocx152 .

Tahmasebi AM, Zhu H, Mankovich G, Prinsen P, Klassen P, Pilato S, van Ommering R, Patel P, Gunn ML, Chang P. Automatic normalization of anatomical phrases in radiology reports using unsupervised learning. J Digit Imaging. 2019;32(1):6–18. https://doi.org/10.1007/s10278-018-0116-5 .

Banerjee I, Chen MC, Lungren MP, Rubin DL. Radiology report annotation using intelligent word embeddings: applied to multi-institutional chest CT cohort. J Biomed Inform. 2018;77:11–20. https://doi.org/10.1016/j.jbi.2017.11.012 .

Young T, Hazarika D, Poria S, Cambria E. Recent trends in deep learning based natural language processing [review article]. IEEE Comput Intell Mag. 2018;13(3):55–75. https://doi.org/10.1109/MCI.2018.2840738 .

Donnelly LF, Grzeszczuk R, Guimaraes CV, Zhang W, Bisset GS III. Using a natural language processing and machine learning algorithm program to analyze inter-radiologist report style variation and compare variation between radiologists when using highly structured versus more free text reporting. Curr Probl Diagn Radiol. 2019;48(6):524–30. https://doi.org/10.1067/j.cpradiol.2018.09.005 .

Xie Z, Yang Y, Wang M, Li M, Huang H, Zheng D, Shu R, Ling T. Introducing information extraction to radiology information systems to improve the efficiency on reading reports. Methods Inf Med. 2019;58(2–03):94–106. https://doi.org/10.1055/s-0039-1694992 .

Zech J, Pain M, Titano J, Badgeley M, Schefflein J, Su A, Costa A, Bederson J, Lehar J, Oermann EK. Natural language-based machine learning models for the annotation of clinical radiology reports. Radiology. 2018;287(2):570–80. https://doi.org/10.1148/radiol.2018171093 .

Yim W, Kwan SW, Johnson G, Yetisgen M. Classification of hepatocellular carcinoma stages from free-text clinical and radiology reports. In: AMIA annual symposium proceedings, vol. 2017, pp. 1858–67 (2018). Accessed 30 Oct 2020.

Payrovnaziri SN, Chen Z, Rengifo-Moreno P, Miller T, Bian J, Chen JH, Liu X, He Z. Explainable artificial intelligence models using real-world electronic health record data: a systematic scoping review. J Am Med Inform Assoc. 2020;27(7):1173–85. https://doi.org/10.1093/jamia/ocaa053 .

Dong H, Suárez-Paniagua V, Whiteley W, Wu H. Explainable automated coding of clinical notes using hierarchical label-wise attention networks and label embedding initialisation. J Biomed Inform. 2021. https://doi.org/10.1016/j.jbi.2021.103728 .

Download references

Acknowledgements

Not applicable.

This research was supported by the Alan Turing Institute, MRC, HDR-UK and the Chief Scientist Office. B.A.,A.C,D.D.,A.G. and C.G. have been supported by the Alan Turing Institute via Turing Fellowships (B.A,C.G.) and Turing project funding (ESPRC Grant EP/N510129/1). A.G. was also funded by a MRC Mental Health Data Pathfinder Award (MRC-MCPC17209). H.W. is MRC/Rutherford Fellow HRD UK (MR/S004149/1). H.D. is supported by HDR UK National Phemomics Resource Project. V.S-P. is supported by the HDR UK National Text Analytics Implementation Project. W.W. is supported by a Scottish Senior Clinical Fellowship (CAF/17/01).

Author information

Authors and affiliations.

School of Literatures, Languages and Cultures (LLC), University of Edinburgh, Edinburgh, Scotland

Arlene Casey, Daniel Duma & Beatrice Alex

Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, Scotland

Emma Davidson, Michael Poon & William Whiteley

Centre for Medical Informatics, Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, Scotland

Hang Dong & Víctor Suárez-Paniagua

Health Data Research UK, London, UK

Hang Dong, Víctor Suárez-Paniagua & Honghan Wu

Institute for Language, Cognition and Computation, School of informatics, University of Edinburgh, Edinburgh, Scotland

Andreas Grivas, Claire Grover & Richard Tobin

Nuffield Department of Population Health, University of Oxford, Oxford, UK

William Whiteley

Institute of Health Informatics, University College London, London, UK

Edinburgh Futures Institute, University of Edinburgh, Edinburgh, Scotland

Beatrice Alex

You can also search for this author in PubMed   Google Scholar

Contributions

B.A., W.W. and H.W. conceptualised this study. D.D. carried out the search including automated filtering and designing meta-enriching steps. BA, AG, CG and RT advised on the automatic data collection method devised by DD. M.T.C.P, A.G., H.D. and D.D carried out the first stage review and A.C., E.D., V.S-P, M.T.C.P, A.G., H.D., B.A. and D.D. carried out the second-stage review. A.C. synthesised the data and wrote the main manuscript with contributions from all authors. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Arlene Casey .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1..

Publication list with application and technical categories.

Additional file 2.

Individual properties for every publication.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Casey, A., Davidson, E., Poon, M. et al. A systematic review of natural language processing applied to radiology reports. BMC Med Inform Decis Mak 21 , 179 (2021). https://doi.org/10.1186/s12911-021-01533-7

Download citation

Received : 09 February 2021

Accepted : 17 May 2021

Published : 03 June 2021

DOI : https://doi.org/10.1186/s12911-021-01533-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Natural language processing
  • Systematic review

BMC Medical Informatics and Decision Making

ISSN: 1472-6947

literature review of nlp

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Review Article
  • Open access
  • Published: 21 December 2022

A survey on clinical natural language processing in the United Kingdom from 2007 to 2022

  • Honghan Wu   ORCID: orcid.org/0000-0002-0213-5668 1 ,
  • Minhong Wang 1 ,
  • Jinge Wu 1 , 2 ,
  • Farah Francis   ORCID: orcid.org/0000-0002-7979-6296 2 ,
  • Yun-Hsuan Chang 1 ,
  • Alex Shavick 3 ,
  • Hang Dong 2 , 4 ,
  • Michael T. C. Poon 2 ,
  • Natalie Fitzpatrick 1 ,
  • Adam P. Levine   ORCID: orcid.org/0000-0003-1333-9938 3 ,
  • Luke T. Slater 5 ,
  • Alex Handy 1 , 6 ,
  • Andreas Karwath 5 ,
  • Georgios V. Gkoutos   ORCID: orcid.org/0000-0002-2061-091X 5 ,
  • Claude Chelala 7 ,
  • Anoop Dinesh Shah   ORCID: orcid.org/0000-0002-8907-5724 1 ,
  • Robert Stewart   ORCID: orcid.org/0000-0002-4435-6397 8 , 9 ,
  • Nigel Collier 10 ,
  • Beatrice Alex 11 ,
  • William Whiteley   ORCID: orcid.org/0000-0002-4816-8991 2 ,
  • Cathie Sudlow 2 ,
  • Angus Roberts   ORCID: orcid.org/0000-0002-4570-9801 12 &
  • Richard J. B. Dobson   ORCID: orcid.org/0000-0003-4224-9245 1 , 12  

npj Digital Medicine volume  5 , Article number:  186 ( 2022 ) Cite this article

8491 Accesses

24 Citations

18 Altmetric

Metrics details

  • Computational science
  • Translational research

Much of the knowledge and information needed for enabling high-quality clinical research is stored in free-text format. Natural language processing (NLP) has been used to extract information from these sources at scale for several decades. This paper aims to present a comprehensive review of clinical NLP for the past 15 years in the UK to identify the community, depict its evolution, analyse methodologies and applications, and identify the main barriers. We collect a dataset of clinical NLP projects ( n  = 94; £  = 41.97 m) funded by UK funders or the European Union’s funding programmes. Additionally, we extract details on 9 funders, 137 organisations, 139 persons and 431 research papers. Networks are created from timestamped data interlinking all entities, and network analysis is subsequently applied to generate insights. 431 publications are identified as part of a literature review, of which 107 are eligible for final analysis. Results show, not surprisingly, clinical NLP in the UK has increased substantially in the last 15 years: the total budget in the period of 2019–2022 was 80 times that of 2007–2010. However, the effort is required to deepen areas such as disease (sub-)phenotyping and broaden application domains. There is also a need to improve links between academia and industry and enable deployments in real-world settings for the realisation of clinical NLP’s great potential in care delivery. The major barriers include research and development access to hospital data, lack of capable computational resources in the right places, the scarcity of labelled data and barriers to sharing of pretrained models.

Similar content being viewed by others

literature review of nlp

Desiderata for delivering NLP to accelerate healthcare AI advancement and a Mayo Clinic NLP-as-a-service implementation

literature review of nlp

Evaluation of the portability of computable phenotypes with natural language processing in the eMERGE network

literature review of nlp

A large language model for electronic health records

Introduction.

Free-text components of Electronic Health Records (EHRs) contain much of the valuable information that is essential to facilitate tailored care and personalised treatments for patients 1 , 2 , 3 . A lot of this information is either unlikely to be available or is more comprehensive than the structured component of EHRs only 4 , 5 . Data such as signs or symptoms of disease, adverse drug reactions, lifestyle (e.g. smoking, alcohol consumption and living arrangements), family medical history, or key information describing disease subtypes are recorded with greater frequency and depth in free-text data 6 , 7 , 8 . To interrogate free texts and unlock deep phenotypic data for research and care, Natural Language Processing (NLP) approaches 2 , 3 , 4 , 6 , 7 , 8 have been adopted to automate the extraction of such information at scale. Like any NLP task, clinical NLP needs to tackle the challenges of devising computer programmes for understanding human spoken or written languages, which constitute some of the most challenging problems faced by artificial intelligence (AI). For those implementing or using clinical NLP, there are additional complications and challenges, which, on the flip side, are new opportunities for research and development.

Clinical NLP often encounters challenges with insufficient data for both supervised and unsupervised machine learning (ML). This ‘low-resource’ setting can be considered in three contexts. First, labelled data for supervised models are scarce and ‘expensive’; these are difficult to scale. Annotators require medical expertise to evaluate clinical information to generate ground truth. Disagreements are prevalent and long-standing among clinical experts 9 , 10 , 11 . The annotation process often requires multiple clinician annotators with senior clinicians, who often have other clinical commitments, adjudicating disagreements. Second, clinical NLP tasks are very likely to deal with highly imbalanced data, which is widely perceived as challenging for ML algorithms 12 . For example, an NLP study examining radiology reports of brain scans 13 reported the most frequent phenotype as Ischaemic Stroke ( n  = 2706 or 11.6%) and the least frequent as Meningioma Tumour ( n  = 10 or 0.4%). The third ‘low resource’ is a computational resource. NLP systems often require capable computational environments with software such as Python, libraries, or open source repositories and hardware, including graphic processing units. It is technically challenging to set up these computational requirements in trusted research environments (TREs), such as those within hospital networks, where clinical data is securely accessible.

Clinical NLP is also knowledge-intensive—the need to incorporate formalised knowledge that computers can understand. Domain knowledge has been shown to be important for understanding biomedical texts, such as in interpreting linguistic structures 14 . Medical text report classifications were also shown to benefit significantly from expert knowledge 15 . In terms of knowledge-based computation, a common feature of clinical NLP applications is the need to perform patient-level inferences, in addition to standard tasks, such as identification of named entities or document classification; an example of this is the inference of subtypes of stroke based on named entities retrieved from text reports 8 .

The knowledge, commonly represented as ontologies, required for clinical decision-making falls at the intersection of many biomedical sciences, including epidemiology, genetics, pharmacology and diagnostics. The size and breadth of background knowledge needed to make inferences are great. However, clinical NLP benefits from the availability of massive knowledge resources that support biomedical science. Medical vocabularies such as SNOMED CT 16 and ICD-10 17 provide classifications of clinical concepts that include taxonomy and vocabulary. In addition to these features, biomedical ontologies provide a formal semantics for a wide range of biomedical concepts and their inter-relations 18 , 19 . Despite the development of these knowledge resources, clinical knowledge at the patient level is largely not represented in a computer-usable form; for example, no existing ontologies can inform an AI system that, while possible 20 , it is probably inconsistent to diagnose a patient with both types 1 and type 2 diabetes. Developing formal knowledge resources is a current challenge for enabling and improving clinical decision-making applications.

Lastly, access to patient-level free-text clinical data is controlled by information governance (IG) regulations 21 , such as the UK’s legal framework 22 , including the NHS Act 2006, the Health and Social Care Act 2012, the Data Protection Act and the Human Rights Act. These regulations are usually complex. The interpretation and application are varied, often resulting in defensive practices. For example, while it is widely acknowledged that it is difficult to comprehensively anonymize free-text data, there is much less consensus on how to do text anonymization at scale, what are the proper evaluation procedures, what level of performance is good enough, and how the anonymization fits within a framework that would ensure confidentiality according to the regulations. As a result, data access to patient data is one of the biggest hurdles for clinical NLP. There has been progress in developing in-house NLP within large NHS organisations such as hospitals, but the IG challenges are greater for using data across NHS organizations.

These challenges (or opportunities) faced by clinical NLP are too great to be tackled by individuals working alone or in small research groups. Cross-organisation collaboration is key to addressing technical challenges, such as sharing data or models, yet the NLP community remains fragmented. Formalising patient-level inference knowledge at scale is only feasible as part of a community effort. Furthermore, national coordination is necessary to create reproducible streamlined procedures for facilitating access to free-text clinical data.

There is a large body of literature reviewing clinical NLP, providing useful summaries of the developments of technologies and applications, for example, on application domains 23 , 24 , on particular clinical questions 25 , 26 , on particular modalities 27 , 28 , 29 , or on methodologies 30 , 31 , 32 . However, healthcare services and their regulations (e.g., the above-mentioned IG policies) differ from country to country. Clinical NLP would particularly benefit from close collaborations and coordination initiatives at a national level. None of the existing reviews provides a comprehensive overview (including who and what, the developments and the gaps) for facilitating such national-level collaborations.

This article aims to facilitate an informed national effort to tackle grand clinical NLP challenges, through a network-based, timestamped and multifaceted review and analysis of the development of clinical NLP in the UK over the past 15 years. Specifically, the main objectives are to gain an understanding of the following key aspects:

Who: To identify the key stakeholders, including organisations (funders, universities, NHS Trusts and companies) and persons (researchers, students and developers) and how they are connected to each other to form the community.

What: To survey the applications, clinical questions, technologies and datasets the community has been working on.

Where: To uncover how the community has grown and how technologies and application domains have evolved over the years; in particular, to assess how the technologies have been used in real-world settings and how the technology maturity levels have changed.

Gaps: Importantly, we identify the gaps that require investment from funders, the barriers to unlocking the potential of clinical NLP and the future research directions.

The scope of this study is depicted in Fig. 1 and comprises two parts. The first is to conduct a community analysis of UK clinical NLP in the last 15 years. This is to reveal the key stakeholders, their connections and developments. The second is to conduct a literature review on the research outputs of the community to understand the technologies used, key application domains and their trends.

figure 1

a A UK community survey (the lower oval); and b a literature review of the community’s research outputs (the upper oval). * NHS—National Health Service in the UK; RL/ML/LLM—NLP technologies of rule-based, machine learning and large language models .

Clinical NLP community analysis results

For overall community developments, Fig. 2 illustrates overall graph representations of the clinical NLP landscape at three-time points (five years apart): 2012, 2017 and 2022. It shows a steady trend of rapid and significant developments in the community in the last 10 years. By 2012, there were only two funded projects involving four organisations with a total of £0.37 million funding. Five years later, by 2017, there were 27 projects and 50 organisations with a total funding of £10.35 million. The latest data collected in this study (by February 2022) shows there were 94 projects, 137 organisations and a total funding of £41.97 million. Interactive visualisation of the graph is available at https://observablehq.com/@626e582587f7e383/uk-clinical-nlp-landscaping-analysis#chart .

figure 2

The graphs contain four types of entities: projects, persons, organisations and funders. Each graph is constructed using data from projects with a start date earlier than or in the given year. Graph data is cumulative, meaning a later year’s data is a superset of its previous years. The size of organisation nodes indicates the number of total amounts in pound sterling they received in funding.

To identify the key stakeholders in the community, we ranked the nodes in the graph by their relative centrality scores based on the Eigenvector centrality measurement. Table 1 shows the three ranked lists of organisations stratified by type. The first part of the table (Table 1 a) lists the top 10 most influential organisations of all types; the second part (Table 1 b) lists the top NHS organisations, and Table 1 c lists the top 5 industry organisations. The top 10 most influential organisations are all universities. The combined influence (=24.62) of the top 5 universities is more than 3.7 times the sum of the influence of all NHS Trusts in the community and more than 4 times that of the top 5 industry institutions.

Most NHS and industry organisations have a relative centrality score larger than one (i.e., higher than the median centrality score), meaning they are involved in relatively highly influential projects.

For individuals, Fig. 3 illustrates the histogram of absolute eigenvector scores of all persons in the community. It shows a likely long-trail distribution.

figure 3

The x -axis is the eigenvector centrality score and the y -axis (log scale) is the number of people with certain scores.

To reflect on technology take-ups and maturity, we did an analysis of the involvement of industry partners and deployment within health services, both of which are key indicators for the maturity of a technology.

Figure 4 shows the budget trends for all projects, projects that involved the NHS and projects that involved industry in the last 15 years grouped by 3-year periods. It shows a clear pattern whereby the funding for clinical NLP in all three categories has increased significantly. It is particularly encouraging to see NHS organisations’ involvement in this area has markedly increased in the last three years. Industry involvement has increased more than 27 times from the 2016-2019 period to the 2019–2022 period.

figure 4

Each tick on the x-axis is a 3-year period. The y-axis shows the total budget. The sums of NHS involved and industry involved project budgets are plotted alongside the budget of all projects across five 3-year periods.

To understand the interactions between groups in the community, it is important to know: (1) what the key subgroups are; and (2) how they are connected with each other.

From the 2022 snapshot in Fig. 2 , we observe that there are four natural clusters in the graph. The middle of the graph is the biggest cluster, containing research projects supported by UK research councils such as EPSC, MRC, BBSRC and ESRC. The top left corner forms the second cluster, which is NIHR-funded projects. The NIHR funds health and social care research, which is supposed to be more translational than research in the main cluster. The third cluster is on the right and contains projects funded by Innovate UK. Such projects are sometimes led by industry and are intended to produce products ready for use by end customers, i.e., health service providers such as the NHS. The top right is the cluster of projects funded by EU Horizon 2020 (H2020) programmes. Overall, the four clusters are connected weekly with each other.

To quantify the strength of connections between subgroups within the community, we conducted a k -connectivity community analysis. Table 2 shows the results, where a sub-community is represented by a funder composed of its funded projects and associated persons and organisations. The community is connected. Therefore, when k is 1, the whole graph constitutes one and only one connected component. When k  = 2, Innovative UK and H2020 sub-communities are separated from the main component. When k  = 3, the whole subgraph of Innovative UK disappears, meaning the connectivity within its own cluster is also weak. The same applies to H2020 projects.

For the main cluster where all other funders reside, the connectivity is not strong: BBSRC disconnected at k  = 5, ESRC disconnected at 6 and NIHR disconnected at 9. EPSRC and MRC form the core, which keeps inter-connected until k reaches 17.

It is worth mentioning that as of 1st January 2020, the graph of the whole community was composed of three separate components of H2020, Innovative UK and other funders. This means the community was formed as an interlinked graph for just a little more than two years.

For depicting the development of training next-generation clinical NLP Leaders, we extracted studentship projects (i.e., funded via doctoral training programmes) to understand the trends of clinical NLP-related PhD projects in recent years. Figure 5 shows three snapshots of funded studentship projects in 2016, 2017 and 2021, respectively. The first project was funded by the MRC, led by Edinburgh and started in 2016. By October 2021, there were a total of 16 funded studentship projects identified, out of which 10 were funded by EPSRC and 5 by MRC.

figure 5

The three figures (from left to right) show the networks of studentship projects and their associated entities (funders, organisations and persons) for 2016, 2017 and 2021 respectively. The 2021 entire network is too big to be shown fully using the same scale. Therefore, a low-resolution overview is shown at the top right and a snapshot of it is displayed using the same scale as other years.

Literature review results on publications

A total of 431 publications were extracted from the 94 projects identified in the community analysis above. A manual screening process was conducted using study criteria detailed in the method section, which identified 107 publications for review.

Table 3 lists the key characteristics of the 107 studies in the last 15 years, including 16 published during 2007–2012 33 , 34 , 35 , 36 , 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47 , 48 ; 31 in 2012–2017 49 , 50 , 51 , 52 , 53 , 54 , 55 , 56 , 57 , 58 , 59 , 60 , 61 , 62 , 63 , 64 , 65 , 66 , 67 , 68 , 69 , 70 , 71 , 72 , 73 , 74 , 75 , 76 , 77 , 78 , 79 ; and 60 published in 2017–2022 80 , 81 , 82 , 83 , 84 , 85 , 86 , 87 , 88 , 89 , 90 , 91 , 92 , 93 , 94 , 95 , 96 , 97 , 98 , 99 , 100 , 101 , 102 , 103 , 104 , 105 , 106 , 107 , 108 , 109 , 110 , 111 , 112 , 113 , 114 , 115 , 116 , 117 , 118 , 119 , 120 , 121 , 122 , 123 , 124 , 125 , 126 , 127 , 128 , 129 , 130 , 131 , 132 , 133 , 134 , 135 , 136 , 137 , 138 , 139 . More than 45% ( n  = 49) of these studies were international (involving at least one collaborator from a country other than the UK). There were a total of 23 collaborating countries or regions, with Japan ( n  = 12), the USA ( n  = 12) and Sweden ( n  = 11) being the top three most frequent collaborating countries.

Categorised by NLP tasks,

31.8% ( n  = 34) performed named entity recognition including extractions of phenotypic information 42 , 55 , 65 , 66 , 85 , 87 , 104 , diseases 50 , 56 , 84 , 89 , 133 , drug entities 53 , 95 , 115 , proteins or genes 39 , 68 , 107 and general concept extractions 52 , 74 , 76 , 81 , 82 , 88 , 90 , 92 , 96 , 103 , 109 , 113 , 122 , 124 , 131 , 137 .

27.1% ( n  = 29) performed text/document classification , including risk assessment classifications 34 , 48 , 49 , 91 , 97 , 99 , literature review 57 , 114 , 117 , 119 , 120 , drug-related 58 , 100 , 116 , 118 , randomised clinical trials 127 , 128 , 129 and generic classifications (such as classifying or clustering documents) 51 , 54 , 60 , 75 , 79 , 80 , 105 , 106 , 112 , 135 , 139 .

16.8% ( n  = 18) performed relation extraction including event extractions 35 , 37 , 59 , 71 , adverse drug reactions 64 , 67 , 69 and generic information extractions 36 , 38 , 40 , 61 , 62 , 70 , 108 , 110 , 121 , 125 , 136 .

13.1% ( n  = 14) did Information retrieval including retrieval from EHR 83 , 93 , 94 , 98 , 101 , 102 , 134 , 140 , literature data 47 , 111 and other types of data 43 , 44 , 45 , 72 , 138 .

Other types of tasks performed included entity normalisations 77 , 78 , 126 , temporal expressions 86 , 93 , 134 and natural language generation 63 .

Contextual mentions of phenotypes and diseases are particularly essential in clinical applications. Identifying positive and negated mentions such as the patient has/has not got fever is among the most studied contextual named entity recognitions 72 , 81 , 98 .

In terms of health categories, mental health was the most widely studied area 84 , 86 , 87 , 88 , 89 , 90 , 91 , 92 , 93 , 94 , 95 , 96 , 97 , 98 , 99 , 100 , 101 , 102 , 103 , 132 , 133 , 134 . It was followed by treatment 53 , 58 , 78 , 79 , 108 , 115 , 116 , 118 , 123 , among which drug-related (mostly adverse drug reactions) studies 53 , 58 , 115 , 118 were most common. Oncology 33 , 34 , 48 , 49 , 55 , 75 , 117 and cardiovascular diseases 62 , 65 , 66 , 83 were the next two most frequently studied areas following treatments. Other disease areas included infectious 42 , 43 , 82 , 96 , 133 , 135 , respiratory 56 , 82 , 96 , 133 , 135 and autoimmune 138 diseases. In particular, there were four studies on COVID-19 82 , 96 , 133 , 135 . The rest were studies that belong to the ‘general applicability’ category, meaning they were tools or models not designed for specific health categories or diseases. They have general utilities for particular scenarios that might be applicable to a wide range of clinical use cases 50 , 51 , 52 , 54 , 57 , 59 , 60 , 61 , 63 , 74 , 77 , 81 , 85 , 104 , 105 , 106 , 109 , 110 , 111 , 112 , 113 , 114 , 119 , 120 , 121 , 122 , 124 , 125 , 131 , 136 , 137 .

Of the 107 reviewed papers, 21 (19.6%) of them provided open access to their repositories, making them usable tools/software for the community. As for utilities in real clinical settings, only 5.6% ( n  = 6) studies were deployed or further developed on systems deployed in NHS environments 81 , 82 , 83 , 84 , 85 , 140 , of which 81 , 140 have been deployed as generic information retrieval or extraction platforms on near real-time EHRs of respective NHS Trusts. Compared to other work, these deployed tools are all concept-linking tools for identifying a broad range of biomedical concepts using large terminologies, including SNOMED CT and UMLS. This makes them suitable for creating a generic platform that can support a wide range of disease areas and application domains.

We conducted further analysis to understand technical objectives vs. NLP tasks. To investigate the clinical application categories, we adapted the classification system from 28 and made slight changes to classify the studies into the following five technical objectives:

Disease information and classification . This is to use NLP for classifying a disease occurrence or extracting information about a disease with no focus on the overall clinical application. Studies in this category include 34 , 35 , 36 , 37 , 38 , 39 , 40 , 45 , 46 , 47 , 48 , 49 , 56 , 58 , 59 , 60 , 61 , 62 , 65 , 66 , 67 , 68 , 69 , 70 , 71 , 72 , 73 , 76 , 77 , 78 , 79 , 80 , 100 , 103 , 112 , 115 , 116 , 117 , 118 , 121 , 122 , 124 , 128 , 129 , 130 , 134 , 135 , 139 .

Language discovery and knowledge . This category studies how ontologies and lexicons could be combined with other NLP methods to represent knowledge that can support clinicians. Studies include 33 , 41 , 50 , 51 , 52 , 55 , 63 , 74 , 75 , 81 , 85 , 86 , 104 , 105 , 106 , 107 , 108 , 110 , 114 , 119 , 131 , 132 .

Diagnostic surveillance . This is to use NLP for extracting disease information for the patient or disease surveillance 64 , 82 , 83 , 84 , 87 , 88 , 89 , 90 , 91 , 92 , 93 , 99 , 102 , 125 , 126 , 133 , 136 , 138 .

Cohort building for epidemiological studies . The objective of this category is to create cohorts for research purposes or support the derivation or analysis of the outcomes of epidemiological analysis. Studies belonging to this category include 57 , 94 , 95 , 96 , 97 , 98 , 101 , 111 , 113 , 120 .

Technical NLP . Other studies include those mainly focused on the technical aspects of NLP, i.e., developing or applying NLP technologies for improving the understanding of clinical free-text. Nine studies belong to this category 42 , 43 , 44 , 53 , 54 , 109 , 123 , 127 , 137 .

Advances in the above three technical objectives in particular ( Disease information and classification, Diagnostic Surveillance and Cohort Building for Epidemiological Studies ) offer a great opportunity for health systems to harness data from unstructured EHRs for better care. In addition, clinical NLP has great potential in (semi-)automated clinical coding for timely and more accurate auditing, surveillance and public health policing 141 . However, at the writing of this review, developments of automated coding are still in their infancy in the UK.

Figure 6 illustrates a scatter plot of the NLP tasks against the technical objectives. It shows how different NLP technologies have been adopted to address different clinical questions. The largest combination is Text Classification with Disease information & Classification containing 16 studies 34 , 48 , 49 , 58 , 60 , 79 , 80 , 100 , 112 , 116 , 117 , 118 , 128 , 129 , 135 , 139 . Named Entity Recognition has been widely used in different clinical applications: 10 studies of Disease information & Classification 39 , 56 , 65 , 66 , 68 , 76 , 103 , 115 , 122 , 124 , 9 studies of Language discovery & knowledge 50 , 52 , 55 , 74 , 81 , 85 , 104 , 107 , 131 and 8 of Diagnostic surveillance 82 , 84 , 87 , 88 , 89 , 90 , 92 , 133 . In particular, there were some areas (represented by small circles in the figure) that were clearly understudied, for example, Text Classification for Diagnostic surveillance 91 , 99 , Entity Normalisation for Diagnostic surveillance 126 and Natural Language Generation in any application domains 63 .

figure 6

The x -axis is the categories of NLP tasks and the y -axis is the technical objectives. The size of the circles denotes the number of publications.

In terms of NLP technologies and the trend, the pie chart in Fig. 7 summarises the different types of clinical NLP algorithms adopted by the selected 107 studies. When there are multiple algorithm categories, we use the main model or best-performing model’s algorithm type.

figure 7

The main bar chart shows the changes of different NLP algorithms in the last 15 years. The pie chart at the top left corner depicts the overall breakdown of algorithms of all research work analysed.

ML-based denotes those tools using ML algorithms (excluding those using deep neural network methods). There were 48.1% of studies using ML-based methods, including Support Vector Machines 34 , 45 , 49 , 55 , 80 , 93 , 104 , 127 , 128 , 129 , Bayesian methods 33 , 34 , 48 , 58 , 72 , 97 , Conditional Random Fields 33 , 54 , 56 , Random Forest 72 , 93 , 119 , Logistic Regression 72 , 93 , 97 , Artificial Neural Networks 104 , Decision Trees 72 and others 43 , 57 , 78 , 83 , 84 , 85 , 113 , 117 , 138 .

Rule-based describes 18.9% of the studies using manually-created rules for classifications or extractions 37 , 38 , 44 , 53 , 74 , 86 , 94 , 96 , 98 , 99 , 101 , 102 , 103 , 105 , 107 , 111 , 115 , 123 , 134 , 135 .

DL-based denotes those using deep learning methods, accounting for 16.0% of the studies, including convolutional neural networks 75 , 77 , 79 , 109 , 110 , 121 , 130 , recurrent neural networks 76 , 77 , 116 , 121 , 122 , 124 , long short-term memory 76 , 116 , 121 , 122 , 124 and transformers 112 , 116 .

Others are those studies where the algorithms were not clearly specified.

The bar chart in the figure shows the development trends of different NLP algorithms used in the community. Traditional ML-based methods peaked around 2015–2016, with DL-based methods becoming increasingly popular thereafter. Rule-based methods started decreasing in 2011 and remained at low-level usage when ML-based methods were popular. Interestingly, they started to increase again in 2018 by both absolute number and percentage.

Domain knowledge utilisation is an essential component in many clinical NLP applications. To understand knowledge technologies, we extracted data from the selected studies to analyse how domain-specific knowledge was represented and utilised for facilitating clinical NLP tasks. We defined domain knowledge in a broad sense in this analysis, including both domain-specific ontologies or terminologies (customised dictionaries), distributed representations learned from external corpora (such as dense vector representations of word semantics) and pretrained large language models (e.g., BERT model and its variants). Figure 8 shows the summary of adopted knowledge techniques.

Ontologies : Clinical domain involves a wide range of domain-specific ontologies, from clinical terminologies to biological ontologies to literature classification systems. Overall, 55.9% of studies utilised ontologies, amongst which we identified the five most commonly used ontologies: Unified Medical Language System was used by 16.8% ( n  = 18) studies 36 , 56 , 61 , 64 , 65 , 66 , 69 , 70 , 81 , 83 , 84 , 85 , 89 , 106 , 113 , 115 , 130 , 139 ; SNOMED CT had 6.5% ( n  = 7) users 65 , 66 , 78 , 82 , 87 , 115 , 124 ; MeSH was used by 5.6% ( n  = 6) studies 61 , 67 , 104 , 111 , 115 , 139 ; ChEBI was used by 4.7% ( n  = 5) 56 , 62 , 104 , 113 , 115 ; UniProt 2.8% ( n  = 3) was used by studies ( n  = 3) 56 , 69 , 113 .

Pretrained embeddings : Techniques such as word2vec 142 aim to learn dense vector representations (called embeddings) for words or larger constructs (like phrases) from large external corpora, which capture ‘transferable’ (domain) language semantics for facilitating new tasks. The most used embedding model from the 107 studies was word2vec 76 , 79 , 121 , 122 , 136 . The second most popular model was FastText 79 , 121 , 122 , 136 . One study 136 used word2vec, FastText, ElMo, Glove and Flair.

Customised dictionary : Ten studies used customised dictionaries, including cancer studies 55 , 75 , two drug studies 53 , 58 , two mental health studies 88 , 134 , a multilingual study 51 and others 105 , 114 , 119 .

ML-BERT : Large language models like BERT or their techniques were used by four studies, including a study for identifying cognitive impairments in schizophrenia 100 , event extractions 112 , a social media corpus study 125 and a pre-trained biomedical entity representation 137 .

Others : There are studies which adopted hybrid methods, including those using bag-of-words representations 34 , utilising lexical structures 41 , using biological process subontology of the gene ontology (GO) 45 , using multiple methods 77 including ontologies (SNOMED CT and SIDER) and word2vec, combing UMLS and customised dictionary 50 and with unspecified methods 63 .

figure 8

The pie chart at the left shows the breakdown of representation techniques. For ontologies, the bar chart on the right depicts the top five frequently used ontologies in clinical NLP applications.

Table 4 summarises the types of datasets used by the studies. The majority of them (54.2%; n  = 58) used literature corpora 33 , 34 , 35 , 36 , 37 , 38 , 39 , 40 , 42 , 43 , 44 , 45 , 46 , 47 , 48 , 49 , 50 , 51 , 52 , 53 , 54 , 55 , 56 , 57 , 59 , 60 , 66 , 67 , 68 , 69 , 70 , 71 , 72 , 73 , 74 , 75 , 76 , 80 , 104 , 105 , 106 , 107 , 109 , 111 , 112 , 113 , 114 , 115 , 117 , 119 , 120 , 121 , 127 , 128 , 129 , 130 , 131 , 139 . Eleven (10.3%) used social media data 58 , 64 , 77 , 78 , 79 , 89 , 125 , 126 , 135 , 136 , 138 .

In total, 31 (28.97%) studies used real-world EHR data. Amongst them (see Table 5 ), 21 (67.74%) used South London and Maudsley NHS Foundation Trust mental health Hospital (SLaM) EHR data (CRIS) 81 , 84 , 85 , 86 , 87 , 88 , 90 , 91 , 92 , 93 , 94 , 95 , 96 , 97 , 98 , 99 , 100 , 103 , 132 , 133 , 134 . There were only 4 UK EHRs utilised by the studies. Apart from SLaM, they were from King’s College Hospital (used by 81 , 82 , 83 ), Oxford Health NHS Foundation Trust (OHFT) (used by 101 ) and Camden & Islington Trust (used by 102 ). None of the UK-based EHRs was openly accessible but were described as being available to collaborators. All three open accessible EHRs were from the US: i2b2 (Informatics for Integrating Biology & the Bedside) (used by 63 , 65 ); n2c2 https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ (used by 116 , 118 ); and MIMIC-III 143 (used by 81 , 116 , 118 . There was one EHR from China—Jinhua People’s Hospital (used by 63 ). The largest EHR dataset cited for NLP implementation was CRIS at SLaM, of which reported sizes were 23.3m documents and >400k people. The second largest was OHFT (31,391 people).

We conducted a detailed study on clinical NLP developments in the UK for the last 15 years since 2007. A network analysis was conducted on the community dataset, including funders, projects, people, and organisations. A further literature review was carried out to analyse publications from the community. Results from the two analyses revealed multifaceted insights into the evolution of the UK NLP community, and related technical research and developments.

In terms of community developments and connections, clearly, clinical NLP has developed rapidly in the UK. The visualisations of different timestamped snapshots (Fig. 2 shows the community to be steadily expanding over the last 10 years. Analysis of community stakeholders has revealed a consistent power-law distribution of their influences across all types of entities (i.e., funders, organisations, persons and projects). This means that there are ‘key players’ in all types of entities. As for funders, MRC and EPSRC play critical roles. Their funded projects form the core of the community.

For organisations, the dominant influence of universities indicates clinical NLP is still a research-dominated area in the UK. Meanwhile, NHS and industrial organisations have gained considerable influence in the community (see Table 1) . These are promising signs that NLP technologies are starting to be taken up by industry and healthcare service providers. Such signs are further confirmed by the analysis of the trends of funding sources that involve these partners. Particularly, industry involvement in projects has increased from less than 1/15 from 2016 to 2019 to around 1/1.5 from 2019 onwards, indicating possible increased technology maturity, or recognition of the potential generally, of clinical NLP in the last 3 years.

Another positive sign observed is the continuously increasing investment in training the next generation of NLP researchers. Since 2016, studentship projects have increased from just one to 16 across 14 institutions. Figure 5 reveals a pattern of continuously increasing studentships overall across different organisations, which is encouraging.

However, links between sub-communities appear to be weak. For example, projects funded by Innovative UK are very weakly linked with other funders and their funded projects/people—only two edges, to be specific. This means the connections between academia and industry sub-communities are fragile. The NIHR and its funded projects, which are supposed to be more translational, also form their own cluster with a similar weak connection to those funded by MRC and EPSRC. Such weak connections might indicate that the translation from research to outputs that directly benefit health services is also weak and not streamlined. These sub-communities mostly work alone. This might indicate barriers to the translation of active research into mature technologies to support business or improve health services.

Our literature review of the 107 selected publications has revealed a strong growth pattern that echoes the expansion of the community from the above network analysis. Specifically, research publications doubled every 5 years in the last 15 years. The community has collaborated with more than 20 countries internationally.

On the aspect of applications and translations, while the studies as a whole have covered a wide range of diseases, the majority were focused on mental health or treatments. The main reason might be that there is a lack of good coverage of coded data in areas such as mental health. For mental health, many symptoms and phenotypes are usually not routinely coded as structured data: for example, the quantification or qualification of cognitive impairment. For treatments (mainly drug-related studies), adverse reactions or events were the main information to be extracted, which are also rarely routinely coded in a structured format. This means current research mainly utilises NLP for uncovering the under-coded information when this is needed across the EHR database as a whole (i.e., in samples too large for manual annotation or checking, and in clinical services where the imposition of structured instruments for routine information-gathering is not feasible or acceptable). The potential of free-text data for subtyping diseases (e.g., revealing the nuance of phenotypic representations) seems less exploited at the current time. This is an area where clinical NLP could maximise its utilities for facilitating personalised medicine as and when in-depth information is demonstrated to have prognostic value.

Regarding technical objectives, the three categories of language discovery & knowledge , disease information & classification and technical NLP combined constitute almost 74% of the studies. This means that only 26% of research-targeted problems are classified as diagnostic surveillance and cohort-epid , both of which are more clinically actionable. This observation indicates that the current studies are less translational in clinical practice, which reflects the findings from the community analysis. This is also reflected by the very low number (<6%) of deployed clinical NLP systems within NHS environments.

Such a low level of development might reflect the big challenges faced by translation to health systems. Among others, deployments of NLP models on production EHR systems do encounter additional technical challenges. For example, compared to research-oriented NLP, translational model developments would mean moving from relatively small-volume evaluation datasets to applications at scale across very large and diverse corpora, making high generalisability an essential requirement. In addition, these models might encounter a near-inevitable drop-off in performance either from annotation-level to whole-patient-timeline-level evaluation due to the shift of data patterns over time or gradual changes within the clinical practices. Further to this, there is also the challenge of translating the application of NLP across large historic datasets into incorporation pipelines of real-time processing of clinical text within the EHR for individual-level feedback, as well as the utility challenge of communicating probabilistic clinical decision support where NLP models are not 100% accurate, and finding case studies that make use of new capabilities (the ‘solution in search of a problem scenario’ common in data science). Lastly, but critically, integrating with health systems would require robustness, resilience, stability and flexibility. For example, at least, embedded NLP models should ensure that they are not crashing and/or degrading clinical systems. Such engineering requirements for critical systems are usually not considered and rarely evaluated in the designs and development of research-oriented NLP models.

Albeit these challenges, we observed several exciting translational developments that have been embedded with real-world EHR systems or the near real-time research copies of them. The CogStack 81 , 140 text analytics framework has been deployed in more than 5 NHS Trusts across the UK, supporting data harmonisation 144 , semantic search 81 , risk detection and live alerting 145 and disease prevention 146 . The deployment of text analytics capabilities with health systems has shown its great potential in facilitating more efficient and cost-effective clinical trials 81 , 147 . Another operational development is the use of clinical NLP models for facilitating efficient medical coding 141 : funded by NIHR recently as an AI Award, University College Hospital colleagues have been comparing 148 , 149 for automatically assigning ICD-10 codes for hospital admissions.

The main gap or barrier to clinical NLP in the UK seems to be impeded research access to real-world EHR data. First of all, there are no openly accessible free-text EHRs from the UK. All three openly accessible EHRs are from the US. While they are useful for model development and transfer learning (e.g., using pre-trained language models), the significant differences between the US and the UK healthcare systems (for example UK’s discharge summaries are usually much shorter) means that we risk developing models that are less representative of the UK system. Having UK open EHR datasets would allow the community to create benchmarks, train large language models and co-design novel solutions, all of which would greatly speed up the translational processes of research. Secondly, very few TRE are clinical NLP-ready across the UK. The UK now has one of the world’s best TREs (managed by NHS Digital), hosting one of the world’s best national-level health datasets—CVD-COVID-UK https://www.hdruk.ac.uk/projects/cvd-covid-uk-project/ . Another notable national initiative is the OpenSAFELY 150 . However, these TREs contain no free-text EHR components at the time of writing. Many local or regional TREs does not support the necessary software environments (e.g., Python or NLP libraries) due to security concerns and/or they do not have the computational resources to support scalable NLP. Thirdly, there are no shareable large language models trained on UK EHRs that could facilitate the community for transfer learning. Finally, it is worth mentioning the line of work on synthetic free-text health data generation 151 for alleviating the pain of data access. Such approaches are in their infancy but could be a promising substitute.

The underlying reason behind the impeded research access is perhaps the lack of a streamlined, reproducible and certified process for making free-text EHRs research-ready. While there are regulations and guidelines for health data research access, the implementation of these for free-text is very much dependent on the decisions and capacities of local (e.g., NHS Trust level or health board level in Scotland) IG committees, who are frequently overstretched and likely to lack specific experience dealing with free-text health data. A new process of this sort, if adopted, would need to lay out the whole pipeline of data anonymisation and implement the steps from data sampling, preprocessing, annotation, anonymisation, validation, iterative improvements and final reporting. It would ideally be coordinated at a national level and draw on what is a healthily growing area of experience and expertise.

Clinical NLP in the UK is part of a wider international research topic. A full quantitative comparison is outside of the scope of this current review, but we will consider a few points, mainly comparing clinical NLP in the United States (US) with the UK. The majority of clinical NLP is carried out on English language text, with only 10% of NLP papers in PubMed reporting the use of another language 152 . This reflects a broader issue in general NLP, where a small number of languages, first amongst them English, dominate the research literature and the available tools, corpora and representations 153 .

US researchers publish around 6 times the number of AI papers published by UK researchers 154 , and it is reasonable to assume that this is the case for sub-domains such as clinical NLP. This is understandable, given that the US has six times the gross domestic product of the UK 155 and 5 times the population 156 . Unlike most other national clinical NLP efforts, however, the UK benefits more directly from US research by virtue of the common use of the English language. Despite this, there is a need for specific UK research: terminologies, healthcare systems and clinical cultures all differ.

Compared to the UK, the US has greater levels of clinical NLP in operational healthcare use, as opposed to pure NLP research or epidemiology, this being the result of differing policy pressures. In the US, the Patient Protection and Affordable Care Act 2010 157 , known as Obamacare, and its emphasis on capturing clinical information for meaningful use, has had a direct influence on work to extract as much useful information as possible from EHRs and from patient feedback (see for example 158 , 159 , 160 ). In the UK, despite the publication of several white papers encouraging and planning for the use of AI in the NHS (e.g. 161 , 162 ), there has never been a policy impetus as clear as that provided by Obamacare.

There is also a US/UK difference in terms of the available resources, such as clinical corpora and community challenges centred on these corpora. In the US, several corpora are available under lightweight access agreements, most notably MIMIC 143 , but also more specialised corpora such as THYME 163 . Other corpora have been made available for community challenges, such as the series organised by I2B2 (e.g. 164 ). The UK’s first semantically annotated corpus of EHR text was reported in 165 . Interestingly, neither the papers reporting this corpus nor the MRC grant that funded it has picked up the searches in the current study. A process was in place for making portions of this UK corpus available to researchers, but it was complex and not used. EHR free text from the CRIS system 166 is available for use, but under much stricter conditions than the widely-used US corpora. Consequently, there has been a complete absence of UK community challenges, with UK researchers instead participating in US challenges, together with the widespread use of US corpora in UK clinical NLP research.

To close some of the main gaps as a community, the Health Data Research UK’s National Text Analytics Project Consortium ( https://www.hdruk.ac.uk/projects/national-text-analytics-project/ ) has established working groups specifically to create a UK-wide free-text databank and is piloting NLP model sharing of MedCAT models for detecting SNOMED CT concepts with multiple secondary care hospitals in England and internationally including University College Hospital, Kings College Hospital, Guys and St Thomas’, Norfolk and Norwich, Manchester, South London and The Maudsley and University Hospitals Birmingham. The model-sharing agreement and description of community tools can be found on the HDRUK Gateway ( https://www.healthdatagateway.org/ ). To unlock clinical NLP’s full potential for improving health service and patient care, many more initiatives like these are needed with coordination, synergy and collaboration between all stakeholders. In particular, the connections between academia and health service providers need to be expanded and strengthened. Interlinking the UK clinical NLP community with international counterparts is not only nice to have but also essential to address many challenging clinical questions, such as better understanding rare diseases, for which a non-single country could offer sufficient power in their data for revealing evidence. This brings in new challenges, including cross-lingual clinical NLP 167 and federated NLP 168 . All these gaps and challenges also open exciting opportunities for a better-interlinked community in the UK and beyond.

As shown in Fig. 1 , the study comprises two parts: one studying the community using network analysis and the other on research and developments using a literature review. While the first is focused on the UK national level, the research outputs include those from the UK as part of international collaborations. This work is not a clinical study, and no personally identifiable information was collected, thus, ethical approval is not required.

Information collection and data extraction

Figure 9 illustrates our two-step process to (1) retrieve relevant information from online data sources and (2) conduct data extraction to obtain all relevant data for later analysis.

figure 9

Step 1: Data were collected for funded clinical NLP projects by querying three searchable datasets from UK and EU funding bodies and downloading project data from UK charities such as British Heart Foundation and Cancer Research UK. Step 2: Data were extracted to obtain metadata of projects and their associated entities.

Step 1. Retrieving relevant clinical NLP projects

To identify the UK clinical NLP community, we first retrieved relevant projects funded by UK funding bodies (e.g., research councils and charities) and the European Union’s (EU) research and innovation funding programmes. The inclusion criteria were programmes that have (a) developed or applied NLP technologies; (b) solved a clinical, public health or life science research problem that is directly applicable to patient care and (c) involved at least one UK-based organisation.

We started with UK Research and Innovation (UKRI), which “is a non-departmental public body of the Government of the United Kingdom that directs research and innovation funding, funded through the science budget of the Department for Business, Energy and Industrial Strategy". UKRI provides an official Application Programming Interface (API): https://gtr.ukri.org/resources/api.html , which allows efficient access (software-based query and extraction) to successful projects from nine UK-based funders, including seven Research Councils, Innovate UK and Research England 169 . Thirty-four combinations of keyword searches were used to query the web service, which returned 107 unique projects. A manual assessment was then conducted to remove irrelevant projects according to the inclusion criteria, leaving 71 relevant projects.

A similar process was conducted for the UK’s National Institute for Health Research (NIHR), a UK government agency which finances research into health and care. Five keyword searches were used to query the NIHR’s search service ( https://fundingawards.nihr.ac.uk/ ). The NIHR only funds projects for health research allowing us to reduce query combinations from 34 to 5 by using NLP-related keywords only. The search revealed 24 projects, and after a manual assessment, 18 projects were deemed relevant.

For projects funded by European Union’s research programmes, we obtained the data from Horizon 2020-funded projects from https://cordis.europa.eu/projects , which contains all projects from 2014 to January 2022. The same set of UKRI keyword queries was applied to these projects’ meta-data, which identified six projects. After the manual assessment, five were deemed relevant. To enable consistent downstream analysis, the funding amounts of these projects were converted from the original currency (Euro) to Pound Sterling using a rate of 1 to 0.83 (as of 25th January 2022).

Searches of three UK-based charities (Wellcome Trust, Cancer Research UK and British Heart Foundation) did not find relevant projects. Some of these funders do not provide sufficient metadata (e.g., abstracts or summaries) for their funded projects. Therefore, it is possible that relevant projects might have been missed due to incomplete information.

To select projects that fit the inclusion/exclusion criteria, a total of 34 keyword combinations were used. We used broad terms for higher sensitivity followed by a manual second filtering step on query results. The automated retrieval codebase, including the full list of keyword combinations, is available at https://tinyurl.com/5fnvdvrh .

The data collection was finalised on 25th January 2022. Overall, we identified 94 relevant projects. The queries used and extraction scripts are available in a code base referenced at the end of this manuscript.

Step 2. Extracting project metadata and associated entities

From the identified projects, we further extended data extraction (see the right part of Fig. 9) to collect project metadata, including title, abstract, technical summary, start/end dates, funding amount, project categories and health categories. For each project, wherever possible, we also extracted its associated entities, including related persons (principal investigators, co-investigators, supervisors/students), organisations (lead organisations, collaborating organisations and their metadata), funders and project outputs (publications, software, datasets and others). In total, from 94 projects, we extracted 139 associated persons, 137 organisations and 431 publications. In particular, for the 137 organisations, we manually classified them into three categories: research , NHS (national health services) and industry .

Analysis methods

Community analysis.

To enable an analysis of the UK’s clinical NLP community, we created a network (or interactive graph) linking four types of entities: projects (also called grants), organisations, persons and funders. Links between these entities were directly extracted from the project metadata. The following analysis approaches were conducted.

(Timestamped and filtered network snapshots) This analysis reveals the evolution of the community from different perspectives, such as the number of projects, involved persons/organisations and funding budgets over the years and the trend of training the next generation of clinical NLP leaders. The metadata of linked entities (e.g., datetime or project categories) were used to create different snapshots of the network.

(Centrality analysis) To identify the ‘key’ stakeholders in the community, centrality analysis 170 was conducted to quantify node importance in the network. Five centrality measurements were implemented, including degree, betweenness, closeness, eigenvector and PageRank. We report results on eigenvector-based centrality scores (PageRank showed very similar results), which measure the ‘influence’ of nodes in a graph. In particular, we propose a relative centrality score metric as an intuitive quantification of node influence among nodes of the same type. It is defined as Eq. ( 1 ), where N o d e s T y p e O f ( n ) represents the set of nodes that have the same type as n . For example, a university with R C S  = 3 would mean it is very influential in the clinical NLP community—three times more than the median to be exact.

(Connectivity analysis) This is to identify clusters (or components) in a network and quantify the strengths of links between and within different clusters. This allows the identification of the core of the community and, equally important, the weak links among sub-communities. Specifically, we conducted a k -connectivity analysis 171 .

(Force-directed graph visualisation) This provides an overall representation of the community that enables both inspections of individual entities and illustrates the nature of clusters. Technically, a force-directed visualisation 172 of the network was implemented to make the network accessible via a browser-based and interactive form.

Literature review on research outputs

We conducted a literature review of all publications from the community over the last 15 years to obtain a comprehensive understanding of the research and development of clinical NLP.

(Information source) We selected relevant publications from the 431 publications extracted from outputs of the above-mentioned 94 projects.

(Eligibility criteria) The inclusion criteria were: (1) develop or apply NLP technologies; (2) applied in health or life science domains including genetics; (3) full articles including research papers, preprints, conference publications, thesis and book chapters. Exclusion criteria were: (1) animal studies; (2) not full papers (e.g., poster); (3) review articles; (4) articles not accessible. After the screening process, 107 publications were included for final data extraction and review. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flowchart of the publication screening and selection process is illustrated in Fig. 10 . Two reviewers (J.W. and M.W.) first screened 20 studies independently and achieved full agreement. Thereafter, screening of the remaining studies was performed by the two reviewers independently.

figure 10

We started with 431 extracted publications, out of which 361 have sufficient information suitable for screening. The title/abstract screen further removed 202 papers which were deemed irrelevant. This left us 159 publications for an eligibility assessment using inclusion/exclusion criteria on their full text. After this final check step, 107 publications were included for the final review.

(Data extraction) Five reviewers (A.S., F.F., J.W., M.W. and Y.C.) carried out data extraction independently based on a defined protocol. Although there was a risk of bias through independent review, this was reduced by a single reviewer, with MW randomly selecting and double-checking a subset of each reviewer’s results. From these papers, information was extracted on 10 dimensions: (1) publication metadata including title, authors, publication year and article type; (2) international collaborators defined as the countries of co-authors; (3) dataset information including data categories (EHR, social media, literature and others), data source, public availability and data size; (4) health category including disease areas as defined by Clinical Data Interchange Standards Consortium https://www.cdisc.org/standards/therapeutic-areas/disease-area , and disease specification; (5) NLP task types including named entity recognition, entity normalisation, information retrieval, relation extraction, natural language generation, text classification, temporal expression extraction, word sense disambiguation and other information extraction; (6) NLP algorithm category including rule based, ML (not using deep neural network), deep learning, and others; (7) application category as defined in 28 ; (8) knowledge representation techniques including ontologies, customised dictionary, pretrained word embeddings, large language models like BERT models 173 and others; (9) availability of code base and pretrained models; (10) deployment and testing in clinical settings. Missing data was marked as ‘N/A’ during data extraction.

Data availability

The data for network analyses and the code for visualising the results are made available at https://observablehq.com/@626e582587f7e383/uk-clinical-nlp-landscaping-analysis .

Code availability

The code for automatically retrieving projects and their associated metadata is available at https://tinyurl.com/5fnvdvrh .

Murdoch, T. B. & Detsky, A. S. The inevitable application of big data to health care. J. Am. Med. Assoc. 309 , 1351–1352 (2013).

Article   CAS   Google Scholar  

Zhang, D., Yin, C., Zeng, J., Yuan, X. & Zhang, P. Combining structured and unstructured data for predictive models: a deep learning approach. BMC Med. Inform. Decis. Mak. 20 , 1–11 (2020).

Article   Google Scholar  

Vest, J. R., Grannis, S. J., Haut, D. P., Halverson, P. K. & Menachemi, N. Using structured and unstructured data to identify patients’ need for services that address the social determinants of health. Int. J. Med. Inform. 107 , 101–106 (2017).

Wu, H. et al. Semehr: a general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research. J. Am. Med. Inform. Assoc. 25 , 530–537 (2018).

Kharrazi, H. et al. The value of unstructured electronic health record data in geriatric syndrome case identification. J. Am. Geriatr. Soc. 66 , 1499–1507 (2018).

Garg, R., Oh, E., Naidech, A., Kording, K. & Prabhakaran, S. Automating ischemic stroke subtype classification using machine learning and natural language processing. J. Stroke Cerebrovasc. Dis. 28 , 2045–2051 (2019).

Shah, A. D. et al. Natural language processing for disease phenotyping in UK primary care records for research: a pilot study in myocardial infarction and death. J. Biomed. Semant. 10 , 1–10 (2019).

Rannikmäe, K. et al. Developing automated methods for disease subtyping in UK biobank: an exemplar study on stroke. BMC Med. Inform. Decis. Mak. 21 , 1–9 (2021).

Fratiglioni, L., Grut, M., Forsell, Y., Viitanen, M. & Winblad, B. Clinical diagnosis of Alzheimer’s disease and other dementias in a population survey: Agreement and causes of disagreement in applying diagnostic and statistical manual of mental disorders, revised third edition, criteria. Arch. Neurol. 49 , 927–932 (1992).

Wilson, M. E. et al. Prevalence of disagreement about appropriateness of treatment between ICU patients/surrogates and clinicians. Chest 155 , 1140–1147 (2019).

Bertrand, P.-M. et al. Disagreement between clinicians and score in decision-making capacity of critically ill patients. Crit. Care Med. 47 , 337–344 (2019).

Japkowicz, N. & Stephen, S. The class imbalance problem: a systematic study. Intell. Data Anal. 6 , 429–449 (2002).

Gorinski, P. J. et al. Named entity recognition for electronic health records: a comparison of rule-based and machine learning approaches. Preprint at arXiv https://doi.org/10.48550/arXiv.1903.03985 (2019).

Rindflesch, T. C. & Fiszman, M. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J. Biomed. Inform. 36 , 462–477 (2003).

Wilcox, A. B. & Hripcsak, G. The role of domain knowledge in automating medical text report classification. J. Am. Med. Inform. Assoc. 10 , 330–338 (2003).

Donnelly, K. et al. SNOMED-CT: The advanced terminology and coding system for ehealth. In Medical and Care Compunetics 3 , vol. 121 of Studies in health technology and informatics , 279–290 (IOS Press, 2006).

World Health Organization. International statistical classification of diseases and related health problems. ICD-10 (World Health Organization, Geneva, Switzerland, 2016), fifth edn.

Rubin, D. L., Shah, N. H. & Noy, N. F. Biomedical ontologies: a functional perspective. Brief. Bioinforma. 9 , 75–90 (2008).

Hoehndorf, R., Dumontier, M. & Gkoutos, G. V. Evaluation of research in biomedical ontologies. Brief. Bioinforma. 14 , 696–712 (2013).

Khawandanah, J. Double or hybrid diabetes: a systematic review on disease prevalence, characteristics and risk factors. Nutr. Diabetes 9 , 1–9 (2019).

Jones, K. H. et al. Toward the development of data governance standards for using clinical free-text data in health research: position paper. J. Med. Internet Res. 22 , e16760 (2020).

England, N. About Information Governance . https://www.england.nhs.uk/ig/about/ (2022).

Kreimeyer, K. et al. Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review. J. Biomed. Inform. 73 , 14–29 (2017).

Koleck, T. A., Dreisbach, C., Bourne, P. E. & Bakken, S. Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review. J. Am. Med. Inform. Assoc. 26 , 364–379 (2019).

Sheikhalishahi, S. et al. Natural language processing of clinical notes on chronic diseases: systematic review. JMIR Med. Inform. 7 , e12239 (2019).

Velupillai, S. et al. Using clinical natural language processing for health outcomes research: overview and actionable suggestions for future advances. J. Biomed. Inform. 88 , 11–19 (2018).

Davidson, E. M. et al. The reporting quality of natural language processing studies: systematic review of studies of radiology reports. BMC Med. Imaging 21 , 1–13 (2021).

Casey, A. et al. A systematic review of natural language processing applied to radiology reports. BMC Med. Inform. Decis. Mak. 21 , 1–18 (2021).

Pons, E., Braun, L. M., Hunink, M. M. & Kors, J. A. Natural language processing in radiology: a systematic review. Radiology 279 , 329–343 (2016).

Wang, Y. et al. Clinical information extraction applications: a literature review. J. Biomed. Inform. 77 , 34–49 (2018).

Wu, S. et al. Deep learning in clinical natural language processing: a methodical review. J. Am. Med. Inform. Assoc. 27 , 457–470 (2020).

Spasic, I. & Nenadic, G. et al. Clinical text data in machine learning: systematic review. JMIR Med. Inform. 8 , e17984 (2020).

Guo, Y. et al. A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment. BMC Bioinform. 12 https://doi.org/10.1186/1471-2105-12-69 (2011).

Korhonen, A., Silins, I., Sun, L. & Stenius, U. The first step in the development of text mining technology for cancer risk assessment: identifying and organizing scientific evidence in risk assessment literature. BMC Bioinform. 10 https://doi.org/10.1186/1471-2105-10-303 (2009).

Miwa, M., Thompson, P., McNaught, J., Kell, D. B. & Ananiadou, S. Extracting semantically enriched events from biomedical literature. BMC Bioinform. 13 https://doi.org/10.1186/1471-2105-13-108 (2012).

Wang, X. et al. Automatic extraction of angiogenesis bioprocess from text. Bioinformatics 27 , 2730–2737 (2011).

Miwa, M., Thompson, P. & Ananiadou, S. Boosting automatic event extraction from the literature using domain adaptation and coreference resolution. Bioinformatics 28 , 1759–1765 (2012).

Tsuruoka, Y., Miwa, M., Hamamoto, K., Tsujii, J. & Ananiadou, S. Discovering and visualizing indirect associations between biomedical concepts. Bioinformatics 27 , i111–i119 (2011).

Wang, X. et al. Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature. BMC Bioinform. 12 https://doi.org/10.1186/1471-2105-12-s8-s11 (2011).

Krallinger, M. et al. The protein-protein interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinform. 12 https://doi.org/10.1186/1471-2105-12-s8-s3 (2011).

Thompson, P. et al. The BioLexicon: a large-scale terminological resource for biomedical text mining. BMC Bioinform. 12 https://doi.org/10.1186/1471-2105-12-397 (2011).

Ananiadou, S. et al. Named entity recognition for bacterial type IV secretion systems. PLoS ONE 6 , e14780 (2011).

Pyysalo, S. et al. Overview of the ID, EPI and REL tasks of BioNLP shared task 2011. BMC Bioinform . 13 https://doi.org/10.1186/1471-2105-13-s11-s2 (2012).

Sasaki, Y., Wang, X. & Ananiadou, S. Extracting secondary bio-event arguments with extraction constraints. Comput. Intell. 27 , 702–721 (2011).

Pyysalo, S. et al. Event extraction across multiple levels of biological organization. Bioinformatics 28 , i575–i581 (2012).

Thompson, P., Nawaz, R., McNaught, J. & Ananiadou, S. Enriching a biomedical event corpus with meta-knowledge annotation. BMC Bioinform. 12 https://doi.org/10.1186/1471-2105-12-393 (2011).

Thompson, P., Iqbal, S. A., McNaught, J. & Ananiadou, S. Construction of an annotated corpus to support biomedical information extraction. BMC Bioinform. 10 https://doi.org/10.1186/1471-2105-10-349 (2009).

Lewin, I., Silins, I., Korhonen, A., Hogberg, J. & Stenius, U. A new challenge for text mining: cancer risk assessment. Proc. ISMB BioLINK Spec. Interest Group Text. Data Min. 20 , 1–4 (2008).

Google Scholar  

Ali, I. et al. Grouping chemicals for health risk assessment: a text mining-based case study of polychlorinated biphenyls (PCBs). Toxicol. Lett. 241 , 32–37 (2016).

Thompson, P. et al. Text mining the history of medicine. PLoS ONE 11 , e0144717 (2016).

Bollegala, D., Kontonatsios, G. & Ananiadou, S. A cross-lingual similarity measure for detecting biomedical term translations. PLoS ONE 10 , e0126196 (2015).

Miwa, M. & Ananiadou, S. Adaptable, high recall, event extraction system with minimal configuration. BMC Bioinform. 16 https://doi.org/10.1186/1471-2105-16-s10-s7 (2015).

Korkontzelos, I., Piliouras, D., Dowsey, A. W. & Ananiadou, S. Boosting drug named entity recognition using an aggregate classifier. Artif. Intell. Med. 65 , 145–153 (2015).

Rak, R., Batista-Navarro, R. T., Carter, J., Rowley, A. & Ananiadou, S. Processing biological literature with customizable web services supporting interoperable formats. Database 2014 , bau064–bau064 (2014).

Baker, S. et al. Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics 32 , 432–440 (2015).

Batista-Navarro, R., Carter, J. & Ananiadou, S. Argo: enabling the development of bespoke workflows and services for disease annotation. Database 2016 , baw066 (2016).

Howard, B. E. et al. SWIFT-review: a text-mining workbench for systematic review. Syst. Rev. 5 https://doi.org/10.1186/s13643-016-0263-z (2016).

Alvaro, N. et al. Crowdsourcing twitter annotations to identify first-hand experiences of prescription drug use. J. Biomed. Inform. 58 , 280–287 (2015).

Ananiadou, S., Thompson, P., Nawaz, R., McNaught, J. & Kell, D. B. Event-based text mining for biology and functional genomics. Brief Funct. Genomics 14 , 213–230 (2014).

Mu, T., Goulermas, J. Y., Korkontzelos, I. & Ananiadou, S. Descriptive document clustering via discriminant learning in a co-embedded space of multilevel similarities. J. Assoc. Inf. Sci. Technol. 67 , 106–133 (2014).

Xu, Y. et al. Anatomical entity recognition with a hierarchical framework augmented by external resources. PLoS ONE 9 , e108396 (2014).

Fu, X., Batista-Navarro, R., Rak, R. & Ananiadou, S. Supporting the annotation of chronic obstructive pulmonary disease (COPD) phenotypes with text mining workflows. J. Biomed. Semant. 6 , 8 (2015).

Xu, Y. et al. Bilingual term alignment from comparable corpora in English discharge summary and Chinese discharge summary. BMC Bioinform. 16 https://doi.org/10.1186/s12859-015-0606-0 (2015).

Korkontzelos, I. et al. Analysis of the effect of sentiment analysis on extracting adverse drug reactions from tweets and forum posts. J. Biomed. Inform. 62 , 148–158 (2016).

Alnazzawi, N., Thompson, P., Batista-Navarro, R. & Ananiadou, S. Using text mining techniques to extract phenotypic information from the PhenoCHF corpus. BMC Med. Inform. Decis. Mak. 15 https://doi.org/10.1186/1472-6947-15-s2-s3 (2015).

Alnazzawi, N., Thompson, P. & Ananiadou, S. Mapping phenotypic information in heterogeneous textual sources to a domain-specific terminological resource. PLoS ONE 11 , e0162287 (2016).

Le, H.-Q., Tran, M.-V., Dang, T. H., Ha, Q.-T. & Collier, N. Sieve-based coreference resolution enhances semi-supervised learning model for chemical-induced disease relation extraction. Database 2016 , baw102 (2016).

Landeghem, S. V. et al. Large-scale event extraction from literature with multi-level gene normalization. PLoS ONE 8 , e55814 (2013).

Miwa, M. et al. A method for integrating and ranking the evidence for biochemical pathways by mining reactions from text. Bioinformatics 29 , i44–i52 (2013).

Pyysalo, S. & Ananiadou, S. Anatomical entity mention recognition at literature scale. Bioinformatics 30 , 868–875 (2013).

Miwa, M., Pyysalo, S., Ohta, T. & Ananiadou, S. Wide coverage biomedical event extraction using multiple partially overlapping corpora. BMC Bioinform. 14 https://doi.org/10.1186/1471-2105-14-175 (2013).

Nawaz, R., Thompson, P. & Ananiadou, S. Negated bio-events: analysis and identification. BMC Bioinform. 14 https://doi.org/10.1186/1471-2105-14-14 (2013).

Mihăilă, C., Ohta, T., Pyysalo, S. & Ananiadou, S. BioCause: Annotating and analysing causality in the biomedical domain. BMC Bioinform. 14 https://doi.org/10.1186/1471-2105-14-2 (2013).

Miwa, M., Thompson, P., Korkontzelos, Y. & Ananiadou, S. Comparable study of event extraction in newswire and biomedical domains. In 25th International Conference on Computational Linguistics (2014).

Baker, S., Korhonen, A. & Pyysalo, S. Cancer hallmark text classification using convolutional neural networks. In Proc. Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016) , 1–9 (2016).

Limsopatham, N. & Collier, N. Learning orthographic features in bi-directional lstm for biomedical named entity recognition. In Proc. Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016) , 10–19 (2016).

Limsopatham, N. & Collier, N. Normalising medical concepts in social media texts by learning semantic representation. In Proc. 54th annual meeting of the association for computational linguistics (volume 1: long papers) , 1014–1023 (2016).

Limsopatham, N. & Collier, N. Adapting phrase-based machine translation to normalise medical terms in social media messages. In Proc. the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015) , 1675–1680 (2015).

Limsopatham, N. & Collier, N. Modelling the combination of generic and target domain embeddings in a convolutional neural network for sentence classification (Association for Computational Linguistics, 2016).

Larsson, K. et al. Text mining for improved exposure assessment. PLoS ONE 12 , e0173132 (2017).

Wu, H. et al. SemEHR: a general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research. J. Am. Med. Inform. Assoc. 25 , 530–537 (2018).

Carr, E. et al. Evaluation and improvement of the national early warning score (NEWS2) for COVID-19: a multi-hospital study. BMC Med. 19 https://doi.org/10.1186/s12916-020-01893-3 (2021).

Bean, D. M. et al. Semantic computational analysis of anticoagulation use in atrial fibrillation from real world data. PLoS ONE 14 , e0225625 (2019).

Kugathasan, P. et al. Association of physical health multimorbidity with mortality in people with schizophrenia spectrum disorders: using a novel semantic search system that captures physical diseases in electronic patient records. Schizophrenia Res. 216 , 408–415 (2020).

Wu, H. et al. Efficient reuse of natural language processing models for phenotype-mention identification in free-text electronic medical records: a phenotype embedding approach. JMIR Med. Inform. 7 , e14782 (2019).

Viani, N. et al. Temporal information extraction from mental health records to identify duration of untreated psychosis. J. Biomed. Semant. 11 https://doi.org/10.1186/s13326-020-00220-2 (2020).

Jackson, R. et al. Knowledge discovery for deep phenotyping serious mental illness from electronic mental health records. F1000Research 7 , 210 (2018).

Ramu, N., Kolliakou, A., Sanyal, J., Patel, R. & Stewart, R. Recorded poor insight as a predictor of service use outcomes: cohort study of patients with first-episode psychosis in a large mental healthcare database. BMJ Open 9 , e028929 (2019).

Abdollahyan, M., Smeraldi, F., Patel, R. & Bessant, C. Investigating comorbidity of mental and physical disorders in online health forums. In Proc. 3rd International Conference on Applications of Intelligent Systems (ACM, 2020). https://doi.org/10.1145/3378184.3378195 .

Rogers, J. P. et al. Catatonia: demographic, clinical and laboratory associations. Psychol. Med. 1–11 https://doi.org/10.1017/s0033291721004402 (2021).

Chesney, E. et al. The impact of cigarette smoking on life expectancy in schizophrenia, schizoaffective disorder and bipolar affective disorder: an electronic case register cohort study. Schizophr. Res. 238 , 29–35 (2021).

Colling, C. et al. Predicting high-cost care in a mental health setting. BJPsych Open 6 https://doi.org/10.1192/bjo.2019.96 (2020).

Viani, N. et al. A natural language processing approach for identifying temporal disease onset information from mental healthcare text. Sci. Rep. 11 https://doi.org/10.1038/s41598-020-80457-0 (2021).

Irving, J. et al. Gender differences in clinical presentation and illicit substance use during first episode psychosis: a natural language processing, electronic case register study. BMJ Open 11 , e042949 (2021).

Wesley, E. W. et al. Gender disparities in clozapine prescription in a cohort of treatment-resistant schizophrenia in the south London and Maudsley case register. Schizophr. Res. 232 , 68–76 (2021).

Patel, R. et al. Impact of the COVID-19 pandemic on remote mental healthcare and prescribing in psychiatry: an electronic health record study. BMJ Open 11 , e046365 (2021).

Bhavsar, V. et al. The association between neighbourhood characteristics and physical victimisation in men and women with mental disorders. BJPsych Open 6 https://doi.org/10.1192/bjo.2020.52 (2020).

Downs, J. et al. Negative symptoms in early-onset psychosis and their association with antipsychotic treatment failure. Schizophr. Bull. 45 , 69–79 (2018).

Irving, J. et al. Using natural language processing on electronic health records to enhance detection and prediction of psychosis risk. Schizophr. Bull. 47 , 405–414 (2020).

Mascio, A. et al. Cognitive impairments in schizophrenia: a study in a large clinical sample using natural language processing. Front. Digit. Health 3 https://doi.org/10.3389/fdgth.2021.711941 (2021).

McDonald, K. et al. Prevalence and incidence of clinical outcomes in patients presenting to secondary mental health care with mood instability and sleep disturbance. Eur. Psychiatry 63 https://doi.org/10.1192/j.eurpsy.2020.39 (2020).

Werbeloff, N. et al. The Camden and Islington research database: Using electronic mental health records for research. PLoS ONE 13 , e0190703 (2018).

Viani, N. et al. Time expressions in mental health records for symptom onset extraction. In Proc. Ninth International Workshop on Health Text Mining and Information Analysis (Association for Computational Linguistics, 2018). https://doi.org/10.18653/v1/w18-5621 .

Baker, S. et al. Cancer hallmarks analytics tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer. Bioinformatics 33 , 3973–3981 (2017).

Chiu, B. et al. A neural classification method for supporting the creation of BioVerbNet. J. Biomed. Semant. 10 https://doi.org/10.1186/s13326-018-0193-x (2019).

Chiu, B., Pyysalo, S., Vulić, I. & Korhonen, A. Bio-SimVerb and bio-SimLex: wide-coverage evaluation sets of word similarity in biomedicine. BMC Bioinform. 19 https://doi.org/10.1186/s12859-018-2039-z (2018).

Pyysalo, S. et al. LION LBD: a literature-based discovery system for cancer biology. Bioinformatics 35 , 1553–1561 (2018).

Crichton, G., Guo, Y., Pyysalo, S. & Korhonen, A. Neural networks for link prediction in realistic biomedical graphs: a multi-dimensional evaluation of graph embedding-based approaches. BMC Bioinform. 19 https://doi.org/10.1186/s12859-018-2163-9 (2018).

Crichton, G., Pyysalo, S., Chiu, B. & Korhonen, A. A neural network multi-task learning approach to biomedical named entity recognition. BMC Bioinform. 18 https://doi.org/10.1186/s12859-017-1776-8 (2017).

Crichton, G., Baker, S., Guo, Y. & Korhonen, A. Neural networks for open and closed literature-based discovery. PLoS ONE 15 , e0232891 (2020).

Butters, O. W., Wilson, R. C., Garner, H. & Burton, T. W. Y. PUblications metadata augmentation (PUMA) pipeline. F1000Research 9 , 1095 (2020).

Trieu, H.-L. et al. DeepEventMine: end-to-end neural nested event extraction from biomedical texts. Bioinformatics 36 , 4910–4917 (2020).

Soto, A. J., Przybyła, P. & Ananiadou, S. Thalia: semantic search engine for biomedical abstracts. Bioinformatics 35 , 1799–1801 (2018).

Zerva, C., Batista-Navarro, R., Day, P. & Ananiadou, S. Using uncertainty to link and rank evidence from biomedical literature for model curation. Bioinformatics 33 , 3784–3792 (2017).

Thompson, P. et al. Annotation and detection of drug effects in text for pharmacovigilance. J. Cheminform. 10 https://doi.org/10.1186/s13321-018-0290-y (2018).

Christopoulou, F., Tran, T. T., Sahu, S. K., Miwa, M. & Ananiadou, S. Adverse drug events and medication relation extraction in electronic health records with ensemble deep learning methods. J. Am. Med. Inform. Assoc. 27 , 39–46 (2019).

Soto, A. J., Zerva, C., Batista-Navarro, R. & Ananiadou, S. LitPathExplorer: a confidence-based visual text analytics tool for exploring literature-enriched pathway models. Bioinformatics 34 , 1389–1397 (2017).

Ju, M., Nguyen, N. T. H., Miwa, M. & Ananiadou, S. An ensemble of neural models for nested adverse drug events and medication extraction with subwords. J. Am. Med. Inform. Assoc. 27 , 22–30 (2019).

Shardlow, M. et al. Identification of research hypotheses and new knowledge from scientific literature. BMC Med. Inform. Decis. Mak. 18 https://doi.org/10.1186/s12911-018-0639-1 (2018).

Kontonatsios, G. et al. A semi-supervised approach using label propagation to support citation screening. J. Biomed. Inform. 72 , 67–76 (2017).

Le, H. et al. Large-scale exploration of neural relation classification architectures. https://www.repository.cam.ac.uk/handle/1810/288012 (2020).

Prokhorov, V., Pilehvar, M. & Collier, N. Generating knowledge graph paths from textual definitions using sequence-to-sequence models. https://www.repository.cam.ac.uk/handle/1810/291464 (2019).

Alvaro, N., Miyao, Y. & Collier, N. TwiMed: Twitter and PubMed comparable corpus of drugs, diseases, symptoms, and their relations. JMIR Public Health Surveill. 3 , e24 (2017).

Kartsaklis, D., Pilehvar, M. & Collier, N. Mapping text to knowledge graph entities using multi-sense lstms. https://www.repository.cam.ac.uk/handle/1810/287907 (2020).

Basaldella, M., Liu, F., Shareghi, E. & Collier, N. COMETA: A corpus for medical entity linking in the social media. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Association for Computational Linguistics, 2020). https://doi.org/10.18653/v1/2020.emnlp-main.253 .

Elkaref, M. & Hassan, L. A joint training approach to tweet classification and adverse effect extraction and normalization for SMM4h 2021. In Proc. Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task (Association for Computational Linguistics, 2021). https://doi.org/10.18653/v1/2021.smm4h-1.16 .

Marshall, I. J., Noel-Storr, A., Kuiper, J., Thomas, J. & Wallace, B. C. Machine learning for identifying randomized controlled trials: an evaluation and practitioner’s guide. Res. Synth. Methods 9 , 602–614 (2018).

Wallace, B. C. et al. Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach. J. Am. Med. Inform. Assoc. 24 , 1165–1168 (2017).

Thomas, J. et al. Machine learning reduced workload with minimal risk of missing studies: development and evaluation of a randomized controlled trial classifier for Cochrane reviews. J. Clin. Epidemiol. 133 , 140–151 (2021).

Singh, G., Marshall, I. J., Thomas, J., Shawe-Taylor, J. & Wallace, B. C. A neural candidate-selector architecture for automatic structured clinical text annotation. In Proc. 2017 ACM on Conference on Information and Knowledge Management (ACM, 2017). https://doi.org/10.1145/3132847.3132989 .

Beck, T. et al. Auto-corpus: a natural language processing tool for standardising and reusing biomedical literature. https://doi.org/10.1101/2021.01.08.425887 (2021).

Viani, N., Patel, R., Stewart, R. & Velupillai, S. Generating positive psychosis symptom keywords from electronic health records. In Conference on Artificial Intelligence in Medicine in Europe , 298–303 (Springer, 2019).

Patel, R. et al. Impact of the covid-19 pandemic on remote mental healthcare and prescribing in psychiatry: an electronic health record study. BMJ Open 11 , e046365 (2021).

Viani, N. et al. Annotating temporal relations to determine the onset of psychosis symptoms. In MedInfo , 418–422 (2019).

Patel, R., Smeraldi, F., Abdollahyan, M., Irving, J. & Bessant, C. Investigating mental and physical disorders associated with covid-19 in online health forums. BMJ Open 11 , e056601 (2021).

Basaldella, M. & Collier, N. Bioreddit: Word embeddings for user-generated biomedical NLP. In Proc. Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019) , 34–38 (2019).

Liu, F., Shareghi, E., Meng, Z., Basaldella, M. & Collier, N. Self-alignment pretraining for biomedical entity representations. Preprint at arXiv https://doi.org/10.48550/arXiv.2010.11784 (2020).

Vivekanantham, A., Belousov, M., Hassan, L., Nenadic, G. & Dixon, W. G. Patient discussions of glucocorticoid-related side effects within an online health community forum. Ann. Rheum. Dis. 79 , 1121–1122 (2020).

Singh, G., Sabet, Z., Shawe-Taylor, J. & Thomas, J. Constructing artificial data for fine-tuning for low-resource biomedical text tagging with applications in pico annotation. In Explainable AI in Healthcare and Medicine , 131–145 (Springer, 2021).

Jackson, R. et al. Cogstack-experiences of deploying integrated information retrieval and extraction services in a large national health service foundation trust hospital. BMC Med. Inform. Decis. Mak. 18 , 1–13 (2018).

Dong, H. et al. Automated clinical coding: what, why, and where we are. npj Digit. Med . 5 , 159 (2022).

Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. Preprint at arXiv https://doi.org/10.48550/arXiv.1301.3781 (2013).

Johnson, A. E. et al. Mimic-III, a freely accessible critical care database. Sci. Data 3 , 1–9 (2016).

Noor, K. et al. Deployment of a free-text analytics platform at a UK national health service research hospital: Cogstack at University College London Hospitals. JMIR Med. Inform. 10 , e38122 (2022).

Wang, T. et al. Implementation of a real-time psychosis risk detection and alerting system based on electronic health records using cogstack. J. Vis. Exp. e60794 (2020).

Braithwaite, T. et al. 212 preventing blindness for patients with optic disc swelling: improving care using transformative new technology (2022).

Tissot, H. C. et al. Natural language processing for mimicking clinical trial recruitment in critical care: a semi-automated simulation based on the leopards trial. IEEE J. Biomed. Health Inform. 24 , 2950–2959 (2020).

Kraljevic, Z. et al. Multi-domain clinical natural language processing with medcat: the medical concept annotation toolkit. Artif. Intell. Med. 117 , 102083 (2021).

Dong, H., Suárez-Paniagua, V., Whiteley, W. & Wu, H. Explainable automated coding of clinical notes using hierarchical label-wise attention networks and label embedding initialisation. J. Biomed. Inform. 116 , 103728 (2021).

Williamson, E. J. et al. Opensafely: factors associated with covid-19 death in 17 million patients. Nature 584 , 430 (2020).

Brekke, P. H., Rama, T., Pilán, I., Nytrø, Ø. & Øvrelid, L. Synthetic data for annotation and extraction of family history information from clinical text. J. Biomed. Semant. 12 , 1–11 (2021).

Névéol, A., Dalianis, H., Velupillai, S., Savova, G. & Zweigenbaum, P. Clinical natural language processing in languages other than English: opportunities and challenges. J. Biomed. Semant. 9 , 1–13 (2018).

Joshi, P., Santy, S., Budhiraja, A., Bali, K. & Choudhury, M. The state and fate of linguistic diversity and inclusion in the NLP world. In Proc of the 58th Annual Meeting of the Association for Computational Linguistics (ACL2020) , 6282–6293 (2020).

Savage, N. The race to the top among the world’s leaders in artificial intelligence. Nature 588 , S102–S102 (2020).

Bank, T. W. GDPs of All Countries and Economies . https://data.worldbank.org/indicator/NY.GDP.MKTP.CD (2022). Accessed 03 October 2022.

Bank, T. W. Populations of All Countries and Economies . https://data.worldbank.org/indicator/SP.POP.TOTL (2022). Accessed 03 October 2022.

Congress, U. HR 3590: Patient Protection and Affordable Care Act. In 111th Congress , vol. 2010 (2009).

Nawab, K., Ramsey, G. & Schreiber, R. Natural language processing to extract meaningful information from patient experience feedback. Appl. Clin. Inform. 11 , 242–252 (2020).

Woller, B. et al. Natural language processing performance for the identification of venous thromboembolism in an integrated healthcare system. Clin. Appl. Thromb. Hemost. 27 , 10760296211013108 (2021).

Lineback, C. M. et al. Prediction of 30-day readmission after stroke using machine learning and natural language processing. Front. Neurol. 1069 (2021).

Joshi, I. & Morley, J. Artificial intelligence: how to get it right. putting policy into practice for safe data-driven innovation in health and care. London: NHSX (2019).

Topol, E. et al. The topol review. Preparing the healthcare workforce to deliver the digital future . 1–48 (2019).

Styler, W. F. et al. Temporal annotation in the clinical domain. Trans. Assoc. Comput. Linguist. 2 , 143–154 (2014).

Uzuner, Ö., South, B. R., Shen, S. & DuVall, S. L. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. J. Am. Med. Inform. Assoc. 18 , 552–556 (2011).

Roberts, A. et al. Building a semantically annotated corpus of clinical texts. J. Biomed. Inform. 42 , 950–966 (2009).

Stewart, R. et al. The south London and Maudsley NHS foundation trust biomedical research centre (slam brc) case register: development and descriptive data. BMC Psychiatry 9 , 1–12 (2009).

Wu, S. & Dredze, M. Beto, bentz, becas: the surprising cross-lingual effectiveness of bert. In Proc of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 833–844 (2019).

Liu, M. et al. Federated learning meets natural language processing: a survey. Preprint at arXiv https://doi.org/10.48550/arXiv.2107.12603 (2021).

Research, U. & Innovation. UKRI—Our councils. https://www.ukri.org/councils/ (2022). Accessed 05 April 2022.

Borgatti, S. P. & Everett, M. G. A graph-theoretic perspective on centrality. Soc. Netw. 28 , 466–484 (2006).

Penrose, M. D. On k-connectivity for a geometric random graph. Random Struct. Algorithms 15 , 145–164 (1999).

Fruchterman, T. M. & Reingold, E. M. Graph drawing by force-directed placement. Software 21 , 1129–1164 (1991).

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , 4171–4186 (Association for Computational Linguistics, Minneapolis, Minnesota, 2019).

Download references

Acknowledgements

This work was supported by UK’s National Text Analytics project, which is funded by Health Data Research UK and Medical Research Council. HW is supported by Medical Research Council (MR/S004149/2), the National Institute for Health Research (NIHR) (NIHR202639), the British Council (UCL-NMU-SEU International Collaboration) and the University of Edinburgh (The Advanced Care Research Centre Programme). RS is part-funded by (i) the NIHR Biomedical Research Centre at the South London and Maudsley NHS Foundation Trust and King’s College London; (ii) the National Institute for Health Research (NIHR) Applied Research Collaboration South London (NIHR ARC South London) at King’s College Hospital NHS Foundation Trust; (iii) the DATAMIND HDR UK Mental Health Data Hub (MRC grant MR/W014386).

The authors thank other members of the Health Data Research UK’s National Text Analytics Project, whose names are not listed in the author list, for their valuable support, input and suggestions.

Author information

Authors and affiliations.

Institute of Health Informatics, University College London, London, UK

Honghan Wu, Minhong Wang, Jinge Wu, Yun-Hsuan Chang, Natalie Fitzpatrick, Alex Handy, Anoop Dinesh Shah & Richard J. B. Dobson

Usher Institute, University of Edinburgh, Edinburgh, UK

Jinge Wu, Farah Francis, Hang Dong, Michael T. C. Poon, William Whiteley & Cathie Sudlow

Research Department of Pathology, UCL Cancer Institute, University College London, London, UK

Alex Shavick & Adam P. Levine

Department of Computer Science, University of Oxford, Oxford, UK

Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK

Luke T. Slater, Andreas Karwath & Georgios V. Gkoutos

University College London Hospitals NHS Trust, London, UK

Centre for Tumour Biology, Barts Cancer Institute, Queen Mary University of London, London, UK

Claude Chelala

Department of Psychological Medicine, Institute of Psychiatry, Psychology and Neuroscience (IoPPN), King’s College London, London, UK

Robert Stewart

South London and Maudsley NHS Foundation Trust, London, UK

Theoretical and Applied Linguistics, Faculty of Modern & Medieval Languages & Linguistics, University of Cambridge, Cambridge, UK

Nigel Collier

Edinburgh Futures Institute, University of Edinburgh, Edinburgh, UK

Beatrice Alex

Department of Biostatistics & Health Informatics, King’s College London, London, UK

Angus Roberts & Richard J. B. Dobson

You can also search for this author in PubMed   Google Scholar

Contributions

H.W., R.D., A.R. and C.S. conceptualised this study. H.W. conducted the data extraction, and data analysis and drafted the first version of the manuscript. A.S., F.F., J.W., M.W. and Y.C. conducted to screen and data extractions for the literature review. H.D., M.P., N.F., A.L., L.S., A.H., A.K., G.G., C.C., A.S., R.S., N.C., B.A., W.W., A.R. and R.D. revised the paper. All authors reviewed and approved the paper.

Corresponding author

Correspondence to Honghan Wu .

Ethics declarations

Competing interests.

R.S. declares research support received in the last 3 years, from Janssen, GSK and Takeda. The remaining authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Wu, H., Wang, M., Wu, J. et al. A survey on clinical natural language processing in the United Kingdom from 2007 to 2022. npj Digit. Med. 5 , 186 (2022). https://doi.org/10.1038/s41746-022-00730-6

Download citation

Received : 20 June 2022

Accepted : 29 November 2022

Published : 21 December 2022

DOI : https://doi.org/10.1038/s41746-022-00730-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

From admission to discharge: a systematic review of clinical natural language processing along the patient journey.

  • Katrin Klug
  • Katharina Beckh
  • Sven Giesselbach

BMC Medical Informatics and Decision Making (2024)

A critical assessment of using ChatGPT for extracting structured data from clinical notes

  • Jingwei Huang
  • Donghan M. Yang

npj Digital Medicine (2024)

Narrative review of recent developments and the future of penicillin allergy de-labelling by non-allergists

  • Neil Powell
  • Michael Blank
  • Jonathan Sandoe

npj Antimicrobials and Resistance (2024)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

literature review of nlp

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Int J Environ Res Public Health

Logo of ijerph

A Narrative Literature Review of Natural Language Processing Applied to the Occupational Exposome

Annika m. schoene.

1 Department of Computer Science, University of Manchester, Manchester M13 9PL, UK

Ioannis Basinas

2 Department of Health Science, University of Manchester, Manchester M13 9PL, UK; [email protected] (I.B.); [email protected] (M.v.T.)

Martie van Tongeren

Sophia ananiadou, associated data.

Not applicable.

The evolution of the Exposome concept revolutionised the research in exposure assessment and epidemiology by introducing the need for a more holistic approach on the exploration of the relationship between the environment and disease. At the same time, further and more dramatic changes have also occurred on the working environment, adding to the already existing dynamic nature of it. Natural Language Processing (NLP) refers to a collection of methods for identifying, reading, extracting and untimely transforming large collections of language. In this work, we aim to give an overview of how NLP has successfully been applied thus far in Exposome research. Methods: We conduct a literature search on PubMed, Scopus and Web of Science for scientific articles published between 2011 and 2021. We use both quantitative and qualitative methods to screen papers and provide insights into the inclusion and exclusion criteria. We outline our approach for article selection and provide an overview of our findings. This is followed by a more detailed insight into selected articles. Results: Overall, 6420 articles were screened for the suitability of this review, where we review 37 articles in depth. Finally, we discuss future avenues of research and outline challenges in existing work. Conclusions: Our results show that (i) there has been an increase in articles published that focus on applying NLP to exposure and epidemiology research, (ii) most work uses existing NLP tools and (iii) traditional machine learning is the most popular approach.

1. Introduction

Natural Language Processing is an area of research within Artificial Intelligence (AI) that is concerned with giving computers the ability to understand natural language (spoken and written) in the same way a human could [ 1 ]. Knowledge of computational linguistics (rule-based modelling of human language), statistics, machine learning and deep learning are used either individually or combined to achieve the aforementioned goal [ 2 ]. The term Exposome was first introduced by [ 3 ], who defined an area of research that takes systematic measurements of exposures (e.g., occupational, physical environment or socio-economic factors) that a person is exposed to throughout life (pre-birth until death) and affects their health outcomes [ 3 ]. However, the term Exposome itself has not been fully integrated into all areas of exposure research yet, where often the term ‘ exposure research ’ is used when referring to the same or similar concepts [ 4 ]. At the same time, text mining and NLP techniques are increasingly applied in a variety of exposure-related research areas. Whilst there are a variety of surveys and literature reviews in NLP and its various subtasks [ 5 , 6 , 7 ], there is no review of NLP and text mining techniques used in the field of occupational and environmental exposure research. This review fills that gap by providing a description of existing tools based on NLP and text mining techniques that have been applied in occupational and environmental exposure research. For this, we utilise a hybrid approach combining classical and automatic reviewing methods with RobotAnalyst [ 8 ], which is a recently developed web-based software system that combines text mining and machine learning algorithms. Papers published in the PubMed, Scopus and WoS databases are screened and reviewed to answer the following research questions:

  • What are the most common text mining and NLP approaches used in exposure assessment research?
  • What resources are used for this task?
  • What are the most common NLP methods used?
  • What are the main challenges and future directions of research?

2. Review Methodology

In this literature review, a search was conducted in three scientific literature databases. We include articles available in full and peer-reviewed, where our search returned 6420 articles, out of which 5957 were selected for pre-screening after duplicates were removed. In Figure 1 , we show the process of selecting for this review, where for each search on the three different platforms (PubMed, Scopus and Web of Sciene), we used the following query terms:

An external file that holds a picture, illustration, etc.
Object name is ijerph-19-08544-g001.jpg

Overview of article selection process used in this narrative literature review.

  • (“natural language processing” OR “text mining” OR “text-mining” OR “text and data mining” OR ontology OR lexic* OR corpus OR corpora) AND (exposome OR exposure OR socioexposome OR (“risk factor” AND (“work” OR “occupational” OR “environmental*”)))

Pre-screening was performed as a two-step process. First, to reduce human workload, we utilised RobotAnalyst [ 8 ] to identify 998 full papers. RobotAnalyst is a web-based and freely available software system that utilises both text mining and machine learning methods to categorise and rank references for their relevance (Free access to RobotAnalyst can be requested to reproduce this work: http://www.nactem.ac.uk/robotanalyst/ (accessed on 2 November 2021). The system uses an iterative classification process which makes decisions based on the abstract for each reference. Next, we manually screened the titles and abstracts of those papers using the inclusion and exclusion criteria outlined below. The inclusion and exclusion criteria used to select studies relevant to occupational exposure research were provided by two experts in occupational exposure. Based on these criteria, we identified 80 papers that specifically focused on text mining and/or natural language processing in the field of exposure research. Next, the full papers were reviewed for their relevance to occupational exposure and usage of NLP or text mining methods. Finally, 40 copies of the full papers of those were retrieved and reviewed in full, resulting in a total of 37 articles that fulfilled our defined inclusion and exclusion criteria.

Inclusion criteria:

  • Original work;
  • Study exposures concerning humans;
  • Study occupational and/or environmental exposures of humans, such as airborne agents (e.g., particulates or substances and biological agents (viruses)), stressors, psycho-social and physical (e.g., muscle-skeletal) exposures as well as workplace accidents;
  • Have their full texts available;
  • Are written in English;
  • Focus on text mining or natural language processing and their texts containing a method, experiments and result section.

Exclusion criteria:

  • Studied animal or plant exposures;
  • Studied drug, nutrition or dietary exposures on humans;
  • Written in another language than English;
  • Commentaries, opinion papers or editorials.

In the following section, we summarise the findings of this literature review, where we focus on the types of resources used, computational methods and existing NLP tools. In Figure 2 , we show the number of papers published each year, where we can observe an increase in publications over time. We also categorise each paper in Table 1 based on NLP tools used, resources and computational method. Finally, we give a brief overview of the literature reviews and qualitative research in this area.

An external file that holds a picture, illustration, etc.
Object name is ijerph-19-08544-g002.jpg

Number of NLP papers applied to occupational exposure research published each year from 2010 to 2021.

A categorisation of each paper based on tools used , resources and computational methods .

Papers
NLTK[ , , , , ],
[ ]
Other[ , , , , ],
[ , , ],
[ , , , ],
[ , , ]
Not declared[ , , , , , ],
[ , , , , ],
[ , , ]
Scientific literature[ , , , , , ],
[ , , , ],
[ , , ],
[ , , , , , ],
[ , , , ]
Existing Database[ , , , , ]
Twitter[ , , ]
EHR[ , , ]
Accident reports[ , , ]
Machine learning[ , , , ],
[ , , , ],
[ , , , , ],
[ , , , , ],
[ , , , ]
Knowledge based[ , , , , ]
Database creation and fusion[ , , , , , ],
[ , , ]
Rule-based algorithms[ , ]
  • A.   Resources

There are different types of resources used, where the most common resource is the existing scientific literature (see Figure 3 ). Other data sources include databases, social media platforms, electronic health records and accident reports (see Table 1 ).

An external file that holds a picture, illustration, etc.
Object name is ijerph-19-08544-g003.jpg

A chart showing the different types of resources used in the selected articles.

  • B.   Computational Methods

Overall, there are four main categories of computational approaches used which include machine learning, knowledge-based approaches, and database creation and fusion approaches. Figure 4 shows the split of computational approaches found in this review.

An external file that holds a picture, illustration, etc.
Object name is ijerph-19-08544-g004.jpg

A chart showing the computational methods utilised in the selected articles.

  • C.   Existing NLP tools

There are a number of different existing NLP preprocessing tools used (see Figure 5 ), where NLTK [ 49 ] is the most commonly used for preprocessing textual data. Given the vast number of different NLP tools used in other studies, we have summarised the tools as ‘ Other ’. However, it has to be noted that a large amount of studies did not declare the type of text mining tool that was used in their work.

An external file that holds a picture, illustration, etc.
Object name is ijerph-19-08544-g005.jpg

A chart showing a summary of the different types of NLP tools in each article.

3.1. Machine Learning Methods

Ref. [ 9 ] proposes a contactless clinical decision support system to diagnose patients with COVID-19 and monitor quarantine progression using Electronic Health Records. Relevant keywords are extracted from unstructured text using NLTK, and the results are added to a searchable database. The final steps of this work include the integration of the system with cloud services and visualisation to make results accessible to clinicians. The work by [ 28 ] proposes a computational approach of mapping the impact of climate change on global health via scientific literature. A total of 3730 papers are labelled manually and subsequently fed into an SVM (Support Vector Machine) to classify the unlabelled documents into the different label categories. Next, topic modelling is used to analyse and visualise the content of the literature. The authors of [ 15 ] propose to use scientific literature on PubMed to assess the impact of environmental exposures from early life using different unsupervised learning methods (e.g., LDA (Latent Dirichlet Allocation)) to gain insight into the different topics. The work by [ 29 ] models the impact of COPD (chronic obstructive pulmonary) from smoking using Adverse Outcome Pathways generated from the scientific literature. The is collected and filtered from PubMed to create a corpus and then clustered using the text mining approach proposed by [ 50 ]. Research by [ 10 ] classifies incident reports to improve aviation safety into two categories using an LSTM (Long Short-Term Memory) with attention. A total of 200,000 reports are preprocessed using NLTK, and word vectors are generated using ULMFiT (Universal Language Model Fine-tuning for Text Classification) [ 51 ]. Ref. [ 12 ] extracts information from the scientific literature to evaluate the impact of human exposure to electromagnetic fields, where topic modelling is used to generate domain-specific lexicons. Work by [ 42 ] develops a computational literature review approach for in utero exposure to environmental pollutants, where they aim to identify multiple chemicals and their health effects and reduce the burden of manual literature reviews. The titles and abstracts of 54,134 papers are clustered using the DoCTER software [ 16 ]. The authors of [ 30 ] propose a network-based predictive model to assess chemical toxicity for risk assessment of environmental pollutants. The Registry of Toxic Effects of Chemical Substances (RTECS) database [ 52 ] is used, where chemicals were annotated with an identifier to show the structure of it. Work by [ 13 ] introduces a supervised machine learning approach to complement a previous manual literature retrieval for the Exposome-Explorer database [ 53 ], where an extensive variety of machine learning algorithms are evaluated using Sckit-Learn [ 54 ]. Ref. [ 48 ] uses multivariable logistic regression to classify the spread of household transmission of COVID-19 in healthcare workers. As part of this work, term-frequency inverse document frequency (tf-idf) matrices are used match confirmed cases by residential address. The authors of [ 17 ] use Chinese accident reports for safety risk analysis in the construction industry, where a software called ROST is used to preprocess the documents and perform cluster and network structural analysis. Research conducted by [ 14 ] develops a corpus of over 3500 abstracts that were manually annotated by an Exposome expert for chemical exposures according to a taxonomy. The taxonomy is based on 32 nodes and was split into two categories: biomonitoring and exposure routes. Finally, the data were fed into an SVM (Support Vector Machine) to classify unseen documents. The authors of [ 11 ] analyse the sentiment of tweets collected based on a specific geolocation (Texas counties along I-20) to determine if there is a link between CVD (cardiovascular disease) rates and factors that may cause or increase the risk included on the tweets. A voting classifier is used to determine the sentiment of each tweet into positive or negative, where an accuracy of 73.69% is achieved. Ref. [ 31 ] developed an ensemble classifier, called SOCcer , to map job titles to occupational classification codes (SOC). For this, a variety of publicly available resources were used to match job titles and tasks to the US SOC-2010 code, which resulted in a knowledge base of around 62,000 linked jobs. To train the ensemble classifier, job descriptions from a bladder cancer study were used as training data, whereas an evaluation of the algorithm was conducted on job titles for personal airborne measurements during an inspection. Research conducted by [ 18 ] collected data using Twitter’s API for ‘ asthma ’, and both manual (e.g., expert annotation and evaluation) and automatic analysis (e.g., topic modelling) are conducted to identify health-related tweets. One of the dominant topics identified by experts was environmental influences and references to triggers of asthma. The work by [ 22 ] uses text mining to assess chemical health risks, where PubMed abstracts are used to identify the mode of action (MOA) of carcinogens. For this work, they use the previously developed CRAB tool [ 55 ], which uses a bag-of-words approach to convert abstracts into vectors. Then, an SVM classifier with Jensen–Shannon divergence (JSD) kernel is trained to categorise the abstracts into a predefined taxonomy. The work by [ 23 ] develops a ranking algorithm to automatically recommend scientific abstracts for curation at CTD (Comparative Toxicogenomics Database [ 56 ]). This is completed by screening each abstract and assigning a document relevancy score (DRS), where 3583 articles are used from PubMed for this task. To analyse each abstract, a variety of text mining tools and approaches are used, which include ABNER [ 57 ], MetaMap [ 58 ] and Oscar3 [ 59 ] for gene/protein recognition and chemical recognition, respectively. Finally, a ranking algorithm is developed that sorts abstracts for curation relevance. The authors of [ 24 ] introduce a new method to classify biomedical documents for curation using the Comparative Toxicogenomics Database (CTD). A total of 1059 previously collected articles are annotated for entities (e.g., genes, chemicals, diseases and respective interactions), and manual abstract annotation is performed for chemicals relevant to the CTD. Finally, the documents are classified using a SVM. The authors of [ 25 ] use 225 electric power causality accident reports from China to identify factors that contribute to personal injury. TF-IDF is used to obtain the word frequency in a document, and the results are subsequently visualised using word clouds. The results are then used to extract key information on the dangers described in the reports. Our results also show that the majority of papers in this section utilise existing literature or databases to extract new information or classify unseen documents into existing categories. Classification experiments are performed using a wide variety of existing supervised machine learning algorithms (e.g.,: SVM or logistic regression). At the same time, new information is commonly uncovered and visualised using unsupervised learning methods (e.g.,: LSA or PCA). NLTK is a commonly used tool for preprocessing textual data, but there are also other NLP tools utilised that may be more suitable to deal with different languages or domains (e.g., ROST or CRAB).

3.2. Knowledge-Based Methods

Ref. [ 43 ] investigates Adverse Outcome Pathways (AOP) of pesticide exposure based on scientific literature collected on PubMed. For this, the recently developed AOP-helpFinder [ 60 ] is extended and subsequently known as AOP-helpFinder 2. The following properties were added: (i) the tool’s ability to automatically process and screen abstracts from PubMed, (ii) link stressors with a dictionary of events and (iii) calculate scores for both systems based on the position and weighted score for all event types. The tool is then evaluated by applying it to screen for a list of pesticides that have unknown long-term exposure effects on human health. Research conducted by [ 44 ] utilises a linguistic analysis of 261 scientific abstracts related to the ‘Exposome’ to gain insight into the current range of exposome research conducted. A literature search was performed, and an analysis was conducted using a combination of Termine [ 61 ] and NLTK [ 49 ] to extract multi-word terms and compute word frequency counts. The second part of this analysis uses over 500 biomedical ontologies provided at the National Center for Biomedical Ontology to automatically map abstracts to relevant ontologies. This work was subsequently extended by [ 62 ], who are using topic modelling and ontology analysis to provide an updated overview of knowledge representation tools relevant to exposure research. The work by [ 21 ] creates a new semantic resource for exposures, which is evaluated both in a clinical setting and on scientific literature. The resource contains (i) manual annotations derived from clinical notes and knowledge from the Unified Medical Language System (UMLS) to find exposome concepts. Ref. [ 20 ] use five corpora of epidemiological studies with different exposures and outcomes to extract exposure-related information that can aid systematic reviews and other summaries. In this work, a rule-based system called GATE [ 63 ] is used that relies on the development of dictionaries, where a total of 21 dictionaries were manually created with domain-specific exposures and outcomes. Research conducted by [ 19 ] uses rule-based patterns to analyse 60 PubMed abstracts in the obesity domain for six semantic concepts (study design, population, exposures, outcomes, covariates and effect size). Fourteen separate dictionaries are created that contain terms related to the previously mentioned six semantic concepts using a variety of tools [ 64 , 65 ]. Research conducted by [ 27 ] enhances the existing METLIN Exposome database to include over 950,000 unique small molecules. As part of this work, IBM Watson [ 66 ] is utilised, where Watson’s NLP approach is based on both rules (e.g., dictionary) and machine learning. The authors of [ 40 ] developed a rule-based SES (socioeconomic status) algorithm ( https://github.com/vserch/SES (accessed on 12 November 2021)) to analyse Electronic Health Records using the Ruby programming language. In this work, the effects of socioeconomic factors on overall health (e.g., mortality, education, occupation) in minorities are used to ensure that these factors will be considered as exposure in future work. In summary, we found that common knowledge sources are dictionaries, lists and ontologies, where sources for this knowledge often are existing literature or clinical notes. Interestingly, there is not one preferred text mining tool used in any of the studies, and therefore, a large variety of different NLP tools are utilised.

3.3. Database Creation and Fusion

One of the most popular databases created is the comparative toxicogenomics database (CTD), which was developed in 2004 and is updated annually [ 45 ]. Generally speaking, this resource is made up of three databases, which include (i) a third party database that contains data from external sources (e.g., MeSH), (ii) a manually curated database of data screened by scientists and (iii) a public web application that combines data from the curation database and third party database. The resources’ aim is to provide content that relates chemical exposures with human health to gain a better insight into diseases that are influenced by the environment. Research by [ 33 ] created an updated human exposome database for predicting the biotransformation of chemicals by using literature mining to manually identify scientific articles. For this work, PubMed was queried based on several keywords related to the exposome (e.g., human exposome, drinking water, air, and disinfection or combustion by-products), where most selected publications were review articles that contain environmental matrices (e.g., indoor air exposome, dust exposome, or waterborne chemicals). The work by [ 34 ] uses the text mining approach proposed by [ 36 ] to generate a new database of organic pollutants in China. The database is based on 2799 scientific publications and includes a total of 112,878 records. Research conducted by [ 46 ] uses the AOP-helpFinder tool as proposed by [ 36 ] to screen a PubMed corpus for exposure to endocrine-disrupting chemicals. The authors of [ 35 ] utilise text mining in combination with integrative systems biology to support decision making for the usage of BPFs (bisphenol F) in manufacturing and therefore circumvent adverse outcome pathways (AOP). To establish a connection between environmental exposures (e.g., to BPFs) and health effects, a variety of existing literature and databases such as PubMed, ToxCast, CompTox, and AOP-wiki are used. In this work, a previously proposed text mining tool called AOP-help Finder [ 60 ] is used to analyse abstracts for links between chemical substances and AOPs. The corpus for this work was developed using both automatic and manual searches. First, an automatic search of PubMed was conducted using the AOP-helpFinder tool to identify links between BPF and AOPs. Then, TOXLINE [ 67 ] was searched from the year 2017 for articles that contain BPF and synonyms of BPF in a toxicological context. The authors of [ 47 ] present an update of the environmental exposure to the nanomaterials database by using NLP to retrieve information from textual data and integrate it into the database. The first step in this work is to use OpenNLP ( https://opennlp.apache.org/ (accessed on 19 November 2021)) to preprocess and prepare a corpus of 10 scientific articles related to environmental risk assessment. An ontology called EXPOSEO ontology is subsequently developed and used to match the extracted information into concepts that can be integrated into the existing database. The work by [ 36 ] uses text mining to create a list of all chemicals related to ‘blood-associated chemicals’, which is then used to create a Blood Exposome Database. Several keywords were used to query PubMed, where the results were then checked manually to remove false positives and a phrase exclusion list was created. The final number of literature abstracts found is 1,085,023 ( https://exposome1.fiehnlab.ucdavis.edu/download/pmid_title_abstract_sb.zip (accessed on 19 November 2021)) and then linked to chemicals, based on the synonym for a chemical, existing links between PubChem and PubMed and by mining supplement tables for chemical synonyms using R (Code in R: https://github.com/barupal/exposome (accessed on 19 November 2021)). As a result, new blood chemicals were discovered in the literature. A similar approach for assessing cancer hazards was used by [ 68 ] using the PubMed literature. The work by [ 69 ] uses a three-step process to update the comparative toxicogenomics database (CDT) with exposure data from the scientific literature sourced on PubMed. A variety of techniques are used to extract vocabularies, which include but are not limited to MeSH [ 70 ], Gene Ontology [ 71 ] and NCBI Gene [ 72 ]. These techniques extract vocabularies for chemical and anatomy words, disease terms, biological processes and geographic locations, respectively. Finally, the data are integrated into the CDT, creating 49 new tables that contain 239 columns. Research by [ 37 ] proposes a new database called the Toxin-Toxin-Target Database (T3DB), which consolidates multi-disciplinary data on toxic compound exposure. A taxonomy of compounds is generated using a classifier to categorise compounds into groups, and then, an ontology of chemical entities is developed. In a nutshell, we find that there is a need for and high usage of databases that hold domain-specific knowledge for exposure research. Furthermore, most databases outlined in this review are generated using literature mining or existing databases, where information commonly retrieved include chemicals, anatomy words, disease terms, biological processes and geographic locations.

3.4. Literature Reviews and Qualitative Research

Ref. [ 38 ] conducts a review of existing ontologies relevant to the external exposome research and argues for the future development of semantic standards. This argument is driven by the variation of exposome resources, where differences include but are not limited to variables having the same or similar names but measuring different exposures. The work by [ 26 ] produces a systematic literature review on transport-related injury, where the first reviewer used traditional methods and the second reviewer utilised text mining techniques to perform the same review. The text mining portion of this work uses WordStat [ 73 ], QDA Miner [ 74 ], and literature screening was conducted in Abstrackr [ 75 ]. Research by [ 39 ] investigates how the public reacted to reports of increased lead levels in school drinking waters. Both a quantitative and qualitative evaluation was performed, where it was found that (i) the majority of tweets were by news agencies and people holding positions in public offices, and (ii) the three most important themes of tweets were information sharing, health concerns and socio-demographic disparities. Overall, we have found that there is a small number of existing reviews that include the use of NLP methods and tools in exposure research. In addition to this, there is also a utilisation of mixed methods to better gauge public opinion on exposure-related health concerns.

4. Discussion

There are a number of challenges remaining in the field of NLP applied to occupational exposure research. In the following section, we outline some challenges and opportunities for future work in this area:

  • Data volume and quality Whilst there has been some use of unsupervised machine learning methods (e.g., clustering via LDA) in the selected studies, a majority use supervised machine learning. One downside of this is that the latter approach requires human annotated data, which usually requires expert knowledge and is therefore a time-consuming and costly process. To overcome this issue, the use of semi-supervised or unsupervised learning methods might be explored, because it requires either significantly less annotated training data or none at all. An example of this is the use of topic modelling techniques to cluster jobs and exposures from the existing literature. Another opportunity lies in using semi-supervised Named Entity Recognition to increase the coverage of annotated literature.

This also includes Transformer-based methods [ 78 ] (e.g., large pre-trained language models such as BERT [ 79 ]), which have made a significant impact on the field of NLP over recent years and could prove to be useful in NLP for occupational exposure research. This type of deep learning method is based on attention [ 80 ], which has been shown to improve results in a variety of other domains that have utilised NLP (e.g., healthcare). These advances could be used to improve tasks such as Named Entity Recognition (NER) [ 81 ] or Relation Extraction (RE) [ 82 ] in occupational exposure research, which up until this point have relied on traditional machine learning only. Both tasks could prove useful in the context of occupational exposure research to automatically identify key concepts (e.g., types of exposures, jobs or work environments) but also how they relate to one another (e.g., a particular role is in a specific work place). Other advances could be made through the use of unsupervised methods, which thus far have also relied on traditional machine learning only. More recent methods such as Neural Topic Models (NTM) have become increasingly popular for different tasks, including document summarisation and text generation [ 83 ] due to their flexibility and capability. These methods could also be applied to occupational exposure research to uncover new topics and concepts at a larger scale or draw new connections between exposures and work environments. Similarly, NTM methods could also be coupled with pre-trained language models to further boost performance and result in more accurate representations of new topics [ 83 ].

  • Extrapolating existing research to other domains of exposure research Most of the research explored in this review is specific to a particular type of exposure, databases or enhancement of literature reviews. The domain-specificity and different needs/requirements for each type of exposure make it therefore hard to extrapolate these existing works to other fields, link and scale up existing approaches.

5. Conclusions

In this work, we have manually reviewed 37 papers relevant to NLP applied to occupational exposure research. Our results show that (i) there has been an increase in articles published, (ii) most work uses existing NLP tools, and (iii) traditional machine learning is the most popular approach. Furthermore, we have outlined challenges and opportunities for future research that could further advance the field.

Acknowledgments

This research was made possible by the support of the EPHOR (Exposome Project for Health and Occupational Research) consortium.

Abbreviations

The following abbreviations are used in this manuscript:

AIArtificial Intelligence
AOPAdverse Outcome Pathways
BERTBidirectional Encoder Representations from Transformers
CTDComparative Toxicogenomics Database
DRSDocument relevancy score
LDALatent Dirichlet Allocation
LSALatent semantic analysis
LSTMLong Short Term Memory
NERNamed Entity Recognition
NLPNatural Language Processing
NLTKNatural Language Toolkit
NTMNeural topic models
PCAPrincipal component analysis
RERelation Extraction
SVMSupport Vector Machine
TF-IDFfrequency–inverse document frequency
UMLSUnified Medical Language System

Funding Statement

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 874703.

Author Contributions

A.M.S., I.B., M.v.T. and S.A. contributed to the conception and design of the literature review (e.g., selecting keywords). A.M.S. and S.A. retrieved relevant papers and completed pre-screening. I.B. and M.v.T. performed a final screening of the selected papers. A.M.S. wrote the manuscript and I.B., M.v.T. and S.A. provided feedback and corrections on individual sections. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Informed consent statement, data availability statement, conflicts of interest.

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

CS 685, Spring 2020, UMass Amherst

Literature Review

Research paper reading, general tips for researching the literature, suggested papers.

COMMENTS

  1. A Systematic Literature Review of Natural Language ...

    In this research paper, a comprehensive literature review was undertaken in order to analyze Natural Language Processing (NLP) application based in different domains. Also, by conducting qualitative research, we will try to analyze the development of the current state and the challenge of NLP technology as a key for Artificial Intelligence (AI ...

  2. (PDF) Natural Language Processing: A Review

    Liddy [1] defines NLP as a theoretically motivated range of computational techniques for analyzi ng and. representing natura lly occurring texts, at one or more levels of ling uistic analysis for ...

  3. Natural Language Processing Challenges and Issues: A Literature Review

    Article Info. Abstract. Natural Language Processing (NLP) is the computerized approach to analyzing text using both. structured and unstructured data. NLP is a simple, empirically powerful, and ...

  4. The State of the Art of Natural Language Processing—A Systematic

    ABSTRACT. Nowadays, natural language processing (NLP) is one of the most popular areas of, broadly understood, artificial intelligence. Therefore, every day, new research contributions are posted, for instance, to the arXiv repository. Hence, it is rather difficult to capture the current "state of the field" and thus, to enter it. This brought the id-art NLP techniques to analyse the NLP ...

  5. Natural language processing: state of the art, current trends and

    In the existing literature, most of the work in NLP is conducted by computer scientists while various other professionals have also shown interest such as linguistics, psychologists, and philosophers etc. ... Liu H (2020) Natural language processing (NLP) in management research: A literature review. Journal of Management Analytics 7(2):139 ...

  6. A Systematic Literature Review on Natural Language Processing (NLP

    In the past decade, NLP received more recognition due to innovation in information and communication technology which led to various research. Thus, it is essential to understand the development taken in the knowledge of literature. The present study aims to present a systematic literature review using bibliometric analysis in NLP research.

  7. A Systematic Literature Review on the Applications of Robots and ...

    Natural language processing (NLP) is the art of investigating others' positive and cooperative communication and rapprochement with others as well as the art of communicating and speaking with others. Furthermore, NLP techniques may substantially enhance most phases of the information-system lifecycle, facilitate access to information for users, and allow for new paradigms in the usage of ...

  8. PDF A Systematic Literature Review of Natural Language ...

    results of NLP technology; it will then briefly introduce the framework structure and the current application fields and applications. Then, through a systematic literature review method, a descriptive analysis is performed on the retrieved documents. Next, a conceptual model of this research will be analyzed. After that, it classifies and sum-

  9. Vision, status, and research topics of Natural Language Processing

    Although advances in NLP have introduced many great opportunities to both academia and industry, there are significant challenges regarding natural language development and understanding from a cognitive perspective (Kang et al., 2020).For instance, existing deep learning approaches for NLP tasks fail to offer human-like computational modeling of cognition to attain, comprehend, and produce ...

  10. A Systematic Literature Review of Natural Language ...

    In this research paper, a comprehensive literature review was undertaken in order to analyze Natural Language Processing (NLP) application based in different domains. Also, by conducting ...

  11. A taxonomy and review of generalization research in NLP

    Hupkes et al. review generalization approaches in the NLP literature and propose a taxonomy based on five axes to analyse such studies: motivation, type of generalization, type of data shift, the ...

  12. Neurolinguistic programming: a systematic review of the effects on

    NLP's position outside mainstream academia has meant that while the evidence base for psychological intervention in both physical and mental health has strengthened, 11 - 14 parallel evidence in relation to NLP has been less evident and has attracted academic criticism. 15, 16 No systematic review of the NLP literature has been undertaken ...

  13. Natural language processing (NLP) in management research: A literature

    Natural language processing (NLP) is gaining momentum in management research for its ability to automatically analyze and comprehend human language. Yet, despite its extensive application in management research, there is neither a comprehensive review of extant literature on such applications, nor is there a detailed walkthrough on how it can ...

  14. Natural language processing in medicine: A review

    Abstract. Natural language processing (NLP) is a form of machine learning which enables the processing and analysis of free text. When used with medical notes, it can aid in the prediction of patient outcomes, augment hospital triage systems, and generate diagnostic models that detect early-stage chronic disease.

  15. A systematic review of natural language processing for ...

    There is growing interest in applying NLP to patient safety, but the evidence in the field has not been summarised and evaluated to date. Objective: To perform a systematic literature review and narrative synthesis to describe and evaluate NLP methods for classification of incident reports and adverse events in healthcare.

  16. Natural language processing for urban research: A systematic review

    This systematic literature review suggests that there have been only a limited number of urban studies that adopted the approach of NLP. Current applications fell into five areas of study: urban governance and management, public health, land use and functional zones, mobility, and urban design.

  17. A systematic review of natural language processing ...

    A preliminary literature review determined the keywords used in the search and then modified them based on feedback from content experts and the librarian. Our review then incorporated a collaborative search strategy to ensure that all papers related to the use of NLP and their use for the assessment of extreme weather events were included in ...

  18. The State of the Art of Natural Language Processing—A Systematic

    of NLP-focused literature. As a result, with a fully automated, systematic, visualization-driven literature analysis, a guide to the state-of-the-art of natural language processing is presented.

  19. A systematic review of natural language processing applied to radiology

    Natural language processing (NLP) has a significant role in advancing healthcare and has been found to be key in extracting structured information from radiology reports. Understanding recent developments in NLP application to radiology is of significance but recent reviews on this are limited. This study systematically assesses and quantifies recent literature in NLP applied to radiology reports.

  20. A survey on clinical natural language processing in the United Kingdom

    We conducted a literature review of all publications from the community over the last 15 years to obtain a comprehensive understanding of the research and development of clinical NLP.

  21. Methods to Integrate Natural Language Processing Into Qualitative

    An extensive literature review was conducted to determine which statistical software program and packages would be used to conduct this project, including both peer reviewed journals and grey literature, such as technical reports, blogs on natural language processing, and information technology related websites (Alsawas et al., 2016; Guetterman ...

  22. A Narrative Literature Review of Natural Language Processing Applied to

    At the same time, text mining and NLP techniques are increasingly applied in a variety of exposure-related research areas. Whilst there are a variety of surveys and literature reviews in NLP and its various subtasks [5,6,7], there is no review of NLP and text mining techniques used in the field of occupational and environmental exposure ...

  23. Literature review

    Literature Review The literature review is a paper that reviews a subfield of NLP of your choice. To ensure some intellectual diversity and depth of literature search, your review must cover at least 12 resesarch papers, and there must be at least 2 papers in each decade since 1990, and 2 papers from before 1990. That is, at least 2 papers in ...

  24. Exploring the impacts of automation in the mining industry: A

    The most prevalent research methods are literature review, interviews, workshops, discussions, etc. Besides, it is important to note that the field research available on automation's impacts in the mining context is scarce. ... NLP here delivered a different point of view of the automation literature; the techniques applied are helpful to ...