- Advanced search
- Peer review
Discover relevant research today
Advance your research field in the open
Reach new audiences and maximize your readership
ScienceOpen puts your research in the context of
Publications
For Publishers
ScienceOpen offers content hosting, context building and marketing services for publishers. See our tailored offerings
- For academic publishers to promote journals and interdisciplinary collections
- For open access journals to host journal content in an interactive environment
- For university library publishing to develop new open access paradigms for their scholars
- For scholarly societies to promote content with interactive features
For Institutions
ScienceOpen offers state-of-the-art technology and a range of solutions and services
- For faculties and research groups to promote and share your work
- For research institutes to build up your own branding for OA publications
- For funders to develop new open access publishing paradigms
- For university libraries to create an independent OA publishing environment
For Researchers
Make an impact and build your research profile in the open with ScienceOpen
- Search and discover relevant research in over 97 million Open Access articles and article records
- Share your expertise and get credit by publicly reviewing any article
- Publish your poster or preprint and track usage and impact with article- and author-level metrics
- Create a topical Collection to advance your research field
Create a Journal powered by ScienceOpen
Launching a new open access journal or an open access press? ScienceOpen now provides full end-to-end open access publishing solutions – embedded within our smart interactive discovery environment. A modular approach allows open access publishers to pick and choose among a range of services and design the platform that fits their goals and budget.
Continue reading “Create a Journal powered by ScienceOpen”
What can a Researcher do on ScienceOpen?
ScienceOpen provides researchers with a wide range of tools to support their research – all for free. Here is a short checklist to make sure you are getting the most of the technological infrastructure and content that we have to offer. What can a researcher do on ScienceOpen? Continue reading “What can a Researcher do on ScienceOpen?”
ScienceOpen on the Road
Upcoming events.
- 15 June – Scheduled Server Maintenance, 13:00 – 01:00 CEST
Past Events
- 20 – 22 February – ResearcherToReader Conference
- 09 November – Webinar for the Discoverability of African Research
- 26 – 27 October – Attending the Workshop on Open Citations and Open Scholarly Metadata
- 18 – 22 October – ScienceOpen at Frankfurt Book Fair.
- 27 – 29 September – Attending OA Tage, Berlin .
- 25 – 27 September – ScienceOpen at Open Science Fair
- 19 – 21 September – OASPA 2023 Annual Conference .
- 22 – 24 May – ScienceOpen sponsoring Pint of Science, Berlin.
- 16-17 May – ScienceOpen at 3rd AEUP Conference.
- 20 – 21 April – ScienceOpen attending Scaling Small: Community-Owned Futures for Open Access Books .
What is ScienceOpen?
- Smart search and discovery within an interactive interface
- Researcher promotion and ORCID integration
- Open evaluation with article reviews and Collections
- Business model based on providing services to publishers
Live Twitter stream
Some of our partners:.
Open Science at NASA
NASA is making a long-term commitment to building an inclusive open science community over the next decade. Open-source science is a commitment to the open sharing of software, data, and knowledge (algorithms, papers, documents, ancillary information) as early as possible in the scientific process.
Open Principles
The principles of open-source science are to make publicly funded scientific research transparent, inclusive, accessible, and reproducible. Advances in technology, including collaborative tools and cloud computing, help enable open-source science, but technology alone is insufficient. Open-source science requires a culture shift to a more inclusive, transparent, and collaborative scientific process, which will increase the pace and quality of scientific progress.
Open science Facts
Open Transparent Science Scientific processes and results should be open such that they are reproducible by members of the community.
Open Inclusive Science Process and participants should welcome participation by and collaboration with diverse people and organizations.
Open Accessible Science Data, tools, software, documentation, and publications should be accessible to all (FAIR).
Open Reproducible Science Scientific process and results should be open such that they are reproducible by members of the community.
Learn More About Open Science
NASA’s Transform to Open Science (TOPS) initiative helps people understand and implement open science practices in their own work. This initiative created Open Science 101, a free online training course to give researchers, academics, and the public a practical working knowledge of open science principles.
Why Do Open Science?
● Broadens participation and fosters greater collaboration in scientific investigations by lowering the barriers to entry into scientific exploration ● Generates greater impact and more citations to scientific results
Open Science Features and Events
NASA AI, Open Science Advance Natural Disaster Research and Recovery
NASA's artificial intelligence weather models and open data practices help researchers monitor hurricanes and other disasters.
NASA Funds Open-Source Software Underpinning Scientific Innovation
NASA awarded $15.6 million to 15 projects supporting open-source tools, frameworks, and libraries used by the NASA science community.
GeneLab Chats with Fiona Samson on Her Latest Publication
High school student Fiona Samson co-authored a research paper about accelerated aging during space travel using NASA's open space biology data.
Pioneer of Change: America Reyes Wang Makes NASA Space Biology More Open
Meet NASA's Space Biology Biospecimen Sharing Program lead, America Reyes Wang, who facilitates vital space biology research through open science.
Explore Open Science at NASA
Transform to Open Science (TOPS)
Provides the visibility, advocacy, and community resources to support and enable the shift to open science.
Open Science Funding
NASA supports open science through call for new innovative programs, supplements to existing awards, and sustainability of software.
The NASA Strategy for Open Science
The Strategy for Data Management and Computing for Groundbreaking Science 2019-2024 was developed through community input and guides NASA’s approach to open science.
Scientific Information Policy
The information produced as part of NASA’s scientific research activities represents a significant public investment. Learn more about how and when it should be shared.
Office of the Chief Science Data Officer
The Office of the Chief Science Data Officer (OCSDO) works to advance transformative open science as part of its activities.
Open Earth Science
Learn about openly available science data and services provided by NASA's Earth Science Data Systems (ESDS) Program.
Science Mission Directorate Science Data
The Science Data Portal provides a comprehensive list of NASA science data repositories.
Discover More Topics From NASA
James Webb Space Telescope
Perseverance Rover
Parker Solar Probe
- Search Search
- CN (Chinese)
- DE (German)
- ES (Spanish)
- FR (Français)
- JP (Japanese)
- Open science
- Peer Reviewers
- Booksellers
- Corporate Site ↗
- Media Centre ↗
- Fundamentals of open research
- Gold or Green routes to open research
- Benefits of open research
- Open research timeline
- Whitepapers
- About overview
- Journal pricing FAQs
- Publishing an OA book
- Journals & books overview
- OA article funding
- Article OA funding and policy guidance
- OA book funding
- Book OA funding and policy guidance
- Funding & support overview
- Open access agreements
- Springer Nature journal policies
- APC waivers and discounts
- Springer Nature book policies
- Publication policies overview
The fundamentals of open access and open research
What is open access and open research.
Open access (OA) refers to the free, immediate, online availability of research outputs such as journal articles or books, combined with the rights to use these outputs fully in the digital environment. OA content is open to all, with no access fees.
Open research goes beyond the boundaries of publications to consider all research outputs – from data to code and even open peer review. Making all outputs of research as open and accessible as possible means research can have a greater impact, and help to solve some of the world’s greatest challenges.
How can I publish my work open access?
As the author of a research article or book, you have the ability to ensure that your research can be accessed and used by the widest possible audience. Springer Nature supports immediate Gold OA as the most open, least restrictive form of OA: authors can choose to publish their research article in a fully OA journal, a hybrid or transformative journal, or as an OA book or OA chapter.
Alternatively, where articles, books or chapters are published via the subscription route, Springer Nature allows authors to archive the accepted version of their manuscript on their own personal website or their funder’s or institution’s repository, for public release after an embargo period (Green OA). Find out more.
Why should I publish OA?
What are creative commons licences.
Open access works published by Springer Nature are published under Creative Commons licences. These provide an industry-standard framework to support re-use of OA material. Please see Springer Nature’s guide to licensing, copyright and author rights for journal articles and books and chapters for further information.
How do I pay for open access?
As costs are involved in every stage of the publication process, authors are asked to pay an open access fee in order for their article to be published open access under a creative commons license. Springer Nature offers a free open access support service to make it easier for our authors to discover and apply for funding to cover article processing charges (APCs) and/or book processing charges (BPCs). Find out more.
What is open data?
We believe that all research data, including research files and code, should be as open as possible and want to make it easier for researchers to share the data that support their publications, making them accessible and reusable. Find out more about our research data services and policies.
What is a preprint?
A preprint is a version of a scientific manuscript posted on a public server prior to formal peer review. Once posted, the preprint becomes a permanent part of the scientific record, citable with its own unique DOI . Early sharing is recommended as it offers an opportunity to receive feedback on your work, claim priority for a discovery, and help research move faster. In Review is one of the most innovative preprint services available, offering real time updates on your manuscript’s progress through peer review. Discover In Review and its benefits.
What is open peer review?
Open peer review refers to the process of making peer reviewer reports openly available. Many publishers and journals offer some form of open peer review, including BMC who were one of the first publishers to open up peer review in 1999. Find out more .
Blog posts on open access from "The Source"
Open Research
How to publish open access with fees covered
Could you publish open access with fees covered under a Springer Nature open access agreement?
Celebrating our 2000th open access book
We are proud to celebrate the publication of our 2000th open access book. Take a look at how we achieved this milestone.
open access
Why is Gold OA best for researchers?
Explore the advantages of Gold OA, by reading some of the highlights from our white paper "Going for Gold".
How researchers are using open data in 2022
How are researchers using open data in 2022? Read this year’s State of Open Data Report, providing insights into the attitudes, motivations and challenges of researchers towards open data.
Ready to publish?
A pioneer of open access publishing, BMC is committed to innovation and offers an evolving portfolio of some 300 journals.
Got a discovery you're ready to share with the world? Publish your findings quickly and with integrity, never letting good research go to waste.
Open research is at the heart of Nature Research. Our portfolio includes Nature Communications , Scientific Reports and many more.
Springer offers a variety of open access options for journal articles and books across a number of disciplines.
Palgrave Macmillan is committed to developing sustainable models of open access for the HSS disciplines.
Apress is dedicated to meeting the information needs of developers, IT professionals, and tech communities worldwide.
Discover more tools and resources along with our author services
Author services
Early Career Resource Center
Journal Suggester
Using Your ORCID ID
The Transfer Desk
Tutorials and educational resources.
How to Write a Manuscript
How to submit a journal article manuscript
Nature Masterclasses
Stay up to date.
Here to foster information exchange with the library community
Connect with us on LinkedIn and stay up to date with news and development.
- Tools & Services
- Account Development
- Sales and account contacts
- Professional
- Media Centre
- Locations & Contact
We are a world leading research, educational and professional publisher. Visit our main website for more information.
- © 2024 Springer Nature
- General terms and conditions
- Your US State Privacy Rights
- Your Privacy Choices / Manage Cookies
- Accessibility
- Legal notice
- Help us to improve this site, send feedback.
Support the future of cybersecurity
Berkeley School of Information
Open Source Research Methods, Safety, and Tools
Next Module: Adversary Persona Development
This module describes how open source intelligence (or open source investigative techniques or OSINT) can be used for research. While OSINT uses publicly available sources of information to learn about an organization, an individual, or their contexts, there are risks that students and partners may face by gathering the information. This module will discuss safety precautions and tools to effectively organize collected information.
Learning Objectives
- Learn how open source information can be used to protect civil society from cyberattacks.
- Understand common OSINT techniques & sources for security researchers.
- Determine when and how to begin collecting open source information.
Pre-Readings
- See Course Readings for “Open Source Research Methods, Safety, and Tools”
- Citizen Clinic Virtual Identities guide
- Citizen Clinic Virtual Private Network guide
- OSINT-gathering activities cannot be attributed to the collector or collecting organization.
- Sources of OSINT information or methods of collection do not need to be protected.
- OSINT is easily accessible.
- OSINT can require a degree of technical expertise.
- Public media sources: news reports, printed magazines, and newspapers
- Internet (Web 2.0) sources: archives, social media, blogs, discussion groups
- Public government data: hearings, budgets, directories, and other public records.
- Professional and academic publications: papers, theses, dissertations, and journals.
- Commercial data: corporate databases, financial, and industrial assessments.
- Grey data: Public but hard to get… conference promotional material, business documents, unpublished works, technical reports…
Starting points:
- Direct from target websites
- Nihad Hassan’s OSINT.Link: https://osint.link
- OSINT Framework: https://osintframework.com/
- Bellingcat Online Investigation Toolkit: http://bit.ly/bcattools
OSINT is an iterative process of methodically collecting, archiving, analyzing, and re-examining available data. Provide examples of taking a piece of information (such as domain name) and uncovering additional pieces of information (such as the real name of a website owner).
Key questions for this work:
- How to organize collected information?
- How to keep track of where you’ve been?
- How to stay safe?
Tools for organization and archiving:
How and where can your OSINT activities be tracked?
- Browser History
- Router / Access Point
- Internet Service Provider
- Sites you directly visit (HTTP vs HTTPS)
- 3rd Party Sites (Lightbeam extension on Firefox)
How can we protect our open source investigations?
- Virtual Private Networks
- The Onion Router aka TOR (or Orbot)
- Using a common browser (https://panopticlick.eff.org/)
- Using Incognito / Private Mode
- HTTPSeverywhere
- PrivacyBadger
- Brave “Shields Up”
- “Burner” devices / virtual machines
- Virtual identities, profiles & user accounts
- Maintain separation between searches / sessions
- Smart defaults (Ex: DuckDuckGo vs Google for searches)
- Critical Thinking (avoid inadvertent connections)
Google Dorking (Advanced Search Queries): https://exposingtheinvisible.org/guides/google-dorking/
- What information are you seeking?
- Where are you going to look?
- What tools / resources do you need? (including burner accounts)
- Protect yourself?
- Protect your partner(s)?
- Protect your investigation?
- Date, URL, & search terms at minimum
- Investigation type can dictate archive needs (do you need to capture the entire site? Screenshot?)
Each team will share elements of their plan with the class.
- What is the information?
- Why does it matter? How could it be used?
- Where (and when) did the information come from?
- If defensive (ie, about a partner), is there a way to mitigate or prevent the information from being used in an attack?
- If offensive (ie, about a threat), what are the next steps? Are there immediate actions that should take place?
Help build and expand our future-focused research agenda
OpenScholar: The open-source A.I. that’s outperforming GPT-4o in scientific research
- Share on Facebook
- Share on LinkedIn
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Scientists are drowning in data. With millions of research papers published every year, even the most dedicated experts struggle to stay updated on the latest findings in their fields.
A new artificial intelligence system, called OpenScholar , is promising to rewrite the rules for how researchers access, evaluate, and synthesize scientific literature. Built by the Allen Institute for AI (Ai2) and the University of Washington , OpenScholar combines cutting-edge retrieval systems with a fine-tuned language model to deliver citation-backed, comprehensive answers to complex research questions.
“Scientific progress depends on researchers’ ability to synthesize the growing body of literature,” the OpenScholar researchers wrote in their paper . But that ability is increasingly constrained by the sheer volume of information. OpenScholar, they argue, offers a path forward—one that not only helps researchers navigate the deluge of papers but also challenges the dominance of proprietary AI systems like OpenAI’s GPT-4o .
How OpenScholar’s AI brain processes 45 million research papers in seconds
At OpenScholar’s core is a retrieval-augmented language model that taps into a datastore of more than 45 million open-access academic papers . When a researcher asks a question, OpenScholar doesn’t merely generate a response from pre-trained knowledge, as models like GPT-4o often do. Instead, it actively retrieves relevant papers, synthesizes their findings, and generates an answer grounded in those sources.
This ability to stay “grounded” in real literature is a major differentiator. In tests using a new benchmark called ScholarQABench , designed specifically to evaluate AI systems on open-ended scientific questions, OpenScholar excelled. The system demonstrated superior performance on factuality and citation accuracy, even outperforming much larger proprietary models like GPT-4o.
One particularly damning finding involved GPT-4o’s tendency to generate fabricated citations—hallucinations, in AI parlance. When tasked with answering biomedical research questions, GPT-4o cited nonexistent papers in more than 90% of cases. OpenScholar, by contrast, remained firmly anchored in verifiable sources.
The grounding in real, retrieved papers is fundamental. The system uses what the researchers describe as their “ self-feedback inference loop ” and “iteratively refines its outputs through natural language feedback, which improves quality and adaptively incorporates supplementary information.”
The implications for researchers, policy-makers, and business leaders are significant. OpenScholar could become an essential tool for accelerating scientific discovery, enabling experts to synthesize knowledge faster and with greater confidence.
Inside the David vs. Goliath battle: Can open source AI compete with Big Tech?
OpenScholar’s debut comes at a time when the AI ecosystem faces a growing tension between closed, proprietary systems and the rise of open-source alternatives like Meta’s Llama. Models like OpenAI’s GPT-4o and Anthropic’s Claude offer impressive capabilities, but they are expensive, opaque, and inaccessible to many researchers. OpenScholar flips this model on its head by being fully open-source.
The OpenScholar team has released not only the code for the language model but also the entire retrieval pipeline , a specialized 8-billion-parameter model fine-tuned for scientific tasks, and a datastore of scientific papers. “To our knowledge, this is the first open release of a complete pipeline for a scientific assistant LM—from data to training recipes to model checkpoints,” the researchers wrote in their blog post announcing the system.
This openness is not just a philosophical stance; it’s also a practical advantage. OpenScholar’s smaller size and streamlined architecture make it far more cost-efficient than proprietary systems. For example, the researchers estimate that OpenScholar-8B is 100 times cheaper to operate than PaperQA2 , a concurrent system built on GPT-4o.
This cost-efficiency could democratize access to powerful AI tools for smaller institutions, underfunded labs, and researchers in developing countries.
Still, OpenScholar is not without limitations. Its datastore is restricted to open-access papers, leaving out paywalled research that dominates some fields. This constraint, while legally necessary, means the system might miss critical findings in areas like medicine or engineering. The researchers acknowledge this gap and hope future iterations can responsibly incorporate closed-access content.
The new scientific method: When AI becomes your research partner
The OpenScholar project raises important questions about the role of AI in science. While the system’s ability to synthesize literature is impressive, it is not infallible. In expert evaluations, OpenScholar’s answers were preferred over human-written responses 70% of the time, but the remaining 30% highlighted areas where the model fell short—such as failing to cite foundational papers or selecting less representative studies.
These limitations underscore a broader truth: AI tools like OpenScholar are meant to augment, not replace, human expertise. The system is designed to assist researchers by handling the time-consuming task of literature synthesis, allowing them to focus on interpretation and advancing knowledge.
Critics may point out that OpenScholar’s reliance on open-access papers limits its immediate utility in high-stakes fields like pharmaceuticals, where much of the research is locked behind paywalls. Others argue that the system’s performance, while strong, still depends heavily on the quality of the retrieved data. If the retrieval step fails, the entire pipeline risks producing suboptimal results.
But even with its limitations, OpenScholar represents a watershed moment in scientific computing. While earlier AI models impressed with their ability to engage in conversation, OpenScholar demonstrates something more fundamental: the capacity to process, understand, and synthesize scientific literature with near-human accuracy.
The numbers tell a compelling story. OpenScholar’s 8-billion-parameter model outperforms GPT-4o while being orders of magnitude smaller. It matches human experts in citation accuracy where other AIs fail 90% of the time. And perhaps most tellingly, experts prefer its answers to those written by their peers.
These achievements suggest we’re entering a new era of AI-assisted research, where the bottleneck in scientific progress may no longer be our ability to process existing knowledge, but rather our capacity to ask the right questions.
The researchers have released everything —code, models, data, and tools—betting that openness will accelerate progress more than keeping their breakthroughs behind closed doors.
In doing so, they’ve answered one of the most pressing questions in AI development: Can open-source solutions compete with Big Tech’s black boxes?
The answer, it seems, is hiding in plain sight among 45 million papers.
Stay in the know! Get the latest news in your inbox daily
By subscribing, you agree to VentureBeat's Terms of Service.
Thanks for subscribing. Check out more VB newsletters here .
An error occured.
- Get in touch
Become a supporting member
- Open Knowledge Maps
- Your guide to scientific knowledge
- Become a supporting member
Map a research topic with AI beta
Get an overview - Find documents - Identify concepts
Our mission
Revolutionising discovery
Open Knowledge Maps is the world's largest AI-based search engine for scientific knowledge. We dramatically increase the visibility of research findings for science and society alike.
Learn more about us
Open and nonprofit
We are a charitable non-profit organization based on the principles of open science. Our aim is to create an inclusive, sustainable and equitable infrastructure that can be used by anyone.
Check out our team
A sustainable platform
We propose to fund Open Knowledge Maps in a collective effort. Organizations are invited to become supporting members and co-create the platform with us.
What users and supporters say
We joined Open Knowledge Maps as a Supporting Member because it is an innovative tool for literature search and we are eager to support the further development of Open Knowledge Maps.
Dr. David Johann, Head of Group Knowledge Management, ETH Library, ETH Zurich
I love how OKMaps breaks down the papers into clusters allowing me to identify themes in the literature and focus on papers that are most pertinent for my work.
Girija Goyal, ReFigure Co-Founder, Staff Scientist at Wyss Institute for Biologically Inspired Engineering at Harvard University, USA
Open Knowledge Maps is a considerable reinforcement in the areas of open science & open access, which are central to our research services.
Dr. Andrea Hacker, Open Access and Bern Open Publishing (BOP), University Library Bern
Now that science gets more and more open, we need ways to visualize it in a relevant way. That's why I support OKMaps.
Jean-Claude Burgelman, Professor of Open Science at VUB, Editor in Chief at Frontiers Policy Labs
Open Knowledge Maps is one of these initiatives we consider to be a visionary innovator in the field of discovery in open spaces.
Prof. Dr. Klaus Tochtermann, Director, ZBW
Education and Knowledge empower people, and everybody should have access to them, it is great to have tools like Open Knowledge Maps empowering people around the world.
Mari Plaza, Data Scientist
Custom integrations
Would you like to complement your services with our AI-based discovery tools? Using our Custom Services, organisations are able to embed Open Knowledge Maps components in their own discovery systems.
Explore live case studies
Supporting members
Project funding
Social Media
Some people look at social media and see a mess of cat videos, over-sharing aunts and anonymous tough guys. Our team sees an untapped resource for research.
Social media research is often the best way to get real-time insight into critical situations. Whether it’s a conflict area or a weather emergency, social-media monitoring can give clarity into the specifics of the situation and the resources that are most needed on the ground.
Online Databases
Another component of open-source research that doesn’t come to mind for most people is commercial or financial databases. There are many state-run databases around the world, as well as privately owned databases that offer public access.
These databases can be an excellent source of historical records and other info that provides additional context to our research. An excellent ones we tap into on a regular basis is The Armed Conflict Location & Event Data (ACLED) project which provides a rich source of conflict data over time. Each event is categorized and geolocated for easy analysis. Other good sources include the Global Terrorism Database curated by the University of Maryland as part of the National Consortium for the Study of Terrorism and Response to Terrorism (START), the World Bank Open Data project, and the UN High Commissioner for Refugees (UNHCR)’s data including population statistics and maps.
Have a problem that open source research could help solve? Contact Tesla Today
Finding Information Isn’t Enough. Expertise and Language Skills Turn Research Into Insight.
Once you’ve rounded up all your sources, you’re good to go, right? Well, not quite.
Finding the information is only part of the work to be done for proper open-source research. In order to turn all the articles, database queries and social media posts you found into information that can actually be used, you need to take it to the next level. That requires the ability to understand the content in the original language as well as a real-world understanding of the subject areas.
Understanding the Language
One of the most important pieces of the puzzle, for us, is language. All the other sources we’ve listed are incredibly valuable…if you can read the language. But what happens when you need insight into an area where the locals speak a language or dialect that you’re clueless about?
In order to get a complete picture of the task at hand, you need a nuanced understanding of the news reports, blogs and social media posts coming out of that region. That’s one reason our team includes a number of multilingual researchers who are fluent in Arabic, Farsi, Dari, Russian, French, Spanish and other languages.
Real-World Understanding of Open-Source Research
Now that you have all your information pulled into one big pile and you can read it, comes the most important task – knowing what is actually good and useful information. This is where subject-matter experts who know the regions and topics that they study come into play. You need teams who know the history, involved parties and politics of a particular subject so they can weed through the mass of information to surface what is actually useful.
The End Game? A Complete Picture
As exciting as research can be (at least to a researcher!), it’s never an end in itself. Our clients are working to address critical needs around the world, with real-life implications. By pursuing open-source research, they are making sure they’ve exhausted all available resources to get a complete understanding and make the best possible decisions.
When it comes down to it, open-source research has the same goal as everything we do here at Tesla: Providing clarity and context so you can advance your mission with confidence.
See Open-Source Research in Action
See what we can do for you.
Put your institutional knowledge to work.
At Tesla Government, we work with our federal government clients to offload the burden of data management and create a clear path to productivity. Tap into our decades of expertise and learn more.
Center for Security and Emerging Technology
Into the jungle: best practices for open-source researchers.
Ryan Fedasiuk
The goal of this guide is to acquaint researchers and analysts with tools, resources, and best practices to ensure security when collecting or accessing open-source information.
Open sources on the internet present numerous potential hazards both to users and the information they access. Navigating these hazards requires habitual vigilance.
There are three main considerations when collecting open-source information online. In order of priority, they are:
- Protecting your devices, network, and files from malware.
- Archiving your sources for posterity.
- Masking your activities from intrusive onlookers.
The Cardinal Rules of Open-Source Investigations
- Always assume the source has been compromised and could present a privacy risk.
- Always stay connected to a VPN.
- Never download files locally.
- Whenever possible, access only the cached or archived versions of web pages.
- Whenever appropriate, archive sources immediately.
- Whenever in doubt, scan before you click.
Resources, Tools, and Best Practices
A virtual private network (VPN) can secure your network by masking your internet protocol (IP) address and encrypting information that is transmitted from your device. Most VPN services will let you select a server through which to route internet traffic. This has the added benefit of camouflaging your IP address. For a faster connection, choose a server located near you. For a slower connection likely to raise fewer eyebrows, choose a connection based near the entity that you are researching. For China, that might be Hong Kong, Taiwan, or Singapore—or, use a service that allows you to tunnel directly beneath the Great Firewall. For Russia, the Baltic states are good options. For North Korea, consider VPN servers based in Seoul.
There are many options and considerations when choosing a VPN: price, number of servers, connection speed, whether the service keeps logs of your browsing activity, and saturation—whether the government whose files you are browsing has blocked many of the service’s possible connection nodes.
2. Cached Web Pages
A safer way to access any web page is to access Google’s cached version of that page, rather than visiting the website directly. A cached page is a past version of the website in question, which Google’s search engine accessed and saved internally while creating search results and previews. Not every web page is cached, but you will find that most web pages have this option.
To access the cached version of a web page, either type cache:[URL] directly into your browser’s navigation bar, or click on the three dots next to a Google search result to see more information about the page:
The bottom-right corner of the ensuing pop-up will include a button that says “Cached.” Click on it to access the cached page.
The cached version of a web page will have a banner at the top that looks like this:
Accessing the cached version of a web page is not foolproof. It is still possible for a website owner to track which IP address is viewing a cached webpage, through certain embedded images and other elements. Accessing the text-only version of a cached page, or the HTML source code, can mitigate some of these risks, and will allow you to more quickly find information on web pages that are slow to load.
Cached web pages are especially useful for previewing documents that you would otherwise have to download directly onto your computer— something that you should avoid if at all possible . For example, take this .xls spreadsheet file hosted by the Cyberspace Affairs Commission of China:
Just clicking on this Google search result would normally result in the file being automatically downloaded to your computer—a disaster. Grappling with auto-download links is a never-ending challenge when collecting open-source information from foreign websites.
A safer (and faster) way of getting at the information is to access the cached version of the web page that is hosting the file. Rather than downloading something and opening it in Excel, Google’s cache transforms it into a web page that you can view in your browser:
This strategy works for all common file types: .doc, .pdf, .xls, and .xlsx, among others, but will sometimes cause errors in file formatting (especially PDFs).
3. Archive Services
Archiving sources is incredibly important. Within days or even hours of publishing research, sources of information frequently disappear, and original website links are frequently broken. But there are several reasons why you might want to archive a website, beyond ensuring future access to the material:
- Archive services can serve a similar function to a cached web page, allowing you to view a safer version of the page. (It is also possible to archive a Google-cached web page, rather than the original source, for layered protection.)
- Some archive services, like the Wayback Machine (discussed below), will tell you if someone else has already archived the page, which can be useful to know.
- Some archive services will generate unique links and display the exact time stamps for when they were generated. This can be helpful in plagiarism disputes and/or tracking project timelines.
In particular, two free archiving services are embraced by the open-source community. These include:
- The Internet Archive (Wayback Machine): https://web.archive.org/save/
- Archive Today: https://archive.vn/
It is often worth archiving particularly valuable documents across more than one archive service.
Please note that most archive services will “ping” the website with a U.S.-based IP address . This can ruin your attempts to remain stealthy, for example, with a China- or Russia-based VPN. Please also note it may be possible for website owners to retroactively break archive links you have already established. For these reasons, web-based digital archive services may not always be the best option.
To maintain maximum privacy, security, and long-term access, it is often worthwhile to save local copies of web pages as PDFs to your computer, then upload them to cloud storage or an external hard drive. This is not the same thing as downloading a PDF from the website itself— which you should avoid if at all possible . Rather, when you are viewing a web page, follow these steps:
- First, attempt to “print” the web page by opening the print interface (press CTRL+P on a PC; CMD+P on a Mac).
- Then, instead of actually printing it out, change the destination to “Save as PDF.”
- Finally, consider duplicating the saved file to external flash drives or uploading it to a secure web cloud like Google Drive.
4. URL and File Scanners
Sometimes, there will be a potentially valuable source of information that resists archiving and has no cached web page. It’s a gamble to directly access these kinds of links. But you can exercise due diligence: Whenever in doubt, scan before you click .
VirusTotal is a free service that scans files and URLs for malware by checking them against dozens of antivirus software services, including well-known consumer brands like AVG, BitDefender, and Kaspersky.
VirusTotal collects information about the files and URLs uploaded to its database. It is a testing platform for antivirus services. It has access to 79+ antivirus services because it provides diagnostic information to improve antivirus products from the scans that users generate. It does not require users to have an account.
5. Browser Sandbox
If you are sitting down for an extended session of information-hunting, it is best to do all of your searching inside a virtual sandbox (or virtual machine, VM). There are several applications that can create a firewall around programs and applications you choose to run, such as web browsers like Google Chrome and Firefox.
A web browser session run inside the sandbox will close when the sandbox is closed. Any files downloaded from the browser will remain inside the sandbox, and can be wiped when the sandbox is closed, without being saved to your actual computer. You can still give permission to transfer individual files out of the sandbox.
There are different sandbox options available for PC or Mac users, but many are free, open-source, and relatively lightweight applications. A popular virtual sandbox application for PC users is Sandboxie . For Mac users, consider Oracle’s VirtualBox .
6. Antivirus Software
If you are conducting open-source research, it behooves you to have a subscription to high-quality antivirus software. However, if you do not already have antivirus on your computer, there are some free options worth downloading and running regularly:
- Malwarebytes offers free, relatively lightweight, on-demand malware scans. It can be run in conjunction with other antivirus software products.
- Bitdefender is often cited as a high-quality, free antivirus software.
- And free trials are offered by paid subscription sources including Norton, McAfee, AVG and Kaspersky.
If at any point you break one of the six cardinal rules outlined in this guide, or accidentally click on an auto-download link, it’s worth running a quick Malwarebytes scan. But remember—the best practice when conducting open-source investigations is to assume compromise. If a state wants to track your browsing and research activity, it will surely be able to do so.
Governments and media publications everywhere are starting to embrace the value of open-source investigations. Even in relatively closed societies, there is an unmined ocean of data capable of informing business and policy decisions. Recent studies have highlighted the utility of budget documents , purchasing orders , geospatial imagery , social media posts , government records , and elite biographies in understanding states’ geopolitical ambitions and military capabilities. Armed with these tips and tricks, where will you venture next?
In October 2024, CSET was provided a Canadian French translation of this article. We are attaching it here for our audience, in case useful. Translation provided by the Government of Canada.
Download Translation
This website uses cookies., privacy overview.
- Mission and history
- Platform features
- Library Advisory Group
- What’s in JSTOR
- For Librarians
- For Publishers
- For Educators
Open and free content on JSTOR and Artstor
Our partnerships with libraries and publishers help us make content discoverable and freely accessible worldwide
Search open content on JSTOR
Explore our growing collection of Open Access journals
Early Journal Content , articles published prior to the last 95 years in the United States, or prior to the last 143 years if initially published internationally, are freely available to all
Even more content is available when you register to read – millions of articles from nearly 2,000 journals
Thousands of Open Access ebooks are available from top scholarly publishers, including Brill, Cornell University Press, University College of London, and University of California Press – at no cost to libraries or users.
This includes Open Access titles in Spanish:
- Collaboration with El Colegio de México
- Partnership with the Latin American Council of Social Sciences
Images and media
JSTOR hosts a growing number of public collections , including Artstor’s Open Access collections , from museums, archives, libraries, and scholars worldwide.
Research reports
A curated set of more than 34,000 research reports from more than 140 policy institutes selected with faculty, librarian, and expert input.
Resources for librarians
Open content title lists:
- Open Access Journals (xlsx)
- Open Access Books (xlsx)
- JSTOR Early Journal Content (xlsx)
- Research Reports (txt)
Open Access ebook resources for librarians
Library-supported collections
Shared Collections : We have a growing corpus of digital special collections published on JSTOR by our institutional partners.
Reveal Digital : A collaboration with libraries to fund, source, digitize and publish open access primary source collections from under-represented voices.
JSTOR Daily
JSTOR Daily is an online publication that contextualizes current events with scholarship. All of our stories contain links to publicly accessible research on JSTOR. We’re proud to publish articles based in fact and grounded by careful research and to provide free access to that research for all of our readers.
How to do open research: 5 basic principles
Opensource.com
Some folks at UNICEF asked me to help them articulate a process for how to make their research projects (usually “is this program we want to do a feasible one?” or “what was the impact of this program we did?” into open content ones. Here’s what I wrote them back. There are some pretty basic things that a researcher can do to make their work into an open content project. Here are a few.
1. Radical realtime transparency. Release all work in an editable format under a creative commons license as soon as it’s made. I’ll elaborate on each of those points in a bit more detail:
1a. Release all work. This means not just the finished/polished products, but the rough drafts, the incoherent notes, and the random scribblings as well. You can put disclaimers of “these are the rough things” at the top, and you don’t need to do announcements of the release of all your low-level work (except in weekly summaries) but they will let other people dig as far as they could possibly want to go on your activity in the space.
1b. In an editable format. No pdfs — wiki pages, plaintext in a version control repository, something Word (or better yet, .odt) files are marginally acceptable, but force you to become a merging bottleneck; it’s best to get as close as possible to people being able to edit not just the material, but also each other’s edits, themselves.
1c. Under a creative commons license. Use the same license as the final paper. UNICEF chose the CC-BY-SA license, which is good; the key point is to avoid the “noncommercial” and “no derivatives” restrictions, which are the non-open creative commons license variants. Remixability for all purposes is vital.
1d. As soon as it’s made. This means what it sounds like; push it as you do it, not after the fact as “background material” accompanying the finished paper. If you want people to help you along your journey, they need to know as accurately as possible where you are right now.
2. Make work findable. Have a central place where people can easily read the current status of the project in 1 minute or less, and where they can quickly navigate to all the materials you’ve created for it. The specific structure/format isn’t as important as having a clear structure to begin with; pick a schema and stick with it.
3. Make participation as low-barrier as possible. Whenever possible, don’t require logins or account creation. If you must use authentication of some sort, think about what accounts the people you want as collaborators are already likely to have (facebook? twitter/identi.ca? wikipedia? github?) and what platforms they’re already likely to be familiar with (do they know version control? word processing? English?) and in general try to make it possible for someone to go from “stumbled across your project” to “made a contribution” in as few seconds and clicks as possible.
4. Update in a regular rhythm. Weekly is usually good, but for some projects it may make sense to cycle more quickly or slowly. For those who need a rule of thumb, I’ll semi-arbitrarily say that you should have at least 5 updates throughout the life of your project, so a 2-month project might have weekly updates, a 2-week project would have daily updates, a 1-day project might have hourly updates, but a 1-year project might have bimonthly updates (though weekly updates will drive more participation). Pick a schedule, announce it, and stick to it; this is something that should be on the front of your “participation” homepage (from #2, “make work findable”) so that new people coming in know when the “next thing” is coming up that they can jump in on.
5. Reach out in backchannel to bring people to the public space. Email, go to conferences, tweet/dent, blog, sit down at coffee shops, go to marketplaces… go where the people are, and engage with them in their spaces as long as it takes for you to help them feel comfortable coming to yours. Basically, private conversations are necessary, but they’re necessary as a means towards the end of bringing people into a public and collaborative space. It’s like opening a new physical location for something like a bar or a library; you want everyone to end up in your space interacting with each other, so you go out and have individual conversations with them aimed towards getting them there.
This article was originally posted on Mel's blog .
Related Content
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
- View all journals
- My Account Login
- Explore content
- About the journal
- Publish with us
- Sign up for alerts
- Open access
- Published: 01 February 2021
An open source machine learning framework for efficient and transparent systematic reviews
- Rens van de Schoot ORCID: orcid.org/0000-0001-7736-2091 1 ,
- Jonathan de Bruin ORCID: orcid.org/0000-0002-4297-0502 2 ,
- Raoul Schram 2 ,
- Parisa Zahedi ORCID: orcid.org/0000-0002-1610-3149 2 ,
- Jan de Boer ORCID: orcid.org/0000-0002-0531-3888 3 ,
- Felix Weijdema ORCID: orcid.org/0000-0001-5150-1102 3 ,
- Bianca Kramer ORCID: orcid.org/0000-0002-5965-6560 3 ,
- Martijn Huijts ORCID: orcid.org/0000-0002-8353-0853 4 ,
- Maarten Hoogerwerf ORCID: orcid.org/0000-0003-1498-2052 2 ,
- Gerbrich Ferdinands ORCID: orcid.org/0000-0002-4998-3293 1 ,
- Albert Harkema ORCID: orcid.org/0000-0002-7091-1147 1 ,
- Joukje Willemsen ORCID: orcid.org/0000-0002-7260-0828 1 ,
- Yongchao Ma ORCID: orcid.org/0000-0003-4100-5468 1 ,
- Qixiang Fang ORCID: orcid.org/0000-0003-2689-6653 1 ,
- Sybren Hindriks 1 ,
- Lars Tummers ORCID: orcid.org/0000-0001-9940-9874 5 &
- Daniel L. Oberski ORCID: orcid.org/0000-0001-7467-2297 1 , 6
Nature Machine Intelligence volume 3 , pages 125–133 ( 2021 ) Cite this article
83k Accesses
343 Citations
164 Altmetric
Metrics details
- Computational biology and bioinformatics
- Computer science
- Medical research
A preprint version of the article is available at arXiv.
To help researchers conduct a systematic review or meta-analysis as efficiently and transparently as possible, we designed a tool to accelerate the step of screening titles and abstracts. For many tasks—including but not limited to systematic reviews and meta-analyses—the scientific literature needs to be checked systematically. Scholars and practitioners currently screen thousands of studies by hand to determine which studies to include in their review or meta-analysis. This is error prone and inefficient because of extremely imbalanced data: only a fraction of the screened studies is relevant. The future of systematic reviewing will be an interaction with machine learning algorithms to deal with the enormous increase of available text. We therefore developed an open source machine learning-aided pipeline applying active learning: ASReview. We demonstrate by means of simulation studies that active learning can yield far more efficient reviewing than manual reviewing while providing high quality. Furthermore, we describe the options of the free and open source research software and present the results from user experience tests. We invite the community to contribute to open source projects such as our own that provide measurable and reproducible improvements over current practice.
Similar content being viewed by others
AI-assisted peer review
A typology for exploring the mitigation of shortcut behaviour
Distributed peer review enhanced with natural language processing and machine learning
With the emergence of online publishing, the number of scientific manuscripts on many topics is skyrocketing 1 . All of these textual data present opportunities to scholars and practitioners while simultaneously confronting them with new challenges. Scholars often develop systematic reviews and meta-analyses to develop comprehensive overviews of the relevant topics 2 . The process entails several explicit and, ideally, reproducible steps, including identifying all likely relevant publications in a standardized way, extracting data from eligible studies and synthesizing the results. Systematic reviews differ from traditional literature reviews in that they are more replicable and transparent 3 , 4 . Such systematic overviews of literature on a specific topic are pivotal not only for scholars, but also for clinicians, policy-makers, journalists and, ultimately, the general public 5 , 6 , 7 .
Given that screening the entire research literature on a given topic is too labour intensive, scholars often develop quite narrow searches. Developing a search strategy for a systematic review is an iterative process aimed at balancing recall and precision 8 , 9 ; that is, including as many potentially relevant studies as possible while simultaneously limiting the total number of studies retrieved. The vast number of publications in the field of study often leads to a relatively precise search, with the risk of missing relevant studies. The process of systematic reviewing is error prone and extremely time intensive 10 . In fact, if the literature of a field is growing faster than the amount of time available for systematic reviews, adequate manual review of this field then becomes impossible 11 .
The rapidly evolving field of machine learning has aided researchers by allowing the development of software tools that assist in developing systematic reviews 11 , 12 , 13 , 14 . Machine learning offers approaches to overcome the manual and time-consuming screening of large numbers of studies by prioritizing relevant studies via active learning 15 . Active learning is a type of machine learning in which a model can choose the data points (for example, records obtained from a systematic search) it would like to learn from and thereby drastically reduce the total number of records that require manual screening 16 , 17 , 18 . In most so-called human-in-the-loop 19 machine-learning applications, the interaction between the machine-learning algorithm and the human is used to train a model with a minimum number of labelling tasks. Unique to systematic reviewing is that not only do all relevant records (that is, titles and abstracts) need to seen by a researcher, but an extremely diverse range of concepts also need to be learned, thereby requiring flexibility in the modelling approach as well as careful error evaluation 11 . In the case of systematic reviewing, the algorithm(s) are interactively optimized for finding the most relevant records, instead of finding the most accurate model. The term researcher-in-the-loop was introduced 20 as a special case of human-in-the-loop with three unique components: (1) the primary output of the process is a selection of the records, not a trained machine learning model; (2) all records in the relevant selection are seen by a human at the end of the process 21 ; (3) the use-case requires a reproducible workflow and complete transparency is required 22 .
Existing tools that implement such an active learning cycle for systematic reviewing are described in Table 1 ; see the Supplementary Information for an overview of all of the software that we considered (note that this list was based on a review of software tools 12 ). However, existing tools have two main drawbacks. First, many are closed source applications with black box algorithms, which is problematic as transparency and data ownership are essential in the era of open science 22 . Second, to our knowledge, existing tools lack the necessary flexibility to deal with the large range of possible concepts to be learned by a screening machine. For example, in systematic reviews, the optimal type of classifier will depend on variable parameters, such as the proportion of relevant publications in the initial search and the complexity of the inclusion criteria used by the researcher 23 . For this reason, any successful system must allow for a wide range of classifier types. Benchmark testing is crucial to understand the real-world performance of any machine learning-aided system, but such benchmark options are currently mostly lacking.
In this paper we present an open source machine learning-aided pipeline with active learning for systematic reviews called ASReview. The goal of ASReview is to help scholars and practitioners to get an overview of the most relevant records for their work as efficiently as possible while being transparent in the process. The open, free and ready-to-use software ASReview addresses all concerns mentioned above: it is open source, uses active learning, allows multiple machine learning models. It also has a benchmark mode, which is especially useful for comparing and designing algorithms. Furthermore, it is intended to be easily extensible, allowing third parties to add modules that enhance the pipeline. Although we focus this paper on systematic reviews, ASReview can handle any text source.
In what follows, we first present the pipeline for manual versus machine learning-aided systematic reviews. We then show how ASReview has been set up and how ASReview can be used in different workflows by presenting several real-world use cases. We subsequently demonstrate the results of simulations that benchmark performance and present the results of a series of user-experience tests. Finally, we discuss future directions.
Pipeline for manual and machine learning-aided systematic reviews
The pipeline of a systematic review without active learning traditionally starts with researchers doing a comprehensive search in multiple databases 24 , using free text words as well as controlled vocabulary to retrieve potentially relevant references. The researcher then typically verifies that the key papers they expect to find are indeed included in the search results. The researcher downloads a file with records containing the text to be screened. In the case of systematic reviewing it contains the titles and abstracts (and potentially other metadata such as the authors’s names, journal name, DOI) of potentially relevant references into a reference manager. Ideally, two or more researchers then screen the records’s titles and abstracts on the basis of the eligibility criteria established beforehand 4 . After all records have been screened, the full texts of the potentially relevant records are read to determine which of them will be ultimately included in the review. Most records are excluded in the title and abstract phase. Typically, only a small fraction of the records belong to the relevant class, making title and abstract screening an important bottleneck in systematic reviewing process 25 . For instance, a recent study analysed 10,115 records and excluded 9,847 after title and abstract screening, a drop of more than 95% 26 . ASReview therefore focuses on this labour-intensive step.
The research pipeline of ASReview is depicted in Fig. 1 . The researcher starts with a search exactly as described above and subsequently uploads a file containing the records (that is, metadata containing the text of the titles and abstracts) into the software. Prior knowledge is then selected, which is used for training of the first model and presenting the first record to the researcher. As screening is a binary classification problem, the reviewer must select at least one key record to include and exclude on the basis of background knowledge. More prior knowledge may result in improved efficiency of the active learning process.
The symbols indicate whether the action is taken by a human, a computer, or whether both options are available.
A machine learning classifier is trained to predict study relevance (labels) from a representation of the record-containing text (feature space) on the basis of prior knowledge. We have purposefully chosen not to include an author name or citation network representation in the feature space to prevent authority bias in the inclusions. In the active learning cycle, the software presents one new record to be screened and labelled by the user. The user’s binary label (1 for relevant versus 0 for irrelevant) is subsequently used to train a new model, after which a new record is presented to the user. This cycle continues up to a certain user-specified stopping criterion has been reached. The user now has a file with (1) records labelled as either relevant or irrelevant and (2) unlabelled records ordered from most to least probable to be relevant as predicted by the current model. This set-up helps to move through a large database much quicker than in the manual process, while the decision process simultaneously remains transparent.
Software implementation for ASReview
The source code 27 of ASReview is available open source under an Apache 2.0 license, including documentation 28 . Compiled and packaged versions of the software are available on the Python Package Index 29 or Docker Hub 30 . The free and ready-to-use software ASReview implements oracle, simulation and exploration modes. The oracle mode is used to perform a systematic review with interaction by the user, the simulation mode is used for simulation of the ASReview performance on existing datasets, and the exploration mode can be used for teaching purposes and includes several preloaded labelled datasets.
The oracle mode presents records to the researcher and the researcher classifies these. Multiple file formats are supported: (1) RIS files are used by digital libraries such as IEEE Xplore, Scopus and ScienceDirect; the citation managers Mendeley, RefWorks, Zotero and EndNote support the RIS format too. (2) Tabular datasets with the .csv, .xlsx and .xls file extensions. CSV files should be comma separated and UTF-8 encoded; the software for CSV files accepts a set of predetermined labels in line with the ones used in RIS files. Each record in the dataset should hold the metadata on, for example, a scientific publication. Mandatory metadata is text and can, for example, be titles or abstracts from scientific papers. If available, both are used to train the model, but at least one is needed. An advanced option is available that splits the title and abstracts in the feature-extraction step and weights the two feature matrices independently (for TF–IDF only). Other metadata such as author, date, DOI and keywords are optional but not used for training the models. When using ASReview in the simulation or exploration mode, an additional binary variable is required to indicate historical labelling decisions. This column, which is automatically detected, can also be used in the oracle mode as background knowledge for previous selection of relevant papers before entering the active learning cycle. If unavailable, the user has to select at least one relevant record that can be identified by searching the pool of records. At least one irrelevant record should also be identified; the software allows to search for specific records or presents random records that are most likely to be irrelevant due to the extremely imbalanced data.
The software has a simple yet extensible default model: a naive Bayes classifier, TF–IDF feature extraction, a dynamic resampling balance strategy 31 and certainty-based sampling 17 , 32 for the query strategy. These defaults were chosen on the basis of their consistently high performance in benchmark experiments across several datasets 31 . Moreover, the low computation time of these default settings makes them attractive in applications, given that the software should be able to run locally. Users can change the settings, shown in Table 2 , and technical details are described in our documentation 28 . Users can also add their own classifiers, feature extraction techniques, query strategies and balance strategies.
ASReview has a number of implemented features (see Table 2 ). First, there are several classifiers available: (1) naive Bayes; (2) support vector machines; (3) logistic regression; (4) neural networks; (5) random forests; (6) LSTM-base, which consists of an embedding layer, an LSTM layer with one output, a dense layer and a single sigmoid output node; and (7) LSTM-pool, which consists of an embedding layer, an LSTM layer with many outputs, a max pooling layer and a single sigmoid output node. The feature extraction techniques available are Doc2Vec 33 , embedding LSTM, embedding with IDF or TF–IDF 34 (the default is unigram, with the option to run n -grams while other parameters are set to the defaults of Scikit-learn 35 ) and sBERT 36 . The available query strategies for the active learning part are (1) random selection, ignoring model-assigned probabilities; (2) uncertainty-based sampling, which chooses the most uncertain record according to the model (that is, closest to 0.5 probability); (3) certainty-based sampling (max in ASReview), which chooses the record most likely to be included according to the model; and (4) mixed sampling, which uses a combination of random and certainty-based sampling.
There are several balance strategies that rebalance and reorder the training data. This is necessary, because the data is typically extremely imbalanced and therefore we have implemented the following balance strategies: (1) full sampling, which uses all of the labelled records; (2) undersampling the irrelevant records so that the included and excluded records are in some particular ratio (closer to one); and (3) dynamic resampling, a novel method similar to undersampling in that it decreases the imbalance of the training data 31 . However, in dynamic resampling, the number of irrelevant records is decreased, whereas the number of relevant records is increased by duplication such that the total number of records in the training data remains the same. The ratio between relevant and irrelevant records is not fixed over interactions, but dynamically updated depending on the number of labelled records, the total number of records and the ratio between relevant and irrelevant records. Details on all of the described algorithms can be found in the code and documentation referred to above.
By default, ASReview converts the records’s texts into a document-term matrix, terms are converted to lowercase and no stop words are removed by default (but this can be changed). As the document-term matrix is identical in each iteration of the active learning cycle, it is generated in advance of model training and stored in the (active learning) state file. Each row of the document-term matrix can easily be requested from the state-file. Records are internally identified by their row number in the input dataset. In oracle mode, the record that is selected to be classified is retrieved from the state file and the record text and other metadata (such as title and abstract) are retrieved from the original dataset (from the file or the computer’s memory). ASReview can run on your local computer, or on a (self-hosted) local or remote server. Data (all records and their labels) remain on the users’s computer. Data ownership and confidentiality are crucial and no data are processed or used in any way by third parties. This is unique by comparison with some of the existing systems, as shown in the last column of Table 1 .
Real-world use cases and high-level function descriptions
Below we highlight a number of real-world use cases and high-level function descriptions for using the pipeline of ASReview.
ASReview can be integrated in classic systematic reviews or meta-analyses. Such reviews or meta-analyses entail several explicit and reproducible steps, as outlined in the PRISMA guidelines 4 . Scholars identify all likely relevant publications in a standardized way, screen retrieved publications to select eligible studies on the basis of defined eligibility criteria, extract data from eligible studies and synthesize the results. ASReview fits into this process, particularly in the abstract screening phase. ASReview does not replace the initial step of collecting all potentially relevant studies. As such, results from ASReview depend on the quality of the initial search process, including selection of databases 24 and construction of comprehensive searches using keywords and controlled vocabulary. However, ASReview can be used to broaden the scope of the search (by keyword expansion or omitting limitation in the search query), resulting in a higher number of initial papers to limit the risk of missing relevant papers during the search part (that is, more focus on recall instead of precision).
Furthermore, many reviewers nowadays move towards meta-reviews when analysing very large literature streams, that is, systematic reviews of systematic reviews 37 . This can be problematic as the various reviews included could use different eligibility criteria and are therefore not always directly comparable. Due to the efficiency of ASReview, scholars using the tool could conduct the study by analysing the papers directly instead of using the systematic reviews. Furthermore, ASReview supports the rapid update of a systematic review. The included papers from the initial review are used to train the machine learning model before screening of the updated set of papers starts. This allows the researcher to quickly screen the updated set of papers on the basis of decisions made in the initial run.
As an example case, let us look at the current literature on COVID-19 and the coronavirus. An enormous number of papers are being published on COVID-19. It is very time consuming to manually find relevant papers (for example, to develop treatment guidelines). This is especially problematic as urgent overviews are required. Medical guidelines rely on comprehensive systematic reviews, but the medical literature is growing at breakneck pace and the quality of the research is not universally adequate for summarization into policy 38 . Such reviews must entail adequate protocols with explicit and reproducible steps, including identifying all potentially relevant papers, extracting data from eligible studies, assessing potential for bias and synthesizing the results into medical guidelines. Researchers need to screen (tens of) thousands of COVID-19-related studies by hand to find relevant papers to include in their overview. Using ASReview, this can be done far more efficiently by selecting key papers that match their (COVID-19) research question in the first step; this should start the active learning cycle and lead to the most relevant COVID-19 papers for their research question being presented next. A plug-in was therefore developed for ASReview 39 , which contained three databases that are updated automatically whenever a new version is released by the owners of the data: (1) the Cord19 database, developed by the Allen Institute for AI, with over all publications on COVID-19 and other coronavirus research (for example SARS, MERS and so on) from PubMed Central, the WHO COVID-19 database of publications, the preprint servers bioRxiv and medRxiv and papers contributed by specific publishers 40 . The CORD-19 dataset is updated daily by the Allen Institute for AI and updated also daily in the plugin. (2) In addition to the full dataset, we automatically construct a daily subset of the database with studies published after December 1st, 2019 to search for relevant papers published during the COVID-19 crisis. (3) A separate dataset of COVID-19 related preprints, containing metadata of preprints from over 15 preprints servers across disciplines, published since January 1st, 2020 41 . The preprint dataset is updated weekly by the maintainers and then automatically updated in ASReview as well. As this dataset is not readily available to researchers through regular search engines (for example, PubMed), its inclusion in ASReview provided added value to researchers interested in COVID-19 research, especially if they want a quick way to screen preprints specifically.
Simulation study
To evaluate the performance of ASReview on a labelled dataset, users can employ the simulation mode. As an example, we ran simulations based on four labelled datasets with version 0.7.2 of ASReview. All scripts to reproduce the results in this paper can be found on Zenodo ( https://doi.org/10.5281/zenodo.4024122 ) 42 , whereas the results are available at OSF ( https://doi.org/10.17605/OSF.IO/2JKD6 ) 43 .
First, we analysed the performance for a study systematically describing studies that performed viral metagenomic next-generation sequencing in common livestock such as cattle, small ruminants, poultry and pigs 44 . Studies were retrieved from Embase ( n = 1,806), Medline ( n = 1,384), Cochrane Central ( n = 1), Web of Science ( n = 977) and Google Scholar ( n = 200, the top relevant references). After deduplication this led to 2,481 studies obtained in the initial search, of which 120 were inclusions (4.84%).
A second simulation study was performed on the results for a systematic review of studies on fault prediction in software engineering 45 . Studies were obtained from ACM Digital Library, IEEExplore and the ISI Web of Science. Furthermore, a snowballing strategy and a manual search were conducted, accumulating to 8,911 publications of which 104 were included in the systematic review (1.2%).
A third simulation study was performed on a review of longitudinal studies that applied unsupervised machine learning techniques to longitudinal data of self-reported symptoms of the post-traumatic stress assessed after trauma exposure 46 , 47 ; 5,782 studies were obtained by searching Pubmed, Embase, PsychInfo and Scopus and through a snowballing strategy in which both the references and the citation of the included papers were screened. Thirty-eight studies were included in the review (0.66%).
A fourth simulation study was performed on the results for a systematic review on the efficacy of angiotensin-converting enzyme inhibitors, from a study collecting various systematic review datasets from the medical sciences 15 . The collection is a subset of 2,544 publications from the TREC 2004 Genomics Track document corpus 48 . This is a static subset from all MEDLINE records from 1994 through 2003, which allows for replicability of results. Forty-one publications were included in the review (1.6%).
Performance metrics
We evaluated the four datasets using three performance metrics. We first assess the work saved over sampling (WSS), which is the percentage reduction in the number of records needed to screen achieved by using active learning instead of screening records at random; WSS is measured at a given level of recall of relevant records, for example 95%, indicating the work reduction in screening effort at the cost of failing to detect 5% of the relevant records. For some researchers it is essential that all relevant literature on the topic is retrieved; this entails that the recall should be 100% (that is, WSS@100%). We also propose the amount of relevant references found after having screened the first 10% of the records (RRF10%). This is a useful metric for getting a quick overview of the relevant literature.
For every dataset, 15 runs were performed with one random inclusion and one random exclusion (see Fig. 2 ). The classical review performance with randomly found inclusions is shown by the dashed line. The average work saved over sampling at 95% recall for ASReview is 83% and ranges from 67% to 92%. Hence, 95% of the eligible studies will be found after screening between only 8% to 33% of the studies. Furthermore, the number of relevant abstracts found after reading 10% of the abstracts ranges from 70% to 100%. In short, our software would have saved many hours of work.
a – d , Results of the simulation study for the results for a study systematically review studies that performed viral metagenomic next-generation sequencing in common livestock ( a ), results for a systematic review of studies on fault prediction in software engineering ( b ), results for longitudinal studies that applied unsupervised machine learning techniques on longitudinal data of self-reported symptoms of posttraumatic stress assessed after trauma exposure ( c ), and results for a systematic review on the efficacy of angiotensin-converting enzyme inhibitors ( d ). Fiteen runs (shown with separate lines) were performed for every dataset, with only one random inclusion and one random exclusion. The classical review performances with randomly found inclusions are shown by the dashed lines.
Usability testing (user experience testing)
We conducted a series of user experience tests to learn from end users how they experience the software and implement it in their workflow. The study was approved by the Ethics Committee of the Faculty of Social and Behavioral Sciences of Utrecht University (ID 20-104).
Unstructured interviews
The first user experience (UX) test—carried out in December 2019—was conducted with an academic research team in a substantive research field (public administration and organizational science) that has conducted various systematic reviews and meta-analyses. It was composed of three university professors (ranging from assistant to full) and three PhD candidates. In one 3.5 h session, the participants used the software and provided feedback via unstructured interviews and group discussions. The goal was to provide feedback on installing the software and testing the performance on their own data. After these sessions we prioritized the feedback in a meeting with the ASReview team, which resulted in the release of v.0.4 and v.0.6. An overview of all releases can be found on GitHub 27 .
A second UX test was conducted with four experienced researchers developing medical guidelines based on classical systematic reviews, and two experienced reviewers working at a pharmaceutical non-profit organization who work on updating reviews with new data. In four sessions, held in February to March 2020, these users tested the software following our testing protocol. After each session we implemented the feedback provided by the experts and asked them to review the software again. The main feedback was about how to upload datasets and select prior papers. Their feedback resulted in the release of v.0.7 and v.0.9.
Systematic UX test
In May 2020 we conducted a systematic UX test. Two groups of users were distinguished: an unexperienced group and an experienced user who already used ASReview. Due to the COVID-19 lockdown the usability tests were conducted via video calling where one person gave instructions to the participant and one person observed, called human-moderated remote testing 49 . During the tests, one person (SH) asked the questions and helped the participant with the tasks, the other person observed and made notes, a user experience professional at the IT department of Utrecht University (MH).
To analyse the notes, thematic analysis was used, which is a method to analyse data by dividing the information in subjects that all have a different meaning 50 using the Nvivo 12 software 51 . When something went wrong the text was coded as showstopper, when something did not go smoothly the text was coded as doubtful, and when something went well the subject was coded as superb. The features the participants requested for future versions of the ASReview tool were discussed with the lead engineer of the ASReview team and were submitted to GitHub as issues or feature requests.
The answers to the quantitative questions can be found at the Open Science Framework 52 . The participants ( N = 11) rated the tool with a grade of 7.9 (s.d. = 0.9) on a scale from one to ten (Table 2 ). The unexperienced users on average rated the tool with an 8.0 (s.d. = 1.1, N = 6). The experienced user on average rated the tool with a 7.8 (s.d. = 0.9, N = 5). The participants described the usability test with words such as helpful, accessible, fun, clear and obvious.
The UX tests resulted in the new release v0.10, v0.10.1 and the major release v0.11, which is a major revision of the graphical user interface. The documentation has been upgraded to make installing and launching ASReview more straightforward. We made setting up the project, selecting a dataset and finding past knowledge is more intuitive and flexible. We also added a project dashboard with information on your progress and advanced settings.
Continuous input via the open source community
Finally, the ASReview development team receives continuous feedback from the open science community about, among other things, the user experience. In every new release we implement features listed by our users. Recurring UX tests are performed to keep up with the needs of users and improve the value of the tool.
We designed a system to accelerate the step of screening titles and abstracts to help researchers conduct a systematic review or meta-analysis as efficiently and transparently as possible. Our system uses active learning to train a machine learning model that predicts relevance from texts using a limited number of labelled examples. The classifier, feature extraction technique, balance strategy and active learning query strategy are flexible. We provide an open source software implementation, ASReview with state-of-the-art systems across a wide range of real-world systematic reviewing applications. Based on our experiments, ASReview provides defaults on its parameters, which exhibited good performance on average across the applications we examined. However, we stress that in practical applications, these defaults should be carefully examined; for this purpose, the software provides a simulation mode to users. We encourage users and developers to perform further evaluation of the proposed approach in their application, and to take advantage of the open source nature of the project by contributing further developments.
Drawbacks of machine learning-based screening systems, including our own, remain. First, although the active learning step greatly reduces the number of manuscripts that must be screened, it also prevents a straightforward evaluation of the system’s error rates without further onerous labelling. Providing users with an accurate estimate of the system’s error rate in the application at hand is therefore a pressing open problem. Second, although, as argued above, the use of such systems is not limited in principle to reviewing, no empirical benchmarks of actual performance in these other situations yet exist to our knowledge. Third, machine learning-based screening systems automate the screening step only; although the screening step is time-consuming and a good target for automation, it is just one part of a much larger process, including the initial search, data extraction, coding for risk of bias, summarizing results and so on. Although some other works, similar to our own, have looked at (semi-)automating some of these steps in isolation 53 , 54 , to our knowledge the field is still far removed from an integrated system that would truly automate the review process while guaranteeing the quality of the produced evidence synthesis. Integrating the various tools that are currently under development to aid the systematic reviewing pipeline is therefore a worthwhile topic for future development.
Possible future research could also focus on the performance of identifying full text articles with different document length and domain-specific terminologies or even other types of text, such as newspaper articles and court cases. When the selection of past knowledge is not possible based on expert knowledge, alternative methods could be explored. For example, unsupervised learning or pseudolabelling algorithms could be used to improve training 55 , 56 . In addition, as the NLP community pushes forward the state of the art in feature extraction methods, these are easily added to our system as well. In all cases, performance benefits should be carefully evaluated using benchmarks for the task at hand. To this end, common benchmark challenges should be constructed that allow for an even comparison of the various tools now available. To facilitate such a benchmark, we have constructed a repository of publicly available systematic reviewing datasets 57 .
The future of systematic reviewing will be an interaction with machine learning algorithms to deal with the enormous increase of available text. We invite the community to contribute to open source projects such as our own, as well as to common benchmark challenges, so that we can provide measurable and reproducible improvement over current practice.
Data availability
The results described in this paper are available at the Open Science Framework ( https://doi.org/10.17605/OSF.IO/2JKD6 ) 43 . The answers to the quantitative questions of the UX test can be found at the Open Science Framework (OSF.IO/7PQNM) 52 .
Code availability
All code to reproduce the results described in this paper can be found on Zenodo ( https://doi.org/10.5281/zenodo.4024122 ) 42 . All code for the software ASReview is available under an Apache 2.0 license ( https://doi.org/10.5281/zenodo.3345592 ) 27 , is maintained on GitHub 63 and includes documentation ( https://doi.org/10.5281/zenodo.4287120 ) 28 .
Bornmann, L. & Mutz, R. Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references. J. Assoc. Inf. Sci. Technol. 66 , 2215–2222 (2015).
Article Google Scholar
Gough, D., Oliver, S. & Thomas, J. An Introduction to Systematic Reviews (Sage, 2017).
Cooper, H. Research Synthesis and Meta-analysis: A Step-by-Step Approach (SAGE Publications, 2015).
Liberati, A. et al. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. J. Clin. Epidemiol. 62 , e1–e34 (2009).
Boaz, A. et al. Systematic Reviews: What have They Got to Offer Evidence Based Policy and Practice? (ESRC UK Centre for Evidence Based Policy and Practice London, 2002).
Oliver, S., Dickson, K. & Bangpan, M. Systematic Reviews: Making Them Policy Relevant. A Briefing for Policy Makers and Systematic Reviewers (UCL Institute of Education, 2015).
Petticrew, M. Systematic reviews from astronomy to zoology: myths and misconceptions. Brit. Med. J. 322 , 98–101 (2001).
Lefebvre, C., Manheimer, E. & Glanville, J. in Cochrane Handbook for Systematic Reviews of Interventions (eds. Higgins, J. P. & Green, S.) 95–150 (John Wiley & Sons, 2008); https://doi.org/10.1002/9780470712184.ch6 .
Sampson, M., Tetzlaff, J. & Urquhart, C. Precision of healthcare systematic review searches in a cross-sectional sample. Res. Synth. Methods 2 , 119–125 (2011).
Wang, Z., Nayfeh, T., Tetzlaff, J., O’Blenis, P. & Murad, M. H. Error rates of human reviewers during abstract screening in systematic reviews. PLoS ONE 15 , e0227742 (2020).
Marshall, I. J. & Wallace, B. C. Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Syst. Rev. 8 , 163 (2019).
Harrison, H., Griffin, S. J., Kuhn, I. & Usher-Smith, J. A. Software tools to support title and abstract screening for systematic reviews in healthcare: an evaluation. BMC Med. Res. Methodol. 20 , 7 (2020).
O’Mara-Eves, A., Thomas, J., McNaught, J., Miwa, M. & Ananiadou, S. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst. Rev. 4 , 5 (2015).
Wallace, B. C., Trikalinos, T. A., Lau, J., Brodley, C. & Schmid, C. H. Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinf. 11 , 55 (2010).
Cohen, A. M., Hersh, W. R., Peterson, K. & Yen, P.-Y. Reducing workload in systematic review preparation using automated citation classification. J. Am. Med. Inform. Assoc. 13 , 206–219 (2006).
Kremer, J., Steenstrup Pedersen, K. & Igel, C. Active learning with support vector machines. WIREs Data Min. Knowl. Discov. 4 , 313–326 (2014).
Miwa, M., Thomas, J., O’Mara-Eves, A. & Ananiadou, S. Reducing systematic review workload through certainty-based screening. J. Biomed. Inform. 51 , 242–253 (2014).
Settles, B. Active Learning Literature Survey (Minds@UW, 2009); https://minds.wisconsin.edu/handle/1793/60660
Holzinger, A. Interactive machine learning for health informatics: when do we need the human-in-the-loop? Brain Inform. 3 , 119–131 (2016).
Van de Schoot, R. & De Bruin, J. Researcher-in-the-loop for Systematic Reviewing of Text Databases (Zenodo, 2020); https://doi.org/10.5281/zenodo.4013207
Kim, D., Seo, D., Cho, S. & Kang, P. Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. Inf. Sci. 477 , 15–29 (2019).
Nosek, B. A. et al. Promoting an open research culture. Science 348 , 1422–1425 (2015).
Kilicoglu, H., Demner-Fushman, D., Rindflesch, T. C., Wilczynski, N. L. & Haynes, R. B. Towards automatic recognition of scientifically rigorous clinical research evidence. J. Am. Med. Inform. Assoc. 16 , 25–31 (2009).
Gusenbauer, M. & Haddaway, N. R. Which academic search systems are suitable for systematic reviews or meta‐analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources. Res. Synth. Methods 11 , 181–217 (2020).
Borah, R., Brown, A. W., Capers, P. L. & Kaiser, K. A. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open 7 , e012545 (2017).
de Vries, H., Bekkers, V. & Tummers, L. Innovation in the Public Sector: a systematic review and future research agenda. Public Adm. 94 , 146–166 (2016).
Van de Schoot, R. et al. ASReview: Active Learning for Systematic Reviews (Zenodo, 2020); https://doi.org/10.5281/zenodo.3345592
De Bruin, J. et al. ASReview Software Documentation 0.14 (Zenodo, 2020); https://doi.org/10.5281/zenodo.4287120
ASReview PyPI Package (ASReview Core Development Team, 2020); https://pypi.org/project/asreview/
Docker container for ASReview (ASReview Core Development Team, 2020); https://hub.docker.com/r/asreview/asreview
Ferdinands, G. et al. Active Learning for Screening Prioritization in Systematic Reviews—A Simulation Study (OSF Preprints, 2020); https://doi.org/10.31219/osf.io/w6qbg
Fu, J. H. & Lee, S. L. Certainty-enhanced active learning for improving imbalanced data classification. In 2011 IEEE 11th International Conference on Data Mining Workshops 405–412 (IEEE, 2011).
Le, Q. V. & Mikolov, T. Distributed representations of sentences and documents. Preprint at https://arxiv.org/abs/1405.4053 (2014).
Ramos, J. Using TF–IDF to determine word relevance in document queries. In Proc. 1st Instructional Conference on Machine Learning Vol. 242, 133–142 (ICML, 2003).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12 , 2825–2830 (2011).
MathSciNet MATH Google Scholar
Reimers, N. & Gurevych, I. Sentence-BERT: sentence embeddings using siamese BERT-networks Preprint at https://arxiv.org/abs/1908.10084 (2019).
Smith, V., Devane, D., Begley, C. M. & Clarke, M. Methodology in conducting a systematic review of systematic reviews of healthcare interventions. BMC Med. Res. Methodol. 11 , 15 (2011).
Wynants, L. et al. Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal. Brit. Med. J . 369 , 1328 (2020).
Van de Schoot, R. et al. Extension for COVID-19 Related Datasets in ASReview (Zenodo, 2020). https://doi.org/10.5281/zenodo.3891420 .
Lu Wang, L. et al. CORD-19: The COVID-19 open research dataset. Preprint at https://arxiv.org/abs/2004.10706 (2020).
Fraser, N. & Kramer, B. Covid19_preprints (FigShare, 2020); https://doi.org/10.6084/m9.figshare.12033672.v18
Ferdinands, G., Schram, R., Van de Schoot, R. & De Bruin, J. Scripts for ‘ASReview: Open Source Software for Efficient and Transparent Active Learning for Systematic Reviews’ (Zenodo, 2020); https://doi.org/10.5281/zenodo.4024122
Ferdinands, G., Schram, R., van de Schoot, R. & de Bruin, J. Results for ‘ASReview: Open Source Software for Efficient and Transparent Active Learning for Systematic Reviews’ (OSF, 2020); https://doi.org/10.17605/OSF.IO/2JKD6
Kwok, K. T. T., Nieuwenhuijse, D. F., Phan, M. V. T. & Koopmans, M. P. G. Virus metagenomics in farm animals: a systematic review. Viruses 12 , 107 (2020).
Hall, T., Beecham, S., Bowes, D., Gray, D. & Counsell, S. A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng. 38 , 1276–1304 (2012).
van de Schoot, R., Sijbrandij, M., Winter, S. D., Depaoli, S. & Vermunt, J. K. The GRoLTS-Checklist: guidelines for reporting on latent trajectory studies. Struct. Equ. Model. Multidiscip. J. 24 , 451–467 (2017).
Article MathSciNet Google Scholar
van de Schoot, R. et al. Bayesian PTSD-trajectory analysis with informed priors based on a systematic literature search and expert elicitation. Multivar. Behav. Res. 53 , 267–291 (2018).
Cohen, A. M., Bhupatiraju, R. T. & Hersh, W. R. Feature generation, feature selection, classifiers, and conceptual drift for biomedical document triage. In Proc. 13th Text Retrieval Conference (TREC, 2004).
Vasalou, A., Ng, B. D., Wiemer-Hastings, P. & Oshlyansky, L. Human-moderated remote user testing: orotocols and applications. In 8th ERCIM Workshop, User Interfaces for All Vol. 19 (ERCIM, 2004).
Joffe, H. in Qualitative Research Methods in Mental Health and Psychotherapy: A Guide for Students and Practitioners (eds Harper, D. & Thompson, A. R.) Ch. 15 (Wiley, 2012).
NVivo v. 12 (QSR International Pty, 2019).
Hindriks, S., Huijts, M. & van de Schoot, R. Data for UX-test ASReview - June 2020. OSF https://doi.org/10.17605/OSF.IO/7PQNM (2020).
Marshall, I. J., Kuiper, J. & Wallace, B. C. RobotReviewer: evaluation of a system for automatically assessing bias in clinical trials. J. Am. Med. Inform. Assoc. 23 , 193–201 (2016).
Nallapati, R., Zhou, B., dos Santos, C. N., Gulcehre, Ç. & Xiang, B. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proc. 20th SIGNLL Conference on Computational Natural Language Learning 280–290 (Association for Computational Linguistics, 2016).
Xie, Q., Dai, Z., Hovy, E., Luong, M.-T. & Le, Q. V. Unsupervised data augmentation for consistency training. Preprint at https://arxiv.org/abs/1904.12848 (2019).
Ratner, A. et al. Snorkel: rapid training data creation with weak supervision. VLDB J. 29 , 709–730 (2020).
Systematic Review Datasets (ASReview Core Development Team, 2020); https://github.com/asreview/systematic-review-datasets
Wallace, B. C., Small, K., Brodley, C. E., Lau, J. & Trikalinos, T. A. Deploying an interactive machine learning system in an evidence-based practice center: Abstrackr. In Proc. 2nd ACM SIGHIT International Health Informatics Symposium 819–824 (Association for Computing Machinery, 2012).
Cheng, S. H. et al. Using machine learning to advance synthesis and use of conservation and environmental evidence. Conserv. Biol. 32 , 762–764 (2018).
Yu, Z., Kraft, N. & Menzies, T. Finding better active learners for faster literature reviews. Empir. Softw. Eng . 23 , 3161–3186 (2018).
Ouzzani, M., Hammady, H., Fedorowicz, Z. & Elmagarmid, A. Rayyan—a web and mobile app for systematic reviews. Syst. Rev. 5 , 210 (2016).
Przybyła, P. et al. Prioritising references for systematic reviews with RobotAnalyst: a user study. Res. Synth. Methods 9 , 470–488 (2018).
ASReview: Active learning for Systematic Reviews (ASReview Core Development Team, 2020); https://github.com/asreview/asreview
Download references
Acknowledgements
We would like to thank the Utrecht University Library, focus area Applied Data Science, and departments of Information and Technology Services, Test and Quality Services, and Methodology and Statistics, for their support. We also want to thank all researchers who shared data, participated in our user experience tests or who gave us feedback on ASReview in other ways. Furthermore, we would like to thank the editors and reviewers for providing constructive feedback. This project was funded by the Innovation Fund for IT in Research Projects, Utrecht University, the Netherlands.
Author information
Authors and affiliations.
Department of Methodology and Statistics, Faculty of Social and Behavioral Sciences, Utrecht University, Utrecht, the Netherlands
Rens van de Schoot, Gerbrich Ferdinands, Albert Harkema, Joukje Willemsen, Yongchao Ma, Qixiang Fang, Sybren Hindriks & Daniel L. Oberski
Department of Research and Data Management Services, Information Technology Services, Utrecht University, Utrecht, the Netherlands
Jonathan de Bruin, Raoul Schram, Parisa Zahedi & Maarten Hoogerwerf
Utrecht University Library, Utrecht University, Utrecht, the Netherlands
Jan de Boer, Felix Weijdema & Bianca Kramer
Department of Test and Quality Services, Information Technology Services, Utrecht University, Utrecht, the Netherlands
Martijn Huijts
School of Governance, Faculty of Law, Economics and Governance, Utrecht University, Utrecht, the Netherlands
Lars Tummers
Department of Biostatistics, Data management and Data Science, Julius Center, University Medical Center Utrecht, Utrecht, the Netherlands
Daniel L. Oberski
You can also search for this author in PubMed Google Scholar
Contributions
R.v.d.S. and D.O. originally designed the project, with later input from L.T. J.d.Br. is the lead engineer, software architect and supervises the code base on GitHub. R.S. coded the algorithms and simulation studies. P.Z. coded the very first version of the software. J.d.Bo., F.W. and B.K. developed the systematic review pipeline. M.Huijts is leading the UX tests and was supported by S.H. M.Hoogerwerf developed the architecture of the produced (meta)data. G.F. conducted the simulation study together with R.S. A.H. performed the literature search comparing the different tools together with G.F. J.W. designed all the artwork and helped with formatting the manuscript. Y.M. and Q.F. are responsible for the preprocessing of the metadata under the supervision of J.d.Br. R.v.d.S, D.O. and L.T. wrote the paper with input from all authors. Each co-author has written parts of the manuscript.
Corresponding author
Correspondence to Rens van de Schoot .
Ethics declarations
Competing interests.
The authors declare no competing interests.
Additional information
Peer review information Nature Machine Intelligence thanks Jian Wu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary information.
Overview of software tools supporting systematic reviews.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .
Reprints and permissions
About this article
Cite this article.
van de Schoot, R., de Bruin, J., Schram, R. et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat Mach Intell 3 , 125–133 (2021). https://doi.org/10.1038/s42256-020-00287-7
Download citation
Received : 04 June 2020
Accepted : 17 December 2020
Published : 01 February 2021
Issue Date : February 2021
DOI : https://doi.org/10.1038/s42256-020-00287-7
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
This article is cited by
What shapes statistical and data literacy research in k-12 stem education a systematic review of metrics and instructional strategies.
- Anja Friedrich
- Saskia Schreiter
- Sarah Malone
International Journal of STEM Education (2024)
Semi-automated title-abstract screening using natural language processing and machine learning
- Maximilian Pilz
- Samuel Zimmermann
- Johannes A. Vey
Systematic Reviews (2024)
Assessing the health status of migrants upon arrival in Europe: a systematic review of the adverse impact of migration journeys
- Cristina Canova
- Lucia Dansero
- Isabella Rosato
Globalization and Health (2024)
A systematic review, meta-analysis, and meta-regression of the prevalence of self-reported disordered eating and associated factors among athletes worldwide
- Hadeel A. Ghazzawi
- Lana S. Nimer
- Haitham Jahrami
Journal of Eating Disorders (2024)
Systematic review using a spiral approach with machine learning
- Amirhossein Saeidmehr
- Piers David Gareth Steel
- Faramarz F. Samavati
Quick links
- Explore articles by subject
- Guide to authors
- Editorial policies
Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.
- X/Twitter Mastodon Linkedin YouTube Facebook
- Shop
- Donate
- Compliance and Standards
- Inclusion, Diversity, Equity, and Access
- Community Building
- Policy Reform
- Preregistration
- Registered Reports
- Instagram Data Access Pilot
- Lifecycle Journal
- Metascience
- SMART Prototyping
- Registered Revisions
- RP: Cancer Biology
- RP: Psychology
- Metascience 2025
- Enabling Research Innovation
- Year of Open Science Conference
Show Your Work. Share Your Work. Advance Science. That’s Open Science.
At the Center for Open Science, we believe an open exchange of ideas accelerates scientific progress toward solving our most persistent problems.
Instagram Data Access Pilot for Well-being Research
Save the Date: Metascience 2025 Conference in London
Global Flourishing Study Releases Wave One Open Research Data
Join our mailing list, keep up on what's happening at the center for open science, get new product updates, and learn more about open science with our blog and video updates., about the center for open science, we envision a future scholarly community in which the process, content, and outcomes of research are openly accessible by default..
Research Software Tools
We maintain the Open Science Framework (OSF) to help researchers conduct research more rigorously, and manage and share their work more openly.
Community Action and Culture Change
From individuals to institutions, we educate, train, and support research communities and their affiliates toward greater adoption of open and reproducible practices.
Metascience and Research
We conduct research on research practices to understand areas of inefficiency, and to evaluate interventions to improve.
The open research lifecycle
Conduct your own open science.
Share Data, Materials, or Code
Making your work openly visible to other researchers invites collaboration, allows others to discover and build on your work, and facilitates replication.
Share a Paper or Preprint
Sharing a paper or preprint on OSF accelerates scholarly communication, feedback that can improve the work, and discoverability of research. It only takes a few minutes.
Register Your Research
Create a time-stamped registration of any OSF project, or begin a preregistration prior to your study to increase the transparency and quality of your research.
Support the open science movement
When you support COS, you’re helping to change the culture toward more open, rigorous, reproducible research. Support open science with a donation today .
Spread the Message
Our blog and YouTube channel provide excellent information about open research best practices. Share these resources, or become an advocate by joining the COS Ambassador program .
Join Our Team
COS has an amazing team of researchers, designers, developers, and communicators that help move open science forward. View our openings .
210 Ridge McIntire Road Suite 500 Charlottesville, VA 22903-5083 Email: [email protected]
- X/Twitter
- Mastodon
- YouTube
Join our mailing list
Translation Help | Terms of Use | Privacy Policy
- Diversity, Equity & Inclusion
- Donate
- Open Science
Responsible stewards of your support
COS has earned top recognition from Charity Navigator and Candid (formerly GuideStar) for our financial transparency and accountability to our mission. COS and the OSF were also awarded SOC2 accreditation in 2023 after an independent assessment of our security and procedures by the American Institute of CPAs (AICPA) .
We invite all of our sponsors, partners, and members of the community to learn more about how our organization operates, our impact, our financial performance, and our nonprofit status.
Photo Stream
The Federal Register
The daily journal of the united states government, request access.
Due to aggressive automated scraping of FederalRegister.gov and eCFR.gov, programmatic access to these sites is limited to access to our extensive developer APIs.
If you are human user receiving this message, we can add your IP address to a set of IPs that can access FederalRegister.gov & eCFR.gov; complete the CAPTCHA (bot test) below and click "Request Access". This process will be necessary for each IP address you wish to access the site from, requests are valid for approximately one quarter (three months) after which the process may need to be repeated.
An official website of the United States government.
If you want to request a wider IP range, first request access for your current IP, and then use the "Site Feedback" button found in the lower left-hand side to make the request.
IMAGES
COMMENTS
By following these steps, you’ll learn where to find open source researchers, how to observe and learn from their work, and how to practice the new skills that you’ll develop. 1. Take Stock of your Skills and Interests. Are you interested in a particular conflict? Or do you love solving puzzles, which could translate to geolocating images?
Search and discover relevant research in over 97 million Open Access articles and article records; Share your expertise and get credit by publicly reviewing any article; Publish your poster or preprint and track usage and impact with article- and author-level metrics; Create a topical Collection to advance your research field
Find open access journals & articles. DOAJ is a unique and extensive index of diverse open access journals from around the world, driven by a growing community, and is committed to ensuring quality content is freely available online for everyone.
Open research aims to make both research methods and the resulting data freely available, often via the internet, in order to support reproducibility and, potentially, massively distributed research collaboration. In this regard, it is related to both open source software and citizen science.
We identify connections between open source research software projects, research papers, organizations, patents, datasets, funding pathways, AI models and applications, and the people who drive it all.
Open science is the movement to make scientific research (including publications, data, physical samples, and software) and its dissemination accessible to all levels of society, amateur or professional. [2] [3] Open science is transparent and accessible knowledge that is shared and developed through collaborative networks. [4]
The principles of open-source science are to make publicly funded scientific research transparent, inclusive, accessible, and reproducible. Advances in technology, including collaborative tools and cloud computing, help enable open-source science, but technology alone is insufficient.
By following these ethical principles and practices, you can use open-source data and materials in a way that is respectful, responsible, and beneficial both for your research and for the research community.
Over 500 open source researchers filled out our survey. Here's what they want tool developers to know.
Open access (OA) refers to the free, immediate, online availability of research outputs such as journal articles or books, combined with the rights to use these outputs fully in the digital environment. OA content is open to all, with no access fees.
Search and analyze the world's research. We index over 250M scholarly works from 250k sources, with extra coverage of humanities, non-English languages, and the Global South. We link these works to 90M disambiguated authors and 100k institutions, as well as enriching them with topic information, SDGs, citation counts, and much more.
Learn how open source information can be used to protect civil society from cyberattacks. Understand common OSINT techniques & sources for security researchers. Determine when and how to begin collecting open source information.
Our open-source tools are used by millions every day, in universities, businesses, and libraries worldwide, to uncover, connect, and analyze research products. We work for Open: scholarly articles, datasets, metadata, and software. And we make our own code and data open, too.
Critics may point out that OpenScholar’s reliance on open-access papers limits its immediate utility in high-stakes fields like pharmaceuticals, where much of the research is locked behind paywalls.
“This new platform is open-source and available to anyone. We see this impacting both academic and pharmaceutical research communities.” While pharmaceutical companies have historically relied on high-powered cloud computing systems to analyze this data, the new platform is designed to be accessible even to researchers using standard ...
Open Knowledge Maps is the world's largest AI-based search engine for scientific knowledge. We dramatically increase the visibility of research findings for science and society alike. Learn more about us. We are a charitable non-profit organization based on the principles of open science.
Open-source research uses publicly available information including social media, periodicals, databases and more to get an understanding of global and regional issues
There are three main considerations when collecting open-source information online. In order of priority, they are: Protecting your devices, network, and files from malware. Archiving your sources for posterity. Masking your activities from intrusive onlookers.
How to find the millions of journal articles, ebooks, images, and other media available on JSTOR and Artstor as Open Access or free to everyone.
There are some pretty basic things that a researcher can do to make their work into an open content project. Here are a few. 1. Radical realtime transparency. Release all work in an editable format under a creative commons license as soon as it’s made. I’ll elaborate on each of those points in a bit more detail: 1a. Release all work.
A new open source machine learning framework called ASReview, which employs active learning and offers a range of machine learning models, can check the literature efficiently and systemically.
We maintain the Open Science Framework (OSF) to help researchers conduct research more rigorously, and manage and share their work more openly. From individuals to institutions, we educate, train, and support research communities and their affiliates toward greater adoption of open and reproducible practices.
Notice is hereby given that on September 27, 2024, pursuant to section 6(a) of the National Cooperative Research and Production Act of 1993, 15 U.S.C. 4301 et seq. (“the Act”), Open Source Imaging Consortium, Inc. (“Open Source Imaging Consortium”) filed written notifications simultaneously with the Attorney General and the Federal Trade Commission disclosing changes in its membership.
Available through Tenable One: The world’s only AI-powered exposure management platform. Tenable One solves the central challenge of modern security: a deeply divided approach to seeing and doing battle against cyber risk.We remove risk by unifying security visibility, insight and action across the attack surface to rapidly expose and close gaps.
Police are on the hunt for 43 monkeys who escaped from a research facility in South Carolina, after a keeper left their pen open. The rhesus macaque fugitives broke out of Alpha Genesis, a company ...