A suggested data structure for transparent and repeatable reporting of bibliographic searching

Abstract Academic searching is integral to research activities: (1) searching to retrieve specific information, (2) to expand our knowledge iteratively, (3) and to collate a representative and unbiased selection of the literature. Rigorous searching methods are vital for reliable, repeatable and unbiased searches needed for these second and third forms of searches (exploratory and systematic searching, respectively) that form a core part of evidence syntheses. Despite the broad awareness of the importance of transparency in reporting search activities in evidence syntheses, the importance of searching has been highlighted only recently and has been the explicit focus of reporting guidance (PRISMA‐S). Ensuring bibliographic searches are reported in a way that is transparent enough to allow for full repeatability or evaluation is challenging for a number of reasons. Here, we detail these reasons and provide for the first time a standardised data structure for transparent and comprehensive reporting of search histories. This data structure was produced by a group of international experts in informatics and library sciences. We explain how the data structure was produced and describe its components in detail. We also demonstrate its practical applicability in tools designed to support literature review authors and explain how it can help to improve interoperability across tools used to manage literature reviews. We call on the research community and developers of reference and review management tools to embrace the data structure to facilitate adequate reporting of academic searching in an effort to raise the standard of evidence syntheses globally.

Thirdly, we may have developed a conceptual model of a topic and know what we want to search for, but wish to search in an unbiased, procedural way to obtain a potentially relevant evidence base that we can then read and screen for relevant information.
This type of searching is referred to as 'systematic searching' (Gusenbauer & Haddaway, 2020;Jansen & Rieh, 2010) and is integral to systematic reviews and evidence-informed decisionmaking that aim to summarise large bodies of evidence in a reliable and robust way.
Exploratory searching is vital for planning systematic searches, and for conducting scoping reviews and other forms of syntheses (bringing together scientific information) that aim to improve understanding of the nature of an evidence base. For both exploratory and systematic searching, researchers often want/need to show the methods they used to search for information. This is so that they can demonstrate any efforts to reduce bias and increase comprehensiveness in their results. This transparency and the resultant repeatability are integral to robust evidence syntheses (Lefebvre et al., 2019;Page et al., 2021) and evidence-informed decision making (Eden et al., 2011).
Despite the broad awareness of the importance of transparency in reporting search activities in evidence syntheses, such as systematic reviews and systematic maps (Koffel & Rethlefsen, 2016;Maggio et al., 2011;Mullins et al., 2014;Rader et al., 2014;Yoshii et al., 2009), it is only recently that the importance of searching has been highlighted and made the explicit focus of reporting guidance (PRISMA-S; Rethlefsen et al., 2021). Previous efforts focusing on transparency in reporting systematic reviews had only limited focus on the details of searching (PRISMA 2009;(Moher et al., 2009)), and such details were far from allowing full repeatability. This lack of search history transparency in evidence syntheses prevents full assessment of the quality of the searches: the inclusion of librarians as co-authors in systematic reviews has been shown to be correlated with higher quality searching (Rethlefsen et al., 2015;Schellinger et al., 2021), but a lack of detail prevents any assessment of conduct quality.
Searches for academic information typically revolve around searches of bibliographic databases (defined as a data set of bibliographic information that, for a given search strategy on a given date and time, would return a fixed and identical set of results) (Eden et al., 2011;Higgins et al., 2018), such as Scopus (http://www.scopus.com), but also often involve a suite of alternative sources and methods, including citation searching (Wright et al., 2014). Grey literature searches are, by their very nature, diverse and highly topic specific: typically the websites of tens of organisations and other repositories are manually searched for potentially relevant documents, with idiosyncratic/varied procedures that are not easily reported in a standardised format (Canadian Agency for Drugs and Technologies in Health, 2018). Academic searches of bibliographic databases, F I G U R E 1 Schematic demonstrating the difference between platforms and databases, highlighting that different institutions may also subscribe to different extents of individual databases. The people on the left of the image represent different institutional subscriptions: the central square icons represent access platforms: the columns represent the year ranges (volume in colour) available for each database according to the users' institutional subscription (colours). Multi-coloured columns indicate databases accessible through different institutional subscriptions. Platforms may provide access to multiple databases. Different date ranges for a database may be provided to different institutional subscriptions or via different platforms. Some databases may be accessible via multiple platforms. however, are far more consistent. Despite this, to date there have been only limited attempts to provide a standard way of reporting searches of bibliographic databases (e.g., Gulhane, 2009;Bethel et al., 2021;de Jonge & Lein, 2015;Lyon et al., 2014).
Ensuring bibliographic searches are reported in a way that is transparent enough to allow for full repeatability or evaluation is not easy. There are a variety of reasons transparent, repeatable searches are challenging to ensure: 1. Researchers use both multi-database search platforms and individual databases. These systems differ in how the database (i.e., the specific, incrementally updated collection of bibliographic data) is searched (occasionally simultaneously in combination with other databases) and how the platform providing the search facility performs the search (see Figure 1). This diversity in search systems causes widespread misunderstandings regarding what is a database (i.e., what is a repeatable single resource). For example, many researchers believe that Web of Science is a database or that Web of Science Core Collection is a set of fixed databases. In fact, Web of Science is a platform through which many different databases can be searched, whilst Web of Science Core Collection is a set of between one and seven databases (Clarivate, 2022), and the time spans available to any user depend on their institutional subscription. Web of Science Core Collection is therefore not a repeatable single database, although this is often referred to as such in systematic reviews (Liu, 2019).
2. In part, because of the diversity in search systems described above, but also because of the complex nature of bibliographic searching, it is not immediately clear what information is sufficient for repeatability. Many search settings are set by default (e.g., lemmatization and stemming), but many others must be selected (e.g., date restrictions). There are further settings which are established at a subscription level, and users may be unaware of them (e.g., the default Boolean/Phrase search mode in EBSCO is customisable by a host institution [EBSCO Industries, 2022]). As a result, many systematic review authors do not provide sufficient information to allow the searches to be precisely repeated.
3. Broadly speaking, there is a lack of awareness, use, and enforcement of reporting standards in systematic reviews. This applies to all aspects of review methods, but is particularly evident in search methods (de Kock et al., 2021;Page et al., 2016;Sargeant et al., 2021). The PRISMA reporting standards  have been supplemented recently by the PRISMA-S extension for reporting searches (Rethlefsen et al., 2021), but these standards are not adopted by all journals, and are rarely enforced (Koffel & Rethlefsen, 2016;Nascimento et al., 2020;Tam et al., 2017). 4. Search strings (the collections of terms entered together into search facilities) are often long and complex, and it is easy for authors to introduce errors when they report their search strings and full search strategies if they transcribe text. Copying and pasting directly is less error prone, but not infallible. No standard file type exists for reporting search histories, so these data must typically be manually collated and reported in a review-a process that is at high risk of induced transcription errors (Sampson & McGowan, 2006). 5. Review authors must select from a myriad of possible places and ways to store search histories (and also search results) during conduct and when reporting their methods. In our experience, authors use a variety of tools to develop and track search histories including text document files (e.g., in Microsoft Word), spreadsheets (e.g., Google Sheets), digital notebooks (e.g., Microsoft OneNote), search history files exported from platforms (e.g., Web of Science), and review management tools (e.g., EPPI-Reviewer).
As mentioned above, none of these search history storage systems uses a standard data format, making reporting complicated and unclear, and providing no means of interoperability. Search history data cannot be exported from one system and uploaded to another without manual transcription or copying-and-pasting, which are error prone (see point 4 above). files is not an ideal means of storing information because: they are not typically peer-reviewed or checked before publication; they are typically not protected by a guarantee to be archived permanently; there is no requirement for ensuring text is digitised (text files are sometimes converted to flat images that cannot be searched or copied, and digital PDFs may be poorly digitised); they are not discoverable independently (i.e., each file is not indexed in search engines such as Google Scholar as a separate entity), so may be particularly hard to find. Furthermore, interlibrary loans (ILL) do not cover provision of supplementary files, making them yet more inaccessible to many readers.
Because of the need to transcribe or copy-and-paste searches into bibliographic databases, the repeatability of searches is limited by reporting accuracy (how correct) and precision (how rich), and is further hampered by the degree of digitisation and digital accessibility (i.e., how easily text/data can be extracted and reused without the need for transcription).
In sum, the systems used by review authors to store and report their bibliographic searches are not designed for transparent and repeatable reporting. This could be remedied by establishing a standard data format for reporting bibliographic search activities: this standard should specify what information should be reported (e.g., which data fields), and how it should be formatted (to allow for digitisation and unambiguous human/ machine readability).
Going beyond this, a standard file type could be developed that would allow search history information to be readily and efficiently HADDAWAY ET AL. | 3 of 12 transferred from one search system to another (e.g., from Scopus and PubMed), between different review management and reporting tools (e.g., EPPI Reviewer and Rayyan): this in turn would facilitate repeatability (allowing a third party to repeat and/or evaluate the original searches precisely). Such interoperability goes beyond tools that translate copied-and-pasted search strings between databases (e.g., Polyglot; http://sr-accelerator.com/#/polyglot) and allow for search histories to be sent from one database to another with no need for manual intervention or curation of the data itself.
We believe there are a suite of significant benefits from such a standardised file type for reporting search strategies. Firstly, it would allow the development of search archives that transparently store searches in publicly accessible and searchable repositories. The records in such a repository could be readily reused, evaluated, incrementally developed or amended, and cited, reducing research waste and improving research efficiency. In this way, search records could be open to public scrutiny and constructive feedback, further providing opportunities for improvement and learning. Secondly, this would support more complete reporting of search activities in evidence syntheses by setting expectations of which data fields to report. Thirdly, it would support interoperability and reusability of searches. Fourthly, it would facilitate evaluation and verification of search activities and error checking before, during and after searches and protocol/review publication. Fifthly, it could facilitate the creation of validated search filters/hedges, by supporting repositories of standard searches that could be incrementally refined. Finally, we believe that such repositories would allow for improved crediting for search specialists involved in designing and conducting searches by creating citable records that can be used to demonstrate impact.

| Objectives
Here, we present a suggested data structure that reports all details necessary to allow full repeatability of bibliographic database searches. The data structure was produced collaboratively by a group of specialists in information and library science and evidence synthesis methodology. This data structure outlines what information should be reported, how it should be presented, and suggests a way that this information can be encoded in a data file that would facilitate digital evaluation, reuse and interoperability. We believe this data structure would be of greatest use to developers producing review management tools and search history repositories, but also to keen systematic review authors wishing to ensure their methods are reported to a high level of detail.

| METHODS
We sought to assemble a diverse group of international experts from a range of professional backgrounds. We identified 19 experts and invited them to join the Advisory Group: 16 people responded positively and joined an online workshop introducing the project and its aims.
The Advisory Group was invited to comment on a draft data structure that had been prepared by NRH and MLR using a Google form. The draft structure consisted of five columns: item name; data example; textual description; requirement (compulsory or optional), and notes/comments. Members of the Advisory Group were asked to provide comments as one of three types: amendments to existing text; addition of items; exclusion of items.
The feedback was collated and the draft data structure adjusted accordingly. We present here the final proposed data structure, with the modifications following feedback from the Advisory Group described in detail in Supporting Information: The draft data structure is presented in Table 1. Each item is accompanied by an example provided in JSON format (a text file format that is readable by humans and machines, providing a nested and hierarchical structure beyond what is possible with flat spreadsheets), a description, optional/compulsory status, and comments.
Here, we justify the inclusion and formatting of each item: 1. Authors-due to the flexibility of a JSON format, each field may contain nested structured data. Here, the author field can contain author name, ORCID identifier and email addresses for each author. This field corresponds to authors of the search strategy and is intended to provide acknowledgement and credit to search specialists. Authorship should be decided based on clearly defined and widely accepted definitions of co-authorship, for example, by adapting the CRediT authorship statement from the high-profit publisher Elsevier (https://www.elsevier. com/authors/policies-and-guidelines/credit-author-statement).
Authorship on a search record (i.e., a record that documents a specific search history) should not be used as justification for removing a search specialist from a review: such behaviour would be unethical at best. 3. String name-this is an optional 'tag' for internal purposes, for example labelling a substring as 'intervention' or 'outcome'. Compulsory Names compulsory, email and ORDIC identifiers optional "ORCID": "0000-0002-3635-6354" "email": "m.seedre@mail.com" "name": "Felton, A" "name": "Lindbladh, M" Data entry date Can be populated automatically from search history export Language "language:" "en"

Language limitations applied
Optional Can be populated automatically from search history export Settings "lemmatization:" "on", Other settings, including content restrictions (e.g., excluding conference proceedings)

Optional
Can be populated automatically from search history export "spellchecking:" "Suggest", Quality assurance "appears in published protocol" Optional categorical description of the type of search: "exploratory search", "appears in published protocol", "appears in published review", "peer-reviewed", "validated 13. Language-this refers to the optional specification of record language that may be specified when searching, not already specified in the search string itself.
14. Settings-this field contains any other optional settings (e.g., lemmatisation or term expansion) not already specified in the search string itself.
15. Quality assurance-this refers to the type of quality assurance provided to the search strategy and may be one of the

| Suggested file format
The data structure proposed above could be encoded within a standardised filetype, for example, a JavaScript object Notation (JSON) file. JSONs lend themselves well to this form of data structure for several reasons, including that: these files are specifically designed for transmitting information between softwares and over the internet; the file contents are coding language independent, self-explanatory, and readily understandable by human and machine (Wehner et al., 2014); data structures are nested and hierarchical, meaning that a single field can contain further two-dimensional datasets (e.g., a single field labelled 'authors' can contain multiple sub-fields for 'author names', 'emails' and 'affiliations').
A proposal for a data structure for a JSON file for search histories needs only contain a set of standard field labels and specification of which fields contain subfields (nested data). We suggest this structure in Table 1 and provide an example file text in JSON format in Box 1. If these labels and structure were to be adopted across platforms and software, search histories could be shared and reused digitally without impediment.

| CONCLUSIONS
Searching for information is arguably the most important step of any evidence synthesis, since it must be conducted in a way so as to maximise comprehensiveness and minimise bias in the returned set of final search results (Rethlefsen et al., 2021). To be sure that searches have been performed correctly, it is necessary for review authors to accurately and completely report their search activities. Bibliographic database searching is a cornerstone in the vast majority of evidence syntheses, contributing the majority of evidence in most reviews (Haddaway & Westgate, 2019). To date, however, most evidence syntheses do not report their searches in sufficient detail to allow repeatability or evaluation (Abbott et al., 2022;de Kock et al., 2021;Koffel & Rethlefsen, 2016;Maggio et al., 2011;Mullins et al., 2014;Yoshii et al., 2009). Therefore, there is clearly a need for efforts to improve the reporting of evidence synthesis search strategies, particularly for bibliographic database searching.
Reporting standards for evidence syntheses are necessary and important, but alone may be insufficient to encourage authors to report searches in a fully transparent and repeatable manner. Given the rapid increase in publication of evidence syntheses (demonstrated by the recent explosion of systematic reviews on covid-19: (Abbott et al., 2022;Dotto et al., 2021), there is an urgent need to rapidly improve reporting of these reviews.
As tools develop, novel access points show promise in increasing the transparency and replicability of searchesnamely, APIs (application programming interfaces), which allow a query to be sent and the results from a database/platform search to be received within a programme or software (e.g., Scopus API in R (Muschelli, 2019)). However, given that many platforms provide different contents depending on subscription, and that this information is not coded within an API query, an API code is insufficiently transparent and replicable alone-a data standard is still required.
Here, we have proposed a set of fields and a proposed file type for reporting bibliographic search strategies in a transparent and repeatable manner. We believe this standard will support: greater transparency and repeatability; a reduction in typographical errors; greater, more efficient and more accurate reuse of search strategies; development of repositories of search strategies for clearer sharing and crediting of searches; learning and awareness raising about the nuance of search strategy reporting; and, better acknowledgement and crediting of search specialists.
We hope that developers of review management tools and search strategy repositories will employ this data structure to assist in transparent reporting of search histories by their users.
We suggest the standardised data file in JSON format may be a useful interoperable format to employ. We encourage the community to extend and develop the data standard as necessary. We hope that keen systematic reviewers may also use the data structure already in their own search strategy reporting where they do not have a suitable repository of review management tool. We believe adoption of this standard by tool developers would be a modest investment in the rigour of systematic reviews broadly. Such integration in easy-to-use and particularly Open Source tools and repositories would allow users to report searches transparently with minimum effort. Furthermore, we would hope that a search history repository (such as www.searchrxiv.org) that properly gave credit for the hard work of librarians in designing and conducting searches would drastically reduce barriers to uptake of this data structure.
Finally, we call for further discussion of these data standards and adoption of a standardised file type for ensuring transparency, consistency, and interoperability of academic search histories. HADDAWAY ET AL.