WO2023096833A1

WO2023096833A1 - Intelligent and focused searching and mining of clinical trial data

Info

Publication number: WO2023096833A1
Application number: PCT/US2022/050405
Authority: WO
Inventors: Benjamin Pamandanan FERNANDEZ; Kai Liu; Mehek MOHAN; Vignesh PRABHAKAR; Tristan Renbourne TAGER; Joseph George WAITE; Daniel Minha CHUN; Jennifer Renee CRAWFORD
Original assignee: Genentech, Inc.
Priority date: 2021-11-24
Filing date: 2022-11-18
Publication date: 2023-06-01

Abstract

The present disclosure relates to techniques for intelligent and focused searching and mining of clinical trial data. Particularly, aspects are directed to receiving, by a query processing system, a query that is at least in part a natural language query received based on user input at a computing device. The query processing system identifies scope data within the query, protocol documents that match the scope data, and key words within the query. The query processing system determines a pipeline for processing the query using the key words. The query processing system executes the pipeline by inputting the query and the protocol documents into the pipeline, executing a set of rules and/or model associations of the pipeline on the protocol documents using data from the query, and obtaining search results. The query processing system provides the search results as an answer to the query for presentation at the computing device.

Description

INTELLIGENT AND FOCUSED SEARCHING AND MINING OF CLINICAL TRIAL DATA

CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application claims the benefit of and priority to U.S. Provisional Application No. 63/282,749, filed on November 24, 2021, which is hereby incorporated by reference in its entirety for all purposes.

FIELD

[0002] The present disclosure relates to clinical trial data management, and in particular to intelligent and focused searching and mining of clinical trial data.

BACKGROUND

[0003] In healthcare, clinical trials help answer questions related to medical treatments and interventions. For example, clinical trials for new drug treatments determine efficacy, safety, dosage, side effects, and other information related to using a drug treatment. Subjects enroll in a clinical trial and have health data collected throughout the clinical trial; analysis of the clinical trial data determines its outcome. A protocol document is created to outline how the clinical trial is to be conducted. For example, the protocol document can include study objectives, design, methods, assessment types, collection schedules, statistical considerations for analyzing the data, and steps for protecting subjects and obtaining quality data. Portions of a protocol document may be usable in multiple clinical trials. But many users overlook the huge amount of data available within a protocol document that can be used in downstream processes, for example, to examine enrollment rates for trials with similar inclusion/exclusion criteria, dropout rates in studies with specific clinical procedures, the most common reasons for protocol amendments, and how common a particular procedure is used per therapeutic area.

SUMMARY

[0004] In various embodiments, a computer-implemented method is provided that involves receiving, by a query processing system, a query that is at least in part a natural language query received based on user input at a computing device. The query may further be at least in part defined by predefined query parameters. The method further involves identifying, by the query processing system, scope data within the query and identifying, using a full text search, protocol documents within structured clinical trial data that match the scope data. The query processing system identifies key words within the query and determines, using the key words, a pipeline from a plurality of pipelines for processing the query. The plurality of pipelines are alternative sets of rules and/or model associations that are used to process the query. The method further involves executing, by the query processing system, the pipeline, which involves inputting the query and the protocol documents into the pipeline, executing a set of rules and/or model associations of the pipeline on the protocol documents using data from the query, and obtaining search results based on executing the sets of rules and/or the model associations. The search results are provided as an answer to the query for presentation at the computing device.

[0005] In some embodiments, identifying the scope data involves (i) identifying drug names, protocol names, or a combination thereof within the natural language query, (ii) receiving a selection of the predefined query parameters from one or more drop down boxes comprising a plurality of predefined query parameters, or (iii) a combination thereof.

[0006] In some embodiments, identifying the scope data further involves identifying additional predefined query parameters associated with the predefined query parameters based on the selection of the predefined query parameters from the one or more drop down boxes.

[0007] In some embodiments identifying the key words involves (i) removing the scope data from the natural language query to generate a modified natural language query, and (ii) identifying, using a full text search, words or roots of the words from the modified natural language query in a predefined data structure comprising the key words.

[0008] In some embodiments, determining the pipeline involves determining the pipeline from the predefined data structure based on a mapping between the key words and the pipeline. [0009] In some embodiments, the plurality of pipelines include an inclusion or exclusion pipeline, a tabular pipeline, a linguistics pipeline, and a substring pipeline.

[0010] In some embodiments, the key words include inclusion or exclusion characteristics, and the pipeline is determined to be the inclusion or exclusion pipeline based on the inclusion or exclusion characteristics.

[0011] In some embodiments, executing the set of rules and/or model associations involves determining, using a named entity recognition model, one or more entity elements from the natural language query, retrieving, using a full text search, a first subset of the search results comprising the one or more entity elements that occur in content of the protocol documents based on the inclusion or exclusion characteristics, and retrieving, using a knowledge graph comprising entities and associated medical data, a second subset of the search results comprising medical data, associated with the one or more entity elements, that occur in the content of the protocol documents based on the inclusion or exclusion characteristics.

[0012] In some embodiments, the knowledge graph includes the entities and the associated medical data as a hierarchical data structure, and the knowledge graph is used to identify protocol documents with medical data of a lower or higher level than the one or more entity elements.

[0013] In some embodiments, providing the search results involves displaying, by the query processing system on the computing device, sub portions of each protocol document from the first subset of search results and the second subset of search results.

[0014] In some embodiments, each search result includes a hyperlink having a uniform resource identifier to each protocol document, and the method further involves receiving, by the query processing system, input from the user regarding selection of a hyperlink for a search result, and displaying, by the query processing system on the computing device, an entire protocol document associated with the search result to provide context for the inclusion or exclusion characteristics.

[0015] In some embodiments, the method further involves receiving, by the query processing system, user feedback on the search results, and retraining the named entity recognition model based on the user feedback.

[0016] In some embodiments, the key words include drug-related information, and the pipeline is determined to be the linguistics pipeline based on the drug-related information.

[0017] In some embodiments, executing the set of rules and/or model associations involves determining, using a neural network-based model for natural language processing, vector representations for each word in the query based on context, calculating, using the neural network-based model, an embedding for the query based on the vector representations for each word in the query, retrieving, using a semantic search, the search results from the protocol documents based on the embedding for the query, and generating a natural language answer to the query based on the search results and the embedding for the query. Generating the natural language answer to the query involves including the scope data, from the query, in the natural language answer and including one or more relevant terms associated with the scope data that occur in the search results or are derived from the search results, in the natural language answer. [0018] In some embodiments, providing the search results comprises displaying, by the query processing system on the computing device, the natural language answer and sub portions of each protocol document within the search results that support the natural language answer. [0019] In some embodiments, each search result includes a hyperlink having a uniform resource identifier to each protocol document, and the method further involves receiving, by the query processing system, input from the user regarding selection of a hyperlink for a search result, and displaying, by the query processing system on the computing device, an entire protocol document associated with the search result to provide context and support for the natural language answer.

[0020] In some embodiments, the method further involves receiving, by the query processing system, user feedback on the search results, and retraining the neural network-based model based on the user feedback.

[0021] In some embodiments, the key words include trial visit characteristics, and the pipeline is determined to be the tabular pipeline based on the trial visit characteristics.

[0022] In some embodiments, executing the set of rules and/or model associations involves retrieving, using a full text search, the search results from tables within the protocol documents based on the query and generating a natural language answer to the query based on the search results and the query. Generating the natural language answer to the query involves including the scope data, from the query, in the natural language answer and including one or more relevant terms associated with the scope data that occur in the search results or are derived from the search results, in the natural language answer.

[0023] In some embodiments, providing the search results to the query involves displaying, by the query processing system on the computing device, the natural language answer and sub portions of the tables in each protocol document within the search results that support the natural language answer. [0024] In some embodiments, displaying the sub portions of the tables involves highlighting rows and columns that support the natural language answer and embedding the rows and/or columns with additional discoverable information.

[0025] In some embodiments, each search result includes a hyperlink having a uniform resource identifier to each protocol document, and the method further comprises receiving, by the query processing system, input from the user regarding selection of a hyperlink for a search result, and displaying, by the query processing system on the computing device, an entire protocol document associated with the search result to provide context and support for the natural language answer.

[0026] In some embodiments, the key words include words associated with a particular character, and the pipeline is determined to be the substring pipeline based on the words associated with the particular character.

[0027] In some embodiments, executing the set of rules and/or model associations involves parsing the query into substrings and retrieving, using a substring search, the search results from the protocol documents based on the substrings in the query.

[0028] In some embodiments, providing the search results to the query involves displaying, by the query processing system on the computing device, sub portions of each protocol document within the search results that include the substring.

[0029] In some embodiments, each search result includes a hyperlink having a uniform resource identifier to each protocol document, and the method further comprises receiving, by the query processing system, input from the user regarding selection of a hyperlink for a search result, and displaying, by the query processing system on the computing device, an entire protocol document associated with the search result to provide context for the substring.

[0030] In some embodiments, the method further involves generating a new protocol document based on the search results. Generating the new protocol document involves authoring a portion of the new protocol document using information obtained through the search results or text obtained within the search results.

[0031] In some embodiments, a system is provided that includes one or more data processors and a non-transitory computer readable storage medium containing instructions which, when executed on the one or more data processors, cause the one or more data processors to perform part or all of one or more methods disclosed herein.

[0032] In some embodiments, a computer-program product is provided that is tangibly embodied in a non-transitory machine-readable storage medium and that includes instructions configured to cause one or more data processors to perform part or all of one or more methods disclosed herein.

[0033] Some embodiments of the present disclosure include a system including one or more data processors. In some embodiments, the system includes a non-transitory computer readable storage medium containing instructions which, when executed on the one or more data processors, cause the one or more data processors to perform part or all of one or more methods and/or part or all of one or more processes disclosed herein. Some embodiments of the present disclosure include a computer-program product tangibly embodied in a non-transitory machine- readable storage medium, including instructions configured to cause one or more data processors to perform part or all of one or more methods and/or part or all of one or more processes disclosed herein.

[0034] The terms and expressions which have been employed are used as terms of description and not of limitation, and there is no intention in the use of such terms and expressions of excluding any equivalents of the features shown and described or portions thereof, but it is recognized that various modifications are possible within the scope of the invention claimed. Thus, it should be understood that although the present invention as claimed has been specifically disclosed by embodiments and optional features, modification and variation of the concepts herein disclosed may be resorted to by those skilled in the art, and that such modifications and variations are considered to be within the scope of this invention as defined by the appended claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0035] The present disclosure is described in conjunction with the appended figures: [0036] FIG. 1 depicts a block diagram of a computing environment for intelligent and focused searching and mining of clinical trial data according to various embodiments; [0037] FIG. 2 depicts a flowchart of an example of a process for searching and mining clinical trial data according to various embodiments;

[0038] FIG. 3 depicts a flowchart of a process for using an inclusion and exclusion pipeline according to various embodiments;

[0039] FIGS. 4A and 4B depict examples of a user interface for executing an inclusion and exclusion pipeline according to various embodiments;

[0040] FIG. 5 depicts a flowchart of a process for using a linguistics pipeline according to various embodiments;

[0041] FIG. 6 depicts a flowchart illustrating a process for using a tabular pipeline according to various embodiments;

[0042] FIGS. 7A-7D depict examples of a user interface for executing a tabular pipeline according to various embodiments;

[0043] FIG. 8 depicts a flowchart illustrating a process for using a substring pipeline according to various embodiments; and

[0044] FIG. 9 depicts a block diagram of another computing environment for intelligent and focused searching and mining of clinical trial data according to various embodiments.

[0045] In the appended figures, similar components and/or features can have the same reference label. Further, various components of the same type can be distinguished by following the reference label by a dash and a second label that distinguishes among the similar components. If only the first reference label is used in the specification, the description is applicable to any one of the similar components having the same first reference label irrespective of the second reference label.

DETAILED DESCRIPTION

I. Overview

[0046] The present disclosure describes techniques for searching and retrieving clinical trial information from structured clinical trial data. More specifically, embodiments of the present disclosure provide techniques for receiving, by a query processing system, a query regarding clinical trial information and generating a search result based on an execution of various pipelines. [0047] Protocol documents for various purposes, such as a clinical trial, may be difficult to analyze during evaluation of a clinical protocol for present clinical trials and protocol documents may be difficult to reuse in development of a clinical protocol for future clinical trials. For example, since clinical trials may only include minor changes between each other, it can be tedious and time consuming for a clinical trial coordinator to generate a new protocol document for each clinical trial. In any given section of a protocol document, the information may be substantially different protocol to protocol, but the overall structure of the protocol document may not shift much. However, the content of one section being similar to one protocol (‘A’) does not necessarily mean that the content of another section is similar to the content of that section in protocol (‘A’) and for that other section another protocol (‘B’), by contrast, may have comparatively more similar content. As such, traditional organization search strategies of grouping past-used protocols together by disease area, treatment type, hospital, subject age, or other characteristics may not provide a meaningful benefit in identifying related protocol documents and/or constructing new protocol documents. It may also be time consuming for the clinical trial coordinator to manually search the protocol documents for a particular piece of data. [0048] To address these limitations and problems, the techniques for clinical trial health data searching and mining in the present disclosure utilize an intelligent and focused approach to generate a search result for a query about a clinical trial. This technique is intended to receive, by a query processing system, a query, identify scope data and key words within the query, and execute a pipeline associated with the key words to generate an answer to the query. The scope data may be drug categories, drug names, protocol categories, or protocol names indicated within the query. The key words may be associated with a pipeline of a set of rules and/or model associations. For example, there may be a tabular pipeline, a linguistics pipeline, an inclusion or exclusion pipeline, and a substring pipeline. The search results can provide the answer to the query. Some or all of the search results may then be easily analyzed or reused in a new protocol document.

[0049] One illustrative embodiment of the present disclosure is directed to a method that includes receiving, by a query processing system, a query. The query is at least in part a natural language query that is received based on user input at a computing device. The method further includes identifying, by the query processing system, scope data within the query and identifying, using a full text search, protocol documents within structured clinical trial data that match the scope data. The method further includes identifying, by the query processing system, key words within the query, and determining, by the query processing system using the key words, a pipeline from multiple pipelines for processing the query. Each of the pipelines are alternative sets of rules and/or model associations that are used to process the query. The method further includes executing, by the query processing system, the pipeline by inputting the query and the protocol documents into the pipeline, executing a set of rules and/or model associations of the pipeline on the protocol documents using data from the query, and obtaining search results based on executing the sets of rules and/or the model associations. The method further includes providing the search results as an answer to the query for presentation at the computing device. [0050] In some instances, the method further includes processing, by the search results include hyperlinks to each protocol document included in the search results. A user may select a hyperlink, and in response, the query processing system can display an entire protocol document associated with the hyperlink. Additionally or alternatively, the method may include generating a new protocol document based on the search results. The query processing system, or an authoring tool in communication with the query processing system, can author a portion of the new protocol document using information obtained through the search results or text obtained within the search results. This can reduce manual effort involved in authoring a new protocol document.

II. Example Computing Environment

[0051] FIG. 1 depicts a block diagram of a computing environment 100 for intelligent and focused searching and mining of clinical trial data according to various embodiments. In the illustrated embodiment, computing environment 100 receives and processes a query 105 using a query processing system 110. The query 105 can be a natural language query that is based on user input a computing device in communication with the query processing system 110. For example, the query 105 may be a question related to inclusion and exclusion criteria, a drug, a protocol document, or any other information about a clinical trial. The query 105 may be further based on predefined query parameters. For example, the user may select a drug or protocol document from a list of predefined drugs or protocol documents, respectively. The query processing system 110 processes the query 105 to generate search results 150, which may be on a local or remote computing system to the query processing system 110. The components of the computing environment 100 communicate via a communication network. The communication network can be any type of network familiar to those skilled in the art that can support data communications using any of a variety of available protocols, including without limitation TCP/IP (transmission control protocol/Internet protocol), SNA (systems network architecture), IPX (Internet packet exchange), AppleTalk®, and the like. Merely by way of example, network(s) can be a local area network (LAN), networks based on Ethernet, Token-Ring, a wide- area network (WAN), the Internet, a virtual network, a virtual private network (VPN), an intranet, an extranet, a public switched telephone network (PSTN), an infra-red network, a wireless network (e.g., a network operating under any of the Institute of Electrical and Electronics (IEEE) 1002.11 suite of protocols, Bluetooth®, and/or any other wireless protocol), and/or any combination of these and/or other networks.

[0052] A scope detector 115 of the query processing system 110 identifies scope data 120 within the query 105. The scope data 120 defines an area or subject matter of the query 105. The scope detector 115 can identify therapeutics, diseases, drug names, protocol names, or a combination thereof within the query 105 to identify the scope data 120. Scope data 120 includes the drug names and the protocol names. The scope data 120 may additionally be broader than a drug name or protocol name, such as a drug category or protocol category. As an example, the scope data 120 may be an anti-nausea drug category. The scope detector 115 identifies the scope data 120 within the query 105 either based on identifying diseases, therapeutics, drug names, protocol names, or a combination thereof within the natural language query, from a selection of the predefined query parameters from one or more drop down boxes, or a combination thereof. [0053] The query processing system 110 uses a full text search to identify protocol documents 130 within structured clinical trial data 125 that match the scope data 120. The structured clinical trial data 125 can include protocol documents and other documents such as case report forms for many clinical trials. The documents are structured, meaning the information is organized in such a way that is it searchable and retrievable. A protocol document describes how the clinical trial is to be conducted. For example, the protocol document can include objectives, a design, methodology, and organization of the clinical trial. The full text search of the structured clinical trial data 125 identifies the protocol documents 130 that match the scope data 120. For example, the query 105 may be associated with scope data 120 of the drug penicillin, and the query processing system 110 can identify five protocol documents in the structured clinical trial data 125 that are associated with penicillin.

[0054] In some instances, a key word detector 135 of the query processing system 110 identifies key words 140 within the query 105. The key words 140 can be words in the query 105 other than the scope data 120 that provide a context for the query 105 with respect to the scope data 120. The key words 140 are words related to information available in the protocol documents 130. For example, the key words 140 include inclusion or exclusion criteria, trial visit characteristics, drug-related information, or words associated with a particular character. The drug-related information may be contextual terms related to drugs and exclude drug names and drug categories. For example, the drug-related information may be terms related to how a drug works, a chemical composition of a drug, and the like. To identify the key words 140, the key word detector 135 removes the scope data 120 from the natural language query to generate a modified natural language query. The key word detector 135 then uses a full text search to identify words and roots of the words from the modified natural language query in a predefined data structure that includes the key words. For example, the natural language query of the query 105 may be “Is a patient taking penicillin excluded?” The scope data 120 can be for the drug penicillin, which can be removed, resulting in a modified natural language query of “Is a patient excluded?” Based on a search using the modified natural language query, the key word detector 135 can determine that “exclude”, the root word of “excluded” is a key word in the predefined data structure.

[0055] Using the key words 140, the query processing system 110 determines a pipeline from multiple pipelines 145(1-N) for processing the query 105. Each of the pipelines 145(1-N) can be an alternate set of rules and/or model associations that are used to process the query 105. Each of the key words 140 are mapped to a pipeline in the predefined data structure, so the query processing system 110 can determine the pipeline based on the mapping. The pipelines 145(1-N) may include an inclusion or exclusion pipeline, a tabular pipeline, a linguistics pipeline, and a substring pipeline. The inclusion or exclusion pipeline can be mapped with key words of inclusion or exclusion criteria, the tabular pipeline can be mapped with key words of trial visit characteristics, the linguistics pipeline can be mapped with key words of drug-related information, and the substring pipeline can be mapped with key words of words associated with the particular character. Each of these pipelines is described further below.

[0056] In some instances, the query processing system 110 executes the pipeline determined using the key words 140. For example, the query processing system 110 can determine the key words 140 correspond to the pipeline 145(1) and execute the pipeline 145(1). Executing the pipeline 145(1) involves inputting the query 105 and the protocol documents 130 into the pipeline 145(1). Then, the query processing system 110 executes a set of rules and/or model associations of the pipeline 145(1) using data from the query 105. Search results 150 are obtained based on executing the set of rules and/or the model associations. The search results 150 are provided as an answer to the query 105 for presentation at the computing device of the user. For example, for the query 105 of “Is a patient taking penicillin excluded?” with a particular protocol document identified, the search results 150 may indicate whether a patient taking penicillin is included in the clinical trial or not. Sub portions of each of the protocol documents 130 within the search results 150 may be displayed on the computing device. Each search result may include a hyperlink having a uniform resource identifier to each protocol document. The user may provide input to the query processing system 110 of a selection of a hyperlink for a search result and the query processing system 110 can display an entire protocol document associated with the search result on the computing device.

[0057] In some instances, the query processing system 110 can generate a new protocol document based on the search results 150. The query processing system 110 can author a portion of the new protocol document using information obtained through the search results 150 or text obtained within the search results 150. For example, a user may be creating a new clinical trial and may provide the query 105 to determine aspects of a similar clinical trial. The user can then indicate to the query processing system 110 that information in the search results 150 is to be included in a new protocol document for the new clinical trial. In response, the query processing system 110 can include the information in the new protocol document. III. Example Methods of Pipelines

[0058] FIGS. 2 and 3 illustrate processes for searching clinical trial data. The processes depicted in FIGS. 2 and 3 are implemented by the architecture, systems, and techniques depicted in FIG. 1. Individual embodiments may be described as a process which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations may be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed, but could have additional steps not included in a figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination may correspond to a return of the function to the calling function or the main function.

[0059] The processes and/or operations depicted in FIGS. 2 and 3 may be implemented in software (e.g., code, instructions, program) executed by one or more processing units (e.g., processors cores), hardware, or combinations thereof. The software may be stored in a memory (e.g., on a memory device, on a non-transitory computer-readable storage medium). The particular series of processing steps in FIGS. 2 and 3 are not intended to be limiting. Other sequences of steps may also be performed according to alternative embodiments. For example, in alternative embodiments the steps outlined above may be performed in a different order. Moreover, the individual steps illustrated in FIGS. 2 and 3 may include multiple sub-steps that may be performed in various sequences as appropriate to the individual step. Furthermore, additional steps may be added or removed depending on the particular applications. One of ordinary skill in the art would recognize many variations, modifications, and alternatives.

[0060] FIG. 2 depicts a flowchart of an example of a process 200 for searching and mining clinical trial data according to various embodiments. The clinical trial data is searched by a query processing system (e.g., the query processing system 110 in FIG. 1).

[0061] At block 205, a query 105 is received. The query processing system 110 can receive the query 105 based on user input at a computing device. The query 105 can be based on a natural language query and/or the query 105 may be defined by predefined query parameters (e.g., selecting predefined query parameters via radio buttons or drop down boxes). In certain instances, the query 105 is a natural language question about one or more clinical trials.

[0062] At block 210, scope data 120 is identified. The scope data 120 is identified by the scope detector 115, which can identify drug categories, therapeutics, diseases, drug names, protocol categories, and/or protocol names within the query 105. The scope detector 115 can identify drug categories, therapeutics, diseases, drug names, protocol categories, and/or protocol names within the natural language query, or the scope detector 115 may receive a selection of the predefined query parameters from one or more drop down boxes. The scope detector 115 may have a list of predefined terms that can be included in the scope data 120. To identify the scope data 120 within the natural language query, the scope detector 115 can determine a word in the natural language query matches a word in the list of predefined terms. Additionally or alternatively, the drop down boxes for the predefined query parameters may include a list of drug names and a list of protocol documents that can be selected by a user to identify the scope data 120.

[0063] At block 215, protocol documents 130 are identified using the scope data 120. The query processing system 110 accesses structured clinical trial data 125 and performs a full text search to identify the protocol documents 130 that match the scope data 120. For example, the query processing system 110 can identify protocol documents that are associated with the drug names and/or protocol names of the scope data 120.

[0064] At block 220, key words 140 are identified. The key words 140 include inclusion or exclusion criteria, trial visit characteristics, drug-related information, or words associated with a particular character. To identify the key words 140, the key word detector 135 removes the scope data 120 from the natural language query to generate a modified natural language query. The key word detector 135 then uses a full text search to identify words and roots of the words from the modified natural language query in a predefined data structure that includes the key words 140. [0065] At block 225, a pipeline is determined using the key words 140. The query processing system 110 determines the pipeline from multiple pipelines 145(1-N), each of which is an alternative set of rules and/or model associations that are used to process the query 105. For example, the pipelines 145(1-N) may include an inclusion or exclusion pipeline, a tabular pipeline, a linguistics pipeline, and a substring pipeline. The query processing system 110 determines the pipeline from the predefined data structure based on a mapping between the key words 140 and the pipeline.

[0066] At block 230, the pipeline of the protocol documents 130 is executed using data from the query 105. Executing the pipeline involves inputting the query 105 and the protocol documents 130 into the pipeline. Then, the query processing system 110 executes a set of rules and/or model associations of the pipeline using the data from the query 105. The data may include vector representations of each word in the query 105, full text of the query 105, substrings of the query 105, or entity elements from the natural language query.

[0067] At block 235, search results 150 are provided. The search results 150 are an answer to the query 105 that are obtained based on executing the set of rules and/or model associations for the pipeline. The query processing system 110 can provide the search results 150 for presentation at the computing device. In some instances, the search results 150, or text from the search results 150, are used in authoring a new protocol document.

[0068] FIG. 3 depicts a flowchart of a process 300 for using an inclusion or exclusion pipeline according to various embodiments. The inclusion and exclusion pipeline can be one of the pipelines 145(1-N) that are executed by the query processing system 110. The query processing system 110 may have previously determined the pipeline is the inclusion or exclusion pipeline based on the key words 140 including inclusion or exclusion characteristics.

[0069] At block 305, one or more entity elements are determined from a query 105. The query processing system 110 can use a named entity recognition model to determine the one or more entity elements. Named-entity recognition involves identifying and categorizing entities in natural -language text. The named entity recognition model can be trained to receive the natural language query of the query 105 and identify entity elements within the natural language query. For example, the one or more entity elements may be medically-relevant terms (e.g., terms of height, weight, pregnancy, antibiotics, etc.) within the natural language query of the query 105. [0070] At block 310, a first subset of search results 150 are retrieved using the one or more entity elements in a full text search. The query processing system 110 searches protocol documents 130 previously determined to be associated with scope data 120 within the query 105 to identify the one or more entity elements. For example, the query 105 may be “are pregnant women included?” for clinical trials associated with a particular drug. The entity element for the query can be “pregnant”, and the pipeline can identify instances in which “pregnant”, or a corresponding word, occurs in content of the protocol documents 130. These instances can be the first subset of the search results 150.

[0071] At block 315, a second subset of the search results 150 is retrieved using the one or more entity elements and a knowledge graph. The second subset of the search results 150 can include medical data associated with the one or more entity elements that occur in the content of the protocol documents 130 based on the inclusion or exclusion characteristics. The knowledge graph can include entities and associated medical data as a hierarchical data structure. For example, therapeutic areas and drug categories may be at the highest level of the knowledge graph, and each of the therapeutic areas and drug categories can connect to lower levels with related terms of a higher specificity. Each branch can further have additional connections to other related branches of the knowledge graph. The knowledge graph may be used to identify protocol documents with medical data of a lower or higher level than the one or more entity elements. The query processing system 110 can search the knowledge graph to find the scope data 120, determine related and unrelated medical data in the context of the protocol documents 130, and provide the search results 150 based on the determination of the related and unrelated medical data. As an example, the query 105 may be “is a patient taking penicillin excluded?”. The query processing system 110 can determine from the protocol documents 130 that a patient taking antibiotics is excluded. The pipeline can then determine, based on the knowledge graph, that penicillin is an instance of antibiotics, so a patient taking penicillin is excluded.

[0072] The query processing system 110 can display the first subset and the second subset of the search results 150 at a computing device of a user that provided the query 105. Sub portions of each protocol document from the first subset of the search results 150 and the second subset of the search results 150 may be displayed. The sub portions can correspond to the relevant portions of the protocol documents, such as the particular text referring to an inclusion or an exclusion of a particular patient group. Each search result can include a hyperlink having a uniform resource identifier to each protocol document. The query processing system 110 can receive input from the user regarding selection of a hyperlink for a search result and display an entire protocol document associated with the search result to provide context for the inclusion or exclusion characteristics. The query processing system 110 may additionally receive feedback on the search results 150 and retrain the named entity recognition model based on the user feedback. For example, repeated negative feedback for a particular search result may trigger the query processing system 110 to retrain the named entity recognition model.

[0073] FIGS. 4A and 4B depict examples of a user interface 400 for executing an inclusion and exclusion pipeline according to various embodiments. In FIG. 4A, a query processing system (e.g., query processing system 110 in FIG. 1) receives a query 405 (e.g., query 105 in FIG. 1) of “is a patient with urticaria excluded” with selections of a primary drug of ibuprofen and a protocol document named AA0001 :v: 1. The “is a patient with urticaria excluded” can be a natural language query, and the drug name and protocol document name can be predefined query parameters since the drug name and protocol document name were selected from a drop down menu. The query processing system 110 can determine that scope data 410 (e.g., scope data 120 in FIG. 1) for the query 405 is the drug ibuprofen, and the protocol document AA0001 :v: 1. The query processing system 110 can also determine that key words 415 within the query 405 are “urticaria” and “excluded”, and that the “excluded” key word corresponds to an inclusion or exclusion characteristic. As a result, the query processing system 110 executes an inclusion or exclusion pipeline on the protocol document for the query 405.

[0074] In FIG. 4B, search results 420 (e.g., search results 150 in FIG. 1) for the query 405 are displayed at the user interface 400. The search results 420 are three criteria from the inclusion or exclusion criteria found by the inclusion or exclusion pipeline in the protocol document. Each search result includes an indication of a type of criteria (inclusion or exclusion), text associated with the criterion, and a name of the protocol document associated with the criterion. Since the protocol document was specified in the query 405, each of the names of the protocol documents are the same. A user may be able to select the protocol name to access the entire protocol document. Additionally, a user can provide feedback, by either selecting a thumbs up or thumbs down button, to indicate whether the search result is accurate in the context of the query 405.

[0075] FIGS. 5 and 6 illustrate additional processes for searching clinical trial data. The processes depicted in FIGS. 5 and 6 are implemented by the architecture, systems, and techniques depicted in FIG. 1. Individual embodiments may be described as a process which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations may be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed but could have additional steps not included in a figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination may correspond to a return of the function to the calling function or the main function.

[0076] The processes and/or operations depicted in FIGS. 5 and 6 may be implemented in software (e.g., code, instructions, program) executed by one or more processing units (e.g., processors cores), hardware, or combinations thereof. The software may be stored in a memory (e.g., on a memory device, on a non-transitory computer-readable storage medium). The particular series of processing steps in FIGS. 5 and 6 are not intended to be limiting. Other sequences of steps may also be performed according to alternative embodiments. For example, in alternative embodiments the steps outlined above may be performed in a different order. Moreover, the individual steps illustrated in FIGS. 5 and 6 may include multiple sub-steps that may be performed in various sequences as appropriate to the individual step. Furthermore, additional steps may be added or removed depending on the particular applications. One of ordinary skill in the art would recognize many variations, modifications, and alternatives.

[0077] FIG. 5 depicts a flowchart of a process 500 for using a linguistics pipeline according to various embodiments. The linguistics pipeline is one of the pipelines 145(1-N) that can be executed by the query processing system 110. The query processing system 110 may have previously determined the pipeline is the linguistics pipeline based on the key words 140 including drug-related information. For example, the query 105 may be “what is the mechanism of action of fenebrutinib”, and the key words 140 can be “mechanism of action”. Since the key words 140 are drug related, the query processing system 110 determines the linguistics pipeline is to be executed. As another example, the query 105 may be about who is the medical monitor for studies from multiple protocol documents for the same drug, so the key words 140 can be determined to be drug-related information.

[0078] At block 505, vector representations are determined for each word in the query 105. The query processing system 110 can use a neural network-based model for natural language processing to determine the vector representations. For example, a transformer-based natural language processing neural network model, such as a Bidirectional Encoder Representations from Transformers (BERT) based neural network model, may be used. The vector representations for each word can be based on context. The neural network-based model can receive the query 105, generate tokens for each word in the query 105, and generate the vector representations based on the tokens.

[0079] At block 510, an embedding is calculated for the query 105 based on the vector representations. The neural network-based model can calculate the embedding for the query 105. In an example, layers of the neural network-based model can include transformer encoders that receive the vector representations. Each output per token from each layer can be the embedding. Additionally or alternatively, the embedding may be a combination of the vector representations for each word in the query 105.

[0080] At block 515, search results 150 are retrieved based on the embedding. The query processing system 110 can use a semantic search to determine matches between the embedding for the query 105 and the protocol documents 130. Each match can be included in the search results 150. For example, if the query 105 is about the mechanism of action for fenebrutinib, each match between the embedding of the query 105 and the protocol documents 130 indicating the mechanism of action of fenebrutinib can be retrieved.

[0081] At block 520, a natural language answer is generated. The natural language answer is generated based on the search results 150 and the embedding for the query 105. To generate the natural language answer, the query processing system 110 includes scope data 120, from the query 105, in the natural language answer. The query processing system 110 also includes one or more relevant terms associated with the scope data 120 that occur in the search results 150 or are derived from the search results 150, in the natural language answer. For example, the natural language answer may be “the mechanism of action of fenebrutinib is inhibiting myeloid and B- cell activation”.

[0082] In some instances, the query processing system 110 can display the natural language answer and sub portions of each protocol document within the search results 150 that support the natural language answer at a computing device of a user that provided the query 105. The sub portions can correspond to the relevant portions of the protocol documents. Each search result can include a hyperlink having a uniform resource identifier to each protocol document. The query processing system 110 can receive input from the user regarding selection of a hyperlink for a search result and display an entire protocol document associated with the search result to provide context and support for the natural language answer. The query processing system 110 may additionally receive feedback on the search results 150 and retrain the neural network-based model based on the user feedback. For example, repeated negative feedback for a particular search result may trigger the query processing system 110 to retrain the neural network-based model.

[0083] FIG. 6 depicts a flowchart illustrating a process 600 for using a tabular pipeline according to various embodiments. The tabular pipeline is one of the pipelines 145(1-N) that can be executed by the query processing system 110. The query processing system 110 may have previously determined the pipeline is the tabular pipeline based on the key words 140 including trial visit characteristics. For example, the query 105 may be “which study visits measure pregnancy” for a specific drug name and protocol name. The key words 140 may include “study visits”, which the query processing system 110 can determine to be a trial visit characteristic, and therefore the tabular pipeline is to be executed.

[0084] At block 605, search results 150 from tables are retrieved. The tables are within the protocol documents 130 that the query processing system 110 previously determined to be associated with scope data 120 of the query 105. As an example, the tables may be schedule of assessments (SOA) tables for the clinical trials. The pipeline can determine matches between terms in the query 105 and the tables. For example, the term may be “pregnancy”, and the search results 150 can include an indication of each trial visit in which a pregnancy test is given to the patient based on determining the term is present in the table.

[0085] At block 610, a natural language answer is generated. The query processing system 110 generates the natural language answer to the query 105 based on the search results 150 and the query 105. To generate the natural language answer, the query processing system 110 includes the scope data 120, from the query 105, in the natural language answer and includes one or more relevant terms associated with the scope data 120 that occur in the search results 150 or are derived from the search results 150, in the natural language answer. [0086] At block 615, the natural language answer and portions of the tables are provided as the search results 150. The portions of the tables can be sub portions of the tables in each protocol document within the search results 150 that support the natural language answer. For example, the sub portions may be the rows and columns associated with giving the patient the pregnancy tests. The query processing system 110 can display the search results 150 on the computing device. Displaying the search results 150 may involve highlighting rows and columns that support the natural language answer and embedding the rows and/or columns with additional discoverable information. For example, footnotes in cells of the tables may be embedded, such that interaction by the user with a cell displays the corresponding footnote. Additionally, each search result can include a hyperlink having a uniform resource identifier to each protocol document. The query processing system 110 can receive input from the user regarding selection of a hyperlink for a search result and display an entire protocol document associated with the search result to provide context and support for the natural language answer.

[0087] FIGS. 7A-7D depict examples of a user interface 700 for executing a tabular pipeline according to various embodiments. In FIG. 7A, a query processing system (e.g., query processing system 110 in FIG. 1) receives a query 705 (e.g., query 105 in FIG. 1) of “which study visits measure pregnancy” with selections of a primary drug of ibuprofen and a protocol document named AA0001:v:l. The “which study visits measure pregnancy” is a natural language query, and the drug name and protocol document name can be predefined query parameters since the drug name and protocol document name were selected from a drop down menu. The query processing system 110 can determine that scope data 710 (e.g., scope data 120 in FIG. 1) for the query 705 is the drug ibuprofen, and the protocol document AA0001 :v: 1. The query processing system 110 can also determine that key words 715 within the query 705 are “urticaria” and “study visits”, and that the “study visits” key word corresponds to a trial visit characteristic. As a result, the query processing system 110 executes a tabular pipeline on the protocol document for the query 705.

[0088] In FIG. 7B, search results 720 (e.g., search results 150 in FIG. 1) for the query 705 are displayed at the user interface 700. The search results 720 include a natural language answer and rows and columns from an SOA table of the protocol document that are associated with the patient receiving a pregnancy test. The rows and columns that support the natural language answer to the query 705 are highlighted. As illustrated, the natural language answer provided is “pregnancy test is done in 8 study visits”. The row for “pregnancy test” and the columns associated with each visit in which a pregnancy test is given are highlighted.

[0089] In FIG. 7C, the user interface 700 with an embedding of a cell with additional discoverable information is illustrated. The cell is part of the search results 720 and includes a footnote about laboratory tests prior to randomization and dosing. A user interaction of hovering over the cell results in the additional discoverable information being displayed.

[0090] In FIG. 7D, the entire SOA table is displayed at the user interface 700. A user may select a hyperlink, or as illustrated in FIG. 7C a “Toggle Full Table” link in the search results 720 to display the entire SOA table. The row associated with the search results 720 to the query 705 remain highlighted within the entire SOA table so that the user may be able to gain context or support information for the natural language answer.

[0091] FIG. 8 depicts a flowchart illustrating a process 800 for using a substring pipeline according to various embodiments. The process 800 depicted in FIG. 8 is implemented by the architecture, systems, and techniques depicted in FIG. 1. Individual embodiments may be described as a process which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations may be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed but could have additional steps not included in a figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination may correspond to a return of the function to the calling function or the main function.

[0092] The processes and/or operations depicted in FIG. 8 may be implemented in software (e.g., code, instructions, program) executed by one or more processing units (e.g., processors cores), hardware, or combinations thereof. The software may be stored in a memory (e.g., on a memory device, on a non-transitory computer-readable storage medium). The particular series of processing steps in FIG. 8 are not intended to be limiting. Other sequences of steps may also be performed according to alternative embodiments. For example, in alternative embodiments the steps outlined above may be performed in a different order. Moreover, the individual steps illustrated in FIG. 8 may include multiple sub-steps that may be performed in various sequences as appropriate to the individual step. Furthermore, additional steps may be added or removed depending on the particular applications. One of ordinary skill in the art would recognize many variations, modifications, and alternatives.

[0093] The substring pipeline is one of the pipelines 145(1-N) that can be executed by the query processing system 110. The query processing system 110 may have previously determined the pipeline is the substring pipeline based on the key words 140 including words associated with a particular character (e.g., quotation marks). For example, the query 105 may include “oprazole” in quotation marks, which can indicate “oprazole” is the key word. Accordingly, the query processing system 110 determines the substring pipeline is to be executed.

[0094] At block 805, the query 105 is parsed into substrings. Each substring can be a set of characters between two of the particular character. For example, the query processing system 110 can determine identify the particular character in the query 105 and determine that any characters between two of the particular character is a substring.

[0095] At block 810, search results 150 are retrieved from protocol documents 130 based on the substrings in the query 105. The query processing system 110 may have previously determined the protocol documents 130 associated with scope data 120 of the query 105. The query processing system 110 uses a substring search to determine matches between the substrings and content of the protocol documents 130. Any matches are included in the search results 150.

[0096] In some instances, the query processing system 110 can display the search results 150 and sub portions of each protocol document within the search results 150 that include the substring at a computing device of a user that provided the query 105. Each search result can include a hyperlink having a uniform resource identifier to each protocol document. The query processing system 110 can receive input from the user regarding selection of a hyperlink for a search result and display an entire protocol document associated with the search result to provide context for the substring. IV. Another Example Computing Environment

[0097] FIG. 9 depicts a block diagram of another computing environment 900 for intelligent and focused searching and mining of clinical trial data according to various embodiments. The computing environment 900 includes a search engine 905, one or more servers 910(l-N), one or more databases 915(1-N), and a computing device 920. The components of the computing environment 900 may communicate via a communication network 925. The communication network 925 can be any type of network familiar to those skilled in the art that can support data communications using any of a variety of available protocols, including without limitation TCP/IP (transmission control protocol/Internet protocol), SNA (systems network architecture), IPX (Internet packet exchange), AppleTalk®, and the like. Merely by way of example, network(s) can be a local area network (LAN), networks based on Ethernet, Token-Ring, a wide- area network (WAN), the Internet, a virtual network, a virtual private network (VPN), an intranet, an extranet, a public switched telephone network (PSTN), an infra-red network, a wireless network (e.g., a network operating under any of the Institute of Electrical and Electronics (IEEE) 1002.11 suite of protocols, Bluetooth®, and/or any other wireless protocol), and/or any combination of these and/or other networks.

[0098] A user can provide a query (e.g., query 105 in FIG. 1) to the search engine 905 via the computing device 920. The user can access the search engine 905 through a software application, such as through a web page of a browser. The computing device 920 generates user interfaces, such as those depicted in FIGS. 4A, 4B, and 7A-7D, for interacting with the search engine 905 via the browser. Based on the query 105, the servers 910(l-N) can execute an appropriate pipeline. Each of the servers 910(l-N) may be a computing device that includes one or more data processors and a non-transitory computer-readable storage medium that stores instructions that when executed by the one or more data process perform computing operations. For example, the servers 910(l-N) can include a query processing system (e.g., query processing system 110 in FIG. 1) that performs some or all of the processes of FIGS. 2, 3, 5, 6, and 8.

[0099] The servers 910(l-N) can access the databases 915(1-N) to generate search results (e.g., search results 150) for the query 105. The databases 915(1-N) can include a database of structured clinical trial data (e.g., structured clinical trial data 125 in FIG. 1) and a database of predefined query parameters that maps key words to pipelines. Additionally, the databases 915(1-N) may include a database of subject data (e.g., that stores an electronic health record (EHR) data from clinical trials, a care provider, one or more provider networks) or the like. The search engine 905 can provide the search results 150 for display at the computing device 920.

V. Additional Considerations

[00100] Some embodiments of the present disclosure include a system including one or more data processors. In some embodiments, the system includes a non-transitory computer readable storage medium containing instructions which, when executed on the one or more data processors, cause the one or more data processors to perform part or all of one or more methods and/or part or all of one or more processes disclosed herein. Some embodiments of the present disclosure include a computer-program product tangibly embodied in a non-transitory machine- readable storage medium, including instructions configured to cause one or more data processors to perform part or all of one or more methods and/or part or all of one or more processes disclosed herein.

[00101] The terms and expressions which have been employed are used as terms of description and not of limitation, and there is no intention in the use of such terms and expressions of excluding any equivalents of the features shown and described or portions thereof, but it is recognized that various modifications are possible within the scope of the invention claimed. Thus, it should be understood that although the present invention as claimed has been specifically disclosed by embodiments and optional features, modification, and variation of the concepts herein disclosed may be resorted to by those skilled in the art, and that such modifications and variations are considered to be within the scope of this invention as defined by the appended claims.

[00102] The description provides preferred exemplary embodiments only, and is not intended to limit the scope, applicability or configuration of the disclosure. Rather, the description of the preferred exemplary embodiments will provide those skilled in the art with an enabling description for implementing various embodiments. It is understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope as set forth in the appended claims. [00103] Specific details are given in the following description to provide a thorough understanding of the embodiments. However, it will be understood that the embodiments may be practiced without these specific details. For example, circuits, systems, networks, processes, and other components may be shown as components in block diagram form in order not to obscure the embodiments in unnecessary detail. In other instances, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the embodiments.

Claims

CLAIMS What is claimed is:

1. A computer-implemented method comprising: receiving, by a query processing system, a query, the query being at least in part a natural language query that is received based on user input at a computing device; identifying, by the query processing system, scope data within the query; identifying, using a full text search, protocol documents within structured clinical trial data that match the scope data; identifying, by the query processing system, key words within the query; determining, by the query processing system using the key words, a pipeline from a plurality of pipelines for processing the query, wherein the plurality of pipelines are alternative sets of rules and/or model associations that are used to process the query; executing, by the query processing system, the pipeline, wherein executing the pipeline comprises: inputting the query and the protocol documents into the pipeline, executing a set of rules and/or model associations of the pipeline on the protocol documents using data from the query, and obtaining search results based on executing the sets of rules and/or the model associations; and providing the search results as an answer to the query for presentation at the computing device.

2. The computer-implemented method of claim 1, wherein the query further being at least in part defined by predefined query parameters.

3. The computer-implemented method of claim 2, wherein identifying the scope data comprises: (i) identifying drug names, protocol names, or a combination thereof within the natural language query, (ii) receiving a selection of the predefined query parameters from one or

- 27 - more drop down boxes comprising a plurality of predefined query parameters, or (iii) a combination thereof.

4. The computer-implemented method of claim 3, wherein identifying the scope data further comprises: identifying additional predefined query parameters associated with the predefined query parameters based on the selection of the predefined query parameters from the one or more drop down boxes.

5. The computer-implemented method of claim 3, wherein identifying the key words comprises: (i) removing the scope data from the natural language query to generate a modified natural language query, and (ii) identifying, using a full text search, words or roots of the words from the modified natural language query in a predefined data structure comprising the key words.

6. The computer-implemented method of claim 5, wherein determining the pipeline comprises determining the pipeline from the predefined data structure based on a mapping between the key words and the pipeline.

7. The computer-implemented method of claim 1 or 6, wherein the plurality of pipelines include an inclusion or exclusion pipeline, a tabular pipeline, a linguistics pipeline, and a substring pipeline.

8. The computer-implemented method of claim 7, wherein the key words include inclusion or exclusion characteristics, and the pipeline is determined to be the inclusion or exclusion pipeline based on the inclusion or exclusion characteristics.

9. The computer-implemented method of claim 8, wherein executing the set of rules and/or model associations comprises: determining, using a named entity recognition model, one or more entity elements from the natural language query; retrieving, using a full text search, a first subset of the search results comprising the one or more entity elements that occur in content of the protocol documents based on the inclusion or exclusion characteristics; and retrieving, using a knowledge graph comprising entities and associated medical data, a second subset of the search results comprising medical data, associated with the one or more entity elements, that occur in the content of the protocol documents based on the inclusion or exclusion characteristics.

10. The computer-implemented method of claim 9, wherein the knowledge graph comprises the entities and the associated medical data as a hierarchical data structure, and the knowledge graph is used to identify protocol documents with medical data of a lower or higher level than the one or more entity elements.

11. The computer-implemented method of claim 10, wherein the providing the search results comprises displaying, by the query processing system on the computing device, sub portions of each protocol document from the first subset of search results and the second subset of the search results.

12. The computer-implemented method of claim 11, wherein each search result includes a hyperlink having a uniform resource identifier to each protocol document, and the method further comprises receiving, by the query processing system, input from the user regarding selection of a hyperlink for a search result, and displaying, by the query processing system on the computing device, an entire protocol document associated with the search result to provide context for the inclusion or exclusion characteristics.

13. The computer-implemented method of claim 11, further comprising receiving, by the query processing system, user feedback on the search results, and retraining the named entity recognition model based on the user feedback.

14. The computer-implemented method of claim 7, wherein the key words include drug- related information, and the pipeline is determined to be the linguistics pipeline based on the drug-related information.

15. The computer-implemented method of claim 14, wherein executing the set of rules and/or model associations comprises: determining, using a neural network-based model for natural language processing, vector representations for each word in the query based on context; calculating, using the neural network-based model, an embedding for the query based on the vector representations for each word in the query; retrieving, using a semantic search, the search results from the protocol documents based on the embedding for the query; and generating a natural language answer to the query based on the search results and the embedding for the query, wherein the generating the natural language answer to the query comprises including the scope data, from the query, in the natural language answer and including one or more relevant terms associated with the scope data, that occur in the search results or are derived from the search results, in the natural language answer.

16. The computer-implemented method of claim 15, wherein the providing the search results comprises displaying, by the query processing system on the computing device, the natural language answer and sub portions of each protocol document within the search results that support the natural language answer.

17. The computer-implemented method of claim 16, wherein each search result includes a hyperlink having a uniform resource identifier to each protocol document, and the method further comprises receiving, by the query processing system, input from the user regarding selection of a hyperlink for a search result, and displaying, by the query processing system on the computing device, an entire protocol document associated with the search result to provide context and support for the natural language answer.

18. The computer-implemented method of claim 16, further comprising receiving, by the query processing system, user feedback on the search results, and retraining the neural networkbased model based on the user feedback.

19. The computer-implemented method of claim 7, wherein the key words include trial visit characteristics, and the pipeline is determined to be the tabular pipeline based on the trial visit characteristics.

20. The computer-implemented method of claim 19, wherein executing the set of rules and/or model associations comprises: retrieving, using a full text search, the search results from tables within the protocol documents based on the query; and generating a natural language answer to the query based on the search results and the query, wherein the generating the natural language answer to the query comprises including the scope data, from the query, in the natural language answer and including one or more relevant terms associated with the scope data, that occur in the search results or are derived from the search results, in the natural language answer.

21. The computer-implemented method of claim 20, wherein the providing the search results to the query comprises displaying, by the query processing system on the computing device, the natural language answer and sub portions of the tables in each protocol document within the search results that support the natural language answer.

- 31 -

22. The computer-implemented method of claim 21, wherein the displaying the sub portions of the tables comprises highlighting rows and columns that support the natural language answer and embedding rows and/or columns of the tables with additional discoverable information.

23. The computer-implemented method of claim 21, wherein each search result includes a hyperlink having a uniform resource identifier to each protocol document, and the method further comprises receiving, by the query processing system, input from the user regarding selection of a hyperlink for a search result, and displaying, by the query processing system on the computing device, an entire protocol document associated with the search result to provide context and support for the natural language answer.

24. The computer-implemented method of claim 7, wherein the key words include words associated with a particular character, and the pipeline is determined to be the substring pipeline based on the words associated with the particular character.

25. The computer-implemented method of claim 24, wherein executing the set of rules and/or model associations comprises parsing the query into substrings and retrieving, using a substring search, the search results from the protocol documents based on the substrings in the query.

26. The computer-implemented method of claim 25, wherein the providing the search results to the query comprises displaying, by the query processing system on the computing device, sub portions of each protocol document within the search results that include the substring.

27. The computer-implemented method of claim 26, wherein each search result includes a hyperlink having a uniform resource identifier to each protocol document, and the method further comprises receiving, by the query processing system, input from the user regarding selection of a hyperlink for a search result, and displaying, by the query processing system on the computing device, an entire protocol document associated with the search result to provide context for the substring.

- 32 -

28. The computer-implemented method of any one of claims 1-27, further comprising generating a new protocol document based on the search results, wherein the generating comprises authoring a portion of the new protocol document using information obtained through the search results or text obtained within the search results.

29. A system comprising: one or more data processors; and a non-transitory computer readable storage medium containing instructions which, when executed on the one or more data processors, cause the one or more data processors to perform the computer-implemented method of any one of claims 1-28.

30. A computer-program product tangibly embodied in a non-transitory machine-readable storage medium, including instructions configured to cause one or more data processors to perform the computer-implemented method of any one of claims 1-28.

- 33 -