CN117480573A

CN117480573A - Computer architecture for generating an integrated data store

Info

Publication number: CN117480573A
Application number: CN202280040161.6A
Authority: CN
Inventors: 纳文·库马尔; 张静文; 尼莎·苏布拉马尼安; 高塔姆·纳亚克; 凯瑟琳·朗; 拉杰什·库查拉帕蒂; 陆舜昕
Original assignee: Guardant Health Inc
Current assignee: Guardant Health Inc
Priority date: 2021-06-03
Filing date: 2022-06-03
Publication date: 2024-01-30

Abstract

An integrated data store may be generated that includes genomic information and health insurance claim data information for a common set of individuals. The data processing pipeline may be implemented for information stored by an integrated data store. The data processing pipeline may include a plurality of sets of data processing instructions executable to analyze the specified information stored by the integrated data store and generate different data sets. The data set may be analyzed to determine the impact of the characteristics of the individual and/or the amount of impact of the treatment provided to the individual in which the biological condition is present.

Description

Computer architecture for generating an integrated data store

Priority claims and incorporation by reference

The present application claims priority from U.S. provisional patent application Ser. No. 63/196,609, titled "Computer Architecture for Generating an Integrated Data Repository", U.S. provisional patent application Ser. No. 63/227,860, titled "Computer Architecture for Identifying Lines of Therapy", U.S. provisional patent application Ser. No. 63/238,851, titled "Data Repository System, and Method for Cohort Selection", and U.S. provisional patent application Ser. No. 63/250,912, titled "Computer Architecture for Generating a Reference Data Table", filed "2021, month 6, month 3, and day 30, to 2021, to U.S. provisional patent application Ser. No. 63/227,860, 2021, 8, 31, and to U.S. provisional patent application Ser. No. 63/250,912, titled" Computer Architecture for Generating a Reference Data Table ", all of which are incorporated herein by reference in their entirety.

Technical Field

Implementations of the present disclosure relate generally to the field of computer architecture, and more particularly, to a computer architecture for generating a data repository (data repository) that integrates multiple healthcare data sources including healthcare insurance claim data and genomic data.

Background

Various types of documents may be generated when an individual visits a healthcare provider to treat one or more biological conditions. For example, medical records (medical records) may be generated by a healthcare provider that includes clinical observations recorded by the healthcare provider, laboratory test results, diagnostic test information, imaging information, dental health information, one or more combinations thereof, and the like. Further, a billing record may be generated that indicates payment information regarding at least one of the products or services provided to the individual by the healthcare provider. Further, health insurance claim information may be generated that indicates information obtained by a health insurance company related to individual treatment for one or more biological conditions.

Brief Description of Drawings

FIG. 1 illustrates an example architecture for generating an integrated data store that includes multiple types of healthcare data according to one or more implementations.

FIG. 2 illustrates an example framework corresponding to a data table arrangement in an integrated data store in accordance with one or more implementations.

FIG. 3 illustrates an architecture for generating one or more data sets from information retrieved from a data store that integrates health-related data from multiple sources, according to one or more implementations.

FIG. 4 illustrates an architecture for generating an integrated data store that includes de-identified health insurance claim data and de-identified genomic data in accordance with one or more implementations.

FIG. 5 illustrates a framework for generating a data set based on data stored by an integrated data store via a data pipeline system, in accordance with one or more implementations.

FIG. 6 is a schematic diagram of an architecture for incorporating medical record data into an integrated data store.

FIG. 7 is a data flow diagram of an example process of generating an integrated data store storing health insurance claim data and genomic data in accordance with one or more implementations.

FIG. 8 is a data flow diagram of an example process of generating multiple (a number of) data sets for analyzing information stored by an integrated data store storing health insurance claim data and genomic data, according to one or more implementations.

FIG. 9 illustrates a diagrammatic representation of machine in the form of a computer system within which a set of instructions, for causing the machine to perform any one or more of the methodologies discussed herein, may be executed according to one or more implementations.

FIG. 10 is a graph showing a Kaplan-Meier curve representing the real world total survival values in terms of high ctDNA count, low ctDNA count, and undetectable ctDNA for a patient receiving 1L therapy to treat non-small cell lung cancer prior to receiving its therapy.

FIG. 11 is a graph showing a Kaplan-Meier curve showing real world total survival values in terms of high ctDNA counts, low ctDNA counts, and undetectable ctDNA for a patient receiving 1L therapy to treat non-small cell lung cancer during treatment.

FIG. 12 is a graph showing a Kaplan-Meier curve showing real world total survival values in terms of high ctDNA counts, low ctDNA counts, and undetectable ctDNA for patients receiving Ornitinib (Osimertinib) to treat non-small cell lung cancer prior to their treatment.

FIG. 13 is a graph showing a Kaplan-Meier curve showing the real world total survival values in terms of high ctDNA count, low ctDNA count, and no ctDNA detected for patients receiving octreotide to treat non-small cell lung cancer during treatment.

FIG. 14 is a graph showing a Kaplan-Meier curve showing real world total survival values in terms of high ctDNA counts, low ctDNA counts, and undetectable ctDNA for a patient receiving chemotherapy to treat non-small cell lung cancer during treatment.

FIG. 15 is a graph showing a Kaplan-Meier curve showing real world total survival values in terms of high ctDNA counts, low ctDNA counts, and undetectable ctDNA for patients receiving chemotherapy to treat non-small cell lung cancer after treatment.

FIG. 16 is a graph showing a Kaplan-Meier curve showing real world total survival values in terms of high ctDNA counts, low ctDNA counts, and undetectable ctDNA for patients receiving chemotherapy to treat non-small cell lung cancer prior to receiving their treatment.

Fig. 17 is a graph showing the frequency of selected changes (alterations) in a group of patients (n=637) diagnosed with advanced non-small cell lung cancer (NSCLC) who received a liquid biopsy after initiation of first-line octenib treatment.

Fig. 18 is a graph showing the frequency of selected mutations in the ligand binding domain (ligand binding domain) of a group of patients diagnosed with breast cancer (n=4448) who received a liquid biopsy after recording treatment with an Aromatase Inhibitor (AI).

Fig. 19 is a graph showing changes associated with octenib resistance detected by liquid biopsies after treatment is provided to women diagnosed with NSCLC.

FIG. 20 is a graph showing ESR1 resistance mutations detected after a second course of treatment in relation to women diagnosed with metastatic breast cancer and treated with an aromatase inhibitor.

Detailed Description

The following description and the drawings sufficiently illustrate specific implementations to enable those skilled in the art to practice them. Other implementations may incorporate structural, logical, electrical, process, and other changes. Portions and features of some implementations may be included in or substituted for those of others. Implementations set forth in the claims include all available equivalents of those claims.

More data is needed to understand tumor behavior and performance of treatments and guidelines outside the highly selective range of random control tests, which are typically designed and implemented by entities for which success is of commercial interest. The use of Real World Evidence (RWE), particularly databases featuring integrated clinical and molecular data, plays an increasingly important role in accurate oncology research. However, most of these databases feature tumor genomic information that is limited to a single point in time (typically at diagnosis), in part because of the practical challenges of genome profiling (profiling) of a series of tumor specimens in real-world clinical practice. Although there is evidence that treatment can significantly alter tumor genomic landscape (land slope) and lead to drug resistance, genomic data for tumors is generally limited to only those patients who have not received systemic treatment. Combining data from liquid biopsy assays with rich clinical information can overcome these challenges and help to improve understanding of tumor evolution and biomarker emergence that confers resistance to guide the development of new therapies for areas that do not meet demand.

Analysis of healthcare data using existing systems and techniques is typically performed on medical records generated by healthcare providers. As used herein, a healthcare provider may refer to an entity, individual, or group of individuals involved in providing care (care) to an individual in connection with at least one of treatment or prevention of one or more biological conditions. Furthermore, as used herein, a biological condition may refer to an abnormality in function and/or structure in an individual to an extent that produces or threatens to produce a detectable characteristic of the abnormality. Biological conditions may be characterized by external and/or internal features, signs and/or symptoms that indicate deviations from a biological normal state (biological norm) in one or more populations. Biological conditions may be characterized by external and/or internal features, signs and/or symptoms that indicate deviations from a normal state of biology in one or more populations. In various examples, the biological condition may include one or more molecular phenotypes. For example, the biological condition may correspond to genetic or epigenetic damage. In one or more additional examples, the biological condition may include at least one of one or more diseases, one or more disorders, one or more injuries, one or more syndromes, one or more disabilities, one or more infections, one or more isolated symptoms, or other atypical variations in a biological structure and/or function of the individual. Further, treatment, as used herein, may refer to substances, procedures, routines, devices, and/or other interventions that may be administered or performed in order to treat one or more effects of a biological condition in an individual. In one or more examples, the treatment may include a substance metabolized by the individual. The substance may comprise a composition of matter, such as a pharmaceutical composition. The substance may be delivered to the individual by a variety of methods, such as ingestion, injection, absorption, or inhalation. Treatment may also include physical intervention, such as one or more surgeries. In at least some examples, the treatment may include a therapeutically significant intervention.

The healthcare data typically analyzed by existing systems includes unstructured data. Unstructured data may include data that is not organized in a predefined or standardized format. For example, unstructured data may include notes made by a healthcare provider that are composed of free text. That is, the manner in which notes are captured does not include predefined inputs that may be selected by the healthcare provider (e.g., through a drop down menu or through a list). Rather, notes include text entered by the healthcare provider, which may include sentences, sentence fragments, words, letters, symbols, abbreviations, one or more combinations thereof, and the like. In some cases, unstructured data may be partially structured. For example, the provider may select an insurance charging code from a predefined list of insurance charging codes and add unstructured notes to the data associated with the charging code.

Existing systems typically devote significant computational resources to analyzing unstructured data in order to extract information that may be relevant to the analysis being performed by the existing system. In some cases, existing systems may analyze unstructured data and convert the unstructured data to a structured format in order to analyze the prior unstructured data. Existing systems may be inefficient and inaccurate in analyzing unstructured data. In the context of obtaining unstructured data from healthcare data, the importance of accurately analyzing the information is high, as the analysis may be relevant to at least one of treatment or diagnosis of a plurality of individuals with respect to one or more biological conditions. Thus, inaccurate analysis of healthcare data may result in suboptimal treatment of an individual.

Implementations of the techniques, architectures, frameworks, systems, processes, and computer-readable instructions described herein aim to analyze health insurance claim data to derive information about at least one of health or treatment of an individual. In contrast to existing systems, health insurance claim data is constructed in accordance with one or more formats and stored by a plurality of data tables. The data sheet may include codes or other alphanumeric information indicating the treatment received by the individual, the date of treatment, dosage information, a diagnosis of the individual regarding one or more biological conditions, information related to a visit (visual) to a healthcare provider, a date of visit to the healthcare provider, billing information, and the like. Implementations described herein can be used to accurately analyze health insurance claim data for hundreds, up to thousands, up to tens of thousands, or more individuals in the presence of one or more biological conditions. In various examples, tens of thousands, hundreds of thousands, up to millions of rows and/or columns of health insurance claim data may be analyzed to determine health related information of individuals having one or more biological conditions.

In various examples, implementations described herein can integrate the molecular data with the health insurance claim data. The molecular data may include information derived from tissue samples extracted from a plurality of individuals. The molecular data may also include information derived from blood samples extracted from a plurality of individuals. In one or more illustrative examples, the molecular data may include genomic data. Further, in one or more examples, the health insurance claim data can be integrated with germline genetic information of a plurality of individuals.

An integrated data store may be created that combines individual health insurance claim data with individual molecular data. In one or more examples, an identifier associated with both the individual's health insurance claim data and the individual's molecular data can be generated for the individual. Both the molecular data and the health insurance claim data stored by the integrated data store can be accessed using a single identifier of the individual. In one or more illustrative examples, the identifier of the individual may include an encrypted security key. In various examples, the integrated data store may include a plurality of data tables corresponding to different aspects of data stored within the data store. For example, a first data table may be generated that includes aggregated data, such as personal information, for individuals included in the integrated data repository, and a second data table may be generated that includes data corresponding to visits to the healthcare provider. Further, a third data table may be generated that indicates a medical procedure provided to the individual, and a fourth data table may be generated that indicates information related to the prescription obtained by the individual. In addition, a fifth data table may be generated that includes a plurality of sets of chemical profile analyses (multiomics profiling) for the individual. The plurality of sets of profiles may include at least one of a genomic profile, a transcriptome (transcriptomic) profile, an epigenetic profile, or a proteomic (proteomic) profile.

The data tables included in the integrated data store may be linked by logical links. In this way, a query to retrieve information from one data table may result in retrieving information from one or more additional data tables. The information stored by the linked data tables may be accessed to generate a plurality of different data sets that may be used to analyze the information stored by the integrated data store. For example, information stored by the integrated data store may be analyzed by one or more algorithms to generate a data set organized according to one or more schemas. The data set may indicate a treatment for the biological condition that the individual has received over a period of time. The data set may also indicate a group of individuals having a plurality of common features included in the integrated data store. In various examples, the data sets may combine and rank information from a plurality of different data sources (including integrated data stores). The data set may be analyzed with respect to the plurality of queries to indicate information that may be of interest to at least one of a healthcare provider, a patient, or a provider of a biological condition treatment. For example, one or more data sets may be analyzed to more accurately determine the survival rate of an individual having a particular genomic profile in the presence of a biological condition in response to receiving a particular treatment.

Implementations described herein can provide a platform to integrate individual health insurance claim data and molecular data, which is not found in existing systems that typically rely on electronic medical records that include large amounts of unstructured data. By generating and analyzing structured health insurance claim data that has been integrated with molecular data, implementations described herein can provide more accurate characterization of the integrated data relative to existing systems that rely on relatively inaccurate unstructured electronic medical record data. Further, implementations described herein generate an analysis-ready data set (analytics ready datasets) that enables analysis of health information about an individual in a confidential and anonymous manner.

FIG. 1 illustrates an example architecture 100 for generating an integrated data store that includes multiple types of healthcare data according to one or more implementations. Architecture 100 may include a data integration and analysis system 102. The data integration and analysis system 102 can obtain data from a plurality of data sources and integrate the data from the data sources into an integrated data store 104. For example, the data integration and analysis system 102 can obtain data from the health insurance claim data store 106. In various examples, the data integration and analysis system 102 and the health insurance claim data store 106 can be created and maintained by different entities. In one or more additional examples, the data integration and analysis system 102 and the health insurance claim data store 106 can be created and maintained by the same entity.

The data integration and analysis system 102 can be implemented by one or more computing devices. The one or more computing devices may include one or more server computing devices, one or more desktop computing devices, one or more laptop computing devices, one or more tablet computing devices, one or more mobile computing devices, or a combination thereof. In some implementations, at least a portion of one or more computing devices may be implemented in a distributed computing environment. For example, at least a portion of one or more computing devices may be implemented in a cloud computing architecture. In a scenario where a computing system for implementing the data integration and analysis system 102 is configured in a distributed computing architecture, processing operations may be performed concurrently by multiple virtual machines. In various examples, the data integration and analysis system 102 may implement multi-threading techniques. The implementation of distributed computing architecture and multi-threading techniques enables the data integration and analysis system 102 to utilize fewer computing resources relative to computing architectures that do not implement these techniques.

The health insurance claim data store 106 can store information obtained from one or more health insurance companies corresponding to insurance claims submitted by subscribers of the one or more health insurance companies. The health insurance claim data store 106 can be arranged (e.g., ordered) by patient identifier. The patient identifier may be based on the patient's first name, last name, date of birth, social security number, address, employer, etc. The data stored by the health insurance claim data store 106 can include structured data arranged in one or more data tables. The one or more data tables storing structured data can include a plurality of rows and a plurality of columns that indicate information regarding health insurance claims submitted by subscribers of the one or more health insurance companies related to procedures and/or treatments received by the subscribers from the healthcare provider. At least a portion of the rows and columns of the data table stored by the health insurance claim data store 106 can include health insurance codes that can indicate diagnosis, treatment, and/or programming of biological conditions obtained by subscribers of one or more health insurance companies. In various examples, the health insurance code may also indicate a diagnostic procedure obtained by the individual that is related to one or more biological conditions that may be present in the individual. In one or more examples, the diagnostic program may provide information for detecting the presence of a biological condition. The diagnostic program may also provide information for determining the progress of the biological condition. In one or more illustrative examples, the diagnostic procedure may include one or more imaging procedures, one or more assays, one or more laboratory procedures, one or more combinations thereof, and the like.

The data integration and analysis system 102 may also obtain information from the molecular data store 108. The molecular data store 108 may store data relating to genomic information, genetic information, metabolomic (metabolomic) information, transcriptomic information, fragment group (fragmentomic) information, immune receptor (immunoreceptor) information, methylation (methylation) information, epigenomic (epigenomic) information, and/or proteomic information for a plurality of individuals. In one or more examples, the data integration and analysis system 102 and the molecular data store 108 may be created and maintained by different entities. In one or more additional examples, the data integration and analysis system 102 and the molecular data store 108 may be created and maintained by the same entity.

The genomic information may indicate one or more mutations corresponding to the genes of the individual. Mutations in the genes of an individual may correspond to differences between the nucleic acid sequences of the individual and one or more reference genomes. The reference genome may comprise a known reference genome, such as hg19. In various examples, the mutation of an individual's gene may correspond to a difference in the individual's germline gene relative to a reference genome. In one or more additional examples, the reference genome may include a germline genome of the individual. In one or more further examples, the mutation of the gene of the individual may include a somatic mutation. Mutations in an individual's gene may be associated with insertions, deletions, single nucleotide variations, heterozygous deletions, replications, amplifications, translocations, fusion genes, or one or more combinations thereof.

In one or more illustrative examples, the genomic information stored by the molecular data store 108 can include a genomic profile of tumor cells present in an individual. In these cases, genomic information may be derived from analysis of genetic material, such as deoxyribonucleic acid (DNA) and/or ribonucleic acid (RNA) from a sample including, but not limited to, a tissue sample or tumor biopsy, circulating Tumor Cells (CTCs), exosomes (exosomes), or cytosomes (effersosomes), or from circulating nucleic acid found in a blood sample of an individual (e.g., cell-free DNA), present due to degradation of tumor cells present in the individual. In one or more examples, genomic information of tumor cells of an individual may correspond to one or more target regions. One or more mutations in the presence of one or more target regions may be indicative of the presence of tumor cells in an individual. Genomic information stored by the molecular data store 108 may be generated in connection with assays or other diagnostic tests that may determine one or more mutations with respect to one or more target regions of a reference genome.

"cell-free DNA," "cfDNA molecule," or simply "cfDNA" includes DNA molecules that are present in a subject in extracellular form (e.g., in blood, serum, plasma, or other bodily fluids such as lymph, cerebrospinal fluid, urine, or sputum), and include DNA that is not contained within or otherwise bound to cells when isolated from a subject. Although DNA is initially present in one or more cells of a large complex biological organism (e.g., a mammal) or is colonized in other cells of the organism (e.g., bacteria), DNA has been released from the cells into fluids found in the organism. cfDNA includes, but is not limited to, cell-free genomic DNA of a subject (e.g., genomic DNA of a human subject) and cell-free DNA of microorganisms (e.g., bacteria) residing in the subject (whether pathogenic or bacteria commonly found in common colonisation sites such as the gut or skin of a healthy control group), but excludes cell-free DNA of microorganisms that contaminate only body fluid samples. Typically, cfDNA can be obtained by obtaining a sample of the fluid without performing an in vitro cell lysis step, and further comprising removing cells present in the fluid (e.g., blood centrifugation to remove cells).

In one or more additional examples, the data integration and analysis system 102 can obtain information from one or more additional data stores 110. The one or more additional data stores 110 can store data related to electronic medical records of individuals whose data resides in at least one of the health insurance claim data store 106 or the molecular data store 108. Further, one or more additional data stores 110 can store data related to pathology reports for individuals whose data resides in at least one of the health insurance claim data store 106 or the molecular data store 108. In various examples, one or more additional data stores 110 can store data related to a biological condition and/or treatment of a biological condition. In one or more examples, at least a portion of the data integration and analysis system 102 and one or more additional data stores 110 can be created and maintained by different entities. In one or more further examples, the data integration and analysis system 102 and at least a portion of the one or more additional data stores 110 can be created and maintained by the same entity.

In one or more further implementations, the data integration and analysis system 102 can obtain information from one or more reference information data stores 112. The one or more reference information data stores 112 may store information including definitions, standards, protocols, vocabularies, one or more combinations thereof, and the like. In various examples, the information stored by the one or more reference information data stores may correspond to a biological condition and/or treatment of a biological condition. In one or more illustrative examples, the one or more reference information data stores 112 may include RxNorm. (RxNorm provides a normalized name for a clinical drug and links its name to a number of drug vocabularies used in pharmacy management and drug interaction software.) in one or more examples, the data integration and analysis system 102 and at least a portion of the one or more reference information data stores 112 may be created and maintained by different entities. In one or more further examples, the data integration and analysis system 102 and at least a portion of the one or more reference information data stores 112 can be created and maintained by the same entity.

The data integration and analysis system 102 can obtain data from at least one of the health insurance claim data store 106, the molecular data store 108, the one or more additional data stores 110, or the reference information data store 112 via one or more communication networks that are accessible to the data integration and analysis system 102 and that are accessible to at least one of the health insurance claim data store 106, the molecular data store 108, the one or more additional data stores 110, or the reference information data store 112. The data integration and analysis system 102 can also obtain data from at least one of the health insurance claim data store 106, the molecular data store 108, one or more additional data stores 110, or the reference information data store 112 via one or more secure communication channels. Further, the data integration and analysis system 102 can obtain data from at least one of the health insurance claim data store 106, the molecular data store 108, one or more additional data stores 110, or the reference information data store 112 via one or more calls to an Application Programming Interface (API).

The data integration and analysis system 102 may include a data integration system 114. The data integration system 114 can obtain data from the health insurance claim data store 106 and the molecular data store 108 to generate the integrated data store 104. The data integration system 114 may also obtain data from one or more additional data stores 110 to generate the integrated data store 104. In various examples, the data integration system 114 may implement one or more natural language processing techniques to integrate data from one or more additional data stores 110 into the integrated data store 104.

In one or more examples, the data integration system 114 can generate one or more tokens to identify individuals having data stored in the health insurance claim data store 106 and having data stored in the molecular data store 108. In various examples, the data integration system 114 may generate one or more tokens by implementing one or more hash functions. The data integration system 114 can implement one or more hash functions to generate one or more tokens based on information stored by at least one of the health insurance claim data store 106 or the molecular data store 108. For example, the information used by the data integration system 114 to generate the respective tokens by implementing the hash function may include at least one of an identifier of the respective individual, a birth date of the respective individual, a zip code of the respective individual, a birth date of the respective individual, or a gender of the respective individual. In one or more illustrative examples, the identifier of the respective individual may include a combination of at least a portion of the first name of the respective individual and at least a portion of the surname of the respective individual. Tokens generated using data from different data stores may correspond to the same or similar information or the same or similar types stored by the different data stores. To illustrate, the token can be generated using a portion of the name, date of birth, at least a portion of the zip code, and gender of the individual obtained from the health insurance claim data store 106 and the molecular data store 108.

The data integration system 114 may integrate data from a plurality of different data sources by analyzing tokens generated by implementing one or more hash functions using data obtained from the plurality of different data sources. For example, the data integration system 114 can obtain one or more first tokens generated from data stored by the health insurance claim data store 106 and one or more second tokens generated from data stored by the molecular data store 108. The data integration system 114 may analyze the one or more first tokens relative to the one or more second tokens to determine individual first tokens corresponding to the individual second tokens. In one or more illustrative examples, the data integration system 114 may identify a separate first token that matches a separate second token. The first token may match the second token when the data of the first token has at least a threshold amount of similarity with respect to the data of the second token. In one or more examples, the first token may match the second token when the data of the first token is the same as the data of the second token. To illustrate, a first token may match a second token when the alphanumeric string of the first token is the same as the alphanumeric string of the second token.

By determining a first token generated using data stored by the health insurance claim data store 106 that corresponds to a second token generated using data stored by the molecular data store 108, the data integration system 114 can identify individuals having data stored in both the health insurance claim data store 106 and the molecular data store 108. In this way, the data integration system 114 can obtain data from multiple individuals from the health insurance claim data store 106 and data from the same multiple individuals from the molecular data store 108 and store the health insurance claim data and the molecular data for the multiple individuals in the integrated data store 104.

The data integration system 114 can also integrate data stored by one or more additional data stores 110 with data from the health insurance claim data store 106 and the molecular data store 108 to generate the integrated data store 104. To illustrate, the data integration system 114 may obtain one or more third tokens generated from data stored by the additional data store 110, such as a data store storing data corresponding to pathology reports. The data integration system 114 can analyze one or more third tokens relative to a first token generated using information stored by the health insurance claim data store 106 and a second token generated using information stored by the molecular data store 108 to determine respective third tokens corresponding to the individual first tokens and the individual second tokens. In one or more illustrative examples, the data integration system 114 can identify a third token generated using one or more hash functions and a common set of information obtained from the health insurance claim data store 106, the molecular data store 108, and the additional data store 110.

By determining a third token generated using data stored by the additional data store 110 that corresponds to a first token generated using data stored by the health insurance claim data store 106 and a second token generated using data stored by the molecular data store 108, the data integration system 114 can identify individuals having data stored in the health insurance claim data store 106, the molecular data store 108, and the additional data store 110. In this manner, the data integration system 114 can obtain data from multiple individuals from the health insurance claim data store 106, and from the molecular data store 108 and the additional data store 110, and store the health insurance claim data, molecular data, and additional data for the multiple individuals in the integrated data store 104.

The data of the plurality of individuals stored by the integrated data store 104 may be accessed using the respective identifiers of the individuals. The data integration system 114 may implement a variety of techniques as part of a de-identification process with respect to storing and retrieving individual information in the integrated data store 104. The identifier of the individual may correspond to a key (key) generated using at least one hash function. The identifier of the individual may also be generated by implementing one or more salification (sang) processes on the key generated using at least one hash function. Tokens generated using one or more hash functions and a common set of information obtained from the health insurance claim data store 106, the molecular data store 108, and/or the additional data store 110. In one or more illustrative examples, the identifier generated by the data integration system 114 for accessing the information of the respective individual stored by the integrated data store 104 may be unique to each individual. In one or more examples, an identifier of an individual may be generated using at least a portion of information used to generate a token associated with the individual. In one or more additional examples, the identifier of the individual may be generated using information that is different from information used to generate the token associated with the individual.

The data integration system 114 may also generate the integrated data store 104 from a variety of different combinations of data stores in a similar manner. For example, the data integration system 114 can obtain tokens generated from information stored by the health insurance claim data store 106, as well as additional tokens generated from information stored by one or more additional data stores 110. The data integration system 114 can determine individual tokens generated from information stored by the health insurance claim data store 106 corresponding to individual additional tokens generated from information stored by one or more additional data stores 110. By determining tokens generated using data stored by the health insurance claim data store 106 that correspond to additional tokens generated using data stored by the additional data store 110, the data integration system 114 can identify individuals having data stored in both the health insurance claim data store 106 and the additional data store 110. In this manner, the data integration system 114 can obtain data from multiple individuals from the health insurance claim data store 106 and data from the same multiple individuals from the additional data store 110 and store the health insurance claim data and additional data for the multiple individuals in the integrated data store 104. The individual's respective identifiers may be used to access the health insurance claim data and additional data for a plurality of individuals stored by the integrated data store 104.

In one or more further examples, the data integration system 114 can obtain tokens generated from information stored by the molecular data store 108 and tokens generated from information stored by the one or more additional data stores 110. The data integration system 114 may determine individual tokens generated from information stored by the molecular data store 108 corresponding to individual additional tokens generated from information stored by one or more additional data stores 110. By determining tokens generated using data stored by molecular data store 108 that correspond to additional tokens generated using data stored by additional data store 110, data integration system 114 can identify individuals having data stored in both molecular data store 108 and additional data store 110. In this manner, data integration system 114 may obtain data from multiple individuals from molecular data store 108 and data from the same multiple individuals from additional data store 110 and store the multiple individuals' molecular data and additional data in integrated data store 104. The molecular data and additional data for a plurality of individuals stored by the integrated data store 104 may be accessed using the respective identifiers of the individuals.

The data stored by the integrated data store 104 can be stored in accordance with one or more regulatory frameworks that protect privacy and ensure the security of individual medical records, health information, and insurance information. For example, the data may be stored by the integrated data store 104 in accordance with one or more government regulatory frameworks (such as health insurance portability and liability Act (HIPAA) and/or General Data Protection Regulations (GDPR)) that aim to protect personal information. The integrated data store 104 also stores data in an anonymized and de-identified manner to ensure privacy of individuals having data stored by the integrated data store 104. To further ensure privacy of individuals having data stored by the integrated data store 104, the data integration system 114 may periodically regenerate the integrated data store 104. For example, the data integration system 114 may create the integrated data store 104 once a quarter. In one or more additional examples, the data integration system 114 may generate the integrated data store 104 once a month, weekly, or bi-weekly. By periodically regenerating the integrated data store 104, rather than simply refreshing the integrated data store 104 when new data is available, the integrated data store 104 enhances privacy protection with respect to the data stored by the integrated data store 104. That is, where the data store is simply refreshed with new data, it may be easier to track individuals associated with data newly added to the data store, as the number of new individuals added at a given time is typically less than the number of existing individuals already having data stored by the data store.

In various examples, the data stored by the integrated data store 104 may be accessed via a database management system. Further, the integrated data store 104 can store data in accordance with one or more database models. In one or more examples, the integrated data store 104 can store data in accordance with one or more relational database techniques. For example, the integrated data store 104 may store data according to a relational database model. In one or more additional examples, integrated data store 104 can store data according to an object-oriented database model. In one or more further examples, the integrated data store 104 can store data in accordance with an extensible markup language (XML) database model. In yet additional examples, integrated data store 104 may store data according to a Structured Query Language (SQL) database model. In still further examples, the integrated data store may store data according to an image database model.

The data integration system 114 may generate the integrated data store 104 by generating a plurality of data tables and creating links between the data tables. The links may indicate logical couplings between the data tables. The data integration system 114 may generate data tables by extracting specified data sets from information obtained from the data stores 106, 108, 110, 112 and storing the data in rows and columns of the corresponding data tables. In various examples, the logical coupling between the data tables may include at least one of a one-to-one link (where one row of information in one data table corresponds to one row of information in another data table), a one-to-many link (where one row of information in one data table corresponds to multiple rows of information in another data table), or a many-to-many link (where multiple rows of information in one data table corresponds to multiple rows of information in another data table).

Multiple data tables may be arranged according to the data store schema 116. In the illustrative example of fig. 1, the data repository schema 114 includes a first data table 118, a second data table 120, a third data table 122, a fourth data table 124, and a fifth data table 124. Although the illustrative example of fig. 1 includes five data tables, in additional implementations, the data store schema 116 may include more data tables or fewer data tables. The data store schema 116 can also include links between the data tables 118, 120, 122, 124, 128. Links between the data tables 118, 120, 122, 124, 126 may indicate that information retrieved from one of the data tables 118, 120, 122, 124, 126 results in additional information stored by one or more additional data tables 118, 120, 122, 124, 126 being retrieved. Furthermore, not all of the data tables 118, 120, 122, 124, 126 may be linked to each of the other data tables 118, 120, 122, 124, 126. In the illustrative example of FIG. 1, first data table 118 is logically coupled to second data table 118 via a first link 128 and first data table 118 is logically coupled to fourth data table 124 via a second link 130. Further, second data table 120 is logically coupled to third data table 122 via a third link 132 and fourth data table 124 is logically coupled to fifth data table 126 via a fourth link 134. Further, the third data table 122 is logically coupled to the fifth data table 126 via a fifth link 136.

In various examples, additional links between data tables may be added to data store schema 116 or removed from data store schema 116 as data tables are added to data store schema 116 and/or removed from data store schema 116. In one or more illustrative examples, integrated data store 104 may store data tables for at least a portion of the following individuals according to data store schema 116: the data integration system 114 obtains the individual's information from a combination of at least two of the health insurance claim data store 106, the molecular data store 108, the one or more additional data stores 110, and the one or more reference information data stores 112. As a result, the integrated data store 104 may store respective instances of the data tables 118, 120, 122, 124, 126 for thousands, tens of thousands, up to hundreds of thousands, or more of volumes according to the data store schema 116.

The data integration and analysis system 102 may also include a data pipeline system 138. The data pipeline system 138 may include a number of algorithms, software code, scripts, macros, or other computer-executable instruction packages that process information stored by the integrated data store 104 to generate additional data sets. The additional data sets may include information obtained from one or more of the data tables 118, 120, 122, 124, 126. The additional data sets may also include information derived from data obtained from one or more of the data tables 118, 120, 122, 124, 126. The components of the data pipeline system 138 implemented to generate the first additional data set may be different from the components of the data pipeline system 138 used to generate the second additional data set.

In one or more examples, the data tubing 138 may generate a data set indicative of medication therapies received by a plurality of individuals. In one or more illustrative examples, the data conduit system 138 may analyze the information stored in at least one of the data tables 118, 120, 122, 124, 126 to determine a health insurance code corresponding to the medication therapy received by the plurality of individuals. The data conduit system 138 may analyze the health insurance code corresponding to the medication therapy with respect to a database (library) indicating specified medication therapies corresponding to one or more health insurance codes to determine the name of the medication therapy that the individual has received. In one or more additional examples, the data conduit system 138 may analyze the information stored by the integrated data store 104 to determine medical procedures received by a plurality of individuals. To illustrate, the data tubing 138 may analyze the information stored by one of the data tables 118, 120, 122, 124, 126 to determine the treatment that the individual received via at least one of injection or intravenous (intravenous). In one or more further examples, the data tubing 138 may analyze the information stored by the integrated data store 104 to determine an individual's care event (ep) a treatment line received by the individual, a progression of a biological condition, or a time of a next treatment. In various examples, the data set generated by the data pipeline system 138 may be different for different biological conditions. For example, the data tubing 138 may generate a first number of data sets for a first type of cancer (e.g., lung cancer) and a second number of data sets for a second type of cancer (e.g., colorectal cancer).

The data pipeline system 138 may also determine one or more confidence levels to assign to information associated with individuals having data stored by the integrated data store 104. The respective confidence levels may correspond to different accuracy metrics of information associated with individuals having data stored by the integrated data store 104. The information associated with the respective confidence levels may correspond to one or more characteristics of the individual derived from the data stored by the integrated data store 104. The data pipeline 138 may generate confidence level values for one or more features in connection with generating one or more data sets from the integrated data store 104. In one or more examples, the first confidence level may correspond to a first range of accuracy metrics, the second confidence level may correspond to a second range of accuracy metrics, and the third confidence level may correspond to a third range of accuracy metrics. In one or more additional examples, the second range of accuracy metrics may include smaller values than the first range of accuracy metrics, and the third range of accuracy metrics may include smaller values than the second range of accuracy metrics. In one or more illustrative examples, the information corresponding to the first confidence level may be referred to as gold standard information, the information corresponding to the second confidence level may be referred to as silver standard information, and the information corresponding to the third confidence level may be referred to as copper standard information.

The data pipeline system 138 may determine the value of the confidence level of the characteristic of the individual based on a number of factors. For example, the corresponding set of information may be used to determine characteristics of the individual. The data conduit system 138 may determine a confidence level of the characteristic of the individual based on an amount of integrity of a corresponding set of information used to determine the characteristic of the individual. In the event that one or more pieces of information are absent from the set of information associated with the first number of individuals, the confidence level of the feature may be lower than the confidence level of a second number of individuals in the set of information that are absent information. In one or more examples, the data conduit system 138 may use the amount of missing information to determine a confidence level of the characteristic of the individual. To illustrate, a greater amount of missing information used to determine a feature may result in a lower confidence level for the feature than if the amount of missing information used to determine the feature of an individual was lower. Further, different types of information may correspond to various confidence levels of the features. In one or more examples, the presence of the first piece of information to determine the feature may result in a higher confidence level for the feature than the presence of the second piece of information to determine the feature of the individual.

In one or more illustrative examples, the data tubing 138 may determine a plurality of individuals included in a group with a preliminary diagnosis of lung cancer (or other biological condition). The data conduit system 138 may determine a confidence level of the respective individual regarding the preliminary diagnosis classified as having lung cancer. The data tubing 138 may use information from the columns included in the data tables 118, 120, 122, 124, 126 to determine a confidence level that an individual is included within a lung cancer group. The plurality of columns may include health insurance codes associated with diagnosis of the biological condition and/or treatment of the biological condition. Furthermore, the plurality of columns may correspond to a date of diagnosis and/or treatment of the biological condition. The data conduit system 138 may determine that, where information for each of the plurality of columns, or at least a threshold number of columns, is available, a confidence level that an individual is characterized as a portion of the lung cancer group is higher than if information for less than the threshold number of columns is available. Further, the data conduit system 138 may determine a confidence level for individuals included in the lung cancer group based on the type of information and the availability of information associated with one or more columns. To illustrate, in the event that one or more diagnostic codes are present and one or more therapeutic codes are not present for one or more time periods for a group of individuals, the data conduit system 138 may determine that the confidence level of including the group of individuals in the lung cancer group is greater than the confidence level in the event that at least one diagnostic code is not present and is used to determine whether the individual is included in the lung cancer group.

The data integration and analysis system 102 may include a data analysis system 140. The data analysis system 148 may receive the integrated data store request 142 from one or more computing devices (e.g., the example computing device 144). One or more integrated data store requests 142 may result in retrieval of data from integrated data store 104. In various examples, one or more integrated data store requests 142 may result in retrieving data from one or more data sets generated by data pipeline system 138. The integrated data store request 142 may specify data to be retrieved from the integrated data store 104 and/or one or more data sets generated by the data pipeline system 138. In one or more additional examples, the integrated data store request 142 can include one or more pre-built queries corresponding to computer-executable instructions for retrieving a specified data set from the integrated data store 104 and/or one or more data sets generated by the data pipeline system 138.

In response to the one or more integrated data store requests 142, the data analysis system 140 can analyze data retrieved from at least one of the integrated data store 104 or one or more data sets generated by the data pipeline system 138 to generate data analysis results 146. The data analysis results 146 may be transmitted to one or more computing devices, such as example computing device 148. Although the illustrative example of fig. 1 shows one or more integrated data store requests 142 from one computing device 144 and data analysis results 146 being sent to another computing device 148, in one or more additional implementations, data analysis results 146 may be received by the same computing device that sent one or more integrated data store requests 142. The data analysis results 146 may be displayed through one or more user interfaces presented by the computing device 144 or the computing device 148.

In one or more examples, the data analysis system 140 can implement at least one of one or more machine learning techniques or one or more statistical techniques to analyze data retrieved in response to the one or more integrated data store requests 142. In one or more examples, the data analysis system 140 can implement one or more artificial neural networks to analyze data retrieved in response to one or more integrated data store requests 142. To illustrate, the data analysis system 140 may implement at least one of one or more convolutional neural networks or one or more residual neural networks to analyze data retrieved from the integrated data store 104 in response to one or more integrated data store requests 142. In at least some examples, the data analysis system 140 can implement one or more random forest techniques, one or more support vector machines, or one or more hidden markov models to analyze data retrieved in response to one or more integrated data store requests 142. One or more statistical models may also be implemented to analyze data retrieved in response to one or more integrated data store requests 142 to identify at least one of a correlation or a significance measure between features of an individual. For example, a log rank test may be applied to data retrieved in response to one or more integrated data store requests 142. Further, a Cox proportional hazards model (Cox proportional hazards model) can be implemented with respect to data retrieved in response to one or more integrated data store requests 142. Further, the wilcoxon symbol rank test may be applied to data retrieved in response to one or more integrated data store requests 142. In other examples, z-score analysis may be performed with respect to data retrieved in response to one or more integrated data store requests 142. In further examples, kaplan Meier analysis may be performed with respect to data retrieved in response to one or more integrated data store requests 142. In at least some examples, one or more machine learning techniques may be implemented in conjunction with one or more statistical techniques to analyze data retrieved in response to one or more integrated data store requests 142.

In one or more illustrative examples, data analysis system 140 may determine a survival rate of an individual with lung cancer in response to one or more treatments. In one or more additional illustrative examples, the data analysis system 140 can determine the survival rate of an individual having one or more genomic region mutations in which lung cancer is present in response to one or more treatments. In various examples, the data analysis system 140 can generate the data analysis results 146 if data retrieved from at least one of the integrated data store 104 or one or more data sets generated by the data pipeline system 138 meets one or more criteria. For example, the data analysis system 140 can determine whether at least a portion of the data retrieved in response to the one or more integrated data store requests 142 meets a threshold confidence level. In the event that the confidence level of at least a portion of the data retrieved in response to the one or more integrated data store requests 142 is less than the threshold confidence level, the data analysis system 140 can refrain from generating at least a portion of the data analysis results 146. In the event that the confidence level of at least a portion of the data retrieved in response to the one or more integrated data store requests 142 is at least a threshold confidence level, the data analysis system 140 can generate at least a portion of the data analysis results 146. In various examples, the threshold confidence level may be related to the type of data analysis results 146 generated by the data analysis system 140.

In one or more illustrative examples, data analysis system 140 may receive integrated data store request 142 to generate data analysis results 146 indicative of the survival of one or more individuals. In these cases, the data analysis system 140 may determine whether the data stored by the integrated data store 104 and/or one or more data sets generated by the data pipeline system 138 meets a threshold confidence level, such as a gold standard confidence level. In one or more additional examples, the data analysis system 140 can receive the integrated data store request 142 to generate data analysis results 146 indicative of the treatment received by the one or more individuals. In these implementations, the data analysis system 140 can determine whether the data stored by the integrated data store 104 and/or one or more data sets generated by the data pipeline system 138 meets a lower threshold confidence level, such as a copper standard confidence level.

In one or more additional illustrative examples, the data analysis system 140 can receive the integrated data store request 142 to determine an individual having one or more genomic mutations and having received one or more treatments for a biological condition. Continuing with this example, the data analysis system 140 can determine a survival rate of an individual having one or more genomic mutations with respect to one or more treatments that the individual receives. The data analysis system 140 can then identify the effectiveness of treatment of the individual in relation to genomic mutations that may be present in the individual based on the survival rate of the individual. In this way, the health outcome of an individual may be improved by identifying an expected treatment that is more effective than the current treatment provided to the individual for a population of individuals having one or more genomic mutations.

FIG. 2 illustrates an example framework 200 corresponding to an arrangement of data tables in an integrated data store in accordance with one or more implementations. In the illustrative example of fig. 2, the framework 200 includes a data store schema 202, the data store schema 202 including a first data table 204, a second data table 206, a third data table 208, a fourth data table 210, a fifth data table 212, a sixth data table 214, and a seventh data table 216. Although the illustrative example of fig. 2 includes seven data tables, in additional implementations, the data repository schema 202 may include more data tables or fewer data tables. The data store schema 202 can also include links between the data tables 204, 206, 208, 210, 212, 214, 216. The links between the data tables 204, 206, 208, 210, 212, 214, 216 may indicate that information retrieved from one of the data tables 204, 206, 208, 210, 212, 214, 216 results in additional information stored by one or more additional data tables 204, 206, 208, 210, 212, 214, 216 being retrieved. Moreover, not all of the data tables 204, 206, 208, 210, 212, 214, 216 may be linked to each of the other data tables 204, 206, 208, 210, 212, 214, 216. In the illustrative example of FIG. 2, first data table 204 is logically coupled to second data table 206 via a first link 218 and third data table 208 is logically coupled to second data table 206 via a second link 220. The second data table 206 is also logically coupled to the fourth data table 210 via a third link 222, the second data table 206 is logically coupled to the fifth data table 212 via a fourth link 224, and the second data table 206 is logically coupled to the sixth data table 214 via a fifth link 226. Further, fifth data table 212 is logically coupled to sixth data table 214 via a sixth link 228 and sixth data table 214 is logically coupled to seventh data table 216 via a seventh link 230. In addition, seventh data table 216 is logically coupled to fourth data table 210 via eighth link 232. In various examples, additional links between data tables may be added to data store schema 202 or removed from data store schema 202 as data tables are added to data store schema 202 and/or removed from data store schema 202. In one or more illustrative examples, integrated data store 104 may store data tables for at least a portion of the following individuals according to data store schema 202: the data integration system 114 obtains the individual's information from a combination of at least two of the health insurance claim data store 106, the molecular data store 108, and the one or more additional data stores 110. As a result, the integrated data store 104 may store respective instances of the data tables 204, 206, 208, 210, 212, 214, 216 for thousands, tens of thousands, up to hundreds of thousands, or more volumes, depending on the data store schema 204.

In one or more examples, the first data table 204 may store data corresponding to genomics and genomics tests for individuals. For example, the first data table 204 may include columns including information corresponding to panels for generating genomic data, mutations for genomic regions, mutation types, copy numbers for genomic regions, coverage data indicating the number of nucleic acid molecules identified in a sample having one or more mutations, date of detection, and patient information. The first data table 204 may also include one or more columns that include health insurance data codes that may correspond to one or more diagnostic codes. Further, the information in the first data table 204 may include at least one identifier of an individual associated with the instance of the first data table 204.

The second data table 206 may store data related to one or more patient visits (patient visits) of an individual to one or more healthcare providers. Third data table 208 may store information corresponding to respective services provided to the individual regarding one or more patient visits to one or more healthcare providers indicated by second data table 206. To illustrate, an individual may visit a (visit) healthcare provider, and multiple services may be performed for the individual at the time of the visit. The second data table 206 may include a column of information indicating each of a plurality of services performed during a patient visit. A plurality of third data tables 208 relating to patient visits may be generated, the plurality of third data tables 208 including columns of information indicating finer granularity levels of corresponding services provided during patient visits than the information relating to patient visits stored by the second data table 206. For example, the second data table 206 may include multiple columns of health insurance codes indicating different services provided to the individual during a patient visit, and the third data table 208 associated with one of the services may include multiple columns of additional health insurance codes corresponding to additional information associated with the respective service. The second data table 206 and the third data table 208 for patient visits may indicate one or more service dates corresponding to the patient visits.

The fourth data table 210 may include columns that indicate information about individuals for which the integrated data store 104 stores information. For example, the fourth data table 210 may include a column indicating information related to at least one of a location of the individual, a gender of the individual, a birth date of the individual, a death date of the individual (if applicable), or one or more keys associated with the individual. In one or more examples, fourth data table 210 may include one or more columns related to whether erroneous data has been identified for an individual. In various examples, a single fourth data table 210 may be generated for a respective individual. Thus, the data store schema 202 may include multiple instances of the fourth data table 210, e.g., thousands, tens of thousands, up to hundreds of thousands, or more.

Fifth data table 212 may include a column indicating information related to a health insurance company or government entity paying for one or more services provided to the respective individual. For example, fifth data table 212 may include one or more payer identifiers. The sixth data table 214 may include columns containing information corresponding to the health insurance coverage information of the respective individual. In one or more examples, the sixth data table 214 may include columns that indicate: the presence of a medical coverage (medical coverage) of the individual, the presence of a pharmacy coverage (pharmacy coverage) of the individual, the type of health insurance plan associated with the individual (e.g., health Maintenance Organization (HMO), preferably Provider Organization (PPO), etc.).

Seventh data table 216 may include columns indicating information related to medication therapies obtained by the respective individuals. In one or more examples, the seventh data table 216 may include one or more columns indicating health insurance codes corresponding to medication therapies available through the pharmacy. The health insurance code may correspond to an individual medication therapy. Furthermore, the health insurance code may indicate a diagnosis regarding the biological condition of the individual. Seventh data table 216 may also include additional information such as at least one of a dose, a number of days of supply, an amount dispensed, a number of authorized recaptures (refill), a date of service, or information related to the individual receiving the medication.

In various examples, the data store schema 202 may provide analysis results of information stored by the data tables 204, 206, 208, 210, 212, 214, 216 in a more efficient manner than typical data store schemas. For example, the logical connections between the data tables 204, 206, 208, 210, 212, 214, 216 are arranged to efficiently retrieve data related across the different data tables 204, 206, 208, 210, 212, 214, 216. Where the data tables 204, 206, 208, 210, 212, 214, 216 are arranged in a serial fashion and/or where a greater number of the data tables 204, 206, 208, 210, 212, 214, 216 are logically connected, retrieving data from one or more of the data tables 204, 206, 208, 210, 212, 214, 216 from the integrated data store 104 in response to a request for information from the integrated data store 104 will be less efficient than if the data store mode 202 were implemented.

FIG. 3 illustrates an architecture 300 for generating one or more data sets from information retrieved from a data store that integrates health-related data from multiple sources, according to one or more implementations. Architecture 300 may include data integration and analysis system 102 and integrated data store 104. Further, the data integration and analysis system 102 may include at least a data pipeline system 138 and a data analysis system 140. The data pipeline system 138 may include a plurality of sets of data processing instructions executable to generate corresponding data sets that the data analysis system 140 may analyze in response to the integrated data store request 142 to generate data analysis results 146.

The data pipeline 138 may include a first data processing instruction 302, a second data processing instruction 304, and up to an nth data processing instruction 306. The data processing instructions 302, 304, 306 may be executed by one or more processing units to perform a number of operations to generate corresponding data sets using information obtained from the integrated data store 104. In one or more illustrative examples, the data processing instructions 302, 304, 306 may include at least one of software code, scripts, API calls, macros, and the like. The first data processing instructions 302 may be executed to generate a first data set 308. Further, the second data processing instructions 304 may be executed to generate a second data set 310. In addition, the nth data processing instruction 306 may be executed to generate an nth data set 312. In various examples, after the data integration and analysis system 102 generates the integrated data store 104, the data pipeline system 138 may cause the data processing instructions 302, 304, 306 to be executed to generate the data sets 308, 310, 312. In one or more examples, the data sets 308, 310, 312 may be stored by the integrated data store 104 or by additional data stores accessible by the data integration and analysis system 102. At least a portion of the data processing instructions 302, 304, 306 may analyze the health insurance code to generate at least a portion of the data sets 308, 310, 312. Further, at least a portion of the data processing instructions 302, 304, 306 may analyze the genomic data to generate at least a portion of the data sets 308, 310, 312.

In one or more examples, the first data processing instructions 302 may be executed to retrieve data from one or more first data tables stored by the integrated data store 104. The first data processing instructions 302 may also be executed to retrieve data from one or more specified columns of one or more first data tables. In various examples, the first data processing instructions 302 may be executed to identify an individual having a health insurance code stored in one or more column and row combinations corresponding to one or more diagnostic codes. The first data processing instructions 302 may then be executed to analyze one or more diagnostic codes to determine a biological condition of the individual that has been diagnosed. In one or more illustrative examples, first data processing instructions 302 may be executed to analyze one or more diagnostic codes with respect to a diagnostic code library that indicates one or more biological conditions (corresponding to respective diagnostic codes). The diagnostic code library may include hundreds to thousands of diagnostic codes. The first data processing instructions 302 may also be executed to determine an individual diagnosed with a biological condition by analyzing timing information of the individual (e.g., date of treatment, date of diagnosis, date of death, one or more combinations thereof, etc.).

The second data processing instructions 304 may be executed to retrieve data from one or more second data tables stored by the integrated data store 104. The second data processing instructions 304 may also be executed to retrieve data from one or more designated columns of one or more second data tables. In various examples, the second data processing instructions 304 may be executed to identify individuals having health insurance codes stored in one or more column and row combinations corresponding to one or more treatment codes. The one or more treatment codes may correspond to a treatment obtained from a pharmacy. In one or more additional examples, the one or more therapy codes may correspond to a therapy received by a medical procedure (e.g., injection or intravenous). The second data processing instructions 304 may be executable to determine one or more treatments corresponding to respective health insurance codes included in the one or more second data tables by analyzing the health insurance codes in relation to the predetermined set of information. The predetermined set of information may include a database indicating one or more treatments corresponding to one of hundreds to thousands of health insurance codes. The second data processing instructions 304 may generate a second data set 310 to indicate the respective treatments received by a group of individuals. In one or more illustrative examples, the set of individuals may correspond to individuals included in the first data set 308. The second data set 310 may be arranged in rows and columns, wherein one or more rows correspond to a single individual and one or more columns indicate the treatment received by the respective individual.

The nth processing instruction 306 (where N may be any positive integer) may be executed to generate the nth data set 312 by combining information from a plurality of previously generated data sets (e.g., the first data set 308 and the second data set 310). Further, the nth processing instructions 306 may be executed to generate the nth data set 312 to retrieve additional information from one or more additional columns of the integrated data store 104 and to combine the additional information from the integrated data store 104 with information obtained from the first data set 308 and the second data set 310. For example, the nth processing instructions 306 may be executed to identify an individual included in the first data set 308 diagnosed with a biological condition and analyze a designated column of one or more additional data tables of the integrated data store 104 to determine a date of treatment indicated in the second data set 210 corresponding to the individual included in the first data set 308. In one or more further examples, the nth processing instructions 306 may be executed to analyze columns of one or more additional data tables of the integrated data store 104 to determine a dose of therapy received by the individual included in the first data set 308 indicated in the second data set 310. In this manner, the nth processing instructions 306 may be executed to generate a care event data set based on information included in the group data set and the treatment data set.

In one or more illustrative examples, in response to receiving integrated data store request 142, data analysis system 140 can determine one or more data sets corresponding to characteristics of a query regarding integrated data store request 142. For example, the data analysis system 140 may determine that the information included in the first data set 308 and the second data set 310 is suitable for responding to the integrated data store request 142. In these scenarios, the data analysis system 140 may analyze at least a portion of the data included in the first data set 308 and the second data set 310 to generate the data analysis results 146. In one or more additional examples, the data analysis system 140 can determine different data sets to respond to different queries included in the integrated data store request 142 to generate the data analysis results 146.

The use of specific sets of data processing instructions to generate corresponding data sets may reduce the number of inputs from users of the data integration and analysis system 102, as well as reduce the computational load, such as the amount of processing resources and memory, for processing the integrated data store requests 142. For example, without the particular architecture of the data pipe system 138, each time an integrated data store request 142 is received, data for responding to the integrated data store request 142 is aggregated from the data store 104. Instead, by implementing the data pipeline system 138 to execute the data processing instructions 302, 304, 306 to generate the data sets 308, 310, 312, the data required to respond to the various integrated data store requests 142 has been aggregated and is accessible by the data analysis system 140 to respond to the integrated data store requests 142. Thus, the computing resources for generating the data sets 308, 310, 312 by implementing the data pipeline system 138 to respond to the integrated data store requests 142 are less than typical systems that perform information parsing and collecting processes for each integrated data store request 142. Furthermore, in situations where the data pipe system 138 has not been implemented, a user of the data integration and analysis system 102 may need to submit multiple integrated data store requests 142 in order to analyze information that the user intends to analyze, either because the particular collection of data in response to the integrated data store requests 142 in a typical system is inaccurate, or because the data analysis system 140 is invoked multiple times to perform information analysis in a typical system, which analysis may be performed using a single integrated data store request 142 when the data pipe system 138 is implemented.

FIG. 4 illustrates an architecture 400 that generates an integrated data store that includes de-identified health insurance claim data and de-identified genomic data in accordance with one or more implementations. Architecture 400 may include data integration and analysis system 102, health insurance claim data store 106, and molecular data store 108. The data integration and analysis system 102 can obtain patient information 402 from the molecular data store 108. Patient information 402 may include genomic data 404 of an individual having data stored by molecular data store 108. The genomic data 404 may indicate the results of one or more nucleic acid sequencing operations that analyze the sequence of nucleic acid molecules included in a sample obtained from an individual with respect to one or more target genomic regions. In one or more examples, the sample may be obtained from tissue of one or more individuals. In one or more additional examples, the sample may be obtained from a fluid (e.g., blood or plasma) of one or more individuals. The one or more target genomic regions may correspond to genomic regions corresponding to the presence of one or more biological conditions. For example, the target region may correspond to a genomic region of a reference genome having a mutation that is present in an individual in the presence of a biological condition. In one or more illustrative examples, the target region may correspond to a genomic region of a reference human genome in which one or more mutations are present in an individual in the presence of one or more forms of cancer. Patient information 402 may also include information indicative of personal information about individuals having data stored by molecular data store 108, as well as information corresponding to tests and analyses performed on samples provided by individuals.

The data integration and analysis system 102 can perform a de-identification process 406, the de-identification process 406 anonymizing personal information obtained from the molecular data store 108. The data integration and analysis system 102 can implement one or more computing techniques as part of the de-identification process to anonymize data related to the individual stored by the molecular data store 108 such that the de-identified data protects the privacy of the individual and conforms to one or more privacy regulatory frameworks. The de-identification process 406 may include an access token at 408. In various examples, the token may include an alphanumeric string of characters. In one or more examples, the token may be generated by the data integration and analysis system 102. In one or more additional examples, the token may be generated by a third party and obtained by the data integration and analysis system 102.

The token may be generated using one or more hash functions associated with the subset 410 of patient information 402. To illustrate, for an individual having information stored by the molecular data store 108, a token may be generated using a combination of at least a portion of the name of the respective individual, at least a portion of the surname of the respective individual, at least a portion of the birth date of the respective individual, the gender of the respective individual, and at least a portion of the location identifier of the respective individual. De-identification process 406 may also include generating an identifier for the individual having data stored by molecular data store 108 at 412. The identifier may be generated by the data integration and analysis system 102 using one or more hash functions that are different from one or more hash functions used to generate the token. In one or more illustrative examples, data integration and analysis system 102 can generate intermediate versions of respective identifiers using one or more hash functions and then apply one or more salification techniques to the intermediate versions of the identifiers to generate final versions of the identifiers. The salt (salt) function includes a function configured to add at least one random bit to each intermediate identifier to generate a corresponding final identifier. In various examples, data integration and analysis system 102 can generate an identifier using at least a portion of the information of the respective individual stored by molecular data store 108 at 412. In one or more illustrative examples, the identifier may be generated based on the patient identifier included in the patient information 402. The identifiers generated by the data integration and analysis system 102 may be unique to the respective individual having the data stored by the molecular data store 108.

At operation 414, the data integration and analysis system 102 can generate modified patient information 416 based on the identifier. The modified patient information 416 may include genomic data 404 about the individual associated with the molecular data store 108 and an identifier of the corresponding individual. The modified patient information 416 may have a data structure 418. The data structure 418 may include columns containing respective identifiers of individuals associated with the molecular data store 108 as well as columns containing genomic data 404 related to those individuals, such as identifiers of one or more genes, changes in one or more genes, types of changes in genes, and so forth.

The data integration and analysis system 102 can generate a token file 420. The token file 420 may include a first token 422 that is accessed at operation 408 for a respective individual having data stored by the molecular data store 108. The token file 420 may have a data structure 424, the data structure 424 comprising a plurality of columns including information of respective individuals. The data structure 424 may include columns indicating the respective identifiers generated by the data integration and analysis system 102, and columns indicating one or more first tokens 422 associated with the respective identifiers. The data integration and analysis system 102 can send the token file 420 to a health insurance claim data management system 426 coupled with the health insurance claim data store 106. The health insurance claim data management system 426 can analyze the first token 422 relative to the corresponding second token 428. The second token 428 can be accessed or generated by the health insurance claim data management system 426. The second token 428 may be generated using a subset of information that is the same as or similar to the subset 410 of patient information 402 for individuals having data stored in the health insurance claim data store 106. For example, the second token 428 may be generated using a combination of at least a portion of the first name of the respective individual, at least a portion of the last name of the respective individual, at least a portion of the birth date of the respective individual, the gender of the individual, and at least a portion of the location identifier of the respective individual.

In various examples, the health insurance claim data management system 426 can retrieve health insurance claim data from the health insurance claim data store 106 for individuals associated with the respective second tokens 428 that match the respective first tokens 422. The first token 422 may match the second token 428 when the data of the first token 422 has at least a threshold amount of similarity with respect to the data of the second token 428. In one or more examples, the first token 422 may match the second token 428 when the data of the first token 422 is the same as the data of the second token 428.

In response to identifying the health insurance claim data for the individual having the respective second token 428 corresponding to the respective first token 422, the health insurance claim data management system 426 can generate modified health insurance claim data 430. The health insurance claim data management system 426 can send the modified health insurance claim data 430 to the data integration and analysis system 102. In one or more examples, the modified health insurance claim data 430 can be formatted according to the data structure 432. The data structure 432 can include a column containing a subset of the second tokens 428 corresponding to the first tokens 422 and a plurality of columns containing health insurance claim data.

At operation 434, the data integration and analysis system 102 may integrate the genomic data and health insurance claim data of the individual that is common to both the molecular data store 108 and the health insurance claim data store 106. The data integration and analysis system 102 can determine individuals that are common to both the molecular data store 108 and the health insurance claim data store 106 by determining genomic data and health insurance claim data corresponding to the common token. The data integration and analysis system 102 can determine that the first token 422 corresponds to the second token 428 by determining a similarity measure between the first token 422 associated with a portion of the genomic data 404 and the second token 428 associated with a portion of the health insurance claim data. In the event that the first token 422 has at least a threshold amount of similarity relative to the second token 428, the data integration and analysis system 102 can store the respective portion of the genomics data 404 and the respective portion of the health insurance claim data in an integrated data store, such as the integrated data store 104 of fig. 1, 2, and 3, with respect to the identifier of the individual.

Implementation of architecture 400 may implement an encryption protocol that enables de-identified information from different data stores to be integrated into a single data store. In this way, the security of the data stored by the integrated data store 104 is increased. Furthermore, the encryption protocol implemented by architecture 400 may enable more efficient retrieval and accurate analysis of information stored by integrated data store 104 than if the encryption protocol of architecture 400 was not used. For example, the data integration and analysis system 102 can match information stored by different data stores corresponding to the same individual by generating a token file 420 including a first token 422 using encryption techniques based on a specified set of information stored by the molecular data store 104 and utilizing a second token 428 generated using the same or similar encryption techniques with respect to a similar or the same set of information stored by the health insurance claim data store 106. Without implementing the encryption protocol of architecture 400, the probability of information from one data store being falsely attributed to one or more individuals increases, which reduces the accuracy of the results provided by data integration and analysis system 102 in response to integrated data store requests 142 sent to data integration and analysis system 102.

FIG. 5 illustrates a framework 500 for generating a data set based on data stored by the integrated data store 104 via the data pipeline system 138 in accordance with one or more implementations. The integrated data store 104 can store health insurance claim data and genomic data for a set of individuals 502. For example, the integrated data store 104 can store information obtained from health insurance claim records 504 for a group of individuals 502. For each individual included in the set of individuals 502, the integrated data store 104 can store information obtained from a plurality of health insurance claim records 504. In various examples, the information stored by the integrated data store 104 can include thousands, tens of thousands, hundreds of thousands, up to millions of health insurance claim records 504 for a plurality of individuals and/or be derived from such health insurance claim records 504. In addition, each health insurance claim record can include multiple columns. As a result, the integrated data store 104 may be generated by analysis of millions of columns of health insurance claim data.

Furthermore, while health insurance claim data may be organized according to a structured data format, the health insurance claim data is generally arranged to be viewed by health insurance providers, patients, and healthcare providers in order to display financial information and insurance code information related to services provided by the healthcare provider to individuals. Thus, it is not easy to analyze health insurance claim data to obtain insight that is available regarding the characteristics of an individual in which a biological condition exists and that can help treat the individual for the biological condition. The integrated data store 104 can be generated and organized by analyzing and modifying the raw health insurance claim data in a manner that enables the data stored by the integrated data store 104 to be further analyzed to determine trends, characteristics, features, and/or insights about individuals who may be in one or more biological conditions. For example, the health insurance code may be stored in the integrated data store 104 in such a way that at least one of a medical procedure, biological condition, treatment, dosage, pharmaceutical manufacturer, pharmaceutical dealer, or diagnosis may be determined for a given individual based on the individual's health insurance claim data. In various examples, the data integration and analysis system 102 can generate and implement one or more tables that indicate correlations between health insurance claim data and various treatments, symptoms, or biological conditions corresponding to the health insurance claim data. In addition, the integrated data store 104 may be generated using the genomic data records 506 of a set of individuals 502. In various examples, a large amount of health insurance claim data can be matched with the genomic data of a set of individuals 502 to generate the integrated data store 104.

By integrating the genomic data records 506 of a set of individuals 502 with the health insurance claim records 504, the data integration and analysis system 102 can determine correlations between the presence of one or more biomarkers present in the genomic data records 506 and other characteristics of the individuals indicated by the health insurance claim data records 506, which are not typically determinable by existing systems. For example, the data integration and analysis system 102 can determine one or more genomic characteristics of an individual that correspond to a treatment received by the individual, a timing of the treatment, a dosage of the treatment, a diagnosis of the individual, a smoking status, a presence of one or more biological conditions, a presence of one or more symptoms of a biological condition, one or more combinations thereof, and the like. Based on the correlations determined by the data integration and analysis system 102 using the integrated data store 104, groups of individuals that may benefit from one or more treatments may be identified that are not identified in existing systems. In one or more examples, the processes and techniques implemented for integrating the health insurance claim record 504 and the genomics claim record 506 to generate the integrated data store 104 can be complex and efficiency enhancing techniques, systems, and processes are implemented to minimize the amount of computing resources used to generate the integrated data store 104.

In one or more illustrative examples, data conduit system 138 may access information stored by integrated data store 104 to generate a data set comprising a plurality of additional data records 508, the additional data records 508 comprising information related to at least a portion of a group of individuals 502. In the illustrative example of fig. 5, additional data record 508 includes information indicating whether an individual is included in a group of individuals having lung cancer present. The data pipeline system 138 may execute multiple sets of different data processing instructions to determine a group of individuals 502 in which lung cancer is present. In various examples, the additional data record 508 may indicate information for determining the status of the individual 502 with respect to lung cancer, such as one or more transaction insurance identifiers, one or more international disease classification (ICD) codes, and one or more health insurance transaction dates. In addition to including a column indicating whether the individual 502 is included in a lung cancer group, the additional data record 508 may also include a column indicating a confidence level of the status of the individual 502 regarding the presence of lung cancer.

FIG. 6 is a schematic diagram of a computing architecture 600 that incorporates medical record data into the integrated data store 104. In various examples, at least a portion of the operations of the computing architecture 600 may be performed by the data integration and analysis system 102 of fig. 1, 3, and 4. In one or more examples, at least a portion of the operations of computing architecture 600 may be performed by one or more additional computing systems, at least one of control, maintenance, or implementation of which is done by a service provider that also accomplishes at least one of control, maintenance, or implementation of data integration and analysis system 102. In one or more additional examples, at least a portion of the operations of computing architecture 600 may be performed by multiple servers in a distributed computing environment.

The computing architecture 600 can include a medical record data store 602. The medical record data store 602 can store medical record data from a plurality of individuals. The medical record data can include imaging information, laboratory test results, diagnostic test information, clinical observations, dental health information, healthcare practitioner records, medical history forms, diagnostic request forms, medical procedure order forms, medical information charts, one or more combinations thereof, and the like. In various examples, for a given individual, the medical record data store 602 can store information obtained from one or more healthcare practitioners related to the individual.

The computing architecture 600 can perform operation 604, the operation 604 including retrieving data packets from the medical records data store 602. In one or more examples, the data package can be obtained in response to one or more requests for medical records corresponding to one or more individuals sent to the medical records data store 602. In one or more additional examples, the computing architecture 600 can use one or more Application Programming Interface (API) calls to obtain the data packet. In one or more illustrative examples, the computing architecture 600 may be used to obtain a first data packet 606, a second data packet 608, and up to an nth data packet 610. The individual data packets 606, 608, 610 can correspond to medical records of the respective individuals. For example, the first data packet 606 can include medical records of a first individual, the second data packet 608 can include medical records of a second individual, and the nth data packet 610 can include medical records of a third individual.

The individual data packets 606, 608, 610 may comprise a plurality of components. In one or more examples, the individual data packages 606, 608, 610 can include various components corresponding to medical records from different healthcare providers. In one or more additional examples, the individual data packages 606, 608, 610 can include various components corresponding to different portions of medical records corresponding to one or more healthcare providers. In the illustrative example of fig. 6, the second data packet 608 may include a first component 612, a second component 614, and up to an nth component 616. In one or more illustrative examples, the first component 612 can include a first portion of an individual medical record, the second component 614 can include a second portion of the individual medical record, and the nth component 616 can include a third portion of the individual medical record. In various examples, the first component 612 can correspond to an individual medical record of a first healthcare provider, the second component 614 can correspond to an individual medical record of a second healthcare provider, and the third component can correspond to an individual medical record of a third healthcare provider. In one or more additional illustrative examples, the first component 612 can include a first segment (section) of an individual medical record, such as one or more tables related to diagnostic tests or procedures, and the second component 614 can include a second segment of an individual medical record, such as a pathology report of an individual.

At operation 618, the computing architecture 600 may pre-process the individual data packets to identify the corpus of information 620 to be analyzed. In one or more examples, preprocessing of the data packets obtained from the medical records data store 602 can include converting the data included in the data packets. For example, preprocessing the data packet can include converting at least a portion of the data obtained from the medical records data store 602 into machine encoded information. To illustrate, preprocessing the data packets can include performing one or more Optical Character Recognition (OCR) operations on at least a portion of the data packets obtained from the medical records data store 602. By converting at least a portion of the data packets obtained from the medical records data store 602 into machine encoded information, the data packets can be subjected to a number of operations, such as one or more parsing operations for identifying one or more characters or strings, or one or more editing operations that cannot be performed on at least a portion of the data packets obtained from the medical records data store 602.

In one or more examples, preprocessing of individual data packets may include determining information included in the individual data packets that is to be excluded from further analysis by computing architecture 600. In various examples, one or more components of the individual data packages may be excluded from the information corpus 620 to be analyzed. For example, with respect to the second data packet 608, the computing architecture 600 may determine that the first component 612 is to be excluded from further analysis by the computing architecture 600. In one or more examples, computing architecture 600 may analyze components 612, 614, and/or 616 with respect to one or more keywords to identify at least one of components 612, 614, and/or 616 for exclusion from further analysis of computing architecture 600. In one or more illustrative examples, computing architecture 600 may parse components 612, 614, and/or 616 to identify one or more keywords, and in response to identifying one or more keywords in components 612, 614, and/or 616, computing architecture 600 may determine to exclude respective components 612, 614, and/or 616 from further analysis of computing architecture 600. For example, the computing architecture 600 may determine that the first component 612 of the second data packet 608 is a detection application form for one or more diagnostic processes or detections. In these scenarios, the computing architecture 600 may determine that the first component 612 is to be excluded from further analysis by the computing architecture 600. Further, the computing architecture 600 may determine that at least one of the second components 614 and/or 616 corresponds to one or more pathology reports of the individual based on one or more keywords included in at least one of the second component 614 or the nth component 616. In these cases, the computing architecture 600 may determine that at least a portion of the second component 614 and/or at least a portion of the nth component 616 are to be included in the information corpus 620 to be further analyzed by the computing architecture 600.

In addition, a subset of the constituent parts of the individual data packages obtained from the medical records data store 602 can be included in the information corpus 620. In various examples, one or more additional operations may be performed to narrow the information corpus 620. For example, one or more queries can be applied to a subset of information obtained from the medical records data store 602. The one or more queries may extract information from one or more data packets that satisfy the one or more queries. In at least some examples, the one or more queries may be a set of queries applied to respective components of the data packet. In one or more illustrative examples, the set of queries may determine information to be included in the information corpus 620 and additional information to be excluded from the information corpus 620. In one or more additional examples, one or more segments of at least one component of the data packet may be excluded from the information corpus 620.

In one or more additional illustrative examples, after determining that first component 612 is to be excluded from further analysis of computing architecture 600, computing architecture 600 may then cause one or more queries to be implemented with respect to at least one of second component 614 or nth component 616. In these scenarios, one or more queries may determine that segments of the second component 614 (e.g., segments indicative of family history of one or more biological conditions) are to be excluded from the information corpus 620. In various examples, the one or more queries may involve identifying a plurality of keywords and/or combinations of keywords included in at least one of the second component 614 or the nth component 616. In these cases, the computing architecture 600 may exclude one or more portions of the respective components of the data packet that include one or more keywords or keyword combinations from the information corpus 620. In one or more additional examples, the computing architecture 600 can exclude words, characters, and/or symbols included in one or more portions of the respective constituent portions of the data package that follow one or more keywords from the information corpus 620.

Further, at operation 622, the computing architecture 600 may analyze the corpus of information to determine features of individuals. In one or more examples, the computing architecture 600 can analyze the corpus of information 620 to determine individuals having one or more phenotypes. In various examples, the computing architecture 600 can analyze the information corpus 620 to determine one or more biomarkers indicative of a biological condition. For example, the computing architecture 600 may analyze the corpus of information 620 to determine individuals having one or more genetic characteristics. The one or more genetic characteristics may include at least one of the one or more variants corresponding to the genomic region of the biological condition. In one or more illustrative examples, the one or more genetic characteristics may correspond to one or more variants of a genomic region corresponding to one type of cancer. In one or more additional illustrative examples, the one or more biomarkers may correspond to analyte levels outside of a specified range. To illustrate, the computing architecture 600 can analyze the corpus of information 620 to determine individuals for which levels of one or more proteins and/or levels of one or more small molecules are present that are indicative of a biological condition. In these scenarios, the computing architecture 600 may analyze the results of laboratory tests to determine the level of an individual's analyte. In one or more additional examples, the computing architecture 600 can analyze the information corpus 620 to determine individuals for the presence of one or more symptoms indicative of a biological condition. In one or more further examples, the computing architecture 600 can analyze the imaging information included in the information corpus 620 to determine individuals for the presence of one or more biomarkers.

In one or more examples, the computing architecture 600 can implement one or more machine learning techniques to analyze the information corpus 620. For example, the computing architecture 600 may implement one or more artificial neural networks, such as at least one of one or more convolutional neural networks or one or more residual neural networks, to analyze the information corpus 620. The computing architecture 600 may also implement at least one of one or more random forest techniques, one or more hidden markov models, or one or more support vector machines to analyze the information corpus 620.

In at least some implementations, the computing architecture 600 can analyze the information corpus 620 by executing one or more queries about the information corpus 620. The one or more queries may correspond to one or more keywords and/or combinations of keywords. The one or more keywords and/or combinations of keywords may correspond to at least one of characters or symbols corresponding to one or more biological conditions. To illustrate, the keywords may correspond to characters associated with mutations in genomic regions, such as HER2. In one or more additional illustrative examples, one or more criteria may be associated with a combination of keywords. To illustrate, criteria corresponding to a combination of keywords may include a plurality of words that are no more than a specified distance from each other in a portion of the information corpus 620 of individuals, such as words "fatigue," "blood pressure," and "swelling" that occur no more than 100 characters from each other. In these cases, the computing architecture 600 may parse the information corpus 620 for one or more keywords and/or combinations of keywords. In various examples, the computing architecture 600 may determine that a biological condition exists with respect to a given individual in response to determining that one or more keywords and/or combinations of keywords are present according to one or more criteria.

In one or more additional examples, the one or more queries can be image-based and the computing architecture 600 can analyze images included in the information corpus 620 with respect to the template images. The template image may be generated based on analyzing a plurality of images in which the biological condition exists and aggregating the plurality of images into the template image. In these scenarios, computing architecture 600 may analyze images included in information corpus 620 with respect to one or more template images to determine a similarity measure between images included in information corpus 620 and the template images. In the event that the similarity measure of the individual is at least a threshold, the computing architecture 600 may determine that a characteristic of the biological condition is present in the individual.

After determining an individual having one or more characteristics, the computing architecture 600 may generate a data structure storing data of the individual having the one or more characteristics at operation 624. In one or more examples, the computing architecture 600 can generate a data table that indicates individuals with individual features and/or individuals with a set of features. For example, the computing architecture 600 may generate a first data table 626 and a second data table 628. The first data table 626 may indicate individuals having one or more first characteristics, and the second data table 628 may indicate individuals having one or more second characteristics. In one or more illustrative examples, the first data table 626 may indicate an individual having one or more first biomarkers for a biological condition, and the second data table 628 may indicate an individual having one or more second biomarkers for the biological condition. The one or more first biomarkers may correspond to one or more first genomic variants associated with the biological condition, and the one or more second biomarkers may correspond to one or more second genomic variants associated with the biological condition. In various examples, the data tables 626, 628 may indicate whether one or more features associated with the individual data tables 626, 628 are present relative to the individual individuals. To illustrate, the first data table 626 may include a first indication for individuals in which one or more first genomic variants are present and a second indication for individuals in which one or more first genomic variants are not present. In one or more additional examples, the first data table 626 may indicate a smoking status of the individual, and the second data table 628 may indicate whether the individual has received one or more treatments for the biological condition.

In one or more illustrative examples, first data table 626 and second data table 628 may have rows corresponding to separate individuals. In at least some examples, the individual identifiers may appear in separate rows. The individual identifier may include at least one of an alphanumeric character or symbol corresponding to the individual. In various examples, the individual identifier may be present in a data packet corresponding to the individual. The columns of the first and second data tables 626 and 628 may indicate the status of individual individuals with respect to one or more features. For example, a column of the data tables 626, 628 may include an identifier that includes at least one of an alphanumeric character or symbol that indicates the presence or absence of one or more features of a given individual. Further, although the illustrative example of FIG. 6 includes a first data table 626 and a second data table 628, the computing architecture 600 may generate more data tables or fewer data tables.

At operation 630, the computing architecture 600 may store the data structure in an additional data store. For example, the computing architecture 600 may store at least the first data table 626 and/or the second data table 628 in the intermediate data store 632. In various examples, first data table 626 and second data table 628 may be temporarily stored in intermediate data store 632. In one or more illustrative examples, first data table 626 and second data table 628 may be stored in intermediate data store 632 prior to being added to integrated data store 104. In one or more examples, the integrated data store 104 can be periodically generated and/or updated. In these scenarios, the data structures generated by computing architecture 600 based on analytics information corpus 620 may be stored in intermediate data store 632 until at least one of integrated data store 104 is to be generated or updated.

Prior to adding the data structures stored by the intermediate data store 632 to the integrated data store 104, the computing architecture 600 may perform one or more de-identification processes at operation 634. The data structures stored by the intermediate data store 632 may be de-identified in order to protect the privacy of the individual. The one or more de-identification processes may include applying one or more electronically-implemented encryption techniques to the individual information included in the data structures stored by the intermediate data store 632. In one or more examples, the computing architecture 600 can generate tokens corresponding to individual individuals having information stored in the data structure of the intermediate data store 632. The token may be generated by applying one or more hash functions to information related to the individual individuals. In one or more examples, the one or more de-identification processes may include applying a salt function to information corresponding to the individual to generate a token for the individual. In various examples, the one or more encryption techniques applied to de-identify the data structures stored by the intermediate data store 632 can be the same as or similar to the encryption techniques applied to the information obtained from the health insurance claim data store 106 of fig. 1 and 4.

At operation 636, the computing architecture 600 may store the de-identified data structures in conjunction with the integrated data store 104. For example, information for a given individual stored in the intermediate data store 632 may be stored in the integrated data store 104 along with additional information about the given individual. To illustrate, the integrated data store 104 can store information for a given individual obtained from at least two of the molecular data stores 108, from the health insurance claim data store 106, and from the intermediate data store 632. In this manner, information about a given individual obtained from a plurality of different data stores may be stored in the integrated data store 104. As a result, information about individuals obtained from different data stores may be analyzed together, rather than separately as in many existing systems.

In various examples, the information stored by the intermediate data store 632 may be used to verify one or more determinations made by the data integration and analysis system 102. For example, the data integration and analysis system 102 can analyze information obtained from the health insurance claim data store 106 and the molecular data store 108 to determine characteristics of an individual. The data integration and analysis system 102 can then analyze the information obtained from the intermediate data store 632 to determine whether the predicted features identified from the information obtained from the health insurance claim data store 106 and from the molecular data store 108 correspond to features of the same individual related to the information stored by the intermediate data store 632.

The one or more encryption techniques applied to de-identify the data structures stored by the intermediate data store 632 may utilize the same or similar information as that used to generate at least one of the first token 422 or the second token 428 of fig. 4. For example, operation 634 may implement one or more encryption techniques to de-identify the data structure of the intermediate data store using a combination of at least a portion of the first name of the respective individual, at least a portion of the last name of the respective individual, at least a portion of the birth date of the respective individual, the gender of the individual, and at least a portion of the location identifier of the respective individual. The information stored by the intermediate data store 632 may be synchronized with information of the same individual having information stored in the integrated data store 104 by de-identifying the data structure stored by the intermediate data store 632 using the same or similar encryption technique and the same or similar subset of information as used to generate at least one of the first token 422 or the second token 428. Both the integrated data store 104 and the intermediate data store 632 may store thousands, tens of thousands, up to millions of individuals of information. Thus, if an individual having records stored by the integrated data store 104 and the intermediate data store 632 is not able to synchronize using a specified encryption protocol as described herein, the data structures of the integrated data store 104 and the intermediate data store 632 associated with the same individual may not be stored in a manner such that the information stored by the integrated data store 104 and the information stored by the intermediate data store 632 may be retrieved together for a given individual, which may result in the data integration and analysis system 102 providing inaccurate information. The lack of a specified encryption protocol as described herein may also result in the use of more computing resources to determine information from other data sources that is stored in the integrated data store 104 as well as information stored by the intermediate data store 632 that corresponds to a given individual. Fig. 7 and 8 illustrate an example process of generating an integrated data store and generating a data set for analyzing information stored by the integrated data store. The example process is illustrated as a collection of blocks in a logic flow diagram, which represent a sequence of operations that can be implemented in hardware, software, or a combination thereof. These blocks are referenced by numerals. In the context of software, the blocks represent computer-executable instructions stored on one or more computer-readable media that, when executed by one or more processing units (e.g., a hardware microprocessor), perform the operations. Generally, computer-executable instructions include routines, programs, objects, components, data structures, etc. that perform particular functions or implement particular data types. The order in which the operations are described is not intended to be construed as a limitation, and any number of the described blocks can be combined in any order and/or in parallel to implement the process.

FIG. 7 is a data flow diagram of an example process 700 of generating an integrated data store storing health insurance claim data and genomic data, according to one or more implementations. At operation 702, the process 700 may include generating a data file including a token generated using a first hash function. The individual tokens may correspond to respective ones of a set of individuals having data stored by the molecular data store. In one or more examples, an individual having data stored by the molecular data store may be associated with one or more tokens. The token may be generated by applying one or more first hash functions to a subset of information stored by the genomics data store that corresponds to a set of individuals. In various examples, the individual tokens may be generated by applying one or more first hash functions to one or more combinations of at least a portion of a first name of a respective individual of the set of individuals, at least a portion of a last name of a respective individual of the set of individuals, a location identifier of a respective individual of the set of individuals, a gender of a respective individual of the set of individuals, and a date of birth of a respective individual of the set of individuals. In one or more illustrative examples, a token may be generated by a data integration and analysis system coupled to a genomic data store. In one or more additional illustrative examples, the token may be generated by a third party system and accessed by a data integration and analysis system coupled to the molecular data store. At operation 704, the process 700 can further include sending the data file to a health insurance claim data management system. The health insurance claim data management system can match the tokens included in the data file with a second token accessed by the health insurance data management system and generated based on information stored in the health insurance claim data store.

Further, at operation 706, the process 700 can include obtaining first data corresponding to a group of individuals from a health insurance claim data management system in response to the data file, wherein the first data includes health insurance claim data. In some implementations, positive consent is obtained from members of a group of individuals to transmit their data from the health insurance claim data management system (affirmative consent). In one or more examples, the data is transmitted in an anonymous format such that the data cannot be traced back to an individual member. The health insurance claim data management system can be coupled to a health insurance claim data store storing health insurance claim information for a plurality of individuals. In one or more examples, the health insurance claim data management system can analyze the tokens of the data file relative to additional tokens generated by the health insurance claim data management system. Additional tokens may be generated based on the same set of information used to generate the tokens included in the data file. However, the identity of an individual cannot be determined based on the token. In various examples, the health insurance claim data management system can match tokens included in the data file with additional tokens generated based on information stored by the health insurance claim data store to determine individuals having information stored by the health insurance claim data store and also having information stored by the genomics data store. The technology disclosed herein meets legal and best practice privacy standards such as HIPAA and GDPR.

At operation 708, the process 700 may include generating a plurality of identifiers using a second hash function that is different from the first hash function. In one or more examples, the individual identifiers may correspond to one or more tokens associated with respective ones of a set of individuals. The identifier may be unique to a given individual in a set of individuals and de-identified. Further, the identifier may be generated using information stored by the genomic data store for a group of individuals that is different from information stored by the genomic data store for generating the token. In various examples, the intermediate identifier may be generated by applying a second hash function to the information of the respective individual group, and the final version of the identifier may be generated by applying one or more salification techniques to the intermediate identifier. The information stored by the genomic data store for the respective individual may be stored in association with an identifier such that at least a portion of the information stored by the genomic data store about the given individual may be accessed using the respective identifier of the given individual.

Further, at operation 710, process 700 may include obtaining second data for a set of individuals from a molecular data store using the plurality of identifiers, and at operation 712, process 700 may include determining respective portions of the first data corresponding to respective portions of the second data for the set of individuals. For example, for a given individual, in addition to second data corresponding to molecular data (e.g., genomic data) of the given individual, first data corresponding to health insurance claim data of the given individual may be identified. In this way, health insurance claim data and molecular data can be identified for a given individual.

At operation 714, the process 700 may include generating an integrated data store that stores respective portions of the first data and respective portions of the second data in association with respective ones of the plurality of identifiers. For example, the integrated data store may store health insurance claim data and genomic claim data for a given individual in association with an identifier that may be used to access the health insurance claim data and genomic claim data for the given individual. The information stored by the integrated data store may be organized according to a data store schema. For example, the integrated data store may store a set of individual health insurance claim data and genomic data in a plurality of data tables. In one or more examples, information stored by multiple data tables may be linked. To illustrate, information related to a given individual stored by a first data table of a data store schema may be linked to additional information related to the given individual stored by a second data table of the data store schema. In this way, information accessed in one data table of the data store schema may result in access to additional information stored in another data table of the data store schema.

In one or more illustrative examples, the data store schema may include a first data table storing genomic data for a set of individuals. For example, the first data table may store information corresponding to panels used to generate genomic data, mutations in genomic regions, mutation types, copy numbers of genomic regions, coverage data indicating the number of nucleic acid molecules identified in a sample having one or more mutations, date of detection, and patient information. The data store schema may also include a second data storing data related to one or more patient visits by an individual to one or more healthcare providers and a third data table storing information corresponding to respective services provided to the individual regarding the one or more patient visits by the one or more healthcare providers indicated by the second data table. In addition, the data store schema may include a fourth data table that stores personal information for a group of individuals and a fifth data table that stores information about health insurance companies or government entities that pay for services provided to a group of individuals. In addition, the data store schema may include a sixth data table that stores information corresponding to health insurance coverage information for a group of individuals, such as the type of health insurance plan associated with the group of individuals. The data store schema may also include a seventh data table that stores information related to medication therapies obtained by a group of individuals.

In one or more examples, the integrated data store can also store medical records corresponding to at least a portion of a group of individuals. In these examples, medical records can be obtained from one or more data stores that store the medical records. One or more Optical Character Recognition (OCR) operations can be performed with respect to the medical record. In addition, the medical records can be analyzed to determine one or more portions of additional information to remove, thereby producing a corpus of information. In various examples, the corpus of information may be analyzed to determine a portion of a subset of an additional set of individuals corresponding to one or more biomarkers.

One or more data structures may be generated from the information corpus, the data structures storing identifiers of portions of the subset of the additional set of individuals and storing indications of the portions of the subset of the additional set of individuals corresponding to the one or more biomarkers. The one or more data structures may be stored by an intermediate data store. One or more de-identification operations can be performed with respect to the identifiers of the portions of the subset of the additional set of individuals prior to modifying the integrated data store to store at least a portion of the additional information of the medical records of the portion of the subset of the additional set of individuals in association with the plurality of identifiers. After de-identifying the information stored by the one or more data structures, the information stored by the integrated data store may be added to the integrated data store. In at least some examples, the de-identified medical record information can be added to the integrated data store in addition to or in lieu of the health insurance claim data. In various examples, one or more data structures storing de-identified medical record information about biomarker data can have one or more logical connections with other data structures stored in an integrated data store. To illustrate, one or more data structures storing de-identified medical record information about biomarker data can have one or more logical connections with at least one of the following data tables: the method may include storing a first data table corresponding to information for generating genomic data, mutations in genomic regions, mutation types, copy numbers of genomic regions, panels of coverage data indicating the number of nucleic acid molecules identified in a sample having one or more mutations, detection dates, and patient information, storing a second data table of data related to one or more patient visits by an individual to one or more healthcare providers, storing a third data table of information corresponding to corresponding services provided to an individual with respect to one or more patient visits by one or more healthcare providers indicated by the second data table, storing a fourth data table of personal information for a group of individuals, storing a fifth data table of information related to a health insurance company or government entity paying for services provided to a group of individuals, storing information corresponding to health insurance coverage information for a group of individuals (e.g., a type of health insurance plan related to the group of individuals), or storing information related to a seventh obtained medication treatment for an individual.

In various examples, medical record data can be added to the integrated data store by generating a data file that includes a first token generated using a first hash function. The individual first tokens may correspond to respective ones of a set of individuals having data stored by the molecular data store. Further, the data file can be transmitted to a medical record data management system, and medical record data corresponding to the set of individuals can be obtained from the medical record data management system in response to the data file. Further, a second hash function different from the first hash function may be used to generate the plurality of identifiers. Each identifier may correspond to one or more tokens associated with each individual in the set of individuals. Using the plurality of identifiers, second data for the set of individuals may be obtained from a molecular data store. In various examples, a respective portion of the first data may be determined that corresponds to a respective portion of the second data for the set of individuals. In this way, an integrated data store may be generated that stores respective portions of the first data and respective portions of the second data in association with respective identifiers of the plurality of identifiers.

After generating the integrated data store storing medical record data, a request can be received to determine data regarding a plurality of individuals having data stored in the integrated data store. The request includes one or more search criteria. In one or more examples, a subset of the plurality of individuals having one or more features corresponding to one or more search criteria may be determined, and information of the subset of the plurality of individuals may be analyzed to determine a measure of significance of a feature of the one or more features with respect to a biological condition.

In one or more illustrative examples, one or more genomic mutations may be determined to be present in a subset of the plurality of individuals, and a plurality of treatments provided to the subset of the plurality of individuals may also be determined. In various examples, respective survival rates, such as real world survival rates, for a subset of the plurality of individuals may be determined. In at least some examples, the significance measure can correspond to a survival rate of the genomic mutation with respect to the therapy of the plurality of therapies and the one or more genomic mutations. Based on the significance measure, the effectiveness of the treatment for the subset of the plurality of individuals may be determined. In one or more examples, an individual of a subset of the plurality of individuals that is not receiving treatment may be determined. One or more therapeutically effective amounts of the treatment may be administered to untreated individuals in a subset of the plurality of individuals.

FIG. 8 is a data flow diagram of an example process 800 of generating multiple data sets for analyzing information stored by an integrated data store storing health insurance claim data and genomics data, according to one or more implementations. At operation 802, process 800 may include determining a first set of data processing instructions executable with respect to first data stored by an integrated data store. The integrated data store may store health insurance claim data and molecular data for a common set of individuals. In one or more examples, the first set of data processing instructions may be included in a plurality of sets of data processing instructions that are part of a data processing pipeline. Each of the data processing instruction sets of the data processing pipeline may be executed to generate a respective analysis-ready data set. For example, separate sets of data processing instructions of the data processing pipeline may be executed to generate a data set comprising a specified portion of information and/or a combination of information stored by the integrated data store. In one or more additional examples, separate sets of data processing instructions of the data processing pipeline may be executed to analyze and modify portions of information stored by the integrated data store to generate corresponding data sets. Furthermore, separate sets of data processing instructions may be executed with respect to separate subsets of the information stored by the integrated data store.

At operation 804, the process 800 may further include causing a first set of data processing instructions to be executed to generate a first data set. The first data set may indicate a subset of biological conditions present in a group of individuals. The first set of data processing instructions may be executed to analyze data stored by the integrated data store to identify a group of individuals having a biological condition. In one or more illustrative examples, the biological condition may include cancer. To illustrate, a first set of data processing instructions may be executed to analyze data stored by an integrated data store to identify a group of individuals having lung cancer. In various examples, the data processing pipeline may include multiple sets of data processing instructions to identify groups of individuals for which different biological conditions exist.

In one or more examples, a first set of data processing instructions can be executed to analyze at least one of health insurance claim data or molecular data to determine a group of individuals having a biological condition. For example, a first set of data processing instructions can be executed to identify individuals having one or more health insurance codes present in the health insurance claim data to determine a group of individuals having a biological condition. Furthermore, the first set of data processing instructions may be executed to identify individuals having one or more mutations in genomic regions of nucleic acid molecules derived from a sample obtained from the individual to determine a group of individuals having a biological condition.

Further, at operation 806, process 800 may include determining a second set of data processing instructions executable with respect to second data stored by the integrated data store. The second data set stored by the integrated data store may be different from the first data set stored by the integrated data store and analyzed with respect to the first set of data processing instructions. For example, the first data may correspond to a first column of one or more first data tables stored by the integrated data store, and the second data may correspond to a second column of one or more second data tables stored by the integrated data store.

At operation 808, the process 800 may include causing a second set of data processing instructions to be executed to generate a second data set indicative of one or more treatments provided to a second subset of the group of individuals. The second data set may indicate a subset of the group of individuals that have received one or more treatments. The one or more treatments may be provided to an individual in the presence of one or more biological conditions. In one or more examples, the second set of data processing instructions may be executed to analyze data stored by the integrated data store to identify a group of individuals who received one or more treatments. To illustrate, the second set of data processing instructions can be executed to analyze at least one of health insurance claim data or genomic data to determine a group of individuals who received one or more treatments. In one or more illustrative examples, the second set of data processing instructions can be executed to identify individuals having one or more health insurance codes present in the health insurance claim data to determine a group of individuals who received one or more treatments.

Further, at operation 810, the process 800 may include determining a third subset of the set of individuals, the third subset including a portion of the first subset of the set of individuals overlapping a portion of the second subset of the set of individuals. As a result, a third subset of the group of individuals corresponds to individuals in which both the biological condition is present and one or more treatments are provided. At 812, the process 800 may include analyzing the first data set and the second data set with respect to a third subset of the set of individuals to determine a measure of significance of the features of the third subset of the set of individuals. In one or more examples, one or more machine learning techniques or statistical techniques may be applied to the information included in at least one of the first data set and the second data set relative to a third subset of the set of individuals. The saliency measure may correspond to a statistical measure of saliency with respect to the feature. In one or more additional examples, the saliency metric may correspond to a probability that the feature is present in an individual having a biological condition.

In one or more illustrative examples, the features may include one or more treatments provided to an individual in the presence of a biological condition. In one or more additional illustrative examples, the feature can include the presence of a mutation in a genomic region of a nucleic acid molecule derived from a sample obtained from an individual in which the biological condition is present. In various examples, information included in at least one of the first data set or the second data set may be analyzed to determine an impact of the feature on the one or more metrics. In one or more examples, information included in at least one of the first data set or the second data set may be analyzed to determine an amount of impact of treatment on survival of an individual having a biological condition. In one or more further examples, information included in at least one of the first data set or the second data set may be analyzed to determine an amount of impact of mutation of the genomic region on survival of an individual having the biological condition. In addition, the information included in the first data set and the second data set may be analyzed to determine an amount of impact of the one or more treatments on the individual in the presence of the biological condition and also in the presence of the one or more genomic mutations.

Fig. 9 shows a diagrammatic representation of machine 9900 in the form of a computer system within which a set of instructions, in accordance with an example, may be executed to cause machine 900 to perform any one or more of the methodologies discussed herein, in accordance with an example implementation. In particular, FIG. 8 shows a diagrammatic representation of machine 900 in the example form of a computer system within which instructions 902 (e.g., software, programs, applications, applets, apps, or other executable code) for causing the machine 900 to perform any one or more of the methods discussed herein may be executed. For example, the instructions 902 may cause the machine 900 to implement the architecture and frameworks 100, 200, 300, 400, 500, 600 described with respect to fig. 1, 2, 3, 4, 5, and 6, respectively, and perform the methods 700, 800 described with respect to fig. 7 and 8, respectively.

The instructions 902 transform a generic, un-programmed machine 900 into a specific machine 900 that is programmed to perform the described and illustrated functions in the manner described. In alternative implementations, the machine 900 operates as a standalone device or may be coupled (e.g., networked) to other machines. In a networked deployment, the machine 900 may operate in the capacity of a server machine or a client machine in server-client network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. Machine 900 may include, but is not limited to, a server computer, a client computer, a Personal Computer (PC), a tablet computer, a laptop computer, a netbook, a set-top box (STB), a Personal Digital Assistant (PDA), an entertainment media system, a cellular telephone, a smart phone, a mobile device, a wearable device (e.g., a smart watch), a smart home device (e.g., a smart appliance), other smart devices, a web device (apple), a network router, a network switch, a bridge, or any machine capable of executing instructions 902 sequentially or otherwise, instructions 902 specifying actions to be taken by machine 900. Furthermore, while only a single machine 900 is illustrated, the term "machine" shall also be taken to include a collection of machines 900 that individually or jointly execute instructions 902 to perform any one or more of the methodologies discussed herein.

Examples of computing device 900 may include logic, one or more components, circuitry (e.g., modules), or mechanisms. Circuitry is a tangible entity configured to perform certain operations. In an example, the circuits may be arranged in a specified manner (e.g., internally or with respect to external entities such as other circuits). In an example, one or more computer systems (e.g., a standalone client or server computer system) or one or more hardware processors (processors) may be configured by software (e.g., instructions, application portions, or applications) into circuitry that operates to perform certain operations described herein. In an example, the software may reside (1) on a non-transitory machine-readable medium, or (2) in a transmission signal. In an example, software, when executed by the underlying hardware of the circuit, causes the circuit to perform certain operations.

In an example, the circuitry may be implemented mechanically or electronically. For example, a circuit may comprise dedicated circuitry or logic that is specifically configured to perform one or more techniques described above, including, for example, a dedicated processor, a Field Programmable Gate Array (FPGA), or an application-specific integrated circuit (ASIC). In an example, the circuitry may include programmable logic (e.g., circuitry contained within a general-purpose processor or other programmable processor) that may be temporarily configured (e.g., by software) to perform certain operations. It will be appreciated that decisions to implement the circuitry, either mechanically (e.g., in dedicated and permanently configured circuits) or in temporarily configured circuits (e.g., through software configuration), may be driven by cost and time considerations.

Thus, the term "circuitry" is understood to include tangible entities, entities that may be physically constructed, permanently configured (e.g., hardwired) or temporarily (e.g., temporarily) configured (e.g., programmed) to operate in a specified manner or to perform a specified operation. In an example, given a plurality of temporarily configured circuits, each circuit need not be configured or instantiated at any one instance in time. For example, where the circuitry includes a general-purpose processor configured by software, the general-purpose processor may be configured as corresponding different circuitry at different times. The software may configure the processor accordingly, for example, to form a particular circuit at one instance of time and to form different circuits at different instances of time.

In an example, a circuit may provide information to and receive information from other circuits. In this example, a circuit may be considered to be communicatively coupled to one or more other circuits. When a plurality of such circuits exist at the same time, communication may be achieved by signal transmission (e.g., through appropriate circuits and buses) connecting these circuits. In implementations in which multiple circuits are configured or instantiated at different times, communication between the circuits may be implemented, for example, by storing and retrieving information in a memory structure accessible to the multiple circuits. For example, a circuit may perform an operation and store the output of the operation in a memory device communicatively coupled thereto. Another circuit may then access the memory device at a later time to retrieve and process the stored output. In an example, the circuitry may be configured to initiate or receive communications with an input or output device and may operate on a resource (e.g., a collection of information).

Various operations of the method examples described herein may be performed, at least in part, by one or more processors that are temporarily configured (e.g., via software) or permanently configured to perform the relevant operations. Such a processor, whether temporarily configured or permanently configured, may constitute processor-implemented circuitry that operates to perform one or more operations or functions. In an example, the circuitry referred to herein may comprise processor-implemented circuitry.

Similarly, the methods described herein may be at least partially processor-implemented. For example, at least some or all of the operations of the method may be performed by one or more processors or processor-implemented circuits. The performance of certain operations may be distributed among one or more processors, which may reside not only within a single machine, but also be deployed on multiple machines. In examples, one or more processors may be located in a single location (e.g., within a home environment, an office environment, or as a server farm), while in other examples, processors may be distributed across multiple locations.

The one or more processors may also be operative to support performance of related operations in a "cloud computing" environment or as "software as a service" (SaaS).

For example, at least some of the operations may be performed by a set of computers (as examples of machines including processors), which may be accessed via a network (e.g., the internet) and via one or more suitable interfaces (e.g., application Program Interfaces (APIs)).

Example implementations (e.g., apparatus, system, or method) may be implemented in digital electronic circuitry, computer hardware, firmware, software, or in any combination thereof. Example implementations may be implemented using a computer program product (e.g., a computer program tangibly embodied in an information carrier or in a machine-readable medium for execution by, or to control the operation of, data processing apparatus, such as a programmable processor, a computer, or multiple computers.

A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a software module, subroutine, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.

In an example, operations may be performed by one or more programmable processors executing a computer program to perform functions by operating on input data and generating output. Examples of method operations may also be performed by, and example apparatus may be implemented as, special purpose logic circuitry, e.g., a Field Programmable Gate Array (FPGA) or an application-specific integrated circuit (ASIC).

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In deploying a programmable computing system implementation, it should be appreciated that both hardware and software architectures need to be considered. In particular, it should be appreciated that the selection of whether to implement certain functions in permanently configured hardware (e.g., an ASIC), temporarily configured hardware (e.g., a combination of software and a programmable processor), or a combination of permanently configured hardware and temporarily configured hardware may be a design choice. The following sets forth hardware (e.g., computing device 900) and software architecture that may be deployed in an example implementation.

In an example, computing device 900 may operate as a standalone device or computing device 900 may be connected (e.g., networked) to other machines.

In a networked deployment, the computing device 900 may operate in the capacity of a server or a client machine in the server-client network environment. In an example, computing device 900 may act as a peer machine in a peer-to-peer (or other distributed) network environment. Computing device 900 may be a Personal Computer (PC), tablet PC, set-top box (STB), mobile telephone, web appliance, network router, switch or bridge, or any machine capable of executing instructions (sequential or otherwise) that specify actions to be taken (e.g., performed) by computing device 900. Furthermore, while only a single computing device 900 is illustrated, the term "computing device" should also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.

The example computing device 900 may include a processor 904 (e.g., a central processing unit CPU, a Graphics Processing Unit (GPU), or both), a main memory 906, and a static memory 908, some or all of which may communicate with each other over a bus 910. Computing device 900 may also include a display unit 912, an alphanumeric input device 914 (e.g., a keyboard), and a User Interface (UI) navigation device 916 (e.g., a mouse). In an example, the display unit 912, the input device 914, and the UI navigation device 916 may be a touch screen display. Computing device 900 can also include a storage device (e.g., a drive unit) 918, a signal generation device 920 (e.g., a speaker), a network interface device 922, and one or more sensors 924, such as a Global Positioning System (GPS) sensor, compass, accelerometer, or another sensor.

The storage 918 may include a machine-readable medium 926 on which one or more sets of data structures or instructions 902 (e.g., software) are stored, the data structures or instructions 902 embodying or being utilized by any one or more of the methods or functions described herein. The instructions 902 may also reside, completely or at least partially, within the main memory 906, within the static memory 908, or within the processor 904 during execution thereof by the computing device 900. In an example, one or any combination of the processor 904, the main memory 906, the static memory 908, or the storage device 918 may constitute machine-readable media.

While the machine-readable medium 926 is shown to be a single medium, the term "machine-readable medium" may include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) configured to store the one or more instructions 902. The term "machine-readable medium" shall also be taken to include any tangible medium that is capable of storing, encoding or carrying instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure, or that is capable of storing, encoding or carrying data structures used by or associated with such instructions. The term "machine-readable medium" shall accordingly be taken to include, but not be limited to, solid-state memories, optical and magnetic media. Specific examples of machine-readable media may include non-volatile memory, including, for example, semiconductor memory devices (e.g., electrically programmable read-only memory) (EPROM), electrically erasable programmable read-only memory (EEPROM)), and flash memory devices; magnetic disks, such as internal hard disks and removable disks; magneto-optical disk; CD-ROM and DVD-ROM discs.

The instructions 902 may also be transmitted or received over a communication network 828 using a transmission medium via the network interface device 822 using any of a variety of transmission protocols (e.g., frame relay, IP, TCP, UDP, HTTP, etc.). Example communication networks may include a Local Area Network (LAN), a Wide Area Network (WAN), a packet data network (e.g., the internet), a mobile telephone network (e.g., a cellular network), a Plain Old Telephone (POTS) network, and a wireless data network (e.g., referred to as the internet)Is known as +.>IEEE 802.16 family of standards), peer-to-peer (P2P) networks, etc. The term "transmission medium" shall be taken to include any intangible medium that is capable of storing, encoding or carrying instructions for execution by the machine, and includes digital or analog communications signals or other intangible medium to facilitate communication of such software. />

As used herein, a component may refer to a device, physical entity, or logic having boundaries defined by function or subroutine calls, branch points, APIs, or other techniques that provide partitioning or modularization of specific processing or control functions. Components may be combined with other components through their interfaces to perform machine processes. A component may be a packaged functional hardware unit designed for use with other components or part of a program that typically performs a particular one of the relevant functions. The components may constitute software components (e.g., code embodied on a machine-readable medium) or hardware components. A "hardware component" is a tangible unit capable of performing certain operations and may be configured or arranged in some physical manner. In various example implementations, one or more computer systems (e.g., a standalone computer system, a client computer system, or a server computer system) or one or more hardware components of a computer system (e.g., a processor or a group of processors) may be configured by software (e.g., an application or application portion) as hardware components that operate to perform certain operations described herein.

The following presents a numbered non-limiting list of aspects of the subject matter.

Aspect 1. A method comprising: generating, by a computing system comprising processing circuitry and memory, a data file comprising first tokens generated using a first hash function, individual first tokens corresponding to respective ones of a set of individuals having data stored by a molecular data store; transmitting, by the computing system, the data file to a health insurance claim data management system; obtaining, by the computing system and from the health insurance claim data management system, health data corresponding to the set of individuals in response to the data file; generating, by the computing system, a plurality of identifiers using a second hash function different from the first hash function, each identifier corresponding to one or more tokens associated with each individual in the set of individuals; obtaining, by the computing system and using the plurality of identifiers, second data about the set of individuals from the molecular data store; determining, by the computing system, a respective portion of the first data corresponding to a respective portion of the second data about the set of individuals; and generating, by the computing system, an integrated data store that stores respective portions of the first data and respective portions of the second data in association with respective ones of the plurality of identifiers.

Aspect 2. The method of aspect 1, comprising: determining, by the computing system, a first set of data processing instructions executable with respect to first data stored by the integrated data store; causing, by the computing system, the first set of data processing instructions to be executed to analyze a first health insurance claim code included in the first data to determine a first subset of biological conditions present in the group of individuals; and generating, by the computing system, a first data set indicating the presence of the subset of the biological conditions in the group of individuals.

Aspect 3 the method according to aspect 2, comprising: determining, by the computing system, a second set of data processing instructions executable with respect to second data stored by the integrated data store; causing, by the computing system, the second set of data processing instructions to be executed to analyze a second health insurance claim code included in the second data to determine one or more treatments provided to a second subset of the group of individuals; and generating, by the computing system, a second data set indicative of one or more treatments provided to the second subset of the group of individuals.

Aspect 4. The method of aspect 3, comprising: determining, by the computing system, a third subset of the set of individuals, the third subset including a portion of the first subset of the set of individuals that overlaps a portion of the second subset of the set of individuals; receiving, by the computing system, a request to perform an analysis of the first data set and the second data set with respect to the third subset of the set of individuals; and analyzing, by the computing system and in response to the request, the first data set and the second data set with respect to the third subset of the set of individuals to determine a measure of significance of features of the third subset of the set of individuals with respect to the biological condition.

Aspect 5. The method of aspect 4, comprising: determining, by the computing system, one or more genomic mutations present in the third subset of the set of individuals; determining, by the computing system, a plurality of treatments to provide to the third subset of the group of individuals; and determining, by the computing system, a respective survival rate of the third subset of the set of individuals.

Aspect 6. The method of aspect 5, wherein the measure of significance corresponds to a survival rate of the genomic mutation relative to the treatment of the plurality of treatments and the one or more genomic mutations.

Aspect 7 the method of aspect 6, comprising determining, by the computing system and based on a significance measure, the effectiveness of the treatment for the third subset of the group of individuals.

Aspect 8 the method of aspect 7, comprising determining, by the computing system, individuals of the third subset of the group of individuals who did not receive the treatment.

Aspect 9. The method of aspect 8, comprising administering one or more therapeutically effective amounts of the treatment to individuals in the third subset who did not receive the treatment.

Aspect 10. The method according to any one of aspects 1 to 9, wherein: the integrated data store is arranged according to a data store schema, the data store schema comprising a plurality of data tables and a plurality of logical links between the plurality of data tables; individual ones of the plurality of logical links indicate that one or more rows of a data table of the plurality of data tables correspond to additional one or more rows of additional data tables of the plurality of data tables.

Aspect 11. The method of aspect 10, wherein the plurality of data tables comprises: a first data table storing genomic data of the set of individuals; second data storing data relating to one or more patient visits by an individual to one or more healthcare providers; a third data table storing information corresponding to respective services provided to individuals regarding one or more patient visits to one or more healthcare providers indicated by the second data table; a fourth data table storing personal information of the group of individuals; a fifth data table storing information related to a health insurance company or government entity paying for the service provided to the group of individuals; a sixth data table storing information corresponding to health insurance coverage information of the group of individuals; and a seventh data table storing information related to medication therapy obtained by the group of individuals.

Aspect 12 the method according to any one of aspects 1-11, wherein the plurality of identifiers generated using the second hash function comprises an intermediate identifier; and the method comprises: a salt function is applied to the intermediate identifiers by the computing system to generate a final set of identifiers.

Aspect 13. The method according to any one of aspects 1 to 12, comprising: obtaining, by the computing system, information from an additional data store, the additional data store comprising an additional set of individual electronic medical records; determining, by the computing system, a subset of the additional set of individuals, the subset of the additional set of individuals corresponding to the set of individuals having data stored by the genomic data store; and modifying, by the computing system, the integrated data store to store at least a portion of the information of the medical records of the subset of the additional set of individuals in association with the plurality of identifiers.

Aspect 14. The method of aspect 13, comprising: performing, by the computing system, one or more optical character recognition operations on the additional information; additional information obtained from the additional data store is analyzed by the computing system to determine one or more portions of the additional information to remove, thereby generating a corpus of information.

Aspect 15. The method of aspect 14, comprising: analyzing, by the computing system, the corpus of information to determine a portion of the subset of the additional set of individuals corresponding to one or more biomarkers; and generating, by the computing system, one or more data structures storing identifiers of the portions of the subset of the additional set of individuals and storing an indication that the portions of the subset of the additional set of individuals correspond to the one or more biomarkers.

Aspect 16 the method of aspect 15, comprising: storing, by the computing system, the one or more data structures in an intermediate data store; one or more de-identification operations are performed by the computing system on the identifiers of the portion of the subset of the additional set of individuals prior to modifying the integrated data store to store at least a portion of the additional information of the medical records of the portion of the subset of the additional set of individuals in association with the plurality of identifiers.

Aspect 17 the method of any one of aspects 1-16, wherein the molecular database stores at least one or more of genomic information, genetic information, metabolome information, transcriptome information, fragment group information, immunoreceptor information, methylation information, epigenomic information, or proteomic information.

Aspect 18. A system, comprising: one or more hardware processing units; one or more computer-readable storage media storing computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform operations comprising: generating a data file comprising first tokens generated using a first hash function, individual first tokens corresponding to respective ones of a set of individuals having data stored by a molecular data store; transmitting the data file to a health insurance claim data management system; obtaining health insurance claim data corresponding to the set of individuals from the health insurance claim data management system in response to the data file; generating a plurality of identifiers using a second hash function different from the first hash function, each identifier corresponding to one or more tokens associated with each individual in the set of individuals; obtaining second data about the set of individuals from the molecular data store using the plurality of identifiers; determining a respective portion of the first data corresponding to a respective portion of the second data about the set of individuals; and generating an integrated data store that stores respective portions of the first data and respective portions of the second data in association with respective ones of the plurality of identifiers.

Aspect 19 the system of aspect 18, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: determining a first set of data processing instructions executable with respect to first data stored by the integrated data store; causing the first set of data processing instructions to be executed to analyze a first health insurance claim code included in the first data to determine a first subset of biological conditions present in the group of individuals; and generating a first data set indicative of the presence of the subset of biological conditions in the group of individuals.

Aspect 20 the system of aspect 19, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: determining, by the computing system, a second set of data processing instructions executable with respect to second data stored by the integrated data store; causing, by the computing system, the second set of data processing instructions to be executed to analyze a second health insurance claim code included in the second data to determine one or more treatments provided to a second subset of the group of individuals; and generating, by the computing system, a second data set indicative of one or more treatments provided to the second subset of the group of individuals.

Aspect 21 the system of aspect 20, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: determining, by the computing system, a third subset of the set of individuals, the third subset including a portion of the first subset of the set of individuals that overlaps a portion of the second subset of the set of individuals; receiving, by the computing system, a request to perform an analysis of the first data set and the second data set with respect to the third subset of the set of individuals; and analyzing, by the computing system and in response to the request, the first data set and the second data set with respect to the third subset of the set of individuals to determine a measure of significance of features of the third subset of the set of individuals with respect to the biological condition.

Aspect 22 the system of aspect 21, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: determining one or more genomic mutations present in the third subset of the set of individuals; determining a plurality of treatments to provide to the third subset of the group of individuals; and determining a respective survival rate of the third subset of the set of individuals.

Aspect 23 the system of aspect 22, wherein the measure of significance corresponds to a survival rate of the genomic mutation relative to a treatment of the plurality of treatments and the one or more genomic mutations.

Aspect 24 the system of aspect 23, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: the effectiveness of the treatment of the third subset of the group of individuals is determined based on a significance measure.

Aspect 25 the system of aspect 24, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: determining individuals in the third subset of the group of individuals who did not receive the treatment.

Aspect 26 the system of any one of aspects 18-25, wherein: the integrated data store is arranged according to a data store schema, the data store schema comprising a plurality of data tables and a plurality of logical links between the plurality of data tables; and individual ones of the plurality of logical links indicate that one or more rows of data tables of the plurality of data tables correspond to additional one or more rows of additional data tables of the plurality of data tables.

Aspect 27 the system of aspect 26, wherein the plurality of data tables comprises: a first data table storing genomic data of the set of individuals; second data storing data relating to one or more patient visits by an individual to one or more healthcare providers; a third data table storing information corresponding to respective services provided to individuals regarding one or more patient visits to one or more healthcare providers indicated by the second data table; a fourth data table storing personal information of the group of individuals; a fifth data table storing information related to a health insurance company or government entity paying for the service provided to the group of individuals; a sixth data table storing information corresponding to health insurance coverage information of the group of individuals; and a seventh data table storing information related to medication therapy obtained by the group of individuals.

Aspect 28 the system of any one of aspects 18-27, wherein: the plurality of identifiers generated using the second hash function includes an intermediate identifier; and the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: a salt function is applied to the intermediate identifiers to generate a final set of identifiers.

Aspects 29 the system of any of aspects 18-28, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: obtaining information from an additional data store, the additional data store comprising an additional set of electronic medical records of an individual; determining a subset of the additional set of individuals, the subset of the additional set of individuals corresponding to the set of individuals having data stored by the genomic data store; and modifying the integrated data store to store at least a portion of the information of medical records of the subset of the additional set of individuals in association with the plurality of identifiers.

The system of aspect 29, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: performing one or more optical character recognition operations on the additional information; additional information obtained from the additional data store is analyzed to determine one or more portions of the additional information to remove, thereby producing a corpus of information.

Aspect 31 the system of aspect 30, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: analyzing the corpus of information to determine a portion of the subset of the additional set of individuals corresponding to one or more biomarkers; and generating one or more data structures that store identifiers of the portions of the subset of the additional set of individuals and store indications that the portions of the subset of the additional set of individuals correspond to the one or more biomarkers.

Aspect 32 the system of aspect 31, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: storing the one or more data structures in an intermediate data store; one or more de-identification operations are performed on the identifiers of the portion of the subset of the additional set of individuals prior to modifying the integrated data store to store at least a portion of the additional information of the medical records of the portion of the subset of the additional set of individuals in association with the plurality of identifiers.

Aspect 33 the system of any one of aspects 18-32, wherein the molecular database stores at least one or more of genomic information, genetic information, metabolome information, transcriptome information, fragment group information, immunoreceptor information, methylation information, epigenomic information, or proteomic information.

Aspects 34 one or more non-transitory computer-readable storage media storing computer-executable instructions that, when executed by one or more hardware processing units, cause a system to perform operations comprising: generating a data file comprising first tokens generated using a first hash function, individual first tokens corresponding to respective ones of a set of individuals having data stored by a molecular data store; transmitting the data file to a health insurance claim data management system; obtaining health insurance claim data corresponding to the set of individuals from the health insurance claim data management system in response to the data file; generating a plurality of identifiers using a second hash function different from the first hash function, each identifier corresponding to one or more tokens associated with each individual in the set of individuals; obtaining second data about the set of individuals from the molecular data store using the plurality of identifiers; determining a respective portion of the first data corresponding to a respective portion of the second data about the set of individuals; and generating an integrated data store that stores respective portions of the first data and respective portions of the second data in association with respective ones of the plurality of identifiers.

Aspect 35 the one or more non-transitory computer-readable media of aspect 34, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: determining a first set of data processing instructions executable with respect to first data stored by the integrated data store; causing the first set of data processing instructions to be executed to analyze a first health insurance claim code included in the first data to determine a first subset of biological conditions present in the group of individuals; and generating a first data set indicative of the presence of the subset of biological conditions in the group of individuals.

Aspect 36 the one or more non-transitory computer-readable media of aspect 35, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: determining, by the computing system, a second set of data processing instructions executable with respect to second data stored by the integrated data store; causing, by the computing system, the second set of data processing instructions to be executed to analyze a second health insurance claim code included in the second data to determine one or more treatments provided to a second subset of the group of individuals; and generating, by the computing system, a second data set indicative of one or more treatments provided to the second subset of the group of individuals.

Aspect 37 the one or more non-transitory computer-readable media of aspect 36, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: determining, by the computing system, a third subset of the set of individuals, the third subset including a portion of the first subset of the set of individuals that overlaps a portion of the second subset of the set of individuals; receiving, by the computing system, a request to perform an analysis of the first data set and the second data set with respect to the third subset of the set of individuals; and analyzing, by the computing system and in response to the request, the first data set and the second data set with respect to the third subset of the set of individuals to determine a measure of significance of features of the third subset of the set of individuals with respect to the biological condition.

Aspect 38 the one or more non-transitory computer-readable media of aspect 37 comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: determining one or more genomic mutations present in the third subset of the set of individuals; determining a plurality of treatments to provide to the third subset of the group of individuals; and determining a respective survival rate of the third subset of the set of individuals.

Aspect 39. The one or more non-transitory computer-readable media of aspect 38, wherein the measure of significance corresponds to a survival rate relative to a treatment of the plurality of treatments and a genomic mutation of the one or more genomic mutations.

Aspect 40. The one or more non-transitory computer-readable media of aspect 39, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: the effectiveness of the treatment of the third subset of the group of individuals is determined based on a significance measure.

Aspect 41 the one or more non-transitory computer-readable media of aspect 40, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: determining individuals in the third subset of the group of individuals who did not receive the treatment.

Aspect 42 the one or more non-transitory computer-readable media of aspect 34, wherein: the integrated data store is arranged according to a data store schema, the data store schema comprising a plurality of data tables and a plurality of logical links between the plurality of data tables; and individual ones of the plurality of logical links indicate that one or more rows of data tables of the plurality of data tables correspond to additional one or more rows of additional data tables of the plurality of data tables.

Aspect 43. The one or more non-transitory computer-readable media of aspect 42, wherein the plurality of data tables comprises: a first data table storing genomic data of the set of individuals; second data storing data relating to one or more patient visits by an individual to one or more healthcare providers; a third data table storing information corresponding to respective services provided to individuals regarding one or more patient visits to one or more healthcare providers indicated by the second data table; a fourth data table storing personal information of the group of individuals; a fifth data table storing information related to a health insurance company or government entity paying for the service provided to the group of individuals; a sixth data table storing information corresponding to health insurance coverage information of the group of individuals; and a seventh data table storing information related to medication therapy obtained by the group of individuals.

Aspect 44. One or more non-transitory computer-readable media according to any one of aspects 34-43, wherein: the plurality of identifiers generated using the second hash function includes an intermediate identifier; and wherein the one or more non-transitory computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: a salt function is applied to the intermediate identifiers to generate a final set of identifiers.

Aspect 45 the one or more non-transitory computer-readable media of aspect 44, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: obtaining information from an additional data store, the additional data store comprising an additional set of electronic medical records of an individual; determining a subset of the additional set of individuals, the subset of the additional set of individuals corresponding to the set of individuals having data stored by the genomic data store; and modifying the integrated data store to store at least a portion of the information of medical records of the subset of the additional set of individuals in association with the plurality of identifiers.

Aspect 46 the one or more non-transitory computer-readable media of aspect 45, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: performing one or more optical character recognition operations on the additional information; additional information obtained from the additional data store is analyzed to determine one or more portions of the additional information to remove, thereby producing a corpus of information.

Aspect 47 the one or more non-transitory computer-readable media of aspect 46, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: analyzing the corpus of information to determine a portion of the subset of the additional set of individuals corresponding to one or more biomarkers; and generating one or more data structures that store identifiers of the portions of the subset of the additional set of individuals and store indications that the portions of the subset of the additional set of individuals correspond to the one or more biomarkers.

Aspect 48. The one or more non-transitory computer-readable media of aspect 47, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: storing the one or more data structures in an intermediate data store; one or more de-identification operations are performed on the identifiers of the portion of the subset of the additional set of individuals prior to modifying the integrated data store to store at least a portion of the additional information of the medical records of the portion of the subset of the additional set of individuals in association with the plurality of identifiers.

Aspect 49. The one or more non-transitory computer-readable media of any one of aspects 34-48, wherein the molecular data store stores at least one or more of genomic information, genetic information, metabolome information, transcriptome information, fragment group information, immunoreceptor information, methylation information, epigenomic information, or proteomic information.

Aspect 50. A method comprising: generating, by a computing system comprising processing circuitry and memory, a data file comprising first tokens generated using a first hash function, individual first tokens corresponding to respective ones of a set of individuals having data stored by a molecular data store; transmitting, by the computing system, the data file to a medical record data management system; obtaining, by the computing system and from the medical record data management system, medical record data corresponding to the set of individuals in response to the data file; generating, by the computing system, a plurality of identifiers using a second hash function different from the first hash function, each identifier corresponding to one or more tokens associated with each individual in the set of individuals; obtaining, by the computing system and using the plurality of identifiers, second data about the set of individuals from the molecular data store; determining, by the computing system, a respective portion of the first data corresponding to a respective portion of the second data about the set of individuals; generating, by the computing system, an integrated data store that stores respective portions of the first data and respective portions of the second data in association with respective identifiers of the plurality of identifiers; receiving, by the computing system, a request to determine data related to a plurality of individuals having data stored in the integrated data store, wherein the request includes one or more search criteria; determining, by the computing system, a subset of the plurality of individuals having one or more features corresponding to the one or more search criteria; and analyzing, by the computing system, information of the subset of the plurality of individuals to determine a measure of significance of a feature of the one or more features relative to a biological condition.

Aspect 51 the method of aspect 50, comprising: determining, by the computing system, one or more genomic mutations present in the subset of the plurality of individuals;

determining, by the computing system, a plurality of treatments to provide to the subset of the plurality of individuals; and determining, by the computing system, respective survival rates for the subset of the plurality of individuals.

Aspect 52. The method of aspect 51, wherein the measure of significance corresponds to a survival rate of the genomic mutation relative to the treatment of the plurality of treatments and the one or more genomic mutations.

Aspect 53. The method of aspect 52, comprising determining, by the computing system and based on a significance measure, the effectiveness of the treatment of the subset of the plurality of individuals.

Aspect 54 the method of aspect 53, comprising determining, by the computing system, individuals in the subset of the plurality of individuals who did not receive the treatment.

Aspect 55 the method of aspect 54, comprising administering one or more therapeutically effective amounts of the treatment to an individual of the subset of the plurality of individuals who did not receive the treatment.

Aspect 56. The method according to any one of aspects 50-55, wherein: the integrated data store is arranged according to a data store schema, the data store schema comprising a plurality of data tables and a plurality of logical links between the plurality of data tables; individual ones of the plurality of logical links indicate that one or more rows of a data table of the plurality of data tables correspond to additional one or more rows of additional data tables of the plurality of data tables.

Aspect 57. The method of aspect 56, wherein the plurality of data tables comprises: a first data table storing genomic data of the set of individuals; second data storing data relating to one or more patient visits by an individual to one or more healthcare providers; a third data table storing information corresponding to respective services provided to individuals regarding one or more patient visits to one or more healthcare providers indicated by the second data table; a fourth data table storing personal information of the group of individuals; a fifth data table storing information related to a health insurance company or government entity paying for the service provided to the group of individuals; a sixth data table storing information corresponding to health insurance coverage information of the group of individuals; and a seventh data table storing information related to medication therapy obtained by the group of individuals.

Aspect 58 the method according to any one of aspects 50-57, wherein the plurality of identifiers generated using the second hash function includes an intermediate identifier; and the method comprises: a salt function is applied to the intermediate identifiers by the computing system to generate a final set of identifiers.

Aspect 59. The method according to any one of aspects 50-58, comprising: obtaining, by the computing system, additional information from an additional data store, the additional data store comprising additional health insurance claim data for a group of individuals; determining, by the computing system, at least a subset of the additional set of individuals, the at least a subset of the additional set of individuals corresponding to the set of individuals having data stored by the genomics data store; and modifying, by the computing system, the integrated data store to store at least a portion of additional information of health insurance claim data for the at least a subset of the additional set of individuals in association with the plurality of identifiers.

Aspect 60. The method according to any one of aspects 50-59, comprising: performing, by the computing system, one or more optical character recognition operations on the medical record data; the medical record data is analyzed by the computing system to determine one or more portions of the medical record data to remove to generate a corpus of information.

Aspect 61. The method of aspect 60, comprising: analyzing, by the computing system, the corpus of information to determine a portion of the subset of the set of individuals corresponding to one or more biomarkers; and generating, by the computing system, one or more data structures storing identifiers of the portions of the subset of the set of individuals and storing an indication that the portions of the subset of the set of individuals correspond to the one or more biomarkers.

Aspect 62. The method of aspect 61, comprising: storing, by the computing system, the one or more data structures in an intermediate data store; one or more de-identification operations are performed by the computing system on the identifiers of the portion of the subset of the set of individuals prior to modifying the integrated data store to store at least a portion of the medical record data of the portion of the subset of the set of individuals in association with the plurality of identifiers.

Aspect 63. The method of any one of aspects 50-62, wherein the molecular database stores at least one or more of genomic information, genetic information, metabolome information, transcriptome information, fragment group information, immunoreceptor information, methylation information, epigenomic information, or proteomic information.

Aspect 64, a system comprising: one or more hardware processing units; one or more computer-readable storage media storing computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform operations comprising: generating a data file comprising first tokens generated using a first hash function, individual first tokens corresponding to respective ones of a set of individuals having data stored by a molecular data store; transmitting the data file to a medical record data management system; obtaining medical record data corresponding to the set of individuals from the medical record data management system in response to the data file; generating a plurality of identifiers using a second hash function different from the first hash function, each identifier corresponding to one or more tokens associated with each individual in the set of individuals; obtaining second data about the set of individuals from the molecular data store using the plurality of identifiers; determining a respective portion of the first data corresponding to a respective portion of the second data about the set of individuals; generating an integrated data store that stores respective portions of the first data and respective portions of the second data in association with respective ones of the plurality of identifiers; receiving a request to determine data related to a plurality of individuals having data stored in the integrated data store, wherein the request includes one or more search criteria; determining a subset of the plurality of individuals having one or more features corresponding to the one or more search criteria; and analyzing information of the subset of the plurality of individuals to determine a measure of significance of a feature of the one or more features relative to a biological condition.

Aspect 65 the system of aspect 64, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: determining one or more genomic mutations present in the subset of the plurality of individuals; determining a plurality of treatments to provide to the subset of the plurality of individuals; and determining respective survivability rates of the subset of the plurality of individuals.

Aspect 66. The system of aspect 65, wherein the measure of significance corresponds to a survival rate of the genomic mutation relative to the treatment of the plurality of treatments and the one or more genomic mutations.

Aspect 67 the system of aspect 66, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: the effectiveness of the treatment of the subset of the plurality of individuals is determined based on a significance measure.

Aspect 68 the system of aspect 67, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: determining individuals in a subset of the plurality of individuals who did not receive the treatment.

Aspect 69 the system of any one of aspects 64-68, wherein: the integrated data store is arranged according to a data store schema, the data store schema comprising a plurality of data tables and a plurality of logical links between the plurality of data tables; and individual ones of the plurality of logical links indicate that one or more rows of data tables of the plurality of data tables correspond to additional one or more rows of additional data tables of the plurality of data tables.

Aspect 70 the system of aspect 69, wherein the plurality of data tables comprises: a first data table storing genomic data of the set of individuals; second data storing data relating to one or more patient visits by an individual to one or more healthcare providers; a third data table storing information corresponding to respective services provided to individuals regarding one or more patient visits to one or more healthcare providers indicated by the second data table; a fourth data table storing personal information of the group of individuals; a fifth data table storing information related to a health insurance company or government entity paying for the service provided to the group of individuals; a sixth data table storing information corresponding to health insurance coverage information of the group of individuals; and a seventh data table storing information related to medication therapy obtained by the group of individuals.

Aspect 71 the system of any one of aspects 64-70, wherein the plurality of identifiers generated using the second hash function includes an intermediate identifier; and wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: a salt function is applied to the intermediate identifiers to generate a final set of identifiers.

Aspect 72 the system of any of aspects 64-71, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: obtaining additional information from an additional data store, the additional data store comprising additional health insurance claim data for a group of individuals; determining at least a subset of the additional set of individuals, the at least a subset of the additional set of individuals corresponding to the set of individuals having data stored by the genomics data store; and modifying the integrated data store to store at least a portion of additional information of health insurance claim data for the at least a subset of the additional set of individuals in association with the plurality of identifiers.

Aspect 73 the system of any one of aspects 64-72, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: performing one or more optical character recognition operations on the medical record data; the medical record data is analyzed to determine one or more portions of the medical record data to remove, thereby generating a corpus of information.

Aspect 74 the system of aspect 73, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: analyzing the corpus of information to determine a portion of the subset of the set of individuals corresponding to one or more biomarkers; and generating one or more data structures that store identifiers of the portions of the subset of the set of individuals and store indications that the portions of the subset of the set of individuals correspond to the one or more biomarkers.

Aspect 75 the system of aspect 74, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: storing the one or more data structures in an intermediate data store; and performing one or more de-identification operations on the identifiers of the portion of the subset of the set of individuals prior to modifying the integrated data store to store at least a portion of the medical record data of the portion of the subset of the set of individuals in association with the plurality of identifiers.

Aspect 76. The system of any one of aspects 64-75, wherein the molecular database stores at least one or more of genomic information, genetic information, metabolome information, transcriptome information, fragment group information, immunoreceptor information, methylation information, epigenomic information, or proteomic information.

Aspect 77. One or more non-transitory computer-readable storage media storing computer-executable instructions that, when executed by one or more hardware processing units, cause a system to perform operations comprising: generating a data file comprising first tokens generated using a first hash function, individual first tokens corresponding to respective ones of a set of individuals having data stored by a molecular data store; transmitting the data file to a medical record data management system; obtaining medical record data corresponding to the set of individuals from the medical record data management system in response to the data file; generating a plurality of identifiers using a second hash function different from the first hash function, each identifier corresponding to one or more tokens associated with each individual in the set of individuals; obtaining second data about the set of individuals from the molecular data store using the plurality of identifiers; determining a respective portion of the first data corresponding to a respective portion of the second data about the set of individuals; generating an integrated data store that stores respective portions of the first data and respective portions of the second data in association with respective ones of the plurality of identifiers; receiving a request to determine data related to a plurality of individuals having data stored in the integrated data store, wherein the request includes one or more search criteria; determining a subset of the plurality of individuals having one or more features corresponding to the one or more search criteria; and analyzing information of the subset of the plurality of individuals to determine a measure of significance of a feature of the one or more features relative to a biological condition.

Aspect 78. One or more non-transitory computer-readable media according to aspect 77, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: determining one or more genomic mutations present in the subset of the plurality of individuals; determining a plurality of treatments to provide to the subset of the plurality of individuals; and determining respective survivability rates of the subset of the plurality of individuals.

Aspect 79. The one or more non-transitory computer-readable media of aspect 78, wherein the measure of significance corresponds to survival relative to treatment of the plurality of treatments and genomic mutations of the one or more genomic mutations.

Aspect 80. The one or more non-transitory computer-readable media of aspect 79, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: the effectiveness of the treatment of the subset of the plurality of individuals is determined based on a significance measure.

Aspect 81 the one or more non-transitory computer-readable media of aspect 80 comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: determining individuals in a subset of the plurality of individuals who did not receive the treatment.

Aspect 82. One or more non-transitory computer-readable media according to any one of aspects 77-81, wherein: the integrated data store is arranged according to a data store schema, the data store schema comprising a plurality of data tables and a plurality of logical links between the plurality of data tables; and individual ones of the plurality of logical links indicate that one or more rows of data tables of the plurality of data tables correspond to additional one or more rows of additional data tables of the plurality of data tables.

Aspect 83. The one or more non-transitory computer-readable media of aspect 82, wherein the plurality of data tables comprises: a first data table storing genomic data of the set of individuals; second data storing data relating to one or more patient visits by an individual to one or more healthcare providers; a third data table storing information corresponding to respective services provided to individuals regarding one or more patient visits to one or more healthcare providers indicated by the second data table; a fourth data table storing personal information of the group of individuals; a fifth data table storing information related to a health insurance company or government entity paying for the service provided to the group of individuals; a sixth data table storing information corresponding to health insurance coverage information of the group of individuals; and a seventh data table storing information related to medication therapy obtained by the group of individuals.

Aspect 84 the one or more non-transitory computer-readable media of any one of aspects 77-83, wherein the plurality of identifiers generated using the second hash function comprises an intermediate identifier; and the one or more non-transitory computer-readable media include additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: a salt function is applied to the intermediate identifiers to generate a final set of identifiers.

Aspect 85. One or more non-transitory computer-readable media according to any one of aspects 77-84, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: obtaining additional information from an additional data store, the additional data store comprising additional health insurance claim data for a group of individuals; determining at least a subset of the additional set of individuals, the at least a subset of the additional set of individuals corresponding to the set of individuals having data stored by the genomics data store; and modifying the integrated data store to store at least a portion of additional information of health insurance claim data for the at least a subset of the additional set of individuals in association with the plurality of identifiers.

Aspect 86. One or more non-transitory computer-readable media according to any one of aspects 77-85 comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: performing one or more optical character recognition operations on the medical record data; and analyzing the medical record data to determine one or more portions of the medical record data to remove, thereby generating a corpus of information.

Aspect 87 the one or more non-transitory computer-readable media of aspect 86, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: analyzing the corpus of information to determine a portion of the subset of the set of individuals corresponding to one or more biomarkers; and generating one or more data structures that store identifiers of the portions of the subset of the set of individuals and store indications that the portions of the subset of the set of individuals correspond to the one or more biomarkers.

Aspect 88 the one or more non-transitory computer-readable media of aspect 87, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: storing the one or more data structures in an intermediate data store; and performing one or more de-identification operations on the identifiers of the portion of the subset of the set of individuals prior to modifying the integrated data store to store at least a portion of the medical record data of the portion of the subset of the set of individuals in association with the plurality of identifiers.

Aspect 89. The one or more non-transitory computer-readable media of any one of aspects 77-88, wherein the molecular data store stores at least one or more of genomic information, genetic information, metabolome information, transcriptome information, fragment group information, immunoreceptor information, methylation information, epigenomic information, or proteomic information.

Example

Example 1

Liquid biopsies provide a less invasive alternative to tissue biopsies for comprehensive genomic profiling (comprehensive genomic profiling, CGP) and also contain additional information in the form of circulating tumor DNA (ctDNA) levels. Qualitative and quantitative ctDNA levels have been demonstrated to be indicative of tumor volume. Little is known about how ctDNA levels estimated from single blood draws correlate with the outcome of patients with advanced metastatic non-small cell lung cancer (NSCLC) undergoing different treatment regimens.

Patients (pts) with NSCLC were identified by an integrated database and grouped according to whether they were liquid biopsied within 190 days before initiation of metastatic first line (1L) therapy ("1L pre"), within 90 days after initiation of 1L ("1L early"), or within 90-190 days after initiation of 1L ("1L late"). Kaplan meier and Cox proportional hazards model (CPH) was used to evaluate differences in real world total survival (rwOS). Gender and age are included as covariates in the CPH. When used as a quantitative measure, ctDNA levels were defined as the highest variant allele fraction, and when used as a classification variable in NSCLC, a threshold of 4% was used to define ctDNA high/low groups.

Regardless of therapy or relative to the time of blood sampling at the beginning of 1L treatment, pts with higher ctDNA levels had worse rwOS, however, controls within 90 days after 1L of the octenib group and chemotherapy group did not pass our significant cut-off (< 0.05), probably because the number of pts in these groups was very small. Pts where no tumor derived changes were detected had the longest rwOS and the lowest risk ratio relative to ctDNA (range: 0.16 to 0.46).

FIG. 12 is a graph showing a Kaplan-Meier curve showing the real world total survival values in terms of high ctDNA count, low ctDNA count, and undetectable ctDNA for a patient receiving octreotide to treat non-small cell lung cancer prior to its treatment.

In addition to providing a less invasive alternative to tissue biopsy for CPG, the highest variant allele fraction reported in liquid biopsy assays, particularly the lack of ctDNA detected, provides prognostic information for patients and may help identify high risk patients who would benefit from a more aggressive treatment regimen.

Table 1. Results for 1L of Ornitinib patients divided by blood sampling time.

G360 time	Log rank p value	CPH p value	CPH HR
				1L front (ctDNA low vs high)	<0.005	<0.005	ctDNA low hr=0.66 (1.52)
1L front (TND vs. high)	<0.005	<0.005	TND HR＝0.37(2.7)
				1L early stage (ctDNA low to high)	0.13	0.12	ctDNA low hr=0.62 (1.61)
1L early stage (TND vs. high)	<0.005	<0.005	TND HR＝0.16(6.3)
				1L late stage (ctDNA low to high)	0.01	0.02	ctDNA low hr=0.54 (1.85)
1L late stage (TND vs. high)	0.02	0.03	TND HR＝0.35(2.86)

Table 2. Results for 1L ICI patients divided by blood sampling time.

G360 time	Log rank p value	CPH p value	CPH HR
				1L front (ctDNA low to high)	<0.005	<0.005	ctDNA low hr=0.73 (1.37)
1L front (TND vs. high)	<0.005	<0.005	TND HR＝0.46(2.17)
				1L early stage (ctDNA low to high)	0.13	<0.005	ctDNA low hr=0.41 (2.44)
1L early stage (TND vs. high)	<0.005	<0.01	TND HR＝0.34(2.94)
				1L late stage (ctDNA low to high)	0.01	0.01	ctDNA low hr=0.55 (1.82)
1L late stage (TND vs. high)	0.02	0.06	TND HR＝0.43(2.33)

Table 3. Results for 1L of chemotherapy patients divided by blood sampling time.

G360 time	Log rank p value	CPH p value	CPH HR
				1L front (ctDNA low to high)	<0.005	<0.005	ctDNA low hr=0.64 (1.56)
1L front (TND vs. high)	<0.005	<0.005	TND HR＝0.32(3.13)
				1L early stage (ctDNA low to high)	<0.005	0.06	ctDNA low hr=0.73 (1.37)
1L early stage (TND vs. high)	<0.005	<0.005	TND HR＝0.42(2.38)
				1L late stage (ctDNA low to high)	<0.005	<0.005	ctDNA low hr=0.60 (1.67)
1L late stage (TND vs. high)	<0.005	<0.005	TND HR＝0.38(2.63)

Example 2

Results from CLIA-certified, CAP-approved, NYSDOH-approved circulating tumor DNA (ctDNA) assays for advanced solid tumor patients were anonymized and tokenized (tokenized) using irreversible one-way hashing, which were performed for about 103,000 patients. These results are linked to a de-identified patient event encounter database containing medical and pharmacy claims using a safe and HIPAA-compliant and authenticated method to provide a longitudinal view of the patient's course, including integrating diagnostic, therapeutic, and real world time to event data points in the database. These de-identified integrated data can then be used to explore biomarkers and treatment-specific models of disease, tumor evolution, and resistance.

Fig. 17 is a graph showing the frequency of selected changes in a group of patients diagnosed with advanced non-small cell lung cancer (NSCLC) (n=637) who received liquid biopsy after initiation of first line of octenib therapy. About 12% exhibit secondary EGFR mutations. About 12% exhibit gene amplification, HER2 and MET. About 10% exhibit mutations in the MAPK/PIK3CA gene. And about 17% showed alterations in cell cycle genes. These results are consistent in direction with those reported by Ramalingam et al. The average duration of treatment for these patients was about 8 months, consistent with published austtinib studies.

Fig. 18 is a graph showing the frequency of selected mutations in the ligand binding domain for a group of patients diagnosed with breast cancer (n=4448) who received liquid biopsy following recording of Aromatase Inhibitor (AI) treatment. In the case of metastatic breast cancer, we examined data from 4448 patients diagnosed with metastatic breast cancer who were given aromatase inhibitor prescriptions and subsequently received liquid biopsy. Mutations occurring in the ESR1 ligand binding domain are the drug resistance mechanisms commonly observed in connection with aromatase inhibitor progression, and the data indicate that these mutations are highly heterogeneous. Thus, we observed this heterogeneity in liquid biopsy findings, with the D538G and Y537S mutations most often observed, as one would expect based on Toy et al.

To further elucidate the utility of the database in exploring genomic changes in the therapeutic setting, we examined two patient cases. In the first case, as shown in fig. 19, we see that the patient developed a T790M mutation detected by liquid biopsy, treated with octreotide, followed by a secondary C797S mutation and MET amplification. In the second case, as shown in fig. 20, we see that one patient had a record of treatment with the aromatase inhibitors letrozole and exemestane, and then continued to show the D538G mutation in ESR1 gene. Fig. 19 is a graph showing changes associated with octreotide resistance detected by liquid biopsy after treatment is provided to women diagnosed with NSCLC. FIG. 20 is a graph showing the detection of ESR1 resistance mutations after a second course of treatment in women treated with an aromatase inhibitor, with respect to diagnosis of metastatic breast cancer.

The integrated database contains integrated and de-identified clinical and genomic information from over 103,000 advanced cancer patients, making it one of the largest databases in the same class of databases. With continued use of liquid biopsy and due to unique and comprehensive capture of integrated clinical data (avoiding follow-up and loss of patient mobility), it will continue to evolve and mature.

The integrated database may be used to identify and study clinical outcomes based on genomic tumor characteristics using liquid biopsy data. Specific drugs and treatment categories (TKI, CDK 4/6is, etc.) can be reliably identified, placed into appropriate groups and studied. This unique resource can explore the biological mechanisms of drug response and resistance associated with advanced cancer treatment in real world environments. Researchers can accelerate the development of new therapies by identifying and characterizing unmet medical needs, trial design optimization, conducting results studies in post-market environments, and identifying new potential combinations or treatment strategies (sequencing) in other applications. Future directions include further validation data and adding supplemental source data to support deeper analysis.

It should be understood that the various steps used in the methods of the present teachings may be performed in any order and/or simultaneously so long as the teachings remain operable. Further, it should be understood that the apparatus and methods of the present teachings may include any number or all of the described implementations as long as the present teachings remain operable.

Various implementations of systems, devices, and methods have been described herein. These implementations are given by way of example only and are not intended to limit the scope of the claimed invention. Furthermore, it should be appreciated that the various features of the implementations that have been described can be combined in various ways to produce numerous additional implementations. Further, although various materials, sizes, shapes, configurations, locations, etc. have been described for use with the disclosed implementations, other materials, sizes, shapes, configurations, locations, etc. than those disclosed may be utilized without departing from the scope of the claimed invention.

One of ordinary skill in the relevant art will recognize that an implementation may include fewer features than are shown in any of the individual implementations described above. Implementations described herein are not meant to be an exhaustive presentation of the various features that may be combined. Thus, mutually exclusive combinations of features are implemented; rather, as will be appreciated by those of ordinary skill in the art, an implementation may include a combination of different individual features selected from different individual implementations. Furthermore, unless otherwise indicated, elements described with respect to one implementation may be implemented in other implementations, even if not described in those implementations. Although a dependent claim may refer to a particular combination with one or more other claims in the claims, other implementations may also include a combination of a dependent claim with the subject matter of each other dependent claim, or a combination of one or more features with other dependent or independent claims. Such combinations are presented herein unless a specific combination is not intended to be used. Furthermore, the features of one claim are intended to be included in any other independent claim even if that claim is not directly dependent on that independent claim.

Furthermore, references in the specification to "one implementation," "an implementation," or "some implementations" mean that a particular feature, structure, or characteristic described in connection with the implementation is included in at least one implementation of the teachings. The appearances of the phrase "in one implementation" in various places in the specification are not necessarily all referring to the same implementation.

The incorporation by reference of any of the above documents is limited such that no subject matter is incorporated that is contrary to the explicit disclosure herein. The incorporation by reference of any of the above-identified documents is further limited such that no claim included in the document is incorporated by reference herein. The incorporation by reference of any of the above documents is further limited such that no definition provided in the document is incorporated by reference herein unless expressly included herein.

While implementations have been described with reference to specific example implementations, it will be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the disclosure. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. The accompanying drawings that form a part hereof show by way of illustration, and not of limitation, specific implementations in which the subject matter may be practiced. The implementations shown are described in sufficient detail to enable those skilled in the art to practice the teachings disclosed herein. Other implementations may be used and derived therefrom such that structural and logical substitutions and changes may be made without departing from the scope of this disclosure. This detailed description, therefore, is not to be taken in a limiting sense, and the scope of various implementations is defined only by the appended claims, along with the full range of equivalents to which such claims are entitled.

Although specific implementations have been illustrated and described herein, it should be appreciated that any arrangement calculated to achieve the same purpose may be substituted for the specific implementations shown. This disclosure is intended to cover any and all adaptations or variations of various implementations. Combinations of the above implementations, and other implementations not specifically described herein, will be apparent to those of skill in the art upon reviewing the above description.

In this document, the terms "a" or "an" are used, as is common in patent documents, to include one or more than one, irrespective of any other instances or usages of "at least one" or "one or more". In this document, the term "or" is used to refer to a non-exclusive or, and thus "a or B" includes "a but not B", "B but not a" and "a and B" unless otherwise indicated. In this document, the terms "include" and "wherein (in white)" are used as pure English equivalents of the respective terms "comprising" and "wherein (white)". Furthermore, in the appended claims, the terms "including" and "comprising" are open-ended, that is, a system, user Equipment (UE), article, composition, formulation, or process that includes other elements in addition to those listed after such term in the claims is still considered to fall within the scope of the claims. Furthermore, in the appended claims, the terms "first," "second," and "third," etc. are used merely as labels, and are not intended to impose numerical requirements on their objects.

Claims

1. A method, comprising:

generating, by a computing system comprising processing circuitry and memory, a data file comprising first tokens generated using a first hash function, individual first tokens corresponding to respective ones of a set of individuals having data stored by a molecular data store;

transmitting, by the computing system, the data file to a health insurance claim data management system;

obtaining, by the computing system and from the health insurance claim data management system, health data corresponding to the set of individuals in response to the data file;

generating, by the computing system, a plurality of identifiers using a second hash function different from the first hash function, each identifier corresponding to one or more tokens associated with each individual in the set of individuals;

obtaining, by the computing system and using the plurality of identifiers, second data about the set of individuals from the molecular data store;

determining, by the computing system, a respective portion of the first data corresponding to a respective portion of the second data about the set of individuals; and

an integrated data store is generated by the computing system, the integrated data store storing respective portions of the first data and respective portions of the second data in association with respective ones of the plurality of identifiers.

2. The method according to claim 1, comprising:

determining, by the computing system, a first set of data processing instructions executable with respect to first data stored by the integrated data store;

causing, by the computing system, the first set of data processing instructions to be executed to analyze a first health insurance claim code included in the first data to determine a first subset of biological conditions present in the group of individuals; and

a first data set is generated by the computing system, the first data set indicating the presence of the subset of the biological conditions in the group of individuals.

3. The method according to claim 2, comprising:

determining, by the computing system, a second set of data processing instructions executable with respect to second data stored by the integrated data store;

causing, by the computing system, the second set of data processing instructions to be executed to analyze a second health insurance claim code included in the second data to determine one or more treatments provided to a second subset of the group of individuals; and

a second data set is generated by the computing system, the second data set indicating the one or more treatments provided to the second subset of the group of individuals.

4. A method according to claim 3, comprising:

determining, by the computing system, a third subset of the set of individuals, the third subset including a portion of the first subset of the set of individuals that overlaps a portion of the second subset of the set of individuals;

receiving, by the computing system, a request to perform an analysis of the first data set and the second data set with respect to the third subset of the set of individuals; and

analyzing, by the computing system and in response to the request, the first data set and the second data set with respect to the third subset of the set of individuals to determine a measure of significance of features of the third subset of the set of individuals with respect to the biological condition.

5. The method of claim 4, comprising:

determining, by the computing system, one or more genomic mutations present in the third subset of the set of individuals;

determining, by the computing system, a plurality of treatments to provide to the third subset of the group of individuals; and

a respective survival rate of the third subset of the set of individuals is determined by the computing system.

6. The method of claim 5, wherein the significance measure corresponds to survival relative to treatment of the plurality of treatments and genomic mutations of the one or more genomic mutations.

7. The method of claim 6, comprising determining, by the computing system and based on a significance measure, effectiveness of the treatment of the third subset of the set of individuals.

8. The method of claim 7, comprising determining, by the computing system, individuals in the third subset of the group of individuals who did not receive the treatment.

9. The method of claim 8, comprising administering one or more therapeutically effective amounts of the treatment to individuals in the third subset who did not receive the treatment.

10. The method of any one of claims 1-9, wherein:

the integrated data store is arranged according to a data store schema, the data store schema comprising a plurality of data tables and a plurality of logical links between the plurality of data tables;

individual ones of the plurality of logical links indicate that one or more rows of a data table of the plurality of data tables correspond to additional one or more rows of additional data tables of the plurality of data tables.

11. The method of claim 10, wherein the plurality of data tables comprises:

a first data table storing genomic data of the set of individuals;

Second data storing data relating to one or more patient visits by an individual to one or more healthcare providers;

a third data table storing information corresponding to respective services provided to individuals regarding one or more patient visits to one or more healthcare providers indicated by the second data table;

a fourth data table storing personal information of the group of individuals;

a fifth data table storing information related to a health insurance company or government entity paying for the service provided to the group of individuals;

a sixth data table storing information corresponding to health insurance coverage information of the group of individuals; and

a seventh data table storing information relating to medication therapy obtained by the group of individuals.

12. The method of any of claims 1-11, wherein the plurality of identifiers generated using the second hash function includes an intermediate identifier; and the method comprises:

a salt function is applied to the intermediate identifiers by the computing system to generate a final set of identifiers.

13. The method according to any one of claims 1-12, comprising:

Obtaining, by the computing system, information from an additional data store, the additional data store comprising an additional set of individual electronic medical records;

determining, by the computing system, a subset of the additional set of individuals, the subset of the additional set of individuals corresponding to the set of individuals having data stored by the genomic data store; and

the integrated data store is modified by the computing system to store at least a portion of the information of the medical records of the subset of the additional set of individuals in association with the plurality of identifiers.

14. The method of claim 13, comprising:

performing, by the computing system, one or more optical character recognition operations on the additional information;

additional information obtained from the additional data store is analyzed by the computing system to determine one or more portions of the additional information to remove, thereby generating a corpus of information.

15. The method of claim 14, comprising:

analyzing, by the computing system, the corpus of information to determine a portion of the subset of the additional set of individuals corresponding to one or more biomarkers; and

One or more data structures are generated by the computing system, the one or more data structures storing identifiers of the portions of the subset of the additional set of individuals and storing indications that the portions of the subset of the additional set of individuals correspond to the one or more biomarkers.

16. The method of claim 15, comprising:

storing, by the computing system, the one or more data structures in an intermediate data store;

one or more de-identification operations are performed by the computing system on the identifiers of the portion of the subset of the additional set of individuals prior to modifying the integrated data store to store at least a portion of the additional information of the medical records of the portion of the subset of the additional set of individuals in association with the plurality of identifiers.

17. The method of any one of claims 1-16, wherein the molecular database stores at least one or more of genomic information, genetic information, metabolome information, transcriptome information, fragment group information, immunoreceptor information, methylation information, epigenomic information, or proteomic information.

18. A system, comprising:

one or more hardware processing units;

one or more computer-readable storage media storing computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform operations comprising:

generating a data file comprising first tokens generated using a first hash function, individual first tokens corresponding to respective ones of a set of individuals having data stored by a molecular data store;

transmitting the data file to a health insurance claim data management system;

obtaining health insurance claim data corresponding to the set of individuals from the health insurance claim data management system in response to the data file;

generating a plurality of identifiers using a second hash function different from the first hash function, each identifier corresponding to one or more tokens associated with each individual in the set of individuals;

obtaining second data about the set of individuals from the molecular data store using the plurality of identifiers;

determining a respective portion of the first data corresponding to a respective portion of the second data about the set of individuals; and

An integrated data store is generated that stores respective portions of the first data and respective portions of the second data in association with respective ones of the plurality of identifiers.

19. The system of claim 18, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

determining a first set of data processing instructions executable with respect to first data stored by the integrated data store;

causing the first set of data processing instructions to be executed to analyze a first health insurance claim code included in the first data to determine a first subset of biological conditions present in the group of individuals; and

a first data set is generated, the first data set being indicative of the presence of the subset of biological conditions in the group of individuals.

20. The system of claim 19, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

a second data set is generated by the computing system, the second data set indicating one or more treatments provided to the second subset of the group of individuals.

21. The system of claim 20, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

22. The system of claim 21, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

determining one or more genomic mutations present in the third subset of the set of individuals;

determining a plurality of treatments to provide to the third subset of the group of individuals; and

a respective survival rate of the third subset of the set of individuals is determined.

23. The system of claim 22, wherein the significance measure corresponds to a survival rate of a genomic mutation relative to a treatment of the plurality of treatments and the one or more genomic mutations.

24. The system of claim 23, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: the effectiveness of the treatment of the third subset of the group of individuals is determined based on a significance measure.

25. The system of claim 24, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: determining individuals in the third subset of the group of individuals who did not receive the treatment.

26. The system of any of claims 18-25, wherein:

the integrated data store is arranged according to a data store schema, the data store schema comprising a plurality of data tables and a plurality of logical links between the plurality of data tables; and is also provided with

27. The system of claim 26, wherein the plurality of data tables comprises:

a first data table storing genomic data of the set of individuals;

a fourth data table storing personal information of the group of individuals;

28. The system of any of claims 18-27, wherein:

the plurality of identifiers generated using the second hash function includes an intermediate identifier; and is also provided with

The one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: a salt function is applied to the intermediate identifiers to generate a final set of identifiers.

29. The system of any of claims 18-28, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

obtaining information from an additional data store, the additional data store comprising an additional set of electronic medical records of an individual;

determining a subset of the additional set of individuals, the subset of the additional set of individuals corresponding to the set of individuals having data stored by the genomic data store; and

the integrated data store is modified to store at least a portion of the information of medical records of the subset of the additional set of individuals in association with the plurality of identifiers.

30. The system of claim 29, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

Performing one or more optical character recognition operations on the additional information;

additional information obtained from the additional data store is analyzed to determine one or more portions of the additional information to remove, thereby producing a corpus of information.

31. The system of claim 30, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

analyzing the corpus of information to determine a portion of the subset of the additional set of individuals corresponding to one or more biomarkers; and

one or more data structures are generated that store identifiers of the portions of the subset of the additional set of individuals and store indications that the portions of the subset of the additional set of individuals correspond to the one or more biomarkers.

32. The system of claim 31, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

Storing the one or more data structures in an intermediate data store;

one or more de-identification operations are performed on the identifiers of the portion of the subset of the additional set of individuals prior to modifying the integrated data store to store at least a portion of the additional information of the medical records of the portion of the subset of the additional set of individuals in association with the plurality of identifiers.

33. The system of any one of claims 18-32, wherein the molecular database stores at least one or more of genomic information, genetic information, metabolome information, transcriptome information, fragment group information, immunoreceptor information, methylation information, epigenomic information, or proteomic information.

34. One or more non-transitory computer-readable storage media storing computer-executable instructions that, when executed by one or more hardware processing units, cause a system to perform operations comprising:

Transmitting the data file to a health insurance claim data management system;

35. The one or more non-transitory computer-readable media of claim 34, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

36. The one or more non-transitory computer-readable media of claim 35, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

37. The one or more non-transitory computer-readable media of claim 36, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

38. The one or more non-transitory computer-readable media of claim 37, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

39. The one or more non-transitory computer-readable media of claim 38, wherein the significance measure corresponds to a survival rate relative to a treatment of the plurality of treatments and a genomic mutation of the one or more genomic mutations.

40. The one or more non-transitory computer-readable media of claim 39, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: the effectiveness of the treatment of the third subset of the group of individuals is determined based on a significance measure.

41. The one or more non-transitory computer-readable media of claim 40, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: determining individuals in the third subset of the group of individuals who did not receive the treatment.

42. The one or more non-transitory computer-readable media of claim 34, wherein:

43. The one or more non-transitory computer-readable media of claim 42, wherein the plurality of data tables comprises:

a first data table storing genomic data of the set of individuals;

a fourth data table storing personal information of the group of individuals;

44. The one or more non-transitory computer-readable media of any one of claims 34-43, wherein:

Wherein the one or more non-transitory computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: a salt function is applied to the intermediate identifiers to generate a final set of identifiers.

45. The one or more non-transitory computer-readable media of claim 44, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

46. The one or more non-transitory computer-readable media of claim 45, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

47. The one or more non-transitory computer-readable media of claim 46, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

48. The one or more non-transitory computer-readable media of claim 47, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

Storing the one or more data structures in an intermediate data store;

49. The one or more non-transitory computer-readable media of any one of claims 34-48, wherein the molecular database stores at least one or more of genomic information, genetic information, metabolome information, transcriptome information, fragment group information, immunoreceptor information, methylation information, epigenomic information, or proteomic information.

50. A method, comprising:

transmitting, by the computing system, the data file to a medical record data management system;

Obtaining, by the computing system and from the medical record data management system, medical record data corresponding to the set of individuals in response to the data file;

determining, by the computing system, a respective portion of the first data corresponding to a respective portion of the second data about the set of individuals;

generating, by the computing system, an integrated data store that stores respective portions of the first data and respective portions of the second data in association with respective identifiers of the plurality of identifiers;

receiving, by the computing system, a request to determine data related to a plurality of individuals having data stored in the integrated data store, wherein the request includes one or more search criteria;

Determining, by the computing system, a subset of the plurality of individuals having one or more features corresponding to the one or more search criteria; and

information of the subset of the plurality of individuals is analyzed by the computing system to determine a measure of significance of a feature of the one or more features relative to a biological condition.

51. The method of claim 50, comprising:

determining, by the computing system, one or more genomic mutations present in the subset of the plurality of individuals;

determining, by the computing system, a plurality of treatments to provide to the subset of the plurality of individuals; and

respective survival rates of the subset of the plurality of individuals are determined by the computing system.

52. The method of claim 51, wherein the measure of significance corresponds to a survival rate of a genomic mutation relative to a treatment of the plurality of treatments and the one or more genomic mutations.

53. The method of claim 52, comprising determining, by the computing system and based on a significance measure, effectiveness of the treatment of the subset of the plurality of individuals.

54. The method of claim 53, comprising determining, by the computing system, individuals in the subset of the plurality of individuals who did not receive the treatment.

55. The method of claim 54, comprising administering one or more therapeutically effective amounts of the treatment to individuals of the subset of the plurality of individuals who did not receive the treatment.

56. The method of any one of claims 50-55, wherein:

57. The method of claim 56, wherein said plurality of data tables comprises:

a first data table storing genomic data of the set of individuals;

A fourth data table storing personal information of the group of individuals;

58. The method of any of claims 50-57, wherein the plurality of identifiers generated using the second hash function comprises an intermediate identifier; and the method comprises:

59. The method of any one of claims 50-58, comprising:

obtaining, by the computing system, additional information from an additional data store, the additional data store comprising additional health insurance claim data for a group of individuals;

determining, by the computing system, at least a subset of the additional set of individuals, the at least a subset of the additional set of individuals corresponding to the set of individuals having data stored by the genomics data store; and

Modifying, by the computing system, the integrated data store to store at least a portion of additional information of health insurance claim data for the at least a subset of the additional set of individuals in association with the plurality of identifiers.

60. The method of any one of claims 50-59, comprising:

performing, by the computing system, one or more optical character recognition operations on the medical record data; and

the medical record data is analyzed by the computing system to determine one or more portions of the medical record data to remove to generate a corpus of information.

61. The method of claim 60, comprising:

analyzing, by the computing system, the corpus of information to determine a portion of the subset of the set of individuals corresponding to one or more biomarkers; and

one or more data structures are generated by the computing system, the one or more data structures storing identifiers of the portions of the subset of the set of individuals and storing an indication that the portions of the subset of the set of individuals correspond to the one or more biomarkers.

62. The method of claim 61, comprising:

one or more de-identification operations are performed by the computing system on the identifiers of the portion of the subset of the set of individuals prior to modifying the integrated data store to store at least a portion of the medical record data of the portion of the subset of the set of individuals in association with the plurality of identifiers.

63. The method of any one of claims 50-62, wherein the molecular database stores at least one or more of genomic information, genetic information, metabolome information, transcriptome information, fragment group information, immunoreceptor information, methylation information, epigenomic information, or proteomic information.

64. A system, comprising:

one or more hardware processing units;

Transmitting the data file to a medical record data management system;

obtaining medical record data corresponding to the set of individuals from the medical record data management system in response to the data file;

determining a respective portion of the first data corresponding to a respective portion of the second data about the set of individuals;

generating an integrated data store that stores respective portions of the first data and respective portions of the second data in association with respective ones of the plurality of identifiers;

receiving a request to determine data related to a plurality of individuals having data stored in the integrated data store, wherein the request includes one or more search criteria;

determining a subset of the plurality of individuals having one or more features corresponding to the one or more search criteria; and

Information of the subset of the plurality of individuals is analyzed to determine a measure of significance of a feature of the one or more features relative to a biological condition.

65. The system of claim 64, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

determining one or more genomic mutations present in the subset of the plurality of individuals;

determining a plurality of treatments to provide to the subset of the plurality of individuals; and

respective survival rates of the subset of the plurality of individuals are determined.

66. The system of claim 65, wherein the significance measure corresponds to a survival rate of a genomic mutation relative to a treatment of the plurality of treatments and the one or more genomic mutations.

67. The system of claim 66, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: the effectiveness of the treatment of the subset of the plurality of individuals is determined based on a significance measure.

68. The system of claim 67, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: determining individuals in a subset of the plurality of individuals who did not receive the treatment.

69. The system of any one of claims 64-68, wherein:

70. The system of claim 69, wherein the plurality of data tables includes:

a first data table storing genomic data of the set of individuals;

a fourth data table storing personal information of the group of individuals;

71. The system of any of claims 64-70, wherein the plurality of identifiers generated using the second hash function comprises an intermediate identifier; and is also provided with

Wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: a salt function is applied to the intermediate identifiers to generate a final set of identifiers.

72. The system of any of claims 64-71, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

obtaining additional information from an additional data store, the additional data store comprising additional health insurance claim data for a group of individuals;

determining at least a subset of the additional set of individuals, the at least a subset of the additional set of individuals corresponding to the set of individuals having data stored by the genomics data store; and

the integrated data store is modified to store at least a portion of additional information for medical records of the at least a subset of the additional set of individuals in association with the plurality of identifiers.

73. The system of any of claims 64-72, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

Performing one or more optical character recognition operations on the medical record data; and

the medical record data is analyzed to determine one or more portions of the medical record data to remove, thereby generating a corpus of information.

74. The system of claim 73, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

analyzing the corpus of information to determine a portion of the subset of the set of individuals corresponding to one or more biomarkers; and

one or more data structures are generated that store identifiers of the portions of the subset of the set of individuals and store indications that the portions of the subset of the set of individuals correspond to the one or more biomarkers.

75. The system of claim 74, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

Storing the one or more data structures in an intermediate data store;

one or more de-identification operations are performed on the identifiers of the portion of the subset of the set of individuals prior to modifying the integrated data store to store at least a portion of the medical record data of the portion of the subset of the set of individuals in association with the plurality of identifiers.

76. The system of any one of claims 64-75, wherein the molecular database stores at least one or more of genomic information, genetic information, metabolome information, transcriptome information, fragment group information, immunoreceptor information, methylation information, epigenomic information, or proteomic information.

77. One or more non-transitory computer-readable storage media storing computer-executable instructions that, when executed by one or more hardware processing units, cause a system to perform operations comprising:

Transmitting the data file to a medical record data management system;

78. The one or more non-transitory computer-readable media of claim 71, wherein the one or more computer-readable storage media store additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

79. The one or more non-transitory computer-readable media of claim 78, wherein the significance measure corresponds to a survival rate relative to a treatment of the plurality of treatments and a genomic mutation of the one or more genomic mutations.

80. The one or more non-transitory computer-readable media of claim 79, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: the effectiveness of the treatment of the subset of the plurality of individuals is determined based on a significance measure.

81. The one or more non-transitory computer-readable media of claim 80, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: determining individuals in a subset of the plurality of individuals who did not receive the treatment.

82. The one or more non-transitory computer-readable media of any one of claims 77-81, wherein:

83. The one or more non-transitory computer-readable media of claim 82, wherein the plurality of data tables comprises:

a first data table storing genomic data of the set of individuals;

a fourth data table storing personal information of the group of individuals;

84. The one or more non-transitory computer-readable media of any one of claims 77-83, wherein the plurality of identifiers generated using the second hash function includes an intermediate identifier; and

the one or more non-transitory computer-readable media include additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising: a salt function is applied to the intermediate identifiers to generate a final set of identifiers.

85. The one or more non-transitory computer-readable media of any one of claims 77-84, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

the integrated data store is modified to store at least a portion of additional information of health insurance claim data for the at least a subset of the additional set of individuals in association with the plurality of identifiers.

86. The one or more non-transitory computer-readable media of any one of claims 77-85, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

Performing one or more optical character recognition operations on the medical record data;

87. The one or more non-transitory computer-readable media of claim 86, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

88. The one or more non-transitory computer-readable media of claim 87, comprising additional computer-executable instructions that, when executed by the one or more hardware processing units, cause the system to perform additional operations comprising:

Storing the one or more data structures in an intermediate data store;

89. The one or more non-transitory computer-readable media of any one of claims 77-88, wherein the molecular database stores at least one or more of genomic information, genetic information, metabolome information, transcriptome information, fragment group information, immunoreceptor information, methylation information, epigenomic information, or proteomic information.