CN114868204A

CN114868204A - Data processing system and method for reusing medicine

Info

Publication number: CN114868204A
Application number: CN202080085251.8A
Authority: CN
Inventors: A-G·拉德纳克; P·布莱斯; E·德雷纳尔迪斯; R·A·赫尔南德斯-维西诺; F·J·J·希门尼斯; C·M·莫洛尼
Original assignee: Sanofi SA
Current assignee: Sanofi SA
Priority date: 2019-12-09
Filing date: 2020-12-09
Publication date: 2022-08-05
Also published as: US20210174912A1; EP4073818A1; WO2021119188A1; JP2023504203A

Abstract

A data processing system for reusing a medication may include: a computer-readable memory comprising computer-executable instructions; and at least one processor configured to execute executable logic comprising the computer-executable instructions and at least one machine learning model to perform one or more operations. The one or more operations may include: receiving data representing medical records for a plurality of patients; selecting a group of patients; determining a plurality of patient characteristics for the set of patients; grouping the set of patients according to the plurality of patient characteristics to generate a plurality of unique groups, each of the unique groups including at least one patient of the set of patients; selecting a set of unique groups based on one or more group selection criteria; and identifying one or more relevant patient characteristics by analyzing each unique group of the set of unique groups.

Description

Data processing system and method for reusing medicine

Priority requirement

This application claims the benefit of european patent application 20315299.6 filed on 9.6.2020 and us provisional patent application serial No. 62/945,814 filed on 9.12.2019. The entire contents of the aforementioned patent application are hereby incorporated by reference.

Technical Field

The present disclosure relates generally to data processing systems and methods for drug reuse.

Background

Clinical drug reuse (sometimes referred to as relocation) may refer to a drug discovery strategy that may be relatively low cost and may provide high efficiency. Drug reuse generally involves analyzing whether a drug that has been approved for the treatment of one type of condition (e.g., a disease) can be used to treat other types of conditions (e.g., common and/or rare diseases). Several therapeutic areas present high potential for drug reuse, including oncology, immunology, infectious diseases and rare diseases.

Disclosure of Invention

In at least one aspect of the present disclosure, a data processing system is provided. The data processing system includes: a computer-readable memory comprising computer-executable instructions; and at least one processor configured to execute executable logic comprising the computer-executable instructions and at least one machine learning model. When the at least one processor is executing the computer-executable instructions, the at least one processor is configured to perform one or more operations. The one or more operations include receiving data representing medical records for a plurality of patients. The one or more operations include selecting a group of patients based on the medical record by: determining at least one target signaling pathway associated with the drug; and determining one or more indicators based on one or more factors corresponding to a diagnosis associated with the target signaling pathway. The one or more operations include determining a plurality of patient characteristics for the set of patients, each patient in the set of patients exhibiting at least one patient characteristic of the plurality of patient characteristics. The one or more operations include grouping the set of patients using the machine learning model and according to the plurality of patient characteristics to generate a plurality of unique groups, each of the unique groups including at least one patient of the set of patients. The one or more operations include selecting a set of unique groups of the plurality of unique groups based on one or more group selection criteria. The one or more operations include identifying one or more relevant patient characteristics by analyzing each unique group of the set of unique groups.

The machine learning model may be trained to group the set of patients using one or more unsupervised clustering techniques. The one or more unsupervised clustering techniques may include a binary k-means clustering technique. Grouping the set of patients may include performing a multi-correspondence analysis to reduce a dimensionality of the plurality of patient characteristics.

Selecting the set of unique groups may include: for each unique group of the plurality of unique groups, determining a feature score for each patient characteristic exhibited by the unique group. Selecting the set of unique groups may include comparing the feature score of each unique group of the plurality of unique groups to a feature score threshold. Selecting a set of unique groups may include at least one of: determining a stability metric for each unique group of the plurality of unique groups; determining a purity metric for each unique group of the plurality of unique groups; and determining a number of patients of the group of patients included in each of the plurality of unique groups.

Identifying the one or more relevant patient characteristics may include: for each unique group of the set of unique groups, generating a plurality of potentially relevant patient characteristics by selecting patient characteristics of the plurality of patient characteristics that are exhibited by the unique group and that correspond to a subject matter of the group. Identifying the one or more relevant patient characteristics may include ranking each of the potentially relevant patient characteristics of the plurality of potentially relevant patient characteristics. Ranking each of the potentially relevant patient characteristics may include: for each potentially relevant patient characteristic, assigning a rank value based on a co-occurrence frequency of the potentially relevant patient characteristic and at least one reference indication associated with the drug. The co-occurrence may be measured by: for each of the potentially relevant patient characteristics, determining a proportion of the set of unique sets that includes both the potentially relevant patient characteristic and the at least one reference indication. Identifying the one or more relevant patient characteristics may include: determining at least one of a clinical feasibility and a commercial feasibility for each of the potentially relevant patient characteristics.

The one or more operations may include identifying at least one of the one or more correlated patient characteristics as a target indication for reuse of the medication.

These and other aspects, features and implementations may be expressed as methods, apparatus, systems, components, program products, means, or steps, among other ways, for performing the functions.

Implementations of the disclosure may provide one or more of the following advantages. The systems and methods described in this specification can improve computational efficiency when compared to conventional techniques by, for example, reducing the computational time for processing large amounts of data with varying levels of complexity to identify potentially relevant indications (sometimes referred to herein as patient characteristics). A particular machine learning technique may be used to discover new indications that may not be identifiable by conventional techniques. The systems and methods described in this specification can reduce reliance on human input and skills when compared to conventional techniques.

These and other aspects, features and implementations will become apparent from the following description, including the claims.

Drawings

Fig. 1 is a diagram illustrating an example of a data processing system for drug reuse.

Fig. 2 is a flow chart illustrating an example method for reusing a medication.

Fig. 3 is a diagram illustrating experiments performed using the systems and methods described in this specification.

Fig. 4 is a graph illustrating experimental results produced by using the systems and methods described in this specification.

FIG. 5 is a block diagram of an example computer system for providing computing functionality associated with the algorithms, methods, functions, processes, flows and programs described in this disclosure.

Detailed Description

Drug reuse can be used to discover new clinical indications for clinically approved drugs (e.g., the reason for using the drug). As hundreds of new indications are explored, the Key Opinion Leader (KOL) may not be able to address the high level of complexity corresponding to these new indications. Handling large amounts of data and using advanced analytics can support KOL-centric approaches. Guided by the relevant clinical questions, powerful advanced analytics techniques can mine out clinically relevant information hidden in large amounts of data, which can then aid clinical decision making.

Computational drug reuse methods can use similarity measures (chemical similarity, molecular activity similarity, gene expression similarity, or side effect similarity), molecular docking, or shared molecular pathology to detect new drug-disease relationships. Drug reuse methods can be classified as web-based text mining (literature search) and semantic methods.

Network-based approaches may involve creating an integrated network by combining multiple data sources, such as drugs, proteins, genes, and diseases. For example, the contact Map (C-Map) method may utilize transcriptomes to correlate biological, chemical, and clinical conditions by utilizing gene expression profiling to facilitate the discovery of new disease-gene-drug links. Network-based clustering techniques (clustering) can be used to discover the biological modules. This approach is motivated by the fact that biological entities (diseases, drugs, proteins, etc.) in the same module of a biological network often share similar properties. Clustering may involve using network topologies to discover drug-disease relationships, drug-drug relationships, disease-disease relationships, or drug-target relationships. Implementing a clustering method may involve some difficulties. For example, marginal connections of drugs and diseases may depend on collected drug-disease associations that are incomplete, which may require integration of multiple databases to improve accuracy of predictions. Additionally, incorporating different data sources that provide information about drug side effects may allow for the collection of potentially safe signals. Furthermore, it may be difficult to distinguish between negative and positive associations, and there is no ready "gold" standard method to test associations between biological modules.

Text mining based methods can use keyword co-occurrence and semantic inference for new drug-disease associations. Some methods may be based on Swanson's analogy reasoning method (ABC model), which may assume that if "B" is one of the characteristics of the disease "C" and substance "A" affects "B", then the implicit association of "A: C" can be deduced through the B connection. However, due to the ambiguous nature of language, limited coverage of biomedical relationships, and limited accuracy of text mining techniques, document mining-based methods alone may be limited.

Implementations of the data processing systems and methods described in this specification can mitigate the previously mentioned shortcomings by using real world data driven schemes for drug reuse in order to identify an indication. In some implementations, the data processing systems and methods described in this specification combine analytics and real-world data with KOL clinical output. In some implementations, the data processing systems and methods described in this specification use real-world data and analytics in an unsupervised manner. In some implementations, patients are clustered using machine learning techniques, and groups of pathologies occurring across multiple clusters are identified. In some implementations, the identified group of pathologies potentially corresponds to a common biological pathway. The data processing systems and methods described in this specification can be used to identify potential indications that have not previously been identified using conventional techniques.

In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It may be evident, however, that the present disclosure may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present disclosure.

In the drawings, for purposes of explanation, specific arrangements or orderings of the illustrative elements are shown, such as to represent arrangements or orderings of devices, modules, instruction blocks, and data elements. However, it will be understood by those skilled in the art that the particular order or arrangement of the illustrative elements in the figures does not imply that a particular order of processing or process separation is required. Further, the inclusion of schematic elements in the figures does not imply that such elements are required in all implementations, or that features represented by such elements may not be included or combined with other elements in some implementations.

Further, in the drawings, a connecting element, such as a solid or dashed line or arrow, is used to indicate a connection, relationship, or association between two or more other exemplary elements, and the absence of any such connecting element does not imply that a connection, relationship, or association is not present. In other words, some connections, relationships, or associations between elements are not shown in the drawings so as not to obscure the disclosure. Further, for ease of illustration, a single connection element is used to represent multiple connections, relationships, or associations between elements. For example, where a connection element represents a communication of signals, data, or instructions, those skilled in the art will appreciate that such element represents one or more signal paths (e.g., buses) as may be required to affect the communication.

Reference will now be made in detail to implementations, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of various described implementations. It will be apparent, however, to one skilled in the art that the various implementations described may be practiced without these specific details. In other instances, well-known methods, procedures, components, circuits, and networks have not been described in detail as not to unnecessarily obscure aspects of the implementations.

Several features are described below, each of which can be used independently of the other or with any combination of the other features. However, any single feature may not solve any of the problems discussed above, or may only solve one of the problems discussed above. Any feature described in this specification may not fully address some of the issues discussed above. Although headings are provided, data related to a particular heading that is not found in the section having that heading may also be found elsewhere in this specification.

Example data processing System and method

FIG. 1 shows an example of a data processing system 100. In some implementations, the data processing system 100 is configured to process data that may represent medical records of multiple patients to identify new indications of medications (medications for reuse). The system 100 includes a computer processor 110. The computer processor 110 includes computer readable memory 111 and computer readable instructions 112. The system 100 also includes a machine learning system 150. The machine learning system 150 includes a machine learning model 120. The machine learning model 120 may be separate from or integrated with the computer processor 110.

The computer-readable medium 111 (or computer-readable memory) may comprise any type of data storage technology suitable to the local technical environment, including, but not limited to, semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory, removable memory, disk memory, flash memory, Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), and the like. In some implementations, computer-readable medium 111 includes code segments having executable instructions.

In some implementations, the computer processor 110 includes a general purpose processor. In some implementations, the computer processor 110 includes a Central Processing Unit (CPU). In some implementations, the computer processor 110 includes at least one Application Specific Integrated Circuit (ASIC). The computer processor 110 may also include a general purpose programmable microprocessor, a graphics processing unit, a special purpose programmable microprocessor, a Digital Signal Processor (DSP), a Programmable Logic Array (PLA), a Field Programmable Gate Array (FPGA), special purpose electronic circuitry, or the like, or a combination thereof. The computer processor 110 is configured to execute program code, such as computer-executable instructions 112, and is configured to execute executable logic including a machine learning model 120.

The computer processor 110 is configured to receive data representing medical records for a plurality of patients. For example, the computer processor 110 may receive data from a database including Electronic Medical Record (EMR) data for approximately 9400 million patients (or more) that may be identified by a key Identifier (ID) that allows matching of patients across different data tables. In some implementations, the data indicates diagnosis, laboratory tests, surgery, medication, patient events, insurance, biomarkers, measurements, clinical status, lifestyle parameters, microbiology, prescription, and the like. In some implementations, the data includes natural language processing driver data. Data may be received via any of a variety of techniques, such as wireless communication, fiber optic communication, USB, CD-ROM, and so forth.

The machine learning system 150 can apply machine learning techniques to train the machine learning model 120. As part of the training of the machine learning model 120, the machine learning system 150 may form a training set of input data by identifying a positive training set of input data items that have been determined to have the attribute, and in some implementations, may form a negative training set of input data items that lack the attribute.

The machine learning system 150 extracts feature values from the input data of the training set, these features being variables that are considered potentially relevant to whether the input data items have relevant attributes. The ordered list of features of the input data is referred to herein as a feature vector of the input data. In some implementations, the machine learning system 150 applies dimensionality reduction to reduce the amount of data in the feature vectors of the input data to a smaller, more representative dataset. For example, the machine learning system 150 may apply Multiple Correspondence Analysis (MCA), Linear Discriminant Analysis (LDA), Principal Component Analysis (PCA), and the like.

In some implementations, the machine learning system 150 trains the machine learning model 120 using unsupervised machine learning. Typically, unsupervised machine learning uses input vectors to infer from a data set without reference to known or labeled results. In some implementations, the machine learning system 150 can perform clustering to divide the data points into groups such that data points in the same group are more similar to other data points in the same group and are dissimilar to data points in other groups. In some implementations, clustering includes performing K-means clustering in which a non-embedded partition of a level of data points is created by iteratively partitioning a data set. That is, if K is the desired number of clusters, then in each iteration the dataset is partitioned into K disjoint clusters. Processing may continue until the specified clustering criterion function values are optimized. In some implementations, the machine learning system 150 is configured to perform binary K-means clustering. Binary k-means clustering typically involves dividing a cluster into two sub-clusters at each binary step (e.g., by using k-means) until k clusters are obtained. Binary K-means clustering may be more beneficial when compared to K-means clustering because binary K-means clustering may reduce computation time when K is a relatively large value, may produce clusters of similar size, and may produce clusters of smaller entropy.

The computer processor 110 is configured to execute the computer-executable instructions 112 to perform one or more operations. In some implementations, the one or more operations include receiving data representing medical records for a plurality of patients. For example, the computer processor 110 may receive data from a database of Electronic Medical Records (EMRs) including approximately 9400 ten thousand patients (or more) that may be identified by key Identifiers (IDs) that allow matching of patients across different data tables. In some implementations, the data indicates diagnosis, laboratory tests, surgery, medication, patient events, insurance, biomarkers, measurements, clinical status, lifestyle parameters, microbiology, prescription, and the like. In some implementations, the data includes natural language processing driven data. Data may be received via any of a variety of techniques, such as wireless communication, fiber optic communication, USB, CD-ROM, and so forth.

In some implementations, the one or more operations include selecting a group of patients based on the medical record. Selecting a group of patients includes determining at least one target signaling pathway associated with a drug for reuse. For example, if the drug for reuse is dopiluzumab (Dupilumab), then the computer processor 110 may determine that the drug modulates interleukin 4(IL-4) and interleukin 13(IL-13) signaling pathways based on the known function of the drug. Selecting a group of patients further includes determining one or more indicators based on one or more factors corresponding to a diagnosis associated with the target signaling pathway. For example, sources including medical databases and medical evidence software may be searched using factors such as pathway mechanisms, related clinical conditions, treatment analogs, data and epidemiology, and drug lifecycle management consistency to identify diseases associated with the determined signaling pathways. These diseases can be classified based on the strength of the link to the determined signaling pathway. Categories may include focus groups, medium groups, and broad groups. For example, returning to the IL-4/IL-13 example, the focal group of diseases may include diseases that have a direct relationship to the mechanism of action of IL-4/IL-13 on the Th2 pathway, the intermediate group of diseases may include diseases that have an indirect relationship to the mechanism of action of IL-4/IL-13 on the Th2 pathway, and the broad group of diseases may include diseases associated with a broader inflammatory response. Moving from an emphasis group to a broad group may increase the number of indices to be considered in selecting a group of patients and may reduce the likelihood of molecular collisions. Thus, in some implementations, only the focal group or both the focal group and the intermediate group are used to select a group of patients. In some implementations, only patients having at least one diagnosis, medication, laboratory test, and/or procedure associated with the determined signaling pathway are selected for inclusion in a group of patients. Detailed examples of the factors and indices are provided later with reference to table 1.

In some implementations, the one or more operations include determining a plurality of patient characteristics (sometimes referred to herein as features) for a set of patients, wherein each patient in the set of patients exhibits at least one patient characteristic of the plurality of patient characteristics. Determining the plurality of patient characteristics may include analyzing the initially received data to identify a wide range of patient characteristics to capture all or a majority of the received data. For example, a wide range of patient characteristics may correspond to diagnosis (e.g., immune conditions, diabetes), prescription (e.g., immune drugs, other drug classifications), surgery (e.g., human leukocyte antigen typing), and laboratory results (e.g., IgE abnormal high/low). In some implementations, determining the plurality of patient characteristics includes receiving user input (e.g., through a user interface in communication with the computer processor 110). For example, a user may enter patient characteristics based on clinical input, demographics, drug treatment, complications, surgery, and laboratory test data specific to immunology. Customized property classifications may also be added to improve data integrity, representativeness, and gather more information about disease and drug response. In some implementations, determining the plurality of patient characteristics includes validating the plurality of patient characteristics. Verification may include determining whether the patient characteristics of the initially received data are correctly mapped to the selected set of patients by calculating a percentage of selected patients having at least one characteristic in each characteristic family (e.g., a percentage of patients having prescription records) and comparing this percentage to a percentage of patients having at least one characteristic in each characteristic family of the initially received data. The closer values of the two numbers indicate that the mapping has been completed correctly. The verification may include determining whether the patient characteristic has been mapped to the correct patient by identifying a plurality of patients included in both the initially received data and the selected set of patients to verify the same patient characteristic mapping between the patient of the initially received data and the selected set of patients.

In some implementations, the one or more operations include grouping a set of patients according to a plurality of patient characteristics (e.g., as defined by features related to the determined signaling pathways) to generate a plurality of unique groups, wherein each of the unique groups includes at least one patient of the set of patients. For example, one or more computer processors 110 may execute machine learning model 120 to perform a clustering technique, such as the binary k-means clustering technique described above. Clustering can result in a plurality of patient clusters (e.g., a plurality of unique patient groups), wherein patients in one cluster are similar to each other to a greater extent than patients in the other cluster in terms of their corresponding patient characteristics. In some implementations, the generated clusters may show correlations between patient characteristics even if these patient characteristics do not exist in the same patient. Clinical inputs may be received and used at various stages of the clustering process to ensure clinical relevance of the resulting clusters. For example, clinical input by a disease expert may facilitate the creation of clinically relevant cohorts, the inclusion and grouping of clinically relevant features, and the validation and evaluation of clusters. A patient characteristic may be identified as unique in a cluster if it occurs more frequently than it occurs in the general population (e.g., for a selected group of patients in total).

In some implementations, Multiple Correspondence Analysis (MCA) is used to reduce the dimensionality of patient characteristics. The dichotomous K-means may facilitate proper and efficient patient isolation in sufficiently "compact" but stable clusters, and allow for the use of large numbers of clusters exhibiting immune-relevance to score patient characteristics, as will be explained in more detail later. The generated clusters may be presented (e.g., via a user interface) to a user (e.g., a clinical expert) for verification and evaluation. This may reduce the risk of unexplainable properties of the clusters and ensure that there are no overlapping features between different clusters.

In some implementations, the one or more operations include selecting a group of unique groups of the plurality of unique groups based on one or more group selection criteria. In some implementations, selecting a group of unique groups includes ranking the groups and selecting a number of the highest ranked groups (e.g., top 60 ranked groups). These groups can be ranked based on immunological enrichment, stability, purity and size. In some implementations, one or more metrics (sometimes referred to as feature scores in this specification) for each patient characteristic are calculated to rank the clusters. The one or more metrics may include, for example, uniqueness (sometimes referred to herein as a "boost score"), number of patients within a cluster exhibiting patient characteristics, and an immunological score. The uniqueness score measures how unique a patient characteristic is within a certain cluster compared to the rest of the population (e.g., if a male accounts for 50% of the population and 75% of the cluster, then the "boost score" may be equal to 1.5). In some implementations, only patient characteristics that exceed a threshold boost score (e.g., 1) and occur in a percentage of patients that exceed a threshold percentage of patients (e.g., 10%) are considered for defining a cluster and correspond to the subject matter of the cluster. Patient characteristics considered for defining a cluster may be referred to in this specification as potentially relevant patient characteristics. An immunological score may then be given to a patient characteristic (e.g., a patient characteristic considered to define a cluster or all patient characteristics), which scores the patient characteristic according to the type of the patient characteristic (e.g., disease, drug, laboratory test, surgery, etc.) and immunological relevance. The patient characteristic scores within each cluster may then be aggregated (e.g., summed) and normalized. Clusters that meet a threshold cluster score (e.g., 50%) can then be considered immunologically specific.

Selecting a set of unique groups may include evaluating one or more of stability, purity, and number of patients within each cluster. Stability can be assessed using one or more of the following methods: (1) reproducing clusters on data of different sizes; (2) changing an initialization seed of the cluster; (3) changing the number of clusters generated; and (4) applying a training-testing method. For each cluster in the training set, stability may be defined as the maximum proportion of patients that are also grouped together in the test set. Purity can be measured by intra-cluster variance of the MCA fraction of patients within a cluster, which can result in homogenous and dense clusters. In some implementations, a cluster is selected if it exceeds a threshold stability percentage (e.g., 50%) and exceeds a threshold purity percentage (e.g., the purity of the cluster is in the highest 20% of all clusters).

In some implementations, the one or more operations include identifying one or more relevant patient characteristics (e.g., indications) by analyzing each unique group of a set of unique groups. Identifying one or more relevant patient characteristics may include ranking the patient characteristics presented by each selected cluster (e.g., all patient characteristics or the patient characteristics considered to define the cluster). The ranking may be based on frequency of co-occurrence with each of a plurality of established (reference) characteristics (used as references) of the drug for reuse (e.g., if the drug is dolipiucizumab, the reference characteristics may include asthma, atopic dermatitis, IgE allergy, and comprehensive immunological scores). Co-occurrence can be measured by calculating the proportion of patient weighted clusters that contain both patient characteristics and a reference. In some implementations, one or more patient characteristics determined by the disciplinary expert to be relevant to the core cluster topic (e.g., as indicated by user input received through the user interface) may also be considered for evaluation, regardless of the number of patients presenting these characteristics (perhaps < 10%).

Identifying one or more patient characteristics may include assessing clinical and commercial viability of the patient characteristics. For example, patient characteristics that show a unique clinical diagnosis can be identified. The business assessment may adjust the years of life (DALY) based on the availability of data indicative of predicted sales and competitor assets, a determined association with targeted signaling pathways (whether or not found in a publication), a worldwide prevalence of patient characteristics, and disability of patient characteristics (e.g., every 100,000 years of life). In some implementations, at least one of the one or more patient characteristics may be identified as a target indication for reusing the drug. Thus, in some implementations, one or more operations typically output one or more new indications of the drug for reuse.

Although this specification generally describes a patient as a human patient, implementations are not so limited. For example, a patient may refer to a non-human animal, plant, or human replication system.

While this specification generally describes receiving data corresponding to approximately 9400 million patients for purposes of illustration, it should be understood that the data may correspond to fewer or more patients.

Although the drug for reuse is described herein as dolipiucizumab for illustrative purposes, it is to be understood that the drug for reuse can be any drug.

Fig. 2 is a flow chart illustrating an example method 200 for reusing a medication. Method 200 may be performed by data processing system 100 previously described with reference to FIG. 1. The method 200 comprises the following steps: receiving data representing a medical record (block 210); selecting a group of patients (block 220); determining a plurality of patient characteristics (block 230); grouping the set of patients to generate a plurality of unique groups (block 240); selecting a set of unique groups (block 250); and identifying one or more relevant patient characteristics (block 260).

At block 210, data representing medical records for a plurality of patients is received. For example, data may be received from a database comprising EMRs for approximately 9400 million patients (or more) that may be identified by key IDs that allow matching of patients across different data tables. In some implementations, the data indicates diagnosis, laboratory tests, surgery, medication, patient events, insurance, biomarkers, measurements, clinical status, lifestyle parameters, microbiology, prescription, medical images, and the like. In some implementations, the data includes natural language processing driven data. Data may be received via any of a variety of techniques, such as wireless communication, fiber optic communication, USB, CD-ROM, and so forth.

At block 220, at least one target signaling pathway associated with the drug for reuse is determined. For example, if the drug is dopril euzumab, then the drug can be determined to modulate interleukin 4(IL-4) and interleukin 13(IL-13) signaling pathways based on the known function of the drug. In some implementations, the one or more indicators are determined based on one or more factors corresponding to a diagnosis associated with the target signaling pathway. For example, sources including medical databases and medical evidence software may be searched using factors such as pathway mechanisms, related clinical conditions, treatment analogs, data and epidemiology, and drug lifecycle management consistency to identify diseases associated with the determined signaling pathways. These diseases can be classified based on the strength of the link to the determined signaling pathway. Categories may include focus groups, medium groups, and broad groups. For example, returning to the IL-4/IL-13 example, the focal group of diseases may include diseases that have a direct relationship to the mechanism of action of IL-4/IL-13 on the Th2 pathway, the intermediate shot dataset disease group may include diseases that have an indirect relationship to the mechanism of action of IL-4/IL-13 on the Th2 pathway, and the broad disease group may include diseases associated with a broader inflammatory response. Moving from an emphasis group to a broad group may increase the number of indices to be considered in selecting a group of patients and may reduce the likelihood of molecular collisions. Thus, in some implementations, only the focal group or both the focal group and the intermediate group are used to select a group of patients. In some implementations, only patients having at least one diagnosis, medication, laboratory test, and/or procedure associated with the determined signaling pathway are selected for inclusion in a group of patients.

At block 230, a plurality of patient characteristics for a set of patients are determined, wherein each patient in the set of patients exhibits at least one patient characteristic of the plurality of patient characteristics. Determining the plurality of patient characteristics may include analyzing the initially received data to identify a wide range of patient characteristics to capture all or a majority of the received data. For example, a wide range of patient characteristics may correspond to diagnosis (e.g., immune conditions, diabetes), prescription (e.g., immune drugs, other drug classifications), surgery (e.g., human leukocyte antigen typing), and laboratory results (e.g., IgE abnormal high/low). In some implementations, determining the plurality of patient characteristics includes receiving a user input (e.g., through a user interface). For example, a user may enter patient characteristics based on clinical input and demographic data, drug treatment, complications, surgery, and laboratory test data specific to immunology. Customized property classifications may also be added to improve data integrity, representativeness, and gather more information about disease and drug response. In some implementations, determining the plurality of patient characteristics includes validating the plurality of patient characteristics. Verification may include determining whether the patient characteristics of the initially received data are correctly mapped to the selected set of patients by calculating a percentage of selected patients having at least one characteristic in each characteristic family (e.g., a percentage of patients having prescription records) and comparing this percentage to a percentage of patients having at least one characteristic in each characteristic family of the initially received data. The closer values of the two numbers indicate that the mapping has been completed correctly. The verification may include determining whether the patient characteristic has been mapped to the correct patient by identifying a plurality of patients included in both the initially received data and the selected set of patients to verify the same patient characteristic mapping between the patient of the initially received data and the selected set of patients.

At block 240, a set of patients is grouped according to a plurality of patient characteristics (e.g., as defined by features related to the determined signaling pathways) to generate a plurality of unique groups, wherein each of the unique groups includes at least one patient of the set of patients. For example, a clustering technique, such as the dichotomy k-means clustering technique described above, may be performed on a group of patients using a plurality of patient characteristics. Clustering can result in a plurality of patient clusters (e.g., a plurality of unique patient groups), wherein patients in one cluster are similar to each other to a greater extent than patients in the other cluster in terms of their corresponding patient characteristics. In some implementations, the generated clusters may show correlations between patient characteristics even if these patient characteristics do not exist in the same patient. Clinical inputs may be received and used at various stages of the clustering process to ensure clinical relevance of the resulting clusters. For example, clinical input by a disease expert may facilitate the creation of clinically relevant cohorts, the inclusion and grouping of clinically relevant features, and the validation and evaluation of clusters. A patient characteristic may be identified as unique in a cluster if it occurs more frequently than it occurs in the general population (e.g., for a selected group of patients in total).

At block 250, a group of unique groups of the plurality of unique groups is selected based on one or more group selection criteria. In some implementations, selecting a set of unique groups includes ranking the groups and selecting a number of the highest ranked groups (e.g., top 60 ranked groups). These groups can be ranked based on immunological enrichment, stability, purity and size. In some implementations, one or more metrics for each patient characteristic are calculated to rank the clusters. The one or more metrics may include, for example, uniqueness (sometimes referred to herein as a "boost score"), number of patients within a cluster exhibiting patient characteristics, and an immunological score. The uniqueness score measures how unique a patient characteristic is within a certain cluster compared to the rest of the population (e.g., if a male accounts for 50% of the population and 75% of the cluster, then the "boost score" may be equal to 1.5). In some implementations, only patient characteristics that exceed a threshold boost score (e.g., 1) and occur in a percentage of patients that exceed a threshold percentage of patients (e.g., 10%) are considered for defining a cluster and correspond to the subject matter of the cluster. Patient characteristics considered for defining a cluster may be referred to in this specification as potentially relevant patient characteristics. An immunological score may then be given to a patient characteristic (e.g., a patient characteristic considered to define a cluster or all patient characteristics), which scores the patient characteristic according to the type of the patient characteristic (e.g., disease, drug, laboratory test, surgery, etc.) and immunological relevance. The patient characteristic scores within each cluster may then be aggregated (e.g., summed) and normalized. Clusters that meet a threshold cluster score (e.g., 50%) can then be considered immunologically specific.

At block 260, one or more relevant patient characteristics are identified by analyzing each unique group of a set of unique groups. Identifying one or more relevant patient characteristics may include ranking the patient characteristics presented by each selected cluster (e.g., all patient characteristics or the patient characteristics considered to define the cluster). The ranking may be based on the frequency of co-occurrence with each of a plurality of established (reference) characteristics (used as references) of the drug for reuse (e.g., if the drug is dolepiuzumab, the reference characteristics may include asthma, atopic dermatitis, IgE allergy, and comprehensive immunological scores). Co-occurrence can be measured by calculating the proportion of patient weighted clusters that contain both patient characteristics and a reference. In some implementations, one or more patient characteristics determined by the disciplinary expert to be relevant to the core cluster topic (e.g., as indicated by user input received through the user interface) may also be considered for evaluation, regardless of the number of patients (perhaps < 10%) in whom these characteristics occur.

Identifying one or more patient characteristics may include assessing clinical and commercial viability of the patient characteristics. For example, patient characteristics that show a unique clinical diagnosis can be identified. The business assessment may adjust the years of life (DALY) based on the availability of data indicative of predicted sales and competitor assets, a determined association with targeted signaling pathways (whether or not found in a publication), a worldwide prevalence of patient characteristics, and disability of patient characteristics (e.g., every 100,000 years of life). In some implementations, at least one of the one or more patient characteristics may be identified as a target indication for reusing the drug.

Results of the experiment

Fig. 3 is a diagram illustrating experiments performed using the systems and methods described in this specification. Experiments were performed to validate the RWD driving scheme for drug reuse of dacepritumumab (which is an anti-IL 4/IL13 drug) in order to identify novel indications for the drug. One goal of the experiment was to reduce drug development costs and time to market while minimizing losses and risks. A hybrid approach is employed with scientific and clinical capabilities through KOL expertise, business evaluation, and analytics combined with real world data.

The data source is as follows: the Optum Humedica dataset from 2014 to 2018 was used. The database contains electronic medical records for 9400 patients identifiable by key identifiers that allow matching of patients across different data tables. The database collects information about EMR data such as diagnoses, laboratory tests, surgery, medications, patient events, insurance, biomarkers, measurements, clinical status and lifestyle parameters, microbiology and prescriptions. Natural Language Processing (NLP) driven tables are not included due to limited data coverage and clinical relevance. In addition, data tables that are incomplete or contain irrelevant information are excluded. A total of 5 data tables were included, reducing the data sources to 4000 million patients.

Selecting a patient: the patient-selected index is based on the clinical framework associated with the underlying immunological pathway, as shown in table 1.

Table 1: patient selection 5 factors to consider for identification of indications

Only adult patients (age 18 years) with at least one diagnosis, medication, laboratory testing and surgery, who had a diagnosis related to the IL4/13 pathway (i.e., signaling pathway) were selected. Using these criteria for immunological pathology and data integrity, the resulting cohort consisted of 1700 unique patients.

Identifying the index takes into account five factors (as shown in table 2). These factors were searched by sources on the vector event engine database (a medical Evidence software and services company, consisting of multiple platforms (DOC Library, DOC Data, vector event, DOC Label, DOC Search) and including PubMed, clinical trials. The information within these factors is then classified according to three shot datasets based on the Th2 response: key, medium and broad. For example, a disease is assigned to an accentuated shot dataset or a medium shot dataset, respectively, based on its direct and indirect relationship to the mechanism of action of IL4/IL13 on the Th2 pathway, and to an extended shot dataset if the disease is associated with a broader inflammatory response. Moving from an accent shot dataset to a broad shot dataset may increase the range of indices to be considered and may reduce the likelihood of molecular collisions. Indicators in the broad shot dataset are eventually excluded from the analysis because their mechanistic link is not specific enough to meet the criteria for identifying patient populations with similar characteristics. Therefore, only the indices of the highlight shot data set and the medium shot data set are included in the experiment. The final list includes 208 metrics for analysis across 17 broad systems.

Selecting characteristics: select a broad range of features (patient characteristics) to capture the available information in the Optum dataset; the features selected by the clinical expert are then prioritized and validated to ensure that all basic variables are included and that the variable values have clinical significance. Some features are retained and others (some demographics) are recreated when appropriate (as shown at V1 described later with reference to fig. 4). New features are added based on clinical input and demographic data, drug treatment, complications, surgery, and laboratory test data specific to immunology. Customized feature classifications are created and iteratively added to the analysis (as shown by V2 and V3 described later with reference to fig. 4) to improve data integrity, representativeness, and gather more information about the severity of disease and drug responses.

Robust methods are used to ensure the integrity of the features in the final database. Two verification steps (based on patient and feature mapping across the Optum database and generated tables) were performed to verify that the features were generated correctly. First, to verify whether the patient features in Optum were correctly mapped to the data table, the percentage of patients with at least one feature family is calculated, and it is determined whether this number is the same in Optum humidica and the data table. Second, to verify whether the features are mapped to the correct patient, ten random patients were tracked from the original Optum humidica data to the generated dataset to ensure that the mapping of the features in the two datasets was the same. After feature verification has confirmed the correct mapping, the algorithm is run on 1700 ten thousand patients with 2700 features included.

Clustering: patients sharing similar characteristics as defined by the characteristics associated with the IL4/13 pathway were grouped together using clustering techniques. Clustering looks for similarities between patients based on their characteristics. The clusters generated allow for the discovery of correlations between pathologies, even if these pathologies are not present in the same patient. Clinical inputs are implanted at various stages of the procedure to ensure clinical relevance of results. Thus, clinical input by the disease specialist facilitates the creation of clinically relevant cohorts, the inclusion and grouping of clinically relevant features, and ultimately, cluster validation and evaluation. A feature is identified as unique in a cluster if it occurs more frequently than it occurs in the general population.

Multiple Correspondence Analysis (MCA) is used to reduce the dimensionality of the features. The data was then divided into 500 clusters using a binary K-means to provide adequate and efficient separation of patients in sufficiently "compact" but stable clusters and to allow for the use of large numbers of clusters exhibiting immune-relevance for indication scoring. The clusters identified by the process are validated and evaluated by a clinical expert. This step facilitates reducing the risk of unexplainability of the clusters and ensures that there are no overlapping features between different clusters. The clustering method ran data representing 2700 features (e.g., MCA components) and 1700 ten thousand patients. The number of clusters generated at the end of the algorithm is 500.

Identify new indications (i.e., relevant patient characteristics): further evaluation, clinical and commercial judgments are made to obtain a shorter priority signal list and to identify the most clinically relevant indications across clusters based on cluster output. Four method steps are used. The first method step selects the top 60 clusters ranked among 500 clusters based on immunological enrichment, stability, purity and size. Three metrics for the features included in each cluster are computed to determine the selection: uniqueness, number of patients presenting with the characteristics within each cluster, and immunological scores. Uniqueness (also referred to as "boost score") measures how unique a feature is within a certain cluster compared to the rest of the population (e.g., if a male accounts for 50% of the population and 75% of the cluster, then the boost score equals 1.5). Only features with a boost score >1 (which means that the rate of appearance of features in a cluster is higher than expected compared to a broader population) and that appear in ≧ 10% of patients are considered for defining (and naming) the cluster and establishing the subject matter of the cluster. In addition, each selected feature is given a score according to its type (disease, drug, laboratory test or surgery) and immunological relevance. The feature scores within each cluster are summed and normalized. A cluster may be considered immunologically specific if it meets a predefined threshold of 50% of the score. As a second step, clusters were selected based on stability, purity, number of patients. Stability can be assessed using the following four methods: 1. reproducing clusters on data of different sizes; 2. changing an initialization seed of the cluster; 3. changing the number of clusters generated; and 4. applying the training-testing method. For each cluster in the training set, stability is defined as the maximum proportion of patients that are also grouped together in the test set. Purity can be measured by intra-cluster variance of the MCA fraction of patients within a cluster, resulting in homogenous and dense clusters. If the stability of the cluster is greater than 50% and its purity is in the highest 20%, the cluster is included in the analysis. In addition, all indications judged by the disciplinary experts to be relevant to the core cluster topic are considered for evaluation regardless of the number of patients presenting these features (perhaps < 10%). Then, in a third step, these new indications were ranked based on the frequency of co-occurrence with each of the four established indications (used as reference) of dolazelimumab (asthma, atopic dermatitis, IgE allergy and comprehensive immunological score). Co-occurrence is measured by calculating the proportion of patient weighted clusters containing both the indication and the reference. In the last step, the final list of indications is further characterized by clinical and commercial viability. Clinical evaluation retained indications that showed unique clinical diagnosis. Based on the ranking with IL4/IL13 as a reference, pathologies not present in the first 30 were deleted, as these pathologies appeared to be less involved in IL4/IL13 regulation. Only if data on predicted sales and competitor assets are available, it is possible to conduct a commercial evaluation of only a clinically reasonable subset of the indications. In addition, a number of factors are also considered for conducting business evaluations: link to the IL4/13 pathway (whether or not found in the literature), worldwide prevalence of indications, and disability-adjusted life years (DALY) of indications (every 100,000 years of life).

As a result: fig. 4 is a graph illustrating experimental results produced by using the systems and methods described in this specification. A final cohort of 1700 ten thousand patients extracted from Optum Humedica was analyzed to assess the integrity and representativeness of the data across selected features. In the medium shot dataset population, 59% of the cohorts of patients were females, most were caucasians (77%), the mean age of the last activity was 53 years (SD ═ 7 years), and the mean follow-up period was 7.1 years. Patients most commonly present with ICD10 code for acute sinusitis (25.2%), allergic rhinitis, unidentified rhinitis (20.6%) and other and unidentified asthma (19.4%) as immunological conditions. The most common immunologically relevant drug therapies are prednisone (28.0%), fluticasone furoate (22.2%) and methylprednisolone (13.3%), and 0.4% of patients received allergen immunotherapy injections and β 2 glycoprotein antibody assays. The majority of patients were tested for leukocyte count (70.8%), Absolute Neutrophil Count (ANC) (64.3%) and Absolute Lymphocyte Count (ALC) (63.8%). The clustering program created 500 clusters, of which 125 were classified as both enriched for immune pathologies and stable. Of these 125 clusters, 110 clusters were also classified as pure. Of these clusters, 60 clusters containing the largest number of patients per cluster were retained and analyzed for clinically relevant signals. After the validation process, using the training-testing method, 84% of the clusters were considered highly stable, and 90% and 99% of the first 20 clusters were reproduced regardless of seed position and number of patients in the data sheet, respectively. Six cluster topics are identified based on the features included in the clusters, and are also partially reproduced in the V2 and V3 iterations: multiple organ immune effects, neoplasia, asthma and other hypersensitivity reactions, musculoskeletal dysfunction, cardiometabolic lineage, and gynecological conditions. Where 250 of the indications were selected by cluster evaluation and ranked by co-occurrence with each reference. A list of about 85 indications was further characterized by clinical and commercial viability: of these, about 20 indications do not present unique clinical diagnosis or the clinical rationale for IL4/13 modulation is inadequate, and other indications have no readily available commercial assessment information. The final list of indications from the hybrid approach identifies about 90% of the indications already in life cycle management, and about 60% of the additional potential new indications.

Fig. 5 is a block diagram of an example computer system 600 for providing computing functionality associated with algorithms, methods, functions, processes, flows and programs described in this disclosure (e.g., the method 200 previously described with reference to fig. 2), in accordance with some implementations of the present disclosure. The illustrated computer 602 is intended to include any computing device, such as a server, desktop computer, laptop/notebook computer, wireless data port, smart phone, Personal Data Assistant (PDA), tablet computing device, or one or more processors in these devices, including physical instances, virtual instances, or both. The computer 602 may include input devices capable of accepting user information, such as a keypad, keyboard, and touch screen. Further, the computer 602 may include an output device that can communicate information associated with the operation of the computer 602. The information may include digital data, visual data, audio information, or a combination of information. The information may be presented in a graphical User Interface (UI) (or GUI).

The computer 602 may act as a client, a network component, a server, a database, a persistence, or a component of a computer system for executing the subject matter described in this disclosure. The illustrated computer 602 is communicatively coupled with a network 630. In some implementations, one or more components of the computer 602 may be configured to operate in different environments, including cloud computing-based environments, local environments, global environments, and combinations of environments.

At a high level, computer 602 is an electronic computing device operable to receive, transmit, process, store, and manage data and information associated with the described subject matter. According to some implementations, the computer 602 may also include or be communicatively coupled with an application server, an email server, a web server, a cache server, a streaming data server, or a combination of servers.

The computer 602 may receive a request from a client application (e.g., executing on another computer 602) over the network 630. The computer 602 may respond to the received request by processing the received request using a software application. Requests can also be sent to the computer 602 from internal users (e.g., from command consoles), external (or third parties), automated applications, entities, individuals, systems, and computers.

Each component of the computer 602 may communicate using a system bus 603. In some implementations, any or all of the components of the computer 602, including hardware or software components, can interface with each other or with the interface 604 (or a combination of both) via the system bus 603. The interface may use an Application Programming Interface (API)612, a service layer 613, or a combination of API 612 and service layer 613. The API 612 may include specifications for routines, data structures, and object classes. API 612 may be independent of or dependent on the computer language. API 612 may refer to a complete interface, a single function, or a set of APIs.

Service layer 613 can provide software services to computer 602 and other components communicatively coupled to computer 602 (whether shown or not). All service consumers using the service layer may access the functionality of the computer 602. Software services, such as provided by the services layer 613, may provide reusable, defined functionality through defined interfaces. For example, the interface may be software written in JAVA, C + +, or a language that provides data in extensible markup language (XML) format. Although shown as an integrated component of computer 602, in alternative implementations, API 612 or services layer 613 can be a stand-alone component related to other components of computer 602 as well as other components communicatively coupled to computer 602. Further, any or all portions of the API 612 or service layer 613 may be implemented as sub-modules or sub-modules of another software module, enterprise application, or hardware module without departing from the scope of this disclosure.

The computer 602 includes an interface 604. Although shown in fig. 5 as a single interface 604, two or more interfaces 604 may be used depending on the particular needs, desires, or particular implementations of the computer 602 and the described functionality. Computer 602 may use interface 604 to communicate with other systems connected to network 630 (whether shown or not) in a distributed environment. In general, the interface 604 may include or be implemented using logic encoded in software or hardware (or a combination of software and hardware) that is operable to communicate with the network 630. More specifically, interface 604 may include software that supports one or more communication protocols associated with communications. Thus, the network 630 or the interface's hardware may be used to transmit physical signals both inside and outside the illustrated computer 602.

The computer 602 includes a processor 605. Although illustrated in fig. 5 as a single processor 605, two or more processors 605 may be used depending on the particular needs, desires, or particular implementations of the computer 602 and the described functionality. In general, the processor 605 can execute instructions and can manipulate data to perform operations of the computer 602, including operations using algorithms, methods, functions, procedures, flows, and programs as described in this disclosure.

Computer 602 also includes a database 606 that can hold data for computer 602 and other components (whether shown or not) connected to network 630. For example, database 606 may be an in-memory, conventional, or database storing data consistent with the present disclosure. In some implementations, database 606 can be a combination of two or more different database types (e.g., a hybrid in-memory database and a traditional database) depending on the particular needs, desires, or particular implementations of computer 602 and the described functionality. Although shown as a single database 606 in fig. 5, two or more databases (of the same type, different types, or combinations of types) may be used depending on the particular needs, desires, or particular implementations of the computer 602 and the described functionality. Although database 606 is shown as an internal component of computer 602, in alternative implementations, database 606 may be external to computer 602.

Computer 602 also includes a memory 607 that can hold data for the computer 602 or a combination of components connected to the network 630 (whether shown or not). Memory 607 may store any data consistent with the present disclosure. In some implementations, the memory 607 can be a combination of two or more different types of memory (e.g., a combination of semiconductor and magnetic memory) depending on the particular needs, desires, or particular implementations of the computer 602 and the described functionality. Although illustrated in fig. 5 as a single memory 607, two or more memories 607 (of the same, different, or combination of types) may be used depending on the particular needs, desires, or particular implementations of the computer 602 and the described functionality. While the memory 607 is shown as an internal component of the computer 602, in alternative implementations, the memory 607 may be external to the computer 602.

Application 608 may be an algorithmic software engine that provides functionality according to particular needs, desires, or particular implementations of computer 602 and described functionality. For example, application 608 may act as one or more components, modules, or applications. Further, although shown as a single application 608, the application 608 may be implemented as multiple applications 608 on the computer 602. Additionally, although illustrated as being internal to the computer 602, in alternative implementations, the application 608 can be external to the computer 602.

The computer 602 may also include a power supply 614. The power source 614 may include a rechargeable or non-rechargeable battery that may be configured to be user replaceable or non-user replaceable. In some implementations, the power supply 614 may include power conversion and management circuitry, including recharging, standby, and power management functions. In some implementations, the power supply 614 can include a power plug to allow the computer 602 to be plugged into a wall outlet or power source, for example, to power the computer 602 or to recharge rechargeable batteries.

There may be any number of computers 602 associated with or external to the computer system containing the computers 602, each computer 602 communicating over the network 630. Moreover, the terms "client," "user," and other suitable terms may be used interchangeably, as appropriate, without departing from the scope of the present disclosure. Further, the present disclosure contemplates that many users may use one computer 602, and that one user may use multiple computers 602.

Implementations of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Software implementations of the described subject matter can be implemented as one or more computer programs. Each computer program may include one or more modules of computer program instructions encoded on a tangible, non-transitory, computer-readable computer storage medium for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded in/on an artificially generated propagated signal. For example, the signals may be machine-generated electrical, optical, or electromagnetic signals generated to encode information for transmission to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access storage device, or a combination of computer storage media.

The terms "data processing apparatus", "computer", and "electronic computer apparatus" (or equivalents thereof as understood by those of ordinary skill in the art) refer to data processing hardware. For example, a data processing apparatus may include all kinds of apparatus, devices, and machines for processing data, including for example a programmable processor, a computer, or multiple processors or computers. The apparatus can also include special purpose logic circuitry, including, for example, a Central Processing Unit (CPU), a Field Programmable Gate Array (FPGA), or an Application Specific Integrated Circuit (ASIC). In some implementations, the data processing apparatus or dedicated logic circuitry (or a combination of the data processing apparatus or dedicated logic circuitry) may be hardware or software based (or a combination of hardware and software based). The apparatus can optionally include code that creates an execution environment for the computer program, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of execution environments. The present disclosure contemplates the use of data processing devices, such as LINUX, UNIX, WINDOWS, MAC OS, ANDROID, or IOS, with or without conventional operating systems.

A computer program can also be referred to as or described as a program, software application, module, software module, script, or code, and can be written in any form of programming language. The programming language may include, for example, a compiled, interpreted, declarative, or procedural language. A program can be deployed in any form, including as a stand-alone program, module, component, subroutine, or unit for use in a computing environment. The computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, such as one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network. While the portions of programs shown in the various figures may be illustrated as separate modules implementing various features and functions through various objects, methods, or processes, the programs may alternatively include multiple sub-modules, third party services, components, and libraries. Rather, the features and functionality of the various components may be combined as suitable into a single component. The threshold for making the computational determination may be static, dynamic, or both static and dynamic.

The methods, processes, or logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The methods, processes, or logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., a CPU, FPGA, or ASIC.

A computer suitable for executing a computer program may be based on one or more general and special purpose microprocessors as well as other types of CPUs. Elements of a computer are a CPU for executing or performing instructions and one or more memory devices for storing instructions and data. Generally, a CPU can receive instructions and data from (and write data to) a memory. A computer may also include, or be operatively coupled to, one or more mass storage devices for storing data. In some implementations, a computer can receive data from and transfer data to a mass storage device, including, for example, a magnetic, magneto-optical disk, or optical disk. Further, the computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device (e.g., a Universal Serial Bus (USB) flash drive).

Computer-readable media (transitory or non-transitory, as appropriate) suitable for storing computer program instructions and data can include all forms of persistent/non-persistent and volatile/non-volatile memory, media and storage. The computer-readable medium may include, for example, semiconductor memory devices such as Random Access Memory (RAM), Read Only Memory (ROM), phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Erasable Programmable Read Only Memory (EPROM), Electrically Erasable Programmable Read Only Memory (EEPROM), and flash memory devices. The computer readable medium may also include, for example, magnetic devices such as magnetic tapes, cassettes, cartridges, and internal/removable disks. The computer readable medium may also include magneto-optical disks and optical storage devices and technologies including, for example, Digital Video Disks (DVDs), CD ROMs, DVD +/-R, DVD-RAMs, DVD-ROMs, HD-DVDs, and BLURAYs. The memory may store various objects or data, including caches, classes, frames, applications, modules, backup data, jobs, web pages, web page templates, data structures, database tables, repositories, and dynamic information. The types of objects and data stored in memory may include parameters, variables, algorithms, instructions, rules, constraints, and references. In addition, memory may include logs, policies, security or access data, and reporting files. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

Implementations of the subject matter described in this disclosure can be implemented on a computer having a display device for providing interaction with a user, including displaying information to the user (and receiving input from the user). Types of display devices may include, for example, Cathode Ray Tubes (CRTs), Liquid Crystal Displays (LCDs), Light Emitting Diodes (LEDs), and plasma displays. The display device may include a keyboard and pointing device, including, for example, a mouse, trackball, or trackpad. User input may also be provided to the computer through the use of a touch screen, such as a tablet computer surface with pressure sensitivity or a multi-touch screen using capacitive or electrical sensing. Other kinds of devices may be used to provide interaction with the user, including receiving user feedback, including for example sensory feedback including visual feedback, auditory feedback, or tactile feedback. Input from the user may be received in the form of sound, speech or tactile input. In addition, the computer may interact with the user by sending and receiving documents to and from the device used by the user. For example, a computer may send a web page to a web browser on a user's client device in response to a request received from the web browser.

The terms "graphical user interface" or "GUI" may describe one or more graphical user interfaces and each display of a particular graphical user interface in the singular or in the plural. Thus, the GUI may represent any graphical user interface, including but not limited to a web browser, touch screen, or Command Line Interface (CLI), which processes information and effectively presents the information results to the user. In general, a GUI may include a plurality of User Interface (UI) elements, some or all of which are associated with a web browser, such as interactive fields, drop-down lists, and buttons. These and other UI elements may be related to or represent functionality of a web browser.

Implementations of the subject matter described in this specification can be implemented in a computing system that includes a back-end component (e.g., as a data server) or that includes a middleware component (e.g., an application server). In addition, the computing system can include a front-end component, e.g., a client computer having one or both of a graphical user interface and a web browser through which a user can interact with the computer. The components of the system can be interconnected by any form or medium of wired or wireless digital data communication (or combination of data communications) in a communication network. Examples of communication networks include a Local Area Network (LAN), a Radio Access Network (RAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), Worldwide Interoperability for Microwave Access (WIMAX), a Wireless Local Area Network (WLAN) (e.g., using 802.11a/b/g/n or a combination of 802.20 or protocols), all or a portion of the internet, or any other communication system or system (or combination of communication networks) at one or more locations. The network may communicate with a combination of communication types such as Internet Protocol (IP) packets, frame relay frames, Asynchronous Transfer Mode (ATM) cells, voice, video, data, or network addresses.

The computing system may include clients and servers. A client and server are generally remote from each other and typically can interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship.

The cluster file system may be any type of file system that is accessible for reading and updating from multiple servers. Locking or consistency tracking may not be necessary because locking of the swap file system may be done at the application layer. Further, the Unicode data files may be different than non-Unicode data files.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of what may be claimed, but rather as descriptions of features specific to particular implementations. Certain features that are described in this specification in the context of separate implementations can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

In the foregoing specification, embodiments of the invention have been described with reference to numerous specific details that may vary from implementation to implementation. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. The sole and exclusive indicator of the scope of the invention, and what is intended by the applicants to be the scope of the invention, is the literal and equivalent scope of the claims that issue from this application, in the specific form in which they issue, including any subsequent correction. Any express definitions herein for terms contained in the claims shall govern the meaning of the terms used in the claims. In addition, when we use the term "further comprising" or "further including" in the preceding description or in the following claims, this phrase may be followed by additional steps or entities, or sub-steps/sub-entities of the previously recited steps or entities.

Particular implementations of the subject matter have been described. Other implementations, modifications, and substitutions to the described implementations are apparent to those skilled in the art and are within the scope of the following claims. Although operations are depicted in the drawings or claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order or sequence shown, or that all illustrated operations be performed (some operations may be considered optional) to achieve desirable results. In some cases, multitasking or parallel processing (or a combination of multitasking and parallel processing) may be advantageous and considered appropriate.

Moreover, the separation or integration of various system modules and components in the implementations described above should not be understood as requiring such separation or integration in all implementations, and it should be understood that the described program components and systems can generally be integrated in a single software product or packaged into multiple software products.

Accordingly, the example implementations described previously do not define or limit the present disclosure. Other changes, substitutions, and alterations are also possible without departing from the spirit and scope of this disclosure.

Moreover, any claimed implementation is considered to apply at least to: a computer-implemented method; a non-transitory computer readable medium storing computer readable instructions to perform a computer-implemented method; and a computer system comprising computer memory operatively interconnected with a hardware processor configured to perform a computer-implemented method or instructions stored on a non-transitory computer-readable medium.

Various embodiments of these systems and methods have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the disclosure.

Claims

1. A computer-implemented method for reusing medications, the computer-implemented method comprising:

receiving, by a computer system, data representing medical records for a plurality of patients;

selecting a group of patients based on the medical record by:

determining at least one target signaling pathway associated with the drug; and

determining one or more indicators based on one or more factors corresponding to a diagnosis associated with the target signaling pathway;

determining a plurality of patient characteristics for the set of patients, each patient in the set of patients exhibiting at least one patient characteristic of the plurality of patient characteristics;

grouping, by the computer system, the set of patients according to the plurality of patient characteristics to generate a plurality of unique groups, each of the unique groups including at least one patient of the set of patients;

selecting a set of unique groups of the plurality of unique groups based on one or more group selection criteria; and

identifying one or more relevant patient characteristics by analyzing each unique group of the set of unique groups.

2. The method of claim 1, wherein grouping the set of patients comprises:

a machine learning system configured to perform one or more unsupervised clustering techniques is executed.

3. The method of claim 2, wherein the one or more unsupervised clustering techniques comprise a binary k-means clustering technique.

4. The method of any one of claims 1-3, wherein grouping the set of patients includes performing a multiple correspondence analysis to reduce a dimensionality of the plurality of patient characteristics.

5. The method of any one of claims 1-4, wherein selecting the set of unique groups comprises:

determining, for each unique group of the plurality of unique groups, a feature score for each patient characteristic exhibited by the unique group; and

comparing the feature score of each unique group of the plurality of unique groups to a feature score threshold.

6. The method of claim 1, wherein selecting a set of unique groups comprises at least one of: determining a stability metric for each unique group of the plurality of unique groups; determining a purity metric for each unique group of the plurality of unique groups; and determining a number of patients of the group of patients included in each of the plurality of unique groups.

7. The method of claim 1, wherein identifying the one or more relevant patient characteristics comprises:

generating, for each unique group of the set of unique groups, a plurality of potentially relevant patient characteristics by selecting patient characteristics of the plurality of patient characteristics that are exhibited by the unique group and that correspond to a subject matter of the group; and

ranking each of the potentially relevant patient characteristics of the plurality of potentially relevant patient characteristics.

8. The method of claim 7, wherein ranking each of the potentially relevant patient characteristics comprises: for each potentially relevant patient characteristic, assigning a rank value based on a co-occurrence frequency of the potentially relevant patient characteristic and at least one reference indication associated with the drug.

9. The method of claim 8, wherein the co-occurrence is measured by: for each of the potentially relevant patient characteristics, determining a proportion of the set of unique sets that includes both the potentially relevant patient characteristic and the at least one reference indication.

10. The method of claim 7, wherein identifying the one or more relevant patient characteristics comprises: determining at least one of a clinical feasibility and a commercial feasibility for each of the potentially relevant patient characteristics.

11. A data processing system for reusing a medication, the data processing system comprising:

a computer-readable memory comprising computer-executable instructions; and

at least one processor configured to execute executable logic comprising the computer-executable instructions and at least one machine learning model to perform one or more operations comprising:

receiving data representing medical records for a plurality of patients;

selecting a group of patients based on the medical record by:

determining at least one target signaling pathway associated with the drug; and

grouping the set of patients using the machine learning model and according to the plurality of patient characteristics to generate a plurality of unique groups, each of the unique groups including at least one patient of the set of patients;

12. The data processing system of claim 11, wherein the machine learning model is trained to group the set of patients using one or more unsupervised clustering techniques.

13. The data processing system of claim 12, wherein the one or more unsupervised clustering techniques comprise a binary k-means clustering technique.

14. The data processing system of any one of claims 11-13, wherein grouping the set of patients includes performing a multiple correspondence analysis to reduce a dimensionality of the plurality of patient characteristics.

15. The data processing system of any of claims 11-14, wherein selecting the set of unique groups comprises:

16. The data processing system of any of claims 11-15, wherein selecting a set of unique groups comprises at least one of: determining a stability metric for each unique group of the plurality of unique groups; determining a purity metric for each unique group of the plurality of unique groups; and determining a number of patients of the group of patients included in each of the plurality of unique groups.

17. The data processing system of claim 11, wherein identifying the one or more relevant patient characteristics comprises:

generating, for each unique group of the set of unique groups, a plurality of potentially relevant patient characteristics by selecting the patient characteristics of the plurality of patient characteristics that are exhibited by the unique group and that correspond to a subject matter of the group; and

18. The data processing system of claim 17, wherein ranking each of the potentially relevant patient characteristics comprises: for each potentially relevant patient characteristic, assigning a rank value based on a co-occurrence frequency of the potentially relevant patient characteristic and at least one reference indication associated with the drug.

19. The data processing system of claim 18, wherein the co-occurrence is measured by: for each of the potentially relevant patient characteristics, determining a proportion of the set of unique sets that includes both the potentially relevant patient characteristic and the at least one reference indication.

20. A computer-implemented method for reusing medications, the computer-implemented method comprising:

selecting a group of patients based on the medical record by:

determining at least one target signaling pathway associated with the drug; and

identifying one or more relevant patient characteristics by analyzing each unique group of the set of unique groups;

identifying at least one of the one or more correlated patient characteristics as a target indication for reuse of the drug.