WO2020184816A1 - 신약 후보 물질 도출을 위한 데이터 처리 방법 - Google Patents
신약 후보 물질 도출을 위한 데이터 처리 방법 Download PDFInfo
- Publication number
- WO2020184816A1 WO2020184816A1 PCT/KR2019/017793 KR2019017793W WO2020184816A1 WO 2020184816 A1 WO2020184816 A1 WO 2020184816A1 KR 2019017793 W KR2019017793 W KR 2019017793W WO 2020184816 A1 WO2020184816 A1 WO 2020184816A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- node
- knowledge network
- nodes
- drug
- correlation
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9024—Graphs; Linked lists
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/20—Heterogeneous data integration
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/40—Searching chemical structures or physicochemical data
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H20/00—ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
- G16H20/10—ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance relating to drugs or medications, e.g. for ensuring correct administration to patients
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H70/00—ICT specially adapted for the handling or processing of medical references
- G16H70/40—ICT specially adapted for the handling or processing of medical references relating to drugs, e.g. their side effects or intended usage
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/90—Programming languages; Computing architectures; Database systems; Data warehousing
Definitions
- the present invention relates to a new drug development method, and more particularly, to a data processing method for deriving a new drug candidate substance from the human Omics database (OMICS Database).
- OMICS Database human Omics database
- the technical problem to be solved by the present invention is to provide a data processing method for discovering new drug candidates.
- Another technical problem to be solved by the present invention relates to a method of generating a multi-omics network having a hierarchical structure from a human body omics database (DB) and generating a refined knowledge network from the multi-omics network.
- DB human body omics database
- FIG. 1 is a block diagram of a data processing apparatus for discovering a new drug candidate, according to an exemplary embodiment.
- FIG. 2 is a flowchart illustrating a data processing method for discovering a new drug candidate substance by a data processing device, according to an exemplary embodiment.
- FIG. 3 shows a predetermined search word input according to an embodiment.
- step S205 shows a DB matrix generated in step S205 according to an embodiment.
- step S205 shows a DB matrix generated in step S205 according to an embodiment.
- FIG. 6 is a first knowledge network according to an embodiment.
- PC Participation coefficient
- FIG. 8 is a second knowledge network generated from a search word "epilepsy syndrome" according to an embodiment.
- FIG 9 shows an example in which an ohmic level (biological entity) is input, according to an embodiment.
- FIG. 10 illustrates an example in which a correlation degree type is input according to an embodiment.
- FIG. 11 is a block diagram of a data processing apparatus for discovering a new drug candidate, according to a further embodiment.
- FIG. 12 is a flowchart illustrating a data processing method for discovering a new drug candidate substance by a data processing apparatus, according to an additional embodiment.
- FIG. 13 is a flowchart of a method for a data processing device to search for a drug available path, according to an embodiment.
- a data processing method for discovering new drug candidate substances performed in a data processing apparatus comprising: generating a DB matrix composed of a selected biological entity and a selected correlation type from an ohmics DB, receiving a search word, the DB matrix Extracting biological entities belonging to the search word and related to the search word, extracting a correlation between the search word and the biological entities from the DB matrix, the search word and the biological entities respectively As a node, and generating a first knowledge network connecting a plurality of nodes using a connection line according to a correlation between the search word and the biological entities or a correlation between the biological entities, the first Computing a graph theory index for each of the plurality of nodes of the knowledge network, and generating a second knowledge network using some nodes selected using the graph theory index among the plurality of nodes of the first knowledge network Including a step, wherein the search word includes at least one of a gene name, a protein name, a metabolite name, a symptom name, a disease name, a compound name,
- Including the shortest path between nodes for at least one of the nodes, a clustering coefficient for each node, and a centrality coefficient for each node, and the weight of the connection line is different according to the category of the correlation diagram indicated by the connection line.
- the shortest path between nodes is calculated by reflecting the set weight
- the step of generating the second knowledge network includes the shortest path between nodes for each of the plurality of nodes constituting the first knowledge network.
- the second knowledge network is established by calculating a standard score for at least one of the clustering coefficient for each node and the centrality coefficient for each node, and deleting a connection line between a node having the standard score less than a threshold value and a node less than the threshold value.
- the standard score is a standard difference between an index value of a predetermined graph theory index for each node constituting the first knowledge network and an average index value of a graph theory index for a plurality of nodes constituting the first knowledge network. It is a value divided by an error, and the DB matrix may be generated so that the selected biological entities are disposed on each of a horizontal axis and a vertical axis, and the correlation type is displayed at a point where the horizontal axis and the vertical axis intersect.
- the step of generating the second knowledge network includes randomly mixing all connection lines constituting the first knowledge network, and then calculating the standard score for each of the nodes of the first knowledge network, and the randomly mixing The number of times may be 1000 or more.
- the generating of the second knowledge network may include deleting a node having one connection line among nodes constituting the first knowledge network, and a node having a clustering coefficient of 0 among nodes constituting the first knowledge network. It may further include the step of deleting.
- the category of the correlation diagram may further include at least one of interaction, cause, present, and localize.
- the step of extracting the drug possible path from the second knowledge network the step of extracting the drug possible path, the standard score of the proximity to each of the drug-disease nodes existing in the second knowledge network Selecting drug-disease node pairs smaller than a reference value, extracting paths in which intermediate nodes present in each of the paths are greater than or equal to a reference number from among paths to the selected drug-disease node pairs, and the extraction Among the extracted paths, a path in which a sum of centrality coefficients of intermediate nodes of the extracted paths is equal to or greater than a reference value may be extracted as the drug-enabled path.
- a recording medium in which a program for executing the data processing method is recorded on a computer may be provided.
- unit used in the specification may refer to a hardware component or circuit such as a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC).
- FPGA Field Programmable Gate Array
- ASIC Application Specific Integrated Circuit
- FIG. 1 is a block diagram of a data processing apparatus for discovering a new drug candidate according to an exemplary embodiment
- FIG. 2 is a flowchart of a data processing method for discovering a new drug candidate by a data processing apparatus according to an exemplary embodiment.
- a data processing apparatus 100 for discovering a new drug candidate includes a DB matrix generation unit 105, a search word receiving unit 110, a data extraction unit 120, a data generation unit 130, and a data processing unit ( 140), a data purification unit 150, an output unit 160, and a storage unit 170 may be included.
- the data processing device 100 may include at least one computing device.
- the data processing apparatus 100 may include at least one processor and at least one memory.
- the DB matrix generation unit 105 is a DB for at least some of the levels of ohmics (biological entities) and a DB for at least some types of correlations from the ohmics DB 200.
- a configured DB matrix can be generated (S205). Omix levels (biological entities) and types of correlations for generating the DB matrix can be selected by the user.
- the DB matrix generation unit 105 receives at least some of the ohmic levels (biological entities) among a plurality of levels constituting the ohmics to generate the DB matrix, and a plurality of types of interrelationships constituting the ohmics At least some of the types of correlations may be input.
- Omics are also called somatics. For example, there are genomics, transcriptomes, proteomics, metabolomics, epigenetics, and geology, and detailed anatomy and biological processes.
- the conduction path (pathway), pharmacological class (pharmacological class), symptoms, diseases, compounds, drugs, side effects, etc. may include, but are not limited thereto.
- Multiple ohmic levels are gene level, transcription level, protein level, metabolite level, epigene level, lipid level, anatomical structure level, biological pathway level, conduction pathway level, pharmacological hierarchy level, symptom level, disease level. , Compound level, drug level, side effect level, and the like, but are not limited thereto.
- the anatomical structure may mean a tissue, an organ, etc.
- the biological pathway is a series of cellular components such as location at the level of the intracellular structure, and molecular functions extracted from gene ontology. May be an event of, and the pharmacological layer may be a pharmacological effect, a mechanism of action.
- interrelationships Multiple types of interrelationships are "interact”, “participate”, “covariate”, “regulate”, “associate”, “bind”, “Upregulate”, “cause”, “resemble”, “treat”, “downregulates”, “palliate”, “present )", “localize”, “include”, “express”, “decrease”, “increase”, etc., and identification number or identification by type Symbols can be given arbitrarily. The identification number or identification symbol for each type may be set by the user or may be automatically set.
- the Omix DB 200 may be a big data DB, a DB outside the data processing apparatus 100 according to an embodiment of the present invention, and a global public DB that can be accessed by anyone or by an authorized person under predetermined conditions.
- the ohmics DB 200 may pre-store information about an ohmic level (biological entity) and information about a degree of correlation between biological entities within the ohmic level.
- the ohmics DB may include a DB for each ohmic level and a DB for each type of correlation.
- the DB for each level of omics is, for example, gene DB, transcription DB, protein DB, metabolite DB, epigene DB, lipid DB, anatomical structure DB, biological pathway DB, conduction pathway DB, symptom DB, disease DB, It may include compound DB, drug DB, and side effect DB.
- DB for each type of correlation is interaction DB, participate DB, covariate DB, regulate DB, associate DB, bind DB, and upregulate DB, cause DB, resemble DB, treatment DB, downregulates DB, palliate DB, present DB, localize DB, include It can include DB, expression DB, decrease DB, and increase DB.
- These DBs can be managed and operated by integrating into one big data DB, or distributed and managed and operated.
- FIG. 9 shows an example in which an ohmic level (biological entity) is input to generate a DB matrix according to an embodiment
- FIG. 10 is a correlation diagram type input to generate a DB matrix according to an embodiment. Shows an example.
- a screen in which a plurality of ohmic levels can be selected may be exposed through the output unit 160, and at least some of the ohmic levels may be selected through a user interface.
- a screen in which a plurality of types of correlations can be selected may be exposed through the output unit 160, and at least some of the types of correlations among a plurality of types of correlations are displayed through a user interface. The type can be chosen.
- FIG. 4 and 5 show examples of the DB matrix. If the user selects all ohmic levels (biological entities) and all correlation types of the ohmics DB to generate the DB matrix, the DB matrix may be generated as shown in FIG. 4. Referring to FIG. 4, selected ohmic levels (biological entities) are disposed on each of the horizontal and vertical axes, and selected types of correlations may be generated to be displayed at points where the horizontal and vertical axes intersect.
- Gene gene level
- Protein protein level
- Lipid lipid level
- metabolite level Metal-based metabolite level
- Anatomical structure level Anatomy
- biological pathway level Biological Process
- cellular basis Cellular Component, Molecular Function, Drug Level, Side Effect, Disease Level, Pharmacological Class, and Symptom Level (Symptom) of the DB matrix.
- the type of correlation between DBs is covariate (Co), control (Reg ), upregulation (U), binding (B), downregulation (D), association (A), similarity (R), treatment (T), relief (Pa), the DB matrix is as shown in FIG. Can be created.
- the search word receiving unit 110 may receive a predetermined search word (S200).
- the predetermined search word may be input through a user interface, and may include at least one of a gene name, a protein name, a metabolite name, a symptom name, a disease name, a compound name, and a drug name.
- the user may input a drug called Bupropion as a search word or a disease called epilepsy syndrome as a search word through the search word receiving unit 110.
- 3 shows an example in which a predetermined search word is input. Referring to FIG. 3, a screen for inputting a predetermined search word may be exposed through the output unit 160, and a predetermined search word may be input through a user interface. 3 shows an example of selecting a disease name as a category and inputting epilepsy syndrome as a predetermined search word.
- the data extraction unit 120 extracts using the DB matrix generated at least one biological entity related to the predetermined search word received in step S200 (S210), and between the predetermined search word and the extracted biological entity. It can be extracted by using the DB matrix that generated the correlation (S220).
- the biological entity may include at least one of genes, proteins, metabolites, symptoms, diseases, compounds, and drugs, and the level to which the predetermined search word belongs may be the same as or different from the ohmic level to which the biological entity belongs. You may. For example, as illustrated in FIG.
- the biological entity extracted in step S210 is a gene associated with epilepsy syndrome, a protein associated with epilepsy syndrome, and a metabolite associated with epilepsy syndrome. , symptoms associated with epilepsy syndrome, diseases associated with epilepsy syndrome, compounds associated with epilepsy syndrome, and drugs associated with epilepsy syndrome.
- the biological entity extracted in step S210 may include a plurality of biological entities for each level. As illustrated in FIG.
- the biological entity extracted in step S210 is a plurality of genes related to epilepsy syndrome, a plurality of proteins related to epilepsy syndrome, and a plurality of budding related to epilepsy syndrome. Metabolites, multiple symptoms associated with epilepsy syndrome, multiple diseases associated with epilepsy syndrome, multiple compounds associated with epilepsy syndrome, and multiple drugs associated with epilepsy syndrome may be included.
- the data generation unit 130 may generate a first knowledge network by using the results extracted in steps S210 and S220 (S230).
- 6 is an example of a first knowledge network created according to an embodiment.
- a circle can represent a node, and a line can represent a connecting line (edge).
- the first knowledge network uses a predetermined search word received in step S200 and each of the biological entities extracted in step S210 as nodes, and the correlation between the predetermined search word extracted in step S220 and the biological entity or between biological entities It may be in the form of a graph in which a plurality of nodes are connected using a connection line according to the degree of correlation of.
- Nodes within the same ohmic level may be connected through a connection line, or nodes within different ohmic levels may be connected through a connection line.
- Paths from node A, which is one of the nodes in the first knowledge network, to node B, which is the other, may vary, and all possible paths may be connected by connection lines.
- the knowledge network is a network consisting of interrelationships between biological entities, and may also be referred to as a biological network.
- the data processing unit 140 may calculate a graph theory index of the first knowledge network generated in step S230 (S240).
- the graph theory index may include at least one of a shortest path between nodes for a plurality of nodes constituting the first knowledge network, a clustering coefficient for each node, a centrality coefficient for each node, and a hub characteristic for each node. have.
- the shortest path between nodes may mean the shortest path among a number of paths from node A to node B in the first knowledge network.
- a method of calculating the shortest path between Node A as one of the biological entities and Node B as the other one of the biological entities will be described.
- node A and node B may be directly connected, or at least one intermediate node may exist on each path between node A and node B.
- the data processing unit 140 may obtain the shortest path between the node A and the node B by using the number of intermediate nodes for each path. For example, the data processing unit 140 may determine that the path is shorter as the number of intermediate nodes among various paths between node A and node B decreases.
- the data processing unit 140 obtains the shortest path between node A and node B by using the number of intermediate nodes for each path, and may reflect the type of interrelationship for each connection line. That is, the weight can be set differently for each category of correlation, and the weight can be applied to the correlation existing for each path.
- Equation 1 is an example of an equation for calculating the shortest path between nodes.
- w st is an index of correlation between two nodes s and t
- f is a weight transformation function
- I the shortest path between two nodes i and j.
- the data processing unit 140 determines the value of Equation 1 for each path, and may select a path having the lowest value or the highest value as the shortest path.
- a clustering coefficient for each node may be calculated by Equation 2 and Equation 3.
- the clustering coefficient may be referred to as a grouping coefficient, and may mean a probability that a specific node and neighboring nodes are connected to each other or a connection density between a specific node and neighboring nodes.
- t i w is the number of triangles in the graph created around each node i of the knowledge network
- N is the total node set of the knowledge network
- w ij is the correlation index between node i and node j
- w ih is a correlation index between node i and node h
- w jh is a correlation index between node j and node h.
- C w is the clustering coefficient
- t i w is the number of triangles in the graph around each node i of the knowledge network
- k i is the degree of node i, that is, the degree of connectivity of node i in the knowledge network.
- the centrality index for each node is an index for whether a specific node has the function of a hub, and is based on D nodal (nodal degree) values, BC (betweenness centrality), and E nodal (nodal efficiency) values. Can be indicated.
- the D nodal value is a value of the connectivity level of each node in the knowledge network, that is, an index indicating how strong or weak node i has connectivity in the knowledge network
- the E nodal value is the efficiency level in the knowledge network of node i
- a value that is, a value expressed by the reciprocal of the shortest path in Equation 1, the shorter the path is, the higher the efficiency is
- the BC value is an index indicating the number of times the node i becomes a shortcut in the path between nodes in the knowledge network.
- wij is an index of correlation between node i and node j
- N is the total node set of the knowledge network.
- E nodal value may be calculated by Equation 5.
- N is a set of all nodes of the knowledge network
- d and i,j are values representing the shortest path calculated in Equation 1.
- g hj means the shortest distance between nodes h and j
- g hj (i) means the shortest distance between h and j passing through the node i.
- the data processing unit 140 may classify the characteristics of the hub.
- the nature of the hub can be classified into a kinless hub, a connector hub, and a provincial hub.
- the kinless hub means the hub with the highest influence, that is, the hub connected to the nodes in many modules
- the connector hub means the hub that connects the modules in the knowledge network
- the provincial hub mainly has high influence within the module.
- the module may be a structural configuration group in which the entire knowledge network is subdivided.
- the module index (Modularity) in the knowledge network may be calculated as in Equation 7.
- the module index (modularity) refers to the number of module types that constitute the entire knowledge network.
- the participation coefficient (PC) of the knowledge network module may be calculated as shown in Equation 8.
- M means a set of modules, Denotes the number of connections between node i and all other nodes in module m, and module m denotes a structural group of subdivided entire knowledge network.
- the z score (within-module degree) of the knowledge network module may be calculated as in Equation 9.
- m i means node i in module m
- Means the degree of connection in module m of node i Denotes the mean and standard deviation of the degree distribution in module m, respectively.
- each node is a hub in the module through the calculation of the index of Equation 9 above. For example, as follows, when the Z score of the knowledge network module is 2.5 or higher, it may be determined as a hub.
- the type of hub can be classified as follows through the calculation of the index of Equation 8, and FIG. 7 shows an example of classifying the type of hub according to the PC.
- the data refiner 150 may generate a refined second knowledge network from the first knowledge network by using the graph theory index. Yes (S250).
- the second knowledge network is a network that is more simplified than the first knowledge network, and may be composed of only some nodes having a high correlation in terms of graph theory among a plurality of nodes constituting the first knowledge network.
- the node constituting the second knowledge network may be composed of a node in which the graph theory index calculated in step S240 is greater than or equal to a reference value among a plurality of nodes constituting the first knowledge network. For example, among a plurality of nodes constituting the first knowledge network, at least some of the indicator value for the shortest path between nodes, the indicator value for the clustering coefficient for each node, and the indicator value for the centrality coefficient for each node are equal to or greater than the reference value. Some nodes may be included in the second knowledge network.
- At least some of the index values for the shortest path between nodes, the clustering coefficient for each node, and the centrality coefficient for each node among the plurality of nodes constituting the first knowledge network are critical. It can be created by deleting the node that is less than the value and deleting the connection associated with the deleted node.
- the graph theory index compared with the reference value may be an index value for a shortest path between nodes, an index value for a clustering coefficient for each node, and an index value for a centrality coefficient for each node.
- the graph theory index compared with the reference value may be a value calculated by integrating at least two of an index value for the shortest path between nodes, an index value for a clustering coefficient for each node, and an index value for a centrality coefficient for each node. .
- At least one of an indicator value for the shortest path between nodes, an indicator value for a clustering coefficient for each node, and an indicator value for a centrality coefficient for each node may be calculated as a standard score for each node, and the calculated standard The score can be compared to a threshold value.
- the standard score may be a z score
- the threshold value may mean 95% significance.
- the z score can be calculated as in Equation 10.
- z is a z score
- X is an index value of a predetermined graph theory index for a specific node in the first knowledge network
- mean(x) is a predetermined graph theory index for at least some nodes in the first knowledge network.
- SE(x) is the standard error of the index value of the graph theory index of at least some nodes in the first knowledge network.
- SE ,
- ⁇ is the standard deviation
- n is the number of at least some nodes constituting the first knowledge network.
- the number of at least some nodes of the first knowledge network selected to determine the z score may be 1000.
- the z score is the standard difference between the index value of a predetermined graph theory index for each node constituting the first knowledge network and the average index value of a predetermined graph theory index for a plurality of nodes constituting the first knowledge network. It can be a value divided by the error.
- the z score may be calculated through a permutation test.
- the permutation test may be performed by randomly mixing all connection lines constituting the first knowledge network and then calculating a z score for each node. At this time, the number of random mixing may be 1000 or more.
- the nodes constituting the second knowledge network may be some nodes extracted by using an index value for the hub characteristic of each node among the graph theory indexes calculated in step S240 from among the plurality of nodes constituting the first knowledge network. That is, the node constituting the second knowledge network is a node determined to be a hub in the module through the calculation of the index of Equation 9, preferably a node classified as one of a kinless hub, a connector hub, and a provincial hub, more preferably kinless. A node classified as one of a hub and a connector hub, more preferably a node classified as a kinless hub.
- the data refiner 150 may additionally remove unnecessary nodes of the first knowledge network during the knowledge network analysis process.
- the data refiner 150 may remove the node having one connection line together with the connection line of the corresponding node. This is because a node with only one connection line can be interpreted as a network node that does not conform to the concept of a multiomics network.
- the data refiner 150 may remove a node having a clustering coefficient of 0 together with a connection line of the corresponding node. This is because a node with a clustering coefficient of 0 can be interpreted as a node that is unlikely to become a major hub node.
- the output unit 160 outputs the second knowledge network generated in step S250 (S260).
- the output unit 160 may be, for example, a display.
- 8 is an example of a second knowledge network generated by using "epilepsy syndrome" as a search word according to an embodiment of the present invention. Referring to FIG. 8, it can be seen that a significantly simplified and refined second knowledge network can be obtained compared to the first knowledge network of FIG. 6. In addition, referring to FIG. 8, it can be seen that biological entities within different ohmic levels associated with "epilepsy syndrome" and interrelationships between them can be intuitively obtained.
- the data processing apparatus 100 may generate a second knowledge network composed of only nodes that have been refined in relation to a predetermined search word, and accordingly, can easily determine a new drug candidate substance or a target of a new drug candidate substance.
- FIG. 11 is a block diagram of a data processing apparatus for discovering a new drug candidate according to an additional embodiment
- FIG. 12 is a flowchart of a data processing method for discovering a new drug candidate by a data processing apparatus according to an additional embodiment.
- the data processing apparatus 100 may further include a path extraction unit 180 for extracting a drug-enabled path.
- the drug-enabled route means a route through which a drug reacts or a route through which a drug acts, and may be mixed with a drug reaction route or a drug action route.
- the drug-enabled pathway may be displayed according to a degree of correlation between biological entities in different ohmic levels, and may mean some connection pathways in the second knowledge network generated in the present specification.
- the path extraction unit 180 may analyze drug-disease node pairs (pairs) existing in the second knowledge network to extract a drug possible path for determining a basic drug for deriving a new drug candidate (S270). ).
- FIG. 13 is a flowchart of a method for a data processing device to search for a drug available path, according to an embodiment.
- the flowchart of FIG. 13 may represent sub-steps of the step S270 of extracting a drug-enabled route.
- the path extraction unit 180 may select drug-disease node pairs in which the standard score (z-score) of the proximity to each of the drug-disease node pairs existing in the second knowledge network is smaller than the reference value. have.
- the path extracting unit 180 may determine at least one drug-disease node pair from the second knowledge network, each having a specific drug node and a disease node connected to the specific drug node through a connection line as a source node and a target node.
- the path extraction unit 180 may extract all drug-disease pairs for a specific drug from the second knowledge network, and calculate a standard score of proximity to each of the extracted drug-disease pairs. .
- the standard score of the proximity of the node pair (s, t) (s: source node (drug), t: target node (disease)) may be calculated using Equation 11 below.
- the path extraction unit 180 may select at least one drug-disease node pair in which the standard score (z-score) of proximity is smaller than the reference value. For example, when the reliability is set to 90%, the reference value may be -1.645, when the reliability is set to 95%, the reference value may be -1.960, and when the reliability is set to 99%, the reference value may be determined as -2.576.
- the path extraction unit 180 is an intermediate node (i.e., a drug node and a disease node) present in each of the paths among the paths to the pairs whose proximity of the drug-disease node pair selected in step S13200 is less than or equal to the reference value. It is possible to extract paths with more than a reference number of nodes). For example, the path extraction unit 180 may extract paths of a drug-disease node pair in which two or more intermediate nodes exist among the pairs extracted in step S13200.
- the path extraction unit 180 may extract a path in which the sum of the centrality coefficients of the intermediate nodes is greater than or equal to the reference value among paths in which the number of intermediate nodes extracted in step S13400 is greater than or equal to the reference value is a drug-enabled path. .
- the path extraction unit 180 calculates the sum of the centrality coefficients of the intermediate nodes constituting the path for each of the paths in which the intermediate node extracted in step S13400 is greater than or equal to the reference number, and the calculated sum Paths that are higher (for example, within the upper 1% of the distribution of the sum of the centrality coefficients of intermediate nodes of the paths extracted in step S13400) may be extracted as drug-enabled paths.
- the path extraction unit 180 may extract a drug-enabled path that increases the efficiency of the movement path through the node having a high concentration in the second knowledge network.
- the term' ⁇ unit' as used herein refers to software or hardware components such as field-programmable gate array (FPGA) or ASIC, and' ⁇ unit' performs certain roles. However,' ⁇ part' is not limited to software or hardware.
- The' ⁇ unit' may be configured to be in an addressable storage medium, or may be configured to reproduce one or more processors.
- ' ⁇ unit' refers to components such as software components, object-oriented software components, class components and task components, processes, functions, properties, and procedures. , Subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, database, data structures, tables, arrays, and variables.
- components and functions provided in the' ⁇ units' may be combined into a smaller number of elements and' ⁇ units', or may be further divided into additional elements and' ⁇ units'.
- components and' ⁇ units' may be implemented to play one or more CPUs in a device or a security multimedia card.
- the above-described data processing method can be implemented as a computer-readable code on a computer-readable recording medium.
- the computer-readable recording medium includes all types of recording devices that store data that can be read by a computer system. Examples of computer-readable recording media may include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
- the computer-readable recording medium is distributed over a computer system connected through a network, so that code that can be read by the processor can be stored and executed in a distributed manner.
- the descriptions are intended to provide exemplary configurations and operations for implementing the present invention.
- the technical idea of the present invention will include not only the embodiments described above, but also implementations that can be obtained by simply changing or modifying the above embodiments.
- the technical idea of the present invention will also include implementations that can be achieved by easily changing or modifying the embodiments described above.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Chemical & Material Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Public Health (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Epidemiology (AREA)
- Primary Health Care (AREA)
- Bioinformatics & Computational Biology (AREA)
- Crystallography & Structural Chemistry (AREA)
- Computing Systems (AREA)
- Medicinal Chemistry (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Pharmacology & Pharmacy (AREA)
- Toxicology (AREA)
- Pathology (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Bioethics (AREA)
- Physiology (AREA)
- Molecular Biology (AREA)
- Analytical Chemistry (AREA)
- Chemical Kinetics & Catalysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
Description
Claims (6)
- 데이터 처리 장치에서 수행되는 신약 후보 물질 발굴을 위한 데이터 처리 방법에 있어서,선택된 생물학적 엔티티와 선택된 상호 연관도 종류로 구성되는 DB 매트릭스를 오믹스 DB로부터 생성하는 단계;검색어를 수신하는 단계;상기 DB 매트릭스로부터 상기 검색어와 다른 오믹스 레벨에 속하고 상기 검색어와 관련된 생물학적 엔티티들을 추출하는 단계;상기 DB 매트릭스로부터 상기 검색어와 상기 생물학적 엔티티들 간의 상호 연관도를 추출하는 단계;상기 검색어와 상기 생물학적 엔티티들 각각을 노드로 하고, 상기 검색어와 상기 생물학적 엔티티들 사이의 상호 연관도 또는 상기 생물학적 엔티티들 간 상호 연관도에 따라 연결선을 이용하여 복수의 노드들을 연결한 제1지식 네트워크를 생성하는 단계;상기 제1지식 네트워크의 복수의 노드들 각각에 대해 그래프 이론 지표를 계산하는 단계; 및상기 제1지식 네트워크의 복수의 노드들 중 상기 그래프 이론 지표를 이용하여 선택된 일부 노드들을 이용하여 제2지식 네트워크를 생성하는 단계를 포함하고,상기 검색어는 유전자명, 단백질명, 신진대사체명, 증상명, 질환명, 화합물명 및 약품명 중 적어도 하나를 포함하고,상기 생물학적 엔티티는 유전자, 단백질, 신진대사체, 증상, 질환, 화합물 및 약품 중 적어도 하나를 포함하며,상기 상호 연관도의 범주는 참여(participate), 공변(covariate), 조절(regulate), 연관(associate), 결합(bind), 업레귤레이트(upregulate), 유사(resemble), 치료(treat), 다운레귤레이트(downregulates), 완화(palliate), 포함(include), 및 표출(express)을 포함하며,상기 그래프 이론 지표는 상기 제1지식 네트워크를 구성하는 복수의 노드들 중 적어도 하나에 대한 노드 간 최단 경로, 노드 별 클러스터링 계수, 노드 별 센트럴리티 계수를 포함하고,상기 연결선이 나타내는 상호연관도의 범주에 따라 상기 연결선의 가중치가 다르게 설정되고, 상기 노드 간 최단 경로는 상기 설정된 가중치를 반영하여 산출되고,상기 제2지식 네트워크를 생성하는 단계는,상기 제1지식 네트워크를 구성하는 복수의 노드들 각각에 대해 상기 노드 간 최단 경로, 상기 노드 별 클러스터링 계수, 및 상기 노드 별 센트럴리티 계수 중 적어도 하나에 대한 표준 점수를 계산하고, 상기 표준 점수가 임계 값 미만인 노드와 상기 임계 값 미만인 노드의 연결선을 삭제함으로써 상기 제2지식 네트워크를 생성하고,상기 표준 점수는 제1 지식 네트워크를 구성하는 각 노드에 대한 소정의 그래프 이론 지표의 지표값과 제1 지식 네트워크를 구성하는 복수의 노드에 대한 그래프 이론 지표의 평균 지표값 간의 차를 표준 에러로 나눈 값이고,상기 DB 매트릭스는,상기 선택된 생물학적 엔티티들이 가로축 및 세로축 각각에 배치되며, 가로축과 세로축이 교차하는 지점에 상기 상호 연관도 종류가 표시되도록 생성되는 방법.
- 제1항에 있어서,상기 제2지식 네트워크를 생성하는 단계는,상기 제1지식 네트워크를 구성하는 전체 연결선을 임의로 섞은 다음 상기 제1지식 네트워크의 노드들 각각에 대해 상기 표준 점수를 계산하는 단계를 포함하고,상기 임의로 섞는 회수는 1000회 이상인 방법.
- 제1항에 있어서,상기 제2지식 네트워크를 생성하는 단계는,상기 제1지식 네트워크를 구성하는 노드들 중에서 연결선이 하나인 노드를 삭제하는 단계; 및상기 제1지식 네트워크를 구성하는 노드들 중에서 클러스터링 계수가 0인 노드를 삭제하는 단계를 더 포함하는 방법.
- 제1항에 있어서,상기 상호연관도의 범주는, 상호작용(interact), 원인(cause), 발현(present), 및 위치(localize) 중 적어도 하나를 더 포함하는 방법.
- 제1항에 있어서,상기 제2지식 네트워크로부터 약물 가능 경로를 추출하는 단계를 더 포함하고,상기 약물 가능 경로를 추출하는 단계는,상기 제2지식 네트워크에 존재하는 약물-질환 노드들 각각에 대한 근접도의 표준 점수가 기준 값보다 작은 약물-질환 노드 페어들을 선택하는 단계;상기 선택된 약물-질환 노드 페어들에 대한 경로들 중에서, 상기 경로들 각각에 존재하는 중간 노드가 기준 개수 이상인 경로들을 추출하는 단계; 및상기 추출된 경로들 중에서, 상기 추출된 경로들의 중간 노드들의 센트렐리티 계수의 총합이 기준 값 이상인 경로를 상기 약물 가능 경로로서 추출하는 단계를 포함하는 방법.
- 제1항 내지 제5항 중 어느 한 항에서 수행되는 방법을 컴퓨터에서 실행시키기 위한 프로그램이 기록된 기록매체.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/428,619 US20220020454A1 (en) | 2019-03-13 | 2019-12-16 | Method for data processing to derive new drug candidate substance |
Applications Claiming Priority (10)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR10-2019-0028788 | 2019-03-13 | ||
PCT/KR2019/002919 WO2020138589A1 (ko) | 2018-12-24 | 2019-03-13 | 신약 후보 물질 발굴을 위한 멀티오믹스 데이터 처리 장치 및 방법 |
KR1020190028789 | 2019-03-13 | ||
KR1020190028788 | 2019-03-13 | ||
KRPCT/KR2019/002919 | 2019-03-13 | ||
KRPCT/KR2019/002918 | 2019-03-13 | ||
PCT/KR2019/002918 WO2020138588A1 (ko) | 2018-12-24 | 2019-03-13 | 신약 후보 물질 발굴을 위한 데이터 처리 장치 및 방법 |
KR10-2019-0028789 | 2019-03-13 | ||
KR10-2019-0163398 | 2019-12-10 | ||
KR1020190163398A KR102181058B1 (ko) | 2019-03-13 | 2019-12-10 | 신약 후보 물질 도출을 위한 데이터 처리 방법 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020184816A1 true WO2020184816A1 (ko) | 2020-09-17 |
Family
ID=72426290
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2019/017793 WO2020184816A1 (ko) | 2019-03-13 | 2019-12-16 | 신약 후보 물질 도출을 위한 데이터 처리 방법 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20220020454A1 (ko) |
KR (1) | KR102379214B1 (ko) |
WO (1) | WO2020184816A1 (ko) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11514334B2 (en) * | 2020-02-07 | 2022-11-29 | International Business Machines Corporation | Maintaining a knowledge database based on user interactions with a user interface |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101450784B1 (ko) * | 2013-07-02 | 2014-10-23 | 아주대학교산학협력단 | 전자의무기록과 약물/질환 네트워크 정보 기반의 신약 재창출 후보 예측 방법 |
JP2016099674A (ja) * | 2014-11-18 | 2016-05-30 | 国立研究開発法人産業技術総合研究所 | 薬剤探索装置、薬剤探索方法およびプログラム |
KR20180109421A (ko) * | 2017-03-28 | 2018-10-08 | 가천대학교 산학협력단 | 약물의 유사도 판단장치, 방법, 및 컴퓨터-판독가능매체 |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080228700A1 (en) * | 2007-03-16 | 2008-09-18 | Expanse Networks, Inc. | Attribute Combination Discovery |
US20140207385A1 (en) * | 2011-08-26 | 2014-07-24 | Philip Morris Products Sa | Systems and methods for characterizing topological network perturbations |
US11037684B2 (en) | 2014-11-14 | 2021-06-15 | International Business Machines Corporation | Generating drug repositioning hypotheses based on integrating multiple aspects of drug similarity and disease similarity |
US11574122B2 (en) * | 2018-08-23 | 2023-02-07 | Shenzhen Keya Medical Technology Corporation | Method and system for joint named entity recognition and relation extraction using convolutional neural network |
US11545242B2 (en) * | 2019-06-21 | 2023-01-03 | nference, inc. | Systems and methods for computing with private healthcare data |
US11487902B2 (en) * | 2019-06-21 | 2022-11-01 | nference, inc. | Systems and methods for computing with private healthcare data |
US11556579B1 (en) * | 2019-12-13 | 2023-01-17 | Amazon Technologies, Inc. | Service architecture for ontology linking of unstructured text |
CA3172707A1 (en) * | 2020-03-23 | 2021-09-30 | Adam Tomkins | Cross-context natural language model generation |
US11574128B2 (en) * | 2020-06-09 | 2023-02-07 | Optum Services (Ireland) Limited | Method, apparatus and computer program product for generating multi-paradigm feature representations |
-
2019
- 2019-12-16 US US17/428,619 patent/US20220020454A1/en active Pending
- 2019-12-16 WO PCT/KR2019/017793 patent/WO2020184816A1/ko active Application Filing
-
2020
- 2020-10-26 KR KR1020200139362A patent/KR102379214B1/ko active IP Right Grant
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101450784B1 (ko) * | 2013-07-02 | 2014-10-23 | 아주대학교산학협력단 | 전자의무기록과 약물/질환 네트워크 정보 기반의 신약 재창출 후보 예측 방법 |
JP2016099674A (ja) * | 2014-11-18 | 2016-05-30 | 国立研究開発法人産業技術総合研究所 | 薬剤探索装置、薬剤探索方法およびプログラム |
KR20180109421A (ko) * | 2017-03-28 | 2018-10-08 | 가천대학교 산학협력단 | 약물의 유사도 판단장치, 방법, 및 컴퓨터-판독가능매체 |
Non-Patent Citations (2)
Title |
---|
D. KENT ARRELL, A TERZIC: "Network Systems Biology for Drug Discovery", CLINICAL PHARMACOLOGY & THERAPEUTICS, vol. 88, no. 1, July 2010 (2010-07-01), pages 120 - 125, XP055740002 * |
YING YU: "PreMedKB: an integrated precision medicine knowledgebase for interpreting relationships between diseases, genes, variants and drugs", NUCLEIC ACIDS RESEARCH, vol. 47, 8 November 2018 (2018-11-08), pages D1090 - D1101, XP055723297, DOI: 10.1093/nar/gky1042 * |
Also Published As
Publication number | Publication date |
---|---|
KR20200123771A (ko) | 2020-10-30 |
KR102379214B1 (ko) | 2022-03-25 |
US20220020454A1 (en) | 2022-01-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102181058B1 (ko) | 신약 후보 물질 도출을 위한 데이터 처리 방법 | |
Dubchak et al. | Recognition of a protein fold in the context of the SCOP classification | |
Warnow | Mathematical approaches to comparative linguistics | |
CN110837550A (zh) | 基于知识图谱的问答方法、装置、电子设备及存储介质 | |
WO2020138590A1 (ko) | 신약 후보 물질의 효과 및 안전성 예측을 위한 데이터 처리 장치 및 방법 | |
WO2022163996A1 (ko) | 자기주의 기반 심층 신경망 모델을 이용한 약물-표적 상호작용 예측 장치 및 그 방법 | |
WO2019164064A1 (ko) | 정제된 인공지능 강화학습 데이터 생성을 통한 의료영상 판독 시스템 및 그 방법 | |
CN113140254B (zh) | 元学习药物-靶点相互作用预测系统及预测方法 | |
WO2021071000A1 (ko) | 신약 후보 물질 도출 방법 및 장치 | |
WO2021049706A1 (ko) | 앙상블 질의 응답을 위한 시스템 및 방법 | |
Linard et al. | Ten years of collaborative progress in the quest for orthologs | |
WO2021095987A1 (ko) | 다중타입 엔티티에 기반한 지식 보완 방법 및 장치 | |
CN108491228A (zh) | 一种二进制漏洞代码克隆检测方法及系统 | |
WO2021149913A1 (ko) | Ngs 분석에서의 질병 관련 유전자 선별 방법 및 장치 | |
WO2018212396A1 (ko) | 데이터를 분석하는 방법, 장치 및 컴퓨터 프로그램 | |
Zhang et al. | CIPHER-SC: disease-gene association inference using graph convolution on a context-aware network with single-cell data | |
Arendsee et al. | Fagin: synteny-based phylostratigraphy and finer classification of young genes | |
WO2020184816A1 (ko) | 신약 후보 물질 도출을 위한 데이터 처리 방법 | |
CN114386511B (zh) | 基于多维度特征融合和模型集成的恶意软件家族分类方法 | |
WO2020138589A1 (ko) | 신약 후보 물질 발굴을 위한 멀티오믹스 데이터 처리 장치 및 방법 | |
WO2019117400A1 (ko) | 유전자 네트워크 구축 장치 및 방법 | |
WO2020138588A1 (ko) | 신약 후보 물질 발굴을 위한 데이터 처리 장치 및 방법 | |
Dutta et al. | SpliceViNCI: Visualizing the splicing of non-canonical introns through recurrent neural networks | |
WO2022080583A1 (ko) | 시계열 분포 특징을 고려한 딥러닝 기반 비트코인 블록 데이터 예측 시스템 | |
KR102187594B1 (ko) | 신약 후보 물질 발굴을 위한 멀티오믹스 데이터 처리 장치 및 방법 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19919418 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19919418 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03/05/2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19919418 Country of ref document: EP Kind code of ref document: A1 |