CN109585024B

CN109585024B - Data mining method and device, storage medium and electronic equipment

Info

Publication number: CN109585024B
Application number: CN201811351545.1A
Authority: CN
Inventors: 闫峻; 王磊
Original assignee: Golden Panda Ltd
Current assignee: Golden Panda Ltd
Priority date: 2018-11-14
Filing date: 2018-11-14
Publication date: 2021-03-09
Anticipated expiration: 2038-11-14
Also published as: CN109585024A

Abstract

The present disclosure relates to the field of computer technologies, and in particular, to a data mining method and apparatus, a storage medium, and an electronic device. The method comprises the following steps: excavating a relation template in a corpus according to the relation of the known disease drugs; acquiring a new disease and drug relationship in the corpus according to the relationship template; constructing a disease drug relationship network according to the new disease drug relationship; mining potential disease-drug relationships in the disease-drug relationship network; and verifying the potential disease drug relationship through a disease drug relationship record in the real medical data, and determining the verified potential disease drug relationship as a target disease drug relationship. The method and the device greatly reduce the cost of excavating the potential disease drug relationship, improve the excavating efficiency, simultaneously improve the excavating accuracy of the potential disease drug relationship, and in addition, improve the verification accuracy and the verification efficiency of the potential disease drug relationship and reduce the verification cost.

Description

Data mining method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data mining method and apparatus, a storage medium, and an electronic device.

Background

Knowledge discovery based on literature refers to a process of mining the internal relationship between certain data from publicly published literature, proposing a new hypothesis based on the internal relationship, verifying the new hypothesis through experiments by researchers, and then mining new knowledge.

Currently, data mining techniques play an increasingly important role in knowledge discovery in the medical field. For example, adverse drug reactions of diseases can be mined in a large number of documents, and adverse drug reactions of potential diseases can be predicted according to the mined adverse drug reactions of diseases, and for example, data mining can be performed on the adverse drug reactions of diseases in a large number of documents, and the adverse drug reactions of potential diseases can be predicted according to the mined adverse drug reactions of diseases. By mining the disease-drug relationship and predicting the potential disease-drug relationship according to the mining result, the method can provide a direction for the research of drugs and diseases, save the research cost and the research time, ensure the safety of the drugs and provide more schemes for the treatment of the diseases.

Generally, a manual marking mode is adopted to mark each disease medicine relationship one by one in a large amount of documents, a potential disease medicine relationship is obtained according to a large amount of marked disease medicine relationships, and the potential disease medicine relationship is verified in a manual verification mode. Obviously, in the above manner, on one hand, because manual labeling is required, the cost for mining the disease drug relationship from the literature is high, the efficiency is low, omission easily occurs, and the implementation is difficult, so that the mining cost of the potential disease drug relationship is increased, the efficiency is reduced, and the accuracy is reduced; on the other hand, the relationship of the potential disease drugs needs to be verified in a manual mode, so that the labor cost for verification is high, the verification efficiency is low, and different people have different experiences, so that the verification result is greatly influenced by human factors, and the verification accuracy is low.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure aims to provide a data mining method and apparatus, a storage medium, and an electronic device, so as to overcome, at least to a certain extent, the problems of high cost, low efficiency, easy omission, difficult realization, and further increased mining cost, reduced efficiency, reduced accuracy, and the like of mining disease-drug relationships from documents due to manual labeling.

According to an aspect of the present disclosure, there is provided a data mining method, including:

excavating a relation template in a corpus according to the relation of the known disease drugs;

acquiring a new disease and drug relationship in the corpus according to the relationship template;

constructing a disease drug relationship network according to the new disease drug relationship;

mining potential disease-drug relationships in the disease-drug relationship network;

and verifying the potential disease drug relationship through a disease drug relationship record in the real medical data, and determining the verified potential disease drug relationship as a target disease drug relationship.

In an exemplary embodiment of the disclosure, the mining of relationship templates in a corpus according to known disease-drug relationships comprises:

expanding the relation of the known disease drugs according to the upper and lower relations of the diseases and the drugs;

and acquiring a relation template in the corpus according to the expanded known disease and drug relation.

In an exemplary embodiment of the present disclosure, the building a disease drug relationship network according to the new disease drug relationship comprises:

and respectively converting the diseases and the medicines in the new disease and medicine relations into nodes, and setting edges among the nodes with the disease and medicine relations to construct a disease and medicine relation network.

In an exemplary embodiment of the present disclosure, the disease-drug relationship network includes: nodes characterizing the disease and nodes characterizing the drug and edges characterizing the disease-drug relationship between the nodes;

said mining potential disease-drug relationships in said disease-drug relationship network comprises:

mapping each node in the disease-drug relationship network into a vector;

calculating the similarity between vectors corresponding to two nodes where the edge does not exist, wherein one node of the two nodes represents the disease and the other node represents the drug;

and mining the relation of the potential disease drugs by combining a preset similarity according to the similarity between vectors corresponding to the two nodes without the edge.

In an exemplary embodiment of the present disclosure, the similarity calculation formula is:

wherein the content of the first and second substances,

to characterize the vector to which the node of the disease corresponds,

to characterize the vector corresponding to the node of the drug,

vectors corresponding to two of the nodes for which the edge does not exist

Sum vector

The similarity between them.

In an exemplary embodiment of the disclosure, the validating the potential disease-drug relationship through the disease-drug relationship record in the real medical data comprises:

matching the potential disease drug relationship with each disease drug relationship in a disease drug relationship record in the real medical data respectively, and acquiring the occurrence frequency of the disease drug relationship matched with the potential disease drug relationship in the disease drug relationship record;

and verifying the relation of the potential disease drugs according to the occurrence frequency and by combining a preset frequency.

converting the potential disease drug relationship to the potential disease drug relationship in the same language as the real medical data by a parallel corpus;

and verifying the potential disease drug relationship in the same language as the real medical data through the disease drug relationship record in the real medical data.

In an exemplary embodiment of the present disclosure, the method further comprises: constructing the parallel corpus; wherein the constructing the parallel corpus comprises:

translating the disease names in different languages into the disease names in the same language as the real medical data;

translating the drug names in different languages into drug names in the same language as the real medical data;

constructing the parallel corpus according to the disease names in different languages, the corresponding disease names in the same language as the real medical data, the medicine names in different languages and the corresponding medicine names in the same language as the real medical data; wherein:

the disease name in one language corresponds to at least one disease name in the same language as the real medical data, and the drug name in one language corresponds to at least one drug name in the same language as the real medical data.

According to an aspect of the present disclosure, there is provided a data mining apparatus including:

the template acquisition module is used for excavating a relation template in the corpus according to the relation of the known disease drugs;

the relation extraction module is used for acquiring a new disease and drug relation in the corpus according to the relation template;

the network construction module is used for constructing a disease drug relationship network according to the new disease drug relationship;

a relationship mining module for mining potential disease drug relationships in the disease drug relationship network;

and the relationship verification module is used for verifying the potential disease medicine relationship through the disease medicine relationship record in the real medical data and determining the verified potential disease medicine relationship as the target disease medicine relationship.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data mining method as described in any one of the above.

According to an aspect of the present disclosure, there is provided an electronic device including:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the data mining method of any of the above via execution of the executable instructions.

The invention discloses a data mining method and device, a storage medium and electronic equipment. The method includes the steps of mining a relation template in a corpus according to a known disease medicine relation, obtaining a new disease medicine relation in the corpus according to the relation template, constructing a disease medicine relation network according to the new disease medicine relation, mining a potential disease medicine relation in the disease medicine relation network, verifying the potential disease medicine relation through a disease medicine relation record in real medical data, and determining the verified potential disease medicine relation as a target disease medicine relation. On one hand, a relation template is mined in a corpus according to a known disease drug relation, a new disease drug relation is obtained in the corpus according to the relation template instead of obtaining the new disease drug relation in a manual labeling mode, automatic obtaining of the new disease drug relation is achieved, the achieving mode is simple, obtaining efficiency of the new disease drug relation is improved, obtaining cost is reduced, cost for mining potential disease drug relations is greatly reduced, and mining efficiency is improved; on the other hand, because a new disease drug relationship is not acquired in a manual labeling mode, but is automatically acquired in the corpus according to the relationship template, the problem of missing the new disease drug relationship is avoided, and the mining accuracy of the potential disease drug relationship is improved; on the other hand, the potential disease drug relationship is verified through real medical data instead of a manual verification mode, so that the influence of human factors on verification results is avoided, the verification accuracy is higher, the verification efficiency is improved, and the verification cost is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The above and other features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

FIG. 1 is a flow chart of a data mining method of the present disclosure;

FIG. 2 is a flowchart of mining potential disease-drug relationships in a disease-drug relationship network, provided by an exemplary embodiment of the present disclosure;

FIG. 3 is a block diagram of a data mining device of the present disclosure;

FIG. 4 is a block diagram view of an electronic device in an exemplary embodiment of the disclosure;

FIG. 5 is a schematic diagram illustrating a program product in an exemplary embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the embodiments of the disclosure can be practiced without one or more of the specific details, or with other methods, components, materials, devices, steps, and so forth. In other instances, well-known structures, methods, devices, implementations, materials, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in the form of software, or in one or more software-hardened modules, or in different networks and/or processor devices and/or microcontroller devices.

First, a data mining method is disclosed in the present exemplary embodiment, and as shown in fig. 1, the data mining method may include the following steps:

step S110, excavating a relation template in a corpus according to the relation of known disease drugs;

step S120, acquiring a new disease and drug relationship in the corpus according to the relationship template;

step S130, constructing a disease drug relationship network according to the new disease drug relationship;

step S140, excavating potential disease drug relations in the disease drug relation network;

and S150, verifying the potential disease drug relationship through the disease drug relationship record in the real medical data, and determining the verified potential disease drug relationship as a target disease drug relationship.

According to the data mining method in the exemplary embodiment, on one hand, the relation template is mined in the corpus according to the known disease drug relation, and the new disease drug relation is acquired in the corpus according to the relation template instead of acquiring the new disease drug relation in a manual labeling mode, so that the automatic acquisition of the new disease drug relation is realized, the implementation mode is simple, the acquisition efficiency of the new disease drug relation is improved, the acquisition cost is reduced, the cost for mining the potential disease drug relation is greatly reduced, and the mining efficiency is improved; on the other hand, because a new disease drug relationship is not acquired in a manual labeling mode, but is automatically acquired in the corpus according to the relationship template, the problem of missing the new disease drug relationship is avoided, and the mining accuracy of the potential disease drug relationship is improved; on the other hand, the potential disease drug relationship is verified through real medical data instead of a manual verification mode, so that the influence of human factors on verification results is avoided, the verification accuracy is higher, the verification efficiency is improved, and the verification cost is reduced.

Next, the data mining method in the present exemplary embodiment will be further explained with reference to fig. 1.

In step S110, a relationship template is mined from a corpus according to known disease-drug relationships.

In the present exemplary embodiment, the disease drug relationship may be a disease drug adverse reaction relationship or a disease drug efficacy reaction relationship. Specifically, if a drug can cause a disease, the drug and the disease are in an adverse drug reaction relationship; if a drug has a therapeutic effect on a disease, the disease and the drug respond to the therapeutic effect of the disease drug. For example, trastuzumab (trastuzumab) can cause cardiomypaphiees (cardiomyopathy), and thus trastuzumab and cardiomypaphiees are a disease drug side effect relationship.

Since about 1300000 known disease-drug relationships are provided in The CTD (The Comparative Toxicogenomics Database), in this exemplary embodiment, a large number of known disease-drug relationships may be obtained in The CTD, and of course, a large number of known disease-drug relationships may also be obtained in other databases, which is not limited in this exemplary embodiment.

The corpus may be a medical document retrieval service system, for example, pubmed (medical document retrieval library), it should be noted that the corpus may also be other medical document retrieval systems, and this exemplary embodiment is not particularly limited in this respect.

The process of mining relationship templates in a corpus based on a large number of known disease-drug relationships may include: and sequentially inputting the disease name and the drug name in each known disease and drug relationship into a corpus simultaneously, so as to obtain sentences which simultaneously comprise the disease name and the drug name in each known disease and drug relationship from each medical document of the corpus, mining relationship templates in the obtained sentences through relationship extraction, integrating all the relationship templates, carrying out homonymy and difference removal processing on the integrated relationship templates, and determining all the remaining relationship templates as the finally obtained relationship templates. The following describes a process of obtaining a relationship template according to a known disease drug relationship, taking cardiomypapathies and trastuzumab as an example. The method comprises the steps of simultaneously inputting Cardiopathis and trastuzumab into a corpus, and acquiring sentences simultaneously including Cardiopathies and trastuzumab from each medical document in the corpus, wherein the acquired sentences simultaneously including Cardiopathies and trastuzumab include: trastuzumab induced cardiomypaphiases (trastuzumab causes myocardial diseases), trastuzumab used cardiomypaphiases (trastuzumab causes myocardial diseases), cardiomypaphias used by trastuzumab (myocardial diseases are caused by trastuzumab). Extracting the relation of the three statements, obtaining the relation of DRUG induced DISEASE (induced), DRUG used DISEASE (induced), DISEASE is used by DRUG (induced), and the like, and respectively determining the DRUG induced DISEASE, the DRUG used DISEASE and the DISEASE is used by DRUG as the relation template. It should be noted that the process of obtaining the relationship template according to the relationship between other known diseases and drugs is the same as the above process, and therefore, the detailed description thereof is omitted here. And finally, collecting the relationship templates obtained according to the relationship of the known disease drugs, and removing the coexistence and the difference of all the collected relationship templates, wherein the rest relationship templates are the finally obtained relationship templates.

In order to increase the diversity of the relationship templates and make the new disease-drug relationship obtained according to the relationship templates more comprehensive, the mining of the relationship templates in the corpus according to the known disease-drug relationship may include: expanding the relation of the known disease drugs according to the upper and lower relations of the diseases and the drugs; and acquiring a relation template in the corpus according to the expanded known disease and drug relation.

In the exemplary embodiment, it is first selected to obtain diseases in an upper-lower relationship with each disease, obtain drugs in an upper-lower relationship with each drug, replace a disease in each known disease-drug relationship with a disease in a corresponding upper-lower relationship and/or replace a drug with a drug in a corresponding upper-lower relationship to obtain a replaced disease-drug relationship, and expand the known disease-drug relationship by the replaced disease-drug relationship. For example, trastuzumab and gastrointestinal reaction are in a disease-drug relationship, as gastrointestinal reaction belongs to the gastrointestinal reaction, which is the lower position of the gastrointestinal reaction, and therefore trastuzumab and gastrointestinal reaction are also in a disease-drug relationship. The process of obtaining the relationship template in the corpus according to the expanded known disease-drug relationship is the same as the process of obtaining the relationship template in the corpus according to the expanded known disease-drug relationship, and therefore, the description thereof is omitted.

Obviously, the relationship between the known disease drugs is expanded through the upper-lower relationship, so that the number of the known disease drug relationships is greatly increased, the known disease drug relationships are more comprehensive, the diversity of the relationship templates is increased, and the new disease drug relationships obtained according to the relationship templates are more comprehensive.

It should be noted that the relationship template may be an adverse reaction relationship template or a therapeutic effect reaction relationship template.

In step S120, a new disease-drug relationship is obtained in the corpus according to the relationship template.

In the exemplary embodiment, the relationship templates are respectively input into the corpus to search the sentences including the relationship templates in each medical document in the corpus, and the relationship extraction is performed on the sentences including the relationship templates to obtain a large number of new disease-drug relationships.

According to the method, the relation template is mined in the corpus according to the known disease medicine relation, the new disease medicine relation is obtained in the corpus according to the relation template instead of the manual labeling mode, the automatic acquisition of the new disease medicine relation is realized, the implementation mode is simple, the acquisition efficiency of the new disease medicine relation is improved, the acquisition cost is reduced, the cost for mining the potential disease medicine relation is greatly reduced, and the mining efficiency is improved; in addition, because a new disease medicine relation is not acquired in a manual labeling mode, but is automatically acquired in the corpus according to the relation template, the problem that the new disease medicine relation is omitted is avoided, and the mining accuracy of the potential disease medicine relation is improved.

In step S130, a disease-drug relationship network is constructed from the new disease-drug relationship.

In the present exemplary embodiment, the disease and the drug in each new disease-drug relationship may be respectively converted into nodes, and edges may be set between the nodes having the disease-drug relationship to construct a disease-drug relationship network, that is, the disease in each new disease-drug relationship may be first respectively converted into nodes, then the drug in each new disease-drug relationship may be respectively converted into nodes, and finally, edges may be set between the nodes having the disease-drug relationship according to each new disease-drug relationship, in other words, edges may be set between the nodes corresponding to the disease and the drug in the new disease-drug relationship.

It should be noted that, in the disease-drug relationship network, one disease can only correspond to one node, and one drug can only correspond to one node. Different diseases correspond to different nodes, and different drugs correspond to different nodes.

In step S140, potential disease-drug relationships are mined in the disease-drug relationship network.

In this exemplary embodiment, the potential disease drug relationship is a potential disease drug adverse reaction relationship or a potential disease drug efficacy reaction relationship. The disease drug relationship network may include: nodes characterizing the disease and nodes characterizing the drug and edges characterizing the disease-drug relationship between the nodes. Based on this, as shown in fig. 2, the mining of potential disease drug relations in the disease drug relation network may include the following steps S210 to S230, wherein:

in step S210, each node in the disease-drug relationship network is mapped as a vector.

In the present exemplary embodiment, each node in the disease and drug relationship network may be mapped as a vector by a LINE (Large scale information network embedding) method, or may be mapped as a vector by another method, which is not particularly limited in this exemplary embodiment.

In step S220, a similarity between vectors corresponding to two nodes where the edge does not exist is calculated, wherein one of the two nodes represents the disease and the other node represents the drug.

In the present exemplary embodiment, since the disease-drug relationship corresponding to two nodes where an edge exists is already present in the literature, in order to mine the potential disease-drug relationship, it is necessary to calculate the similarity between vectors corresponding to two nodes where the edge does not exist, wherein one of the two nodes characterizes the disease and the other node characterizes the drug.

Specifically, the similarity calculation formula may be:

wherein the content of the first and second substances,

to characterize the vector to which the node of the disease corresponds,

to characterize the vector corresponding to the node of the drug,

vectors corresponding to two of the nodes for which the edge does not exist

Sum vector

The similarity between them.

Through the formula, the similarity between vectors corresponding to any two nodes without edges in the disease-drug relationship network can be calculated, and one node represents a disease and the other node represents a drug in the two nodes without edges to be explained.

In step S230, the relationship between the potential disease drugs is mined according to the similarity between the vectors corresponding to the two nodes where the edge does not exist and a preset similarity.

In the present exemplary embodiment, the calculator may determine the magnitude of the preset similarity empirically, or may obtain the magnitude through experiments, and this is not particularly limited in the present exemplary embodiment.

Comparing the plurality of similarities obtained in step S220 with a preset similarity, and when the similarity is greater than the preset similarity, determining the disease and the drug corresponding to the two edge-free nodes corresponding to the similarity as a potential disease-drug relationship. If the similarity is not greater than the preset similarity, the disease and drug relationship corresponding to the two edge-free nodes corresponding to the similarity is indicated to be absent.

For example, node a represents disease a, node B represents drug B, and there is no edge between node a and node B, if the similarity between the vector corresponding to node a and the vector corresponding to node B is greater than the preset similarity, then disease a and drug B are a potential disease-drug relationship, and if the similarity between the vector corresponding to node a and the vector corresponding to node B is not greater than the preset similarity, then disease a and drug B do not have a potential disease-drug relationship.

In step S150, the potential disease drug relationship is verified through the disease drug relationship record in the real medical data, and the verified potential disease drug relationship is determined as the target disease drug relationship.

In the present exemplary embodiment, the real medical data may be clinical data of each large hospital. The real medical data includes a disease-drug relationship record, and the disease-drug relationship record can be used for recording the disease-drug relationship and the occurrence time of the disease-drug relationship when the disease-drug relationship is found.

The validating the potential disease-drug relationship through the disease-drug relationship record in the real medical data may include: matching the potential disease drug relationship with each disease drug relationship in a disease drug relationship record in the real medical data respectively, and acquiring the occurrence frequency of the disease drug relationship matched with the potential disease drug relationship in the disease drug relationship record; and verifying the relation of the potential disease drugs according to the occurrence frequency and by combining a preset frequency.

In this exemplary embodiment, when a potential disease drug relationship is verified, the potential disease drug relationship is sequentially matched with each disease drug relationship in a disease drug relationship record in a real medical database, the occurrence frequency of the disease drug relationship matched with the potential disease drug relationship in a disease drug response relationship record is obtained, when the occurrence frequency is greater than a preset frequency, it is indicated that the potential disease drug relationship passes verification, and the potential disease drug relationship is determined as a target disease drug relationship. The preset frequency may be set by a developer according to a working experience, and the present exemplary embodiment is not particularly limited thereto.

In other exemplary embodiments of the present disclosure, the significance level or P-value of the potential disease drug relationship may also be calculated according to the frequency of occurrence of the disease drug relationship in the disease drug relationship record matching the potential disease drug relationship, and the potential disease drug relationship may be verified according to the significance level or P-value of the potential disease drug relationship.

It should be noted that, when the number of potential disease drug relationships is plural, each potential disease drug relationship can be verified separately in the above manner. Further, considering that one disease may have a plurality of names and one drug may have a plurality of names, in the above matching process, as long as the diseases are the same disease and the drugs belong to the same drug, the matching can be considered to be successful.

According to the method, the potential disease drug relationship is verified through the real medical data instead of adopting a manual verification mode, so that the influence of human factors on the verification result is avoided, the verification accuracy is higher, the verification efficiency is improved, and the verification cost is reduced.

Furthermore, if the language of the potential disease drug relationship is different from the language of the real medical data, the potential disease drug relationship cannot be verified through the real medical data, and therefore, in order to solve the above problem, the verifying the potential disease drug relationship through the disease drug relationship record in the real medical data may include: converting the potential disease drug relationship to the potential disease drug relationship in the same language as the real medical data by a parallel corpus; and verifying the potential disease drug relationship in the same language as the real medical data through the disease drug relationship record in the real medical data.

In this exemplary embodiment, the parallel corpus construction process may include: translating the disease names in different languages into the disease names in the same language as the real medical data; translating the drug names in different languages into drug names in the same language as the real medical data; constructing the parallel corpus according to the disease names of different languages, the corresponding disease names of the same language as the real medical data, the medicine names of different languages and the corresponding medicine names of the same language as the real medical data; wherein: since a disease has a plurality of disease names and a drug has a plurality of drug names, in the parallel corpus, the disease name in one language corresponds to at least one disease name in the same language as the real medical data, and the drug name in one language corresponds to at least one drug name in the same language as the real medical data.

Based on the parallel corpus, the potential disease-drug relationship can be translated into the same potential disease-drug relationship as the real medical data language. It should be noted that, since a disease may have a plurality of disease names and a drug has a plurality of drug names, one potential disease drug relationship may be translated into a plurality of potential disease drug relationships in the same language as the real medical data, when matching is performed in the disease drug relationship records in the real medical data, a candidate potential disease drug relationship may be determined among the translated plurality of potential disease drug relationships, and the candidate potential disease drug relationship may be verified according to the candidate potential disease drug relationship and in combination with the disease drug relationship records in the real medical data, and since the verification method here is the same as the verification method described above, details are not repeated here.

In conclusion, the adverse drug reaction relationship of the potential disease can be mined according to the known adverse drug reaction relationship of the disease, or the curative effect reaction relationship of the potential disease can be mined according to the known curative effect reaction relationship of the disease.

It should be noted that although the various steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

In an exemplary embodiment of the present disclosure, there is also provided a data mining apparatus, as shown in fig. 3, the data mining apparatus 300 may include: the system comprises a template obtaining module 301, a relation extracting module 302, a network constructing module 303, a relation mining module 304 and a relation verifying module 305, wherein:

the template acquisition module 301 is used for mining a relation template in a corpus according to a known disease drug relation;

a relation extraction module 302, configured to obtain a new disease-drug relation in the corpus according to the relation template;

a network construction module 303, configured to construct a disease drug relationship network according to the new disease drug relationship;

a relationship mining module 304 for mining potential disease-drug relationships in the disease-drug relationship network;

a relationship verification module 305, configured to verify the potential disease drug relationship through a disease drug relationship record in the real medical data, and determine the verified potential disease drug relationship as a target disease drug relationship.

The specific details of each data mining device module are already described in detail in the corresponding data mining method, and therefore are not described herein again.

It should be noted that although in the above detailed description several modules or units of the apparatus for performing are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 400 according to this embodiment of the invention is described below with reference to fig. 4. The electronic device 400 shown in fig. 4 is only an example and should not bring any limitation to the function and the scope of use of the embodiments of the present invention.

As shown in fig. 4, electronic device 400 is embodied in the form of a general purpose computing device. The components of electronic device 400 may include, but are not limited to: the at least one processing unit 410, the at least one memory unit 420, a bus 430 connecting different system components (including the memory unit 420 and the processing unit 410), and a display unit 440.

Wherein the storage unit stores program code that is executable by the processing unit 410 to cause the processing unit 410 to perform steps according to various exemplary embodiments of the present invention as described in the above section "exemplary methods" of the present specification. For example, the processing unit 410 may execute step S110 shown in fig. 1, mining a relationship template in a corpus according to a known disease-drug relationship; step S120, acquiring a new disease and drug relationship in the corpus according to the relationship template; step S130, constructing a disease drug relationship network according to the new disease drug relationship; step S140, excavating potential disease drug relations in the disease drug relation network; and S150, verifying the potential disease drug relationship through the disease drug relationship record in the real medical data, and determining the verified potential disease drug relationship as a target disease drug relationship.

The storage unit 420 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM)4201 and/or a cache memory unit 4202, and may further include a read only memory unit (ROM) 4203.

The storage unit 420 may also include a program/utility 4204 having a set (at least one) of program modules 4205, such program modules 4205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 430 may be any bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 400 may also communicate with one or more external devices 470 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 400, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 400 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 450. Also, the electronic device 400 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 460. As shown, the network adapter 460 communicates with the other modules of the electronic device 400 over the bus 430. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 400, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.

Referring to fig. 5, a program product 500 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. A method of data mining, comprising:

expanding the relation of the known diseases and the medicines according to the upper and lower relations of the diseases and the medicines;

acquiring a relation template in a corpus according to the expanded known disease drug relation; the relationship template is used for identifying the disease drug relationship;

respectively inputting the relation templates into the corpus so as to search sentences comprising the relation templates in each medical document in the corpus, and extracting relations of the sentences comprising the relation templates so as to obtain new medicine disease relations;

respectively converting the diseases and the medicines in the new disease and medicine relations into nodes, and setting edges among the nodes with the disease and medicine relations to construct a disease and medicine relation network; wherein the disease-drug relationship network comprises nodes characterizing the disease and nodes characterizing the drug and edges characterizing the disease-drug relationship between nodes;

mapping each node in the disease-drug relationship network into a vector;

according to the similarity between vectors corresponding to the two nodes without the edge, and a preset similarity, mining the relation of potential disease drugs; matching the potential disease drug relationship in real medical data, and acquiring the occurrence frequency of the disease drug relationship matched with the potential disease drug relationship in the disease drug relationship record;

calculating the significance level of the potential disease drug relationship according to the occurrence frequency, verifying the potential disease drug relationship according to the significance level, and determining the potential disease drug relationship as a target disease drug relationship when the occurrence frequency is greater than a preset frequency.

2. The data mining method of claim 1, wherein the similarity calculation formula is:

wherein the content of the first and second substances,

to characterize the vector to which the node of the disease corresponds,

to characterize the vector corresponding to the node of the drug,

vectors corresponding to two of the nodes for which the edge does not exist

Sum vector

The similarity between them.

3. The data mining method of claim 1, wherein the matching of the potential disease drug relationships in real medical data comprises:

matching the potential disease drug relationships in the same language as the real medical data by a disease drug relationship record in the real medical data.

4. The data mining method of claim 3, further comprising: constructing the parallel corpus; wherein the constructing the parallel corpus comprises:

5. A data mining device, comprising:

the template acquisition module is used for expanding the relation of the known diseases and the medicines according to the upper and lower relations of the diseases and the medicines; acquiring a relation template in a corpus according to the expanded known disease drug relation; the relationship template is used for identifying the disease drug relationship;

the relation extraction module is used for respectively inputting the relation templates into the corpus so as to search sentences comprising the relation templates in each medical document in the corpus and extract the relations of the sentences comprising the relation templates so as to obtain new medicine disease relations;

the relationship mining module is used for respectively converting the diseases and the medicines in the new disease and medicine relationship into nodes and setting edges among the nodes with the disease and medicine relationship to construct a disease and medicine relationship network; wherein the disease-drug relationship network comprises nodes characterizing the disease and nodes characterizing the drug and edges characterizing the disease-drug relationship between nodes; mapping each node in the disease-drug relationship network into a vector; calculating the similarity between vectors corresponding to two nodes where the edge does not exist, wherein one node of the two nodes represents the disease and the other node represents the drug; according to the similarity between vectors corresponding to the two nodes without the edge, and a preset similarity, mining the relation of potential disease drugs;

the relationship verification module is used for matching the potential disease medicine relationship in real medical data, acquiring the occurrence frequency of the disease medicine relationship matched with the potential disease medicine relationship in the disease medicine relationship record, and determining the potential medicine disease relationship as a target disease medicine relationship when the occurrence frequency is greater than a preset frequency; and calculating the significance level of the potential disease drug relationship according to the occurrence frequency, verifying the potential disease drug relationship according to the significance level, and determining the potential disease drug relationship as a target disease drug relationship when the occurrence frequency is greater than a preset frequency.

6. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the data mining method of any one of claims 1 to 4.

7. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the data mining method of any of claims 1-4 via execution of the executable instructions.