WO2021155684A1

WO2021155684A1 - Gene-disease relationship knowledge base construction method and apparatus, and computer device

Info

Publication number: WO2021155684A1
Application number: PCT/CN2020/125143
Authority: WO
Inventors: 张圣; 顾大中
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-09-09
Filing date: 2020-10-30
Publication date: 2021-08-12
Also published as: CN112036151B; CN112036151A

Abstract

A gene-disease relationship knowledge base construction method and an apparatus, and a computer device, the method comprising: performing dependence relationship analysis on a designated number of natural sentences to acquire dependence relationships; on the basis of the dependence relationships, determining path descriptors of the natural sentences; on the basis of the path descriptors, generating rule templates, and constructing a rule template database; using the rule templates to perform knowledge extraction on an entirety of medical documents to acquire gene-disease relationships, and constructing a gene-disease relationship knowledge base. The method is able to automatically learn a large amount of rule templates, and then use the rule templates to automatically extract gene-disease relationship knowledge from medical documents without requiring high manual labor costs. In addition, the amount of knowledge data extracted in the method is large, the extraction effect thereof is good, and the method features good mobility and can be used in extraction of more relationships between medical entities. The present method further relates to blockchain technology, and the rule templates and the gene/disease relationship knowledge base etc. can be stored in a blockchain.

Description

Method, device and computer equipment for constructing genetic disease relationship knowledge base

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 9, 2020, with the application number 2020109416427, and the invention titled "Methods, Apparatus, and Computer Equipment for Constructing Genetic Disease Relation Knowledge Base", the entire contents of which are incorporated by reference Incorporated in this application.

Technical field

This application relates to the field of artificial intelligence and smart medical care, in particular to a method, device and computer equipment for constructing a genetic disease relationship knowledge base.

Background technique

There are a large number of natural sentences that contain the target relationships between disease genes in medical literature. The target genes of diseases are of great significance for medical basic research, disease diagnosis and treatment, and targeted drug development. Regarding the construction of the knowledge base of disease target genes, the existing high-quality disease gene target relationships are basically obtained through the construction of expert manpower, but with the exponential growth of medical literature, it is only manually edited and reviewed by experts. The method of constructing a medical knowledge base basically cannot realize the construction of a relatively complete knowledge base.

The inventor found that there are also technical solutions that use computer technology to automatically obtain medical entity relationships from medical literature. These technical solutions are mainly divided into two types, based on artificially designed rules for medical entity relationship extraction and machine learning technology for medical entities Relationship extraction. The current practice of rule-based solutions requires domain experts to summarize available high-quality rules. The amount of knowledge that can be obtained depends entirely on the quality and quantity of high-quality rules. At present, most rule-based solutions have a low recall rate and are accurate. The rate is higher but the cost is also high. Based on the machine learning algorithm for medical relationship extraction, the best model at present is the relationship extraction model based on deep learning, but even the current deep learning-based model has a relatively low effect on medical relationship extraction, which is far from practical. There is also a larger transverse groove. In addition, the training of deep learning models requires a large number of high-quality label data sets, and high-quality medical relationship extraction label data requires experts to manually label.

technical problem

The main purpose of this application is to provide a method, device and computer equipment for constructing a genetic disease relationship knowledge base, aiming to solve the current problems of high cost and poor effect of the current genetic disease relationship knowledge database.

Technical solutions

In order to achieve the above-mentioned purpose of the invention, the first aspect of this application proposes a method for constructing a genetic disease relationship knowledge base, which includes:

Perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;

Determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the sequence of all words in the natural sentence on the path of genetic disease entity dependence;

Generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library;

The rule template in the rule template library is used to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.

In the second aspect, this application also provides a device for constructing a genetic disease relationship knowledge base, including:

The dependency analysis module is used to perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;

A path descriptor determining module, configured to determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence;

A rule template generation module, which is used to generate a rule template according to the path descriptor of each natural sentence, and build a rule template library;

The knowledge extraction module is used for extracting knowledge from the full amount of medical literature by using the rule templates in the rule template library, obtaining genetic disease relationships, and establishing a genetic disease relationship knowledge base.

In a third aspect, the present application also provides a computer device, including a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, a method for constructing a genetic disease relationship knowledge base is realized, wherein The methods for constructing the knowledge base of the genetic disease relationship include:

In a fourth aspect, the present application also provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, a method for constructing a genetic disease relationship knowledge base is realized, wherein the genetic disease relationship knowledge Library construction methods include:

Beneficial effect

The method, device and computer equipment for constructing a gene-disease relationship knowledge base of this application automatically learn a large number of rule templates by analyzing a specified number of natural sentences containing gene-disease entity pairs, and then use the rule templates from medical literature It automatically extracts the relationship knowledge of genetic diseases without high labor cost, and the amount of extracted knowledge is large, the extraction effect is good, and it has good migration and applicability, and can be used for more medical entity relationship extraction.

Description of the drawings

FIG. 1 is a schematic flow chart of a method for constructing a gene-disease relationship knowledge base according to an embodiment of the application;

FIG. 2 is a schematic diagram of an example of the dependency relationship of natural sentences according to an embodiment of the application;

FIG. 3 is a schematic diagram of an example of the dependency relationship of natural sentences according to another embodiment of the application;

4 is a schematic block diagram of the structure of an apparatus for constructing a gene-disease relationship knowledge base according to an embodiment of the application;

FIG. 5 is a schematic block diagram of the structure of a computer device according to an embodiment of the application.

The best mode of the present invention

In order to make the purpose, technical solutions, and advantages of this application clearer and clearer, the following further describes the application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.

This application relates to the field of artificial intelligence, as well as the field of smart medical care in smart cities. Referring to Fig. 1, an embodiment of the present application provides a method for constructing a gene-disease relationship knowledge base, which includes the following steps:

S1. Perform a dependency relationship analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtain the dependency relationship of each natural sentence;

S2. Determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, where the path descriptor refers to the sequence of all words in the natural sentence on the path of genetic disease entity dependence;

S3. Generate a rule template according to the path descriptor of each natural sentence, and establish a rule template library;

S4. Use the rule template in the rule template library to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.

The task of medical relationship extraction is to judge the relationship between a given gene and disease entity pair based on the semantic information in a medical text containing a given gene-disease entity pair. In this embodiment, a scheme based on a rule template is used to extract gene-disease relationship knowledge from a large amount of medical literature. The rule template in this embodiment is not a rule template constructed by an expert, and a rule template constructed by an expert needs to be expensive. The labor cost is relatively small, and the number of rule templates is small, which leads to the small scale and high cost of medical relationship knowledge extracted based on the way of constructing rules by experts. In this embodiment, a large number of high-quality usable rule templates can be automatically learned from a specified number of natural sentences containing entity pairs, and then these templates can be used for knowledge extraction, and a large amount of medical relationship knowledge can be obtained from the entire medical literature. , Build a knowledge base.

As described in step S1 above, the rule template is designed and extracted based on the dependency relationship of the natural sentence. Dependency analysis, also known as dependency syntax analysis, is one of the key technologies in natural language processing. It is the process of analyzing the input text sentence to obtain the syntax structure of the sentence. At present, the commonly used dependency analysis tools include StanfordNLP toolkit, Hanlp, SpaCy and FudanNLP of Fudan University. Specifically, take an example Case 1 as an example.

Case 1: "The profile of the apelin makes it a therapeutic target for ischemic heart disease." The dependency relationship is shown in Figure 2, where the arrow represents the dependency relationship between different words in the sentence, and the text on the arrow (Such as: det, nsubj, case, nmod, etc.) indicate specific types of dependencies, and the types of dependencies of natural sentences have widely recognized and standardized classifications. GENE in the figure represents apelin, and DISE represents ischemic heart disease.

It can be seen from the figure that the dependency path of a given GENE entity and DISE entity in this sentence is GENE←profile←makes→target→DISE, and it can be seen from the dependency path that makes is the root node in the path ( ROOT).

As described in step S2 above, the path descriptor can be determined according to the dependency relationship. Take Case 1 as an example. Arrange all words on the dependency path of a given GENE entity and DISE entity in the order of the natural sentence to get "profile GENE makes target DISE", and "profile GENE makes target DISE" is It is called the path descriptor.

As described in step S3 above, by performing dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, a large number of path descriptors can be obtained, and then these path descriptors can be deduplicated to obtain candidate rule templates. Then sort the candidate rule templates, filter out the path descriptors whose case number extracted from the given path descriptor is less than the preset value, then evaluate the quality of the remaining path descriptors, and save the path descriptors that pass the evaluation as a rule template , Stored in the rule template library.

As described in step S4 above, after the rule template library is established, knowledge extraction is performed on the full amount of medical documents, a large amount of genetic disease relationships are obtained, the acquired genetic disease relationships are stored, and a genetic disease relationship knowledge base is established.

In one embodiment, before the step of performing dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtaining the dependency of each natural sentence, the step includes:

S01. Obtain natural sentences containing gene-disease entity pairs in a designated medical database;

S02. Randomly select a specified number of natural sentences containing gene-disease entity pairs.

As mentioned above, the establishment of a rule template first needs to perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, that is, it is necessary to automatically learn a rule template from a given natural sentence. In this embodiment, the designated medical database is Pubmed, which is the largest medical document database. As of 2019, the number of documents in Pubmed exceeds 30 million. The gene entity database uses ncbi's gene entity database, and the disease entity database uses the mesh disease entity database. The aforementioned gene entity database and disease entity database are currently widely recognized in the medical field with high quality and wide coverage. The entity database provides the English standard names and aliases of genes and diseases. Use the names of genes and diseases to extract sentences that contain both genes and diseases from medical literature, such as "Breastfeeding and the risk of breast cancer in BRCA1 mutation carriers.", Breast cancer is the name of a disease in the disease entity database, and BRAC1 is the name of a gene in the gene entity database. Obtain a set of sentences containing both genes and disease entities from the medical literature Pubmed, and then extract a specified number of natural sentences for dependency analysis, and finally get the rule template. More specifically, the designated number is 1 million.

In an embodiment, the step of performing dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtaining the dependency of each natural sentence includes:

S11. Use the natural language processing toolkit StanfordNLP to perform a dependency relationship analysis on each natural sentence to obtain the dependency relationship of each natural sentence.

As mentioned above, in this implementation, StanfordNLP is selected as a tool for dependency analysis. The StanfordNLP toolkit supports a complete text analysis pipeline in multiple languages, including word segmentation, part-of-speech tagging, morphological merging, and dependency analysis. In addition, it also provides a Python interface with CoreNLP to easily set up a local Python implementation.

In one embodiment, after the step of determining the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, the generating rule according to the path descriptor of each natural sentence Templates, before the steps to establish a rule template library, it also includes:

S21: Calculate the edit distance between different path descriptors, and cluster the path descriptors whose edit distance is less than or equal to the first specified value into the same type of path descriptor; and,

S22. Identify whether there is negative semantics in the dependency relationship in the natural sentence, and if so, filter out the path descriptor corresponding to the natural sentence.

As mentioned in the above step S21, the path descriptors obtained in the above step S2 have a lot of redundancy, such as the following path descriptors {"GENE target in DISE", "GENE target on DISE", "GENE targets in DISE ","GENE targets on DISE"}, these path descriptors are actually redundant. Regarding the redundancy of path descriptors, in this embodiment, the edit distance of different path descriptors is calculated. If the edit distance is less than or equal to the first specified value, the same path descriptor is considered. Through subsequent statistics, it is found that clustering the path descriptors with an edit distance of 2 (preferably the first specified value) can reduce the number of rule templates by 60%, thereby reducing the redundancy of a large number of path descriptors. Edit distance refers to the minimum number of editing operations for two given strings to switch from one to the other. The editing operations here can be delete, insert, and replace operations. For example, "GENE target in DISE" can be changed to "GENE targets in DISE" through an insert operation (insert s), and "GENE targets on DISE" can be changed through a replacement operation (i is replaced by o). Therefore, the edit distance of "GENE target in DISE" and "GENE target in DISE" here is 2.

As described in step S22 above, the negative information (Neg) can not be found in the existing path descriptor. Here is a specific example for illustration.

Case 2: "The profile of the apelin did not make it a therapeutic target for ischemic heart disease." The dependency relationship is shown in Figure 3.

It can be seen that the dependency path of the given GENE and DISE given here is GENE←profile←make→target→DISE, and the corresponding path descriptor is "profile GENE make target DISE", where make is the root node of the path (ROOT ). It can be found that the path descriptor "profile GENE makes target DISE" of case 2 and case 1 express the same semantics, but in fact, the negative semantics expressed by case 2 can be found in the dependency relationship of the root node make of case 2

It can be seen that there is negative semantics (neg). Here, among all the dependencies of the root node, if there is neg in the dependency of the root node, the example will be filtered out when generating the rule template.

In an embodiment, the step of generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library includes:

S31. Count the number of natural sentence cases corresponding to the same path descriptor, and filter out path descriptors whose case number is less than a second specified value;

S32. Perform quality evaluation on the filtered path descriptors, save the path descriptors that pass the quality evaluation as a rule template, and establish a rule template library.

As mentioned above, the path descriptor can be obtained in step S2. Take case 1 as an example. The sentence of case 1 is "The profile of the apelin makes it a therapeutic target for ischemic heart disease." The example is given in this example GENE is apelin, and a given DISE represents ischemic heart disease. Get the path descriptor "profile GENE makes target DISE" (here, the path descriptor is the candidate rule template). After processing each data sample, the following information can be obtained {data sample, entity pair in the data sample, corresponding path descriptor}.

For example, in case 1, you can get {"The profile of the apelin makes it a therapeutic target for ischemic heart disease.", <apelin, ischemic heart disease>, "profile GENE makes target DISE"}

Then calculate the edit distance between every two path descriptors in all path descriptors. If the edit distance is less than or equal to 2, it is considered as the same type of path descriptor, thereby solving the redundancy problem of path descriptors. In this way, all the path descriptors are the candidate rule templates. For example, the path descriptor "profile GENE makes target DISE" in case 1 can be simplified to "profile GENE make target DISE". After simplifying the path descriptor, new path descriptors corresponding to all data can be obtained. After processing all the data samples, you can get this data:

{case 1, genetic disease entity pair, path descriptor 1},..., {case n, genetic disease entity pair, path descriptor m}

The above data format can get the cases of each path descriptor through simple statistics:

{Path descriptor 1, corresponding to all cases, corresponding to all entity pair set 1},..., {path descriptor m, corresponding to all cases, corresponding to all entity pair set m}.

According to the cases corresponding to each path descriptor, the number of data samples corresponding to each path descriptor can be obtained through simple statistics. The format is as follows: {path descriptor 1, corresponding case number},...,{path descriptor m, corresponding case number}. Sort according to the case number of each path descriptor, and filter out path descriptors whose case number is less than the second specified value (here the second specified value is set to 3). This improves the universality and accuracy of the extracted path descriptors.

Then, perform quality evaluation on the filtered path descriptors, save the path descriptors that pass the quality evaluation as a rule template, and establish a rule template library. The methods of quality evaluation can use manual crowdsourcing, supervised learning, etc.

In a specific embodiment, the quality evaluation is performed on the filtered path descriptors, and the path descriptors passing the quality evaluation are saved as a rule template, and the step of establishing a rule template library includes:

S321. Count the entity pair sets corresponding to the path descriptors to be evaluated;

S322: Count the number of entity pairs in the entity pair set that exist in the CTD;

S323. If the number of existences is greater than the specified number threshold or the ratio of the number of existences to the total number of entities in the entity pair set is greater than the specified ratio threshold, the path descriptor to be evaluated is retained as an available rule template, and stored to establish Rule template library.

As mentioned above, in this embodiment, the rule template is evaluated based on the idea of remote supervision. The core idea of remote supervision is that if there are knowledge triples in the existing knowledge base (such as <ACE, target, heart failure>, which means that the gene ACE and the disease heart failure have a target relationship), then mention it in the text The text to the entity pair (such as ACE, heart failure) has a high probability of describing the target semantics of the entity pair. Specifically, the existing knowledge base used is CTD, and CTD (Common Technical Document) is a widely recognized medical knowledge base in the medical field.

Randomly select a path descriptor i in a path descriptor set 1~m and the entity pair set i corresponding to the path descriptor i, and count the entity pairs in the entity pair set i that exist in the CTD knowledge base Quantity, if the number of existence is greater than the specified number threshold (preferably 4) or the ratio of the number of existence to the total number of entity pairs in the entity pair set is greater than the specified ratio threshold (preferably 0.5), then the path descriptor i is retained For the available rule templates, save all available rule templates to build a rule template library.

In one embodiment, the step of using the rule templates in the rule template library to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base includes:

S41. Perform entity recognition on natural sentences in a full amount of medical literature, and obtain natural sentences containing gene-disease entity pairs;

S42: Perform a dependency relationship analysis on all natural sentences including gene-disease entity pairs to obtain the dependency relationship of each natural sentence;

S43: Determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence;

S44: Determine whether the path descriptor is in the rule template library;

S45. If yes, obtain the genetic disease relationship according to the path descriptor, and save the genetic disease relationship in the genetic disease relationship knowledge base.

As described above, in steps S1 to S3, the path descriptor is obtained by analyzing the dependency relationship of the natural sentence, and then performing the quality evaluation on the path descriptor to obtain the rule template, and establish the rule template library. One million natural sentences containing gene-disease entity pairs were selected during the establishment of the rule template library. Recognize genes and disease entities of natural sentences in the full amount of medical literature, obtain all natural sentences that contain gene-disease entities, and then use the toolkit to analyze the dependencies of these natural sentences in turn to obtain the dependency of each natural sentence. Determine the path descriptor, and then determine whether the path descriptor is in the rule template library created through steps S1 to S3. If so, obtain the relationship between the gene and the disease (such as the target in case 1) according to the path descriptor, The genetic disease relationship is stored in the genetic disease relationship knowledge base.

In one embodiment, the medical database, the rule template, the genetic disease relationship knowledge base, etc. are stored in the nodes of the blockchain, and the above-mentioned genetic disease relationship knowledge base construction is realized in the blockchain. method.

As mentioned above, blockchain is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer. The underlying platform of the blockchain can include processing modules such as user management, basic services, smart contracts, and operation monitoring. Among them, the user management module is responsible for the identity information management of all blockchain participants, including the maintenance of public and private key generation (account management), key management, and maintenance of the correspondence between the user’s real identity and the blockchain address (authority management), etc. In the case of authorization, supervise and audit certain real-identity transactions, and provide risk control rule configuration (risk control audit); basic service modules are deployed on all blockchain node devices to verify the validity of business requests, After completing the consensus on the valid request, it is recorded on the storage. For a new business request, the basic service first performs interface adaptation analysis and authentication processing (interface configuration), and then encrypts the business information through the consensus algorithm (consensus management). After encryption, it is completely and consistently transmitted to the shared ledger (network communication), and recorded and stored; the smart contract module is responsible for contract registration and issuance, contract triggering and contract execution. Developers can define the contract logic through a certain programming language and publish it to the district On the block chain (contract registration), according to the logic of the contract terms, the key or other events trigger execution to complete the contract logic, and also provide the function of contract upgrade and cancellation; the operation monitoring module is mainly responsible for the deployment, Configuration modification, contract settings, cloud adaptation, and visual output of real-time status during product operation, such as alarms, monitoring network conditions, monitoring node equipment health status, etc.

The method for constructing the genetic disease relationship knowledge base of the embodiment of the application automatically learns a large number of rule templates by analyzing the dependence relationship of a specified number of natural sentences containing gene-disease entity pairs, and then automatically extracts a large number of rule templates from the medical literature using the rule templates The relationship knowledge of genetic diseases does not require high labor costs, and the amount of extracted knowledge is large, the extraction effect is good, and it has good migration and applicability, and can be used for more medical entity relationship extraction.

Referring to Fig. 4, an embodiment of the present application also provides an apparatus for constructing a genetic disease relationship knowledge base, including:

The dependency analysis module 1 is used to perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;

The path descriptor determining module 2 is used to determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to all the path descriptors in the genetic disease entity dependent path in the natural sentence. The order of words;

The rule template generation module 3 is configured to generate a rule template according to the path descriptor of each natural sentence, and establish a rule template library;

The knowledge extraction module 4 is configured to use the rule templates in the rule template library to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.

In an embodiment, the gene-disease relationship knowledge base construction device further includes:

The natural sentence acquisition module is used to acquire natural sentences containing gene-disease entity pairs in the designated medical database;

The selection module is used to randomly select a specified number of natural sentences containing gene-disease entity pairs.

In an embodiment, the dependency analysis module 1 includes:

The dependence relationship analysis unit is used to analyze the dependence relationship of each natural sentence by using the natural language processing toolkit StanfordNLP to obtain the dependence relationship of each natural sentence.

The clustering module is used to calculate the edit distance between different path descriptors, and cluster the path descriptors whose edit distance is less than or equal to the first specified value into the same type of path descriptor;

The filtering module is used to identify whether there is negative semantics in the dependency relationship in the natural sentence, and if so, filter out the path descriptor corresponding to the natural sentence.

In an embodiment, the rule template generation module 3 includes:

The statistics module is used to count the number of natural sentence cases corresponding to the same path descriptor, and filter out path descriptors whose case number is less than the second specified value;

The quality evaluation module is used to perform quality evaluation on the filtered path descriptors, save the path descriptors that pass the quality evaluation as a rule template, and establish a rule template library.

In an embodiment, the quality assessment module includes:

The first statistics unit is used to count the entity pair sets corresponding to the path descriptors to be evaluated;

The second statistical unit is used to count the number of entity pairs in the entity pair set that exist in the CTD;

A processing unit, configured to reserve the path descriptor to be evaluated as an available rule template if the number of existences is greater than the specified number threshold or the ratio of the number of existences to the total number of entities in the entity pair set is greater than the specified ratio threshold, Store it to build a rule template library.

In an embodiment, the knowledge extraction module 4 includes:

The entity recognition unit is used for entity recognition of natural sentences in a full amount of medical literature, and obtaining natural sentences containing gene-disease entity pairs;

The dependency analysis unit is used to perform dependency analysis on all natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;

A path descriptor determining unit, configured to determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence;

A judging unit, configured to judge whether the path descriptor is in the rule template library;

The obtaining unit, if yes, obtains the genetic disease relationship according to the path descriptor, and saves the genetic disease relationship in the genetic disease relationship knowledge base.

As mentioned above, it is understandable that the components of the genetic disease relationship knowledge base construction device proposed in this application can realize the function of any one of the above-mentioned genetic disease relationship knowledge database construction methods, and the specific structure will not be repeated.

Referring to FIG. 5, an embodiment of the present application also provides a computer device. The computer device may be a server, and its internal structure may be as shown in FIG. 5. The computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer equipment is used to store data such as rule templates. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program is executed by the processor to realize a method for constructing a genetic disease relationship knowledge base.

The above-mentioned processor executes the above-mentioned method for constructing a genetic disease relationship knowledge base, including:

The present application also provides a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. A computer program is stored thereon, and when the computer program is executed by a processor, a The method of constructing a knowledge base of genetic disease relationship includes the following steps:

The above-mentioned method for constructing the genetic disease relationship knowledge base implements the dependency relationship analysis of a specified number of natural sentences containing gene-disease entity pairs, automatically learns a large number of rule templates, and then uses the rule templates to automatically extract genetic diseases from medical literature. The relationship knowledge does not require high labor costs, and the amount of extracted knowledge is large, the extraction effect is good, and it has good transferability and applicability, and can be used for more medical entity relationship extraction.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer readable storage. In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media provided in this application and used in the embodiments may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual-rate data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, It also includes other elements not explicitly listed, or elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.

The above are only the preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the specification and drawings of this application, or directly or indirectly applied to other related The technical field is equally included in the scope of patent protection of this application.

Claims

A method for constructing a knowledge base of genetic disease relationships, including:

Perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;

Determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the sequence of all words in the natural sentence on the path of genetic disease entity dependence;

Generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library;

The rule template in the rule template library is used to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.
The method for constructing a genetic disease relationship knowledge base according to claim 1, wherein before the step of performing dependency relationship analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtaining the dependency relationship of each natural sentence, include:

Obtain natural sentences containing gene-disease entity pairs in the designated medical database;

Randomly select a specified number of natural sentences containing gene-disease entity pairs.
The method for constructing a genetic disease relation knowledge base according to claim 1, wherein the step of performing dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtaining the dependency relation of each natural sentence comprises :

A natural language processing toolkit StanfordNLP is used to analyze the dependence relationship of each natural sentence to obtain the dependence relationship of each natural sentence.
The method for constructing a genetic disease relationship knowledge base according to claim 1, wherein after the step of determining the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, the method according to each natural sentence The path descriptor of the natural sentence generates a rule template, and before the step of establishing a rule template library, it also includes:

Calculate the edit distance between different path descriptors, and cluster the path descriptors whose edit distance is less than or equal to the first specified value into the same type of path descriptor; and,

Identify whether there is negative semantics in the dependency relationship in the natural sentence, and if so, filter out the path descriptor corresponding to the natural sentence.
The method for constructing a genetic disease relationship knowledge base according to claim 1, wherein said generating a rule template according to said path descriptor of each said natural sentence, and the step of establishing a rule template library comprises:

Count the number of natural sentence cases corresponding to the same path descriptor, and filter out path descriptors whose case number is less than the second specified value;

Perform quality evaluation on the filtered path descriptors, save the path descriptors that pass the quality evaluation as a rule template, and establish a rule template library.
The method for constructing a genetic disease relationship knowledge base according to claim 5, wherein the step of performing quality evaluation on the filtered path descriptors, saving the path descriptors passing the quality evaluation as a rule template, and establishing a rule template library includes :

Count the set of entity pairs corresponding to the path descriptors to be evaluated;

Count the number of entity pairs in the entity pair set that exist in the CTD;

If the number of existence is greater than the specified number threshold or the ratio of the number of existence to the total number of entities in the entity pair set is greater than the specified ratio threshold, the path descriptor to be evaluated is retained as an available rule template, and the rule template is stored to create a rule template Library.
The method for constructing a genetic disease relationship knowledge base according to claim 1, wherein said rule template in said rule template library is used to extract knowledge from a full amount of medical literature to obtain genetic disease relationship and establish genetic disease relationship knowledge The steps of the library include:

Perform entity recognition of natural sentences in a full amount of medical literature, and obtain natural sentences containing gene-disease entity pairs;

Perform dependency analysis on all natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;

Determining the path descriptor of each natural sentence according to the dependency relationship of each natural sentence;

Judging whether the path descriptor is in the rule template library;

If so, the genetic disease relationship is obtained according to the path descriptor, and the genetic disease relationship is stored in the genetic disease relationship knowledge base.
A device for constructing a genetic disease relationship knowledge base, including:

The dependency analysis module is used to perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;

The path descriptor determining module is used to determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to all the words in the natural sentence on the path of genetic disease entity dependence The order of arrangement;

A rule template generation module, which is used to generate a rule template according to the path descriptor of each natural sentence, and build a rule template library;

The knowledge extraction module is used for extracting knowledge from the full amount of medical literature by using the rule templates in the rule template library, obtaining genetic disease relationships, and establishing a genetic disease relationship knowledge base.
A computer device includes a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, a method for constructing a genetic disease relationship knowledge base is realized, wherein the method for constructing a genetic disease relationship knowledge base include:

Perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;

Determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the sequence of all words in the natural sentence on the path of genetic disease entity dependence;

Generate a rule template according to the path descriptor of each natural sentence to establish a rule template library; use the rule template in the rule template library to extract knowledge from the entire medical literature, obtain genetic disease relationships, and establish genes Disease relationship knowledge base.
9. The computer device according to claim 9, wherein before the step of performing dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtaining the dependency relation of each natural sentence, the step comprises:

Obtain natural sentences containing gene-disease entity pairs in the designated medical database;

Randomly select a specified number of natural sentences containing gene-disease entity pairs.
9. The computer device according to claim 9, wherein the step of performing dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtaining the dependency of each natural sentence comprises:

A natural language processing toolkit StanfordNLP is used to analyze the dependence relationship of each natural sentence to obtain the dependence relationship of each natural sentence.
8. The computer device according to claim 9, wherein after the step of determining the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, the path descriptor of each natural sentence is determined according to the The path descriptor generates a rule template, and before the step of establishing a rule template library, it also includes:

Calculate the edit distance between different path descriptors, and cluster the path descriptors whose edit distance is less than or equal to the first specified value into the same type of path descriptor; and,

Identify whether there is negative semantics in the dependency relationship in the natural sentence, and if so, filter out the path descriptor corresponding to the natural sentence.
9. The computer device according to claim 9, wherein the step of generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library comprises:

Count the number of natural sentence cases corresponding to the same path descriptor, and filter out path descriptors whose case number is less than the second specified value;

Perform quality evaluation on the filtered path descriptors, save the path descriptors that pass the quality evaluation as a rule template, and establish a rule template library.
The computer device according to claim 13, wherein the step of performing quality evaluation on the filtered path descriptors, saving the path descriptors passing the quality evaluation as a rule template, and establishing a rule template library comprises:

Count the set of entity pairs corresponding to the path descriptors to be evaluated;

Count the number of entity pairs in the entity pair set that exist in the CTD;

If the number of existence is greater than the specified number threshold or the ratio of the number of existence to the total number of entities in the entity pair set is greater than the specified ratio threshold, the path descriptor to be evaluated is retained as an available rule template, and the rule template is stored to create a rule template Library.
A computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by a processor, a method for constructing a genetic disease relationship knowledge base is realized, wherein the method for constructing a genetic disease relationship knowledge base includes:

Perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;

Determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the sequence of all words in the natural sentence on the path of genetic disease entity dependence;

Generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library;

The rule template in the rule template library is used to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.
15. The computer-readable storage medium according to claim 15, wherein the step of performing dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtaining the dependency of each natural sentence comprises:

Obtain natural sentences containing gene-disease entity pairs in the designated medical database;

Randomly select a specified number of natural sentences containing gene-disease entity pairs.
15. The computer-readable storage medium according to claim 15, wherein the step of performing dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtaining the dependency of each natural sentence comprises:

A natural language processing toolkit StanfordNLP is used to analyze the dependence relationship of each natural sentence to obtain the dependence relationship of each natural sentence.
The computer-readable storage medium according to claim 15, wherein after the step of determining the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, the The path descriptor of the sentence generates a rule template, and before the step of establishing a rule template library, it also includes:

Calculate the edit distance between different path descriptors, and cluster the path descriptors whose edit distance is less than or equal to the first specified value into the same type of path descriptor; and,

Identify whether there is negative semantics in the dependency relationship in the natural sentence, and if so, filter out the path descriptor corresponding to the natural sentence.
15. The computer-readable storage medium according to claim 15, wherein the step of generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library comprises:

Count the number of natural sentence cases corresponding to the same path descriptor, and filter out path descriptors whose case number is less than the second specified value;

Perform quality evaluation on the filtered path descriptors, save the path descriptors that pass the quality evaluation as a rule template, and establish a rule template library.
18. The computer-readable storage medium according to claim 19, wherein the step of performing quality evaluation on the filtered path descriptors, storing the path descriptors passing the quality evaluation as a rule template, and establishing a rule template library comprises:

Count the set of entity pairs corresponding to the path descriptors to be evaluated;

Count the number of entity pairs in the entity pair set that exist in the CTD;

If the number of existence is greater than the specified number threshold or the ratio of the number of existence to the total number of entities in the entity pair set is greater than the specified ratio threshold, the path descriptor to be evaluated is retained as an available rule template, and the rule template is stored to create a rule template Library.