WO2021155684A1 - Gene-disease relationship knowledge base construction method and apparatus, and computer device - Google Patents

Gene-disease relationship knowledge base construction method and apparatus, and computer device Download PDF

Info

Publication number
WO2021155684A1
WO2021155684A1 PCT/CN2020/125143 CN2020125143W WO2021155684A1 WO 2021155684 A1 WO2021155684 A1 WO 2021155684A1 CN 2020125143 W CN2020125143 W CN 2020125143W WO 2021155684 A1 WO2021155684 A1 WO 2021155684A1
Authority
WO
WIPO (PCT)
Prior art keywords
rule template
path
natural
natural sentence
relationship
Prior art date
Application number
PCT/CN2020/125143
Other languages
French (fr)
Chinese (zh)
Inventor
张圣
顾大中
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021155684A1 publication Critical patent/WO2021155684A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • This application relates to the field of artificial intelligence and smart medical care, in particular to a method, device and computer equipment for constructing a genetic disease relationship knowledge base.
  • the current practice of rule-based solutions requires domain experts to summarize available high-quality rules. The amount of knowledge that can be obtained depends entirely on the quality and quantity of high-quality rules. At present, most rule-based solutions have a low recall rate and are accurate. The rate is higher but the cost is also high.
  • the best model at present is the relationship extraction model based on deep learning, but even the current deep learning-based model has a relatively low effect on medical relationship extraction, which is far from practical. There is also a larger transverse groove.
  • the training of deep learning models requires a large number of high-quality label data sets, and high-quality medical relationship extraction label data requires experts to manually label.
  • the main purpose of this application is to provide a method, device and computer equipment for constructing a genetic disease relationship knowledge base, aiming to solve the current problems of high cost and poor effect of the current genetic disease relationship knowledge database.
  • the first aspect of this application proposes a method for constructing a genetic disease relationship knowledge base, which includes:
  • path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the sequence of all words in the natural sentence on the path of genetic disease entity dependence;
  • the rule template in the rule template library is used to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.
  • this application also provides a device for constructing a genetic disease relationship knowledge base, including:
  • the dependency analysis module is used to perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;
  • a path descriptor determining module configured to determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence
  • a rule template generation module which is used to generate a rule template according to the path descriptor of each natural sentence, and build a rule template library;
  • the knowledge extraction module is used for extracting knowledge from the full amount of medical literature by using the rule templates in the rule template library, obtaining genetic disease relationships, and establishing a genetic disease relationship knowledge base.
  • the present application also provides a computer device, including a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, a method for constructing a genetic disease relationship knowledge base is realized, wherein
  • the methods for constructing the knowledge base of the genetic disease relationship include:
  • path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the sequence of all words in the natural sentence on the path of genetic disease entity dependence;
  • the rule template in the rule template library is used to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.
  • the present application also provides a computer-readable storage medium on which a computer program is stored.
  • a method for constructing a genetic disease relationship knowledge base is realized, wherein the genetic disease relationship knowledge Library construction methods include:
  • path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the sequence of all words in the natural sentence on the path of genetic disease entity dependence;
  • the rule template in the rule template library is used to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.
  • the method, device and computer equipment for constructing a gene-disease relationship knowledge base of this application automatically learn a large number of rule templates by analyzing a specified number of natural sentences containing gene-disease entity pairs, and then use the rule templates from medical literature It automatically extracts the relationship knowledge of genetic diseases without high labor cost, and the amount of extracted knowledge is large, the extraction effect is good, and it has good migration and applicability, and can be used for more medical entity relationship extraction.
  • FIG. 1 is a schematic flow chart of a method for constructing a gene-disease relationship knowledge base according to an embodiment of the application
  • FIG. 2 is a schematic diagram of an example of the dependency relationship of natural sentences according to an embodiment of the application
  • FIG. 3 is a schematic diagram of an example of the dependency relationship of natural sentences according to another embodiment of the application.
  • FIG. 4 is a schematic block diagram of the structure of an apparatus for constructing a gene-disease relationship knowledge base according to an embodiment of the application;
  • FIG. 5 is a schematic block diagram of the structure of a computer device according to an embodiment of the application.
  • an embodiment of the present application provides a method for constructing a gene-disease relationship knowledge base, which includes the following steps:
  • the task of medical relationship extraction is to judge the relationship between a given gene and disease entity pair based on the semantic information in a medical text containing a given gene-disease entity pair.
  • a scheme based on a rule template is used to extract gene-disease relationship knowledge from a large amount of medical literature.
  • the rule template in this embodiment is not a rule template constructed by an expert, and a rule template constructed by an expert needs to be expensive.
  • the labor cost is relatively small, and the number of rule templates is small, which leads to the small scale and high cost of medical relationship knowledge extracted based on the way of constructing rules by experts.
  • a large number of high-quality usable rule templates can be automatically learned from a specified number of natural sentences containing entity pairs, and then these templates can be used for knowledge extraction, and a large amount of medical relationship knowledge can be obtained from the entire medical literature. , Build a knowledge base.
  • the rule template is designed and extracted based on the dependency relationship of the natural sentence.
  • Dependency analysis also known as dependency syntax analysis, is one of the key technologies in natural language processing. It is the process of analyzing the input text sentence to obtain the syntax structure of the sentence.
  • the commonly used dependency analysis tools include StanfordNLP toolkit, Hanlp, SpaCy and FudanNLP of Fudan University. Specifically, take an example Case 1 as an example.
  • Case 1 "The profile of the apelin makes it a therapeutic target for ischemic heart disease.”
  • the dependency relationship is shown in Figure 2, where the arrow represents the dependency relationship between different words in the sentence, and the text on the arrow (Such as: det, nsubj, case, nmod, etc.) indicate specific types of dependencies, and the types of dependencies of natural sentences have widely recognized and standardized classifications.
  • GENE in the figure represents apelin, and DISE represents ischemic heart disease.
  • the path descriptor can be determined according to the dependency relationship. Take Case 1 as an example. Arrange all words on the dependency path of a given GENE entity and DISE entity in the order of the natural sentence to get "profile GENE makes target DISE", and "profile GENE makes target DISE” is It is called the path descriptor.
  • step S3 by performing dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, a large number of path descriptors can be obtained, and then these path descriptors can be deduplicated to obtain candidate rule templates. Then sort the candidate rule templates, filter out the path descriptors whose case number extracted from the given path descriptor is less than the preset value, then evaluate the quality of the remaining path descriptors, and save the path descriptors that pass the evaluation as a rule template , Stored in the rule template library.
  • step S4 after the rule template library is established, knowledge extraction is performed on the full amount of medical documents, a large amount of genetic disease relationships are obtained, the acquired genetic disease relationships are stored, and a genetic disease relationship knowledge base is established.
  • the step before the step of performing dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtaining the dependency of each natural sentence, the step includes:
  • the establishment of a rule template first needs to perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, that is, it is necessary to automatically learn a rule template from a given natural sentence.
  • the designated medical database is Pubmed, which is the largest medical document database. As of 2019, the number of documents in Pubmed exceeds 30 million.
  • the gene entity database uses ncbi's gene entity database, and the disease entity database uses the mesh disease entity database.
  • the aforementioned gene entity database and disease entity database are currently widely recognized in the medical field with high quality and wide coverage.
  • the entity database provides the English standard names and aliases of genes and diseases.
  • Breast cancer is the name of a disease in the disease entity database
  • BRAC1 is the name of a gene in the gene entity database.
  • the step of performing dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtaining the dependency of each natural sentence includes:
  • StanfordNLP is selected as a tool for dependency analysis.
  • the StanfordNLP toolkit supports a complete text analysis pipeline in multiple languages, including word segmentation, part-of-speech tagging, morphological merging, and dependency analysis.
  • it also provides a Python interface with CoreNLP to easily set up a local Python implementation.
  • the generating rule according to the path descriptor of each natural sentence Templates before the steps to establish a rule template library, it also includes:
  • the path descriptors obtained in the above step S2 have a lot of redundancy, such as the following path descriptors ⁇ "GENE target in DISE", “GENE target on DISE”, “GENE targets in DISE ",”GENE targets on DISE” ⁇ , these path descriptors are actually redundant.
  • the edit distance of different path descriptors is calculated. If the edit distance is less than or equal to the first specified value, the same path descriptor is considered.
  • Edit distance refers to the minimum number of editing operations for two given strings to switch from one to the other.
  • the editing operations here can be delete, insert, and replace operations.
  • “GENE target in DISE” can be changed to "GENE targets in DISE” through an insert operation (insert s), and "GENE targets on DISE” can be changed through a replacement operation (i is replaced by o). Therefore, the edit distance of "GENE target in DISE” and "GENE target in DISE” here is 2.
  • the negative information (Neg) can not be found in the existing path descriptor.
  • the negative information can not be found in the existing path descriptor.
  • the negative information can not be found in the existing path descriptor.
  • the dependency path of the given GENE and DISE given here is GENE ⁇ profile ⁇ make ⁇ target ⁇ DISE
  • the corresponding path descriptor is "profile GENE make target DISE", where make is the root node of the path (ROOT ).
  • the path descriptor "profile GENE makes target DISE” of case 2 and case 1 express the same semantics, but in fact, the negative semantics expressed by case 2 can be found in the dependency relationship of the root node make of case 2
  • there is negative semantics (neg).
  • the example will be filtered out when generating the rule template.
  • the step of generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library includes:
  • the path descriptor can be obtained in step S2.
  • case 1 Take case 1 as an example.
  • the sentence of case 1 is "The profile of the apelin makes it a therapeutic target for ischemic heart disease.”
  • the example is given in this example GENE is apelin, and a given DISE represents ischemic heart disease.
  • Get the path descriptor "profile GENE makes target DISE” here, the path descriptor is the candidate rule template). After processing each data sample, the following information can be obtained ⁇ data sample, entity pair in the data sample, corresponding path descriptor ⁇ .
  • ⁇ Path descriptor 1 corresponding to all cases, corresponding to all entity pair set 1 ⁇
  • ⁇ path descriptor m corresponding to all cases, corresponding to all entity pair set m ⁇ .
  • the number of data samples corresponding to each path descriptor can be obtained through simple statistics.
  • the format is as follows: ⁇ path descriptor 1, corresponding case number ⁇ ,..., ⁇ path descriptor m, corresponding case number ⁇ . Sort according to the case number of each path descriptor, and filter out path descriptors whose case number is less than the second specified value (here the second specified value is set to 3). This improves the universality and accuracy of the extracted path descriptors.
  • the methods of quality evaluation can use manual crowdsourcing, supervised learning, etc.
  • the quality evaluation is performed on the filtered path descriptors, and the path descriptors passing the quality evaluation are saved as a rule template, and the step of establishing a rule template library includes:
  • the path descriptor to be evaluated is retained as an available rule template, and stored to establish Rule template library.
  • the rule template is evaluated based on the idea of remote supervision.
  • the core idea of remote supervision is that if there are knowledge triples in the existing knowledge base (such as ⁇ ACE, target, heart failure>, which means that the gene ACE and the disease heart failure have a target relationship), then mention it in the text
  • the text to the entity pair (such as ACE, heart failure) has a high probability of describing the target semantics of the entity pair.
  • the existing knowledge base used is CTD
  • CTD Common Technical Document
  • the step of using the rule templates in the rule template library to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base includes:
  • the path descriptor is obtained by analyzing the dependency relationship of the natural sentence, and then performing the quality evaluation on the path descriptor to obtain the rule template, and establish the rule template library.
  • One million natural sentences containing gene-disease entity pairs were selected during the establishment of the rule template library. Recognize genes and disease entities of natural sentences in the full amount of medical literature, obtain all natural sentences that contain gene-disease entities, and then use the toolkit to analyze the dependencies of these natural sentences in turn to obtain the dependency of each natural sentence.
  • Determine the path descriptor and then determine whether the path descriptor is in the rule template library created through steps S1 to S3. If so, obtain the relationship between the gene and the disease (such as the target in case 1) according to the path descriptor,
  • the genetic disease relationship is stored in the genetic disease relationship knowledge base.
  • the medical database, the rule template, the genetic disease relationship knowledge base, etc. are stored in the nodes of the blockchain, and the above-mentioned genetic disease relationship knowledge base construction is realized in the blockchain. method.
  • Blockchain is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the underlying platform of the blockchain can include processing modules such as user management, basic services, smart contracts, and operation monitoring.
  • the user management module is responsible for the identity information management of all blockchain participants, including the maintenance of public and private key generation (account management), key management, and maintenance of the correspondence between the user’s real identity and the blockchain address (authority management), etc.
  • authorization supervise and audit certain real-identity transactions, and provide risk control rule configuration (risk control audit); basic service modules are deployed on all blockchain node devices to verify the validity of business requests, After completing the consensus on the valid request, it is recorded on the storage. For a new business request, the basic service first performs interface adaptation analysis and authentication processing (interface configuration), and then encrypts the business information through the consensus algorithm (consensus management).
  • the smart contract module is responsible for contract registration and issuance, contract triggering and contract execution. Developers can define the contract logic through a certain programming language and publish it to the district On the block chain (contract registration), according to the logic of the contract terms, the key or other events trigger execution to complete the contract logic, and also provide the function of contract upgrade and cancellation; the operation monitoring module is mainly responsible for the deployment, Configuration modification, contract settings, cloud adaptation, and visual output of real-time status during product operation, such as alarms, monitoring network conditions, monitoring node equipment health status, etc.
  • the method for constructing the genetic disease relationship knowledge base of the embodiment of the application automatically learns a large number of rule templates by analyzing the dependence relationship of a specified number of natural sentences containing gene-disease entity pairs, and then automatically extracts a large number of rule templates from the medical literature using the rule templates.
  • the relationship knowledge of genetic diseases does not require high labor costs, and the amount of extracted knowledge is large, the extraction effect is good, and it has good migration and applicability, and can be used for more medical entity relationship extraction.
  • an embodiment of the present application also provides an apparatus for constructing a genetic disease relationship knowledge base, including:
  • the dependency analysis module 1 is used to perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;
  • the path descriptor determining module 2 is used to determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to all the path descriptors in the genetic disease entity dependent path in the natural sentence.
  • the rule template generation module 3 is configured to generate a rule template according to the path descriptor of each natural sentence, and establish a rule template library;
  • the knowledge extraction module 4 is configured to use the rule templates in the rule template library to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.
  • the gene-disease relationship knowledge base construction device further includes:
  • the natural sentence acquisition module is used to acquire natural sentences containing gene-disease entity pairs in the designated medical database
  • the selection module is used to randomly select a specified number of natural sentences containing gene-disease entity pairs.
  • the dependency analysis module 1 includes:
  • the dependence relationship analysis unit is used to analyze the dependence relationship of each natural sentence by using the natural language processing toolkit StanfordNLP to obtain the dependence relationship of each natural sentence.
  • the gene-disease relationship knowledge base construction device further includes:
  • the clustering module is used to calculate the edit distance between different path descriptors, and cluster the path descriptors whose edit distance is less than or equal to the first specified value into the same type of path descriptor;
  • the filtering module is used to identify whether there is negative semantics in the dependency relationship in the natural sentence, and if so, filter out the path descriptor corresponding to the natural sentence.
  • the rule template generation module 3 includes:
  • the statistics module is used to count the number of natural sentence cases corresponding to the same path descriptor, and filter out path descriptors whose case number is less than the second specified value;
  • the quality evaluation module is used to perform quality evaluation on the filtered path descriptors, save the path descriptors that pass the quality evaluation as a rule template, and establish a rule template library.
  • the quality assessment module includes:
  • the first statistics unit is used to count the entity pair sets corresponding to the path descriptors to be evaluated
  • the second statistical unit is used to count the number of entity pairs in the entity pair set that exist in the CTD;
  • a processing unit configured to reserve the path descriptor to be evaluated as an available rule template if the number of existences is greater than the specified number threshold or the ratio of the number of existences to the total number of entities in the entity pair set is greater than the specified ratio threshold, Store it to build a rule template library.
  • the knowledge extraction module 4 includes:
  • the entity recognition unit is used for entity recognition of natural sentences in a full amount of medical literature, and obtaining natural sentences containing gene-disease entity pairs;
  • the dependency analysis unit is used to perform dependency analysis on all natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;
  • a path descriptor determining unit configured to determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence
  • a judging unit configured to judge whether the path descriptor is in the rule template library
  • the obtaining unit if yes, obtains the genetic disease relationship according to the path descriptor, and saves the genetic disease relationship in the genetic disease relationship knowledge base.
  • the components of the genetic disease relationship knowledge base construction device proposed in this application can realize the function of any one of the above-mentioned genetic disease relationship knowledge database construction methods, and the specific structure will not be repeated.
  • an embodiment of the present application also provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 5.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium.
  • the database of the computer equipment is used to store data such as rule templates.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize a method for constructing a genetic disease relationship knowledge base.
  • the above-mentioned processor executes the above-mentioned method for constructing a genetic disease relationship knowledge base, including:
  • path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the sequence of all words in the natural sentence on the path of genetic disease entity dependence;
  • the rule template in the rule template library is used to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile.
  • a computer program is stored thereon, and when the computer program is executed by a processor, a The method of constructing a knowledge base of genetic disease relationship includes the following steps:
  • path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the sequence of all words in the natural sentence on the path of genetic disease entity dependence;
  • the rule template in the rule template library is used to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.
  • the above-mentioned method for constructing the genetic disease relationship knowledge base implements the dependency relationship analysis of a specified number of natural sentences containing gene-disease entity pairs, automatically learns a large number of rule templates, and then uses the rule templates to automatically extract genetic diseases from medical literature.
  • the relationship knowledge does not require high labor costs, and the amount of extracted knowledge is large, the extraction effect is good, and it has good transferability and applicability, and can be used for more medical entity relationship extraction.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual-rate data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

A gene-disease relationship knowledge base construction method and an apparatus, and a computer device, the method comprising: performing dependence relationship analysis on a designated number of natural sentences to acquire dependence relationships; on the basis of the dependence relationships, determining path descriptors of the natural sentences; on the basis of the path descriptors, generating rule templates, and constructing a rule template database; using the rule templates to perform knowledge extraction on an entirety of medical documents to acquire gene-disease relationships, and constructing a gene-disease relationship knowledge base. The method is able to automatically learn a large amount of rule templates, and then use the rule templates to automatically extract gene-disease relationship knowledge from medical documents without requiring high manual labor costs. In addition, the amount of knowledge data extracted in the method is large, the extraction effect thereof is good, and the method features good mobility and can be used in extraction of more relationships between medical entities. The present method further relates to blockchain technology, and the rule templates and the gene/disease relationship knowledge base etc. can be stored in a blockchain.

Description

基因疾病关系知识库构建方法、装置和计算机设备Method, device and computer equipment for constructing genetic disease relationship knowledge base
本申请要求于2020年9月9日提交中国专利局、申请号为2020109416427,发明名称为“基因疾病关系知识库构建方法、装置和计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 9, 2020, with the application number 2020109416427, and the invention titled "Methods, Apparatus, and Computer Equipment for Constructing Genetic Disease Relation Knowledge Base", the entire contents of which are incorporated by reference Incorporated in this application.
技术领域Technical field
本申请涉及到人工智能领域和智慧医疗领域,特别是涉及到一种基因疾病关系知识库构建方法、装置和计算机设备。This application relates to the field of artificial intelligence and smart medical care, in particular to a method, device and computer equipment for constructing a genetic disease relationship knowledge base.
背景技术Background technique
医学文献资料中存在着海量的包含疾病基因之间的靶标关系的自然语句,疾病的靶标基因对于医学基础的研究、疾病诊治及靶向药物研发有着重要的意义。关于疾病靶标基因的知识库的构建,现有的高质量的疾病基因的靶标关系的获取基本是通过专家人力构建获取的,但是随着医学文献的指数级增长,只是靠专家人工整理编辑审核的构建医学知识库的方式基本上无法实现构建比较全的知识库。There are a large number of natural sentences that contain the target relationships between disease genes in medical literature. The target genes of diseases are of great significance for medical basic research, disease diagnosis and treatment, and targeted drug development. Regarding the construction of the knowledge base of disease target genes, the existing high-quality disease gene target relationships are basically obtained through the construction of expert manpower, but with the exponential growth of medical literature, it is only manually edited and reviewed by experts. The method of constructing a medical knowledge base basically cannot realize the construction of a relatively complete knowledge base.
发明人发现,目前也有利用计算机技术自动从医学文献资料中获取医学实体关系的技术方案,这些技术方案主要分为两种,基于人为设计的规则进行医学实体关系抽取和利用机器学习技术进行医学实体关系抽取。基于规则的方案目前的做法都是需要领域专家总结可用高质量规则,可获取的知识的数量完全取决于高质量的规则的质量和数量,目前大部分基于规则的方案召回率都很低,准确率较高但成本也高。基于机器学习算法进行医学关系抽取的方案,目前最好的模型是基于深度学习的关系抽取模型,但是即便是在目前基于深度学习的模型在医学关系抽取的效果仍然比较低,离可以实际可用的还有较大的横沟。另外深度学习模型的训练需要大量的高质量的标签数据集,高质量的医学关系抽取标签数据需要专家人工标注。The inventor found that there are also technical solutions that use computer technology to automatically obtain medical entity relationships from medical literature. These technical solutions are mainly divided into two types, based on artificially designed rules for medical entity relationship extraction and machine learning technology for medical entities Relationship extraction. The current practice of rule-based solutions requires domain experts to summarize available high-quality rules. The amount of knowledge that can be obtained depends entirely on the quality and quantity of high-quality rules. At present, most rule-based solutions have a low recall rate and are accurate. The rate is higher but the cost is also high. Based on the machine learning algorithm for medical relationship extraction, the best model at present is the relationship extraction model based on deep learning, but even the current deep learning-based model has a relatively low effect on medical relationship extraction, which is far from practical. There is also a larger transverse groove. In addition, the training of deep learning models requires a large number of high-quality label data sets, and high-quality medical relationship extraction label data requires experts to manually label.
技术问题technical problem
本申请的主要目的为提供一种基因疾病关系知识库构建方法、装置和计算机设备,旨在解决目前的基因疾病关系知识库构建成本高、效果差的问题。The main purpose of this application is to provide a method, device and computer equipment for constructing a genetic disease relationship knowledge base, aiming to solve the current problems of high cost and poor effect of the current genetic disease relationship knowledge database.
技术解决方案Technical solutions
为了实现上述发明目的,本申请第一方面提出一种基因疾病关系知识库构建方法,包括:In order to achieve the above-mentioned purpose of the invention, the first aspect of this application proposes a method for constructing a genetic disease relationship knowledge base, which includes:
对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系;Perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;
根据每个所述自然语句的依存关系确定每个所述自然语句的路径描述符,其中所述路径描述符是指自然语句中在基因疾病实体依存路径上所有词的排列顺序;Determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the sequence of all words in the natural sentence on the path of genetic disease entity dependence;
根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库;Generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library;
利用所述规则模板库中的所述规则模板对全量的医学文献进行知识抽取,获取基因疾病关系,建立基因疾病关系知识库。The rule template in the rule template library is used to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.
第二方面,本申请还提供一种基因疾病关系知识库构建装置,包括:In the second aspect, this application also provides a device for constructing a genetic disease relationship knowledge base, including:
依存关系分析模块,用于对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系;The dependency analysis module is used to perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;
路径描述符确定模块,用于根据每个所述自然语句的依存关系确定每个所述自然语句的路径描述符;A path descriptor determining module, configured to determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence;
规则模板生成模块,用于根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库;A rule template generation module, which is used to generate a rule template according to the path descriptor of each natural sentence, and build a rule template library;
知识抽取模块,用于利用所述规则模板库中的所述规则模板对全量的医学文献进行知识抽取,获取基因疾病关系,建立基因疾病关系知识库。The knowledge extraction module is used for extracting knowledge from the full amount of medical literature by using the rule templates in the rule template library, obtaining genetic disease relationships, and establishing a genetic disease relationship knowledge base.
第三方面,本申请还提供一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现一种基因疾病关系知识库构建方法,其中所述基因疾病关系知识库构建方法包括:In a third aspect, the present application also provides a computer device, including a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, a method for constructing a genetic disease relationship knowledge base is realized, wherein The methods for constructing the knowledge base of the genetic disease relationship include:
对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系;Perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;
根据每个所述自然语句的依存关系确定每个所述自然语句的路径描述符,其中所述路径描述符是指自然语句中在基因疾病实体依存路径上所有词的排列顺序;Determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the sequence of all words in the natural sentence on the path of genetic disease entity dependence;
根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库;Generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library;
利用所述规则模板库中的所述规则模板对全量的医学文献进行知识抽取,获取基因疾病关系,建立基因疾病关系知识库。The rule template in the rule template library is used to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.
第四方面,本申请还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现一种基因疾病关系知识库构建方法,其中所述基因疾病关系知识库构建方法包括:In a fourth aspect, the present application also provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, a method for constructing a genetic disease relationship knowledge base is realized, wherein the genetic disease relationship knowledge Library construction methods include:
对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系;Perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;
根据每个所述自然语句的依存关系确定每个所述自然语句的路径描述符,其中所述路径描述符是指自然语句中在基因疾病实体依存路径上所有词的排列顺序;Determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the sequence of all words in the natural sentence on the path of genetic disease entity dependence;
根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库;Generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library;
利用所述规则模板库中的所述规则模板对全量的医学文献进行知识抽取,获取基因疾病关系,建立基因疾病关系知识库。The rule template in the rule template library is used to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.
有益效果Beneficial effect
本申请的基因疾病关系知识库构建方法、装置和计算机设备,通过对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,自动学习出大量的规则模板,然后利用规则模板从医学文献中自动抽取基因疾病的关系知识,无需高昂的人工成本,而且抽取到的知识数量多,抽取效果好,并且具有良好的迁移性和适用性,可用于更多的医学实体间关系抽取。The method, device and computer equipment for constructing a gene-disease relationship knowledge base of this application automatically learn a large number of rule templates by analyzing a specified number of natural sentences containing gene-disease entity pairs, and then use the rule templates from medical literature It automatically extracts the relationship knowledge of genetic diseases without high labor cost, and the amount of extracted knowledge is large, the extraction effect is good, and it has good migration and applicability, and can be used for more medical entity relationship extraction.
附图说明Description of the drawings
图1为本申请一实施例的基因疾病关系知识库构建方法的流程示意图;FIG. 1 is a schematic flow chart of a method for constructing a gene-disease relationship knowledge base according to an embodiment of the application;
图2为本申请一实施例的自然语句依存关系举例示意图;FIG. 2 is a schematic diagram of an example of the dependency relationship of natural sentences according to an embodiment of the application;
图3为本申请另一实施例的自然语句依存关系举例示意图;FIG. 3 is a schematic diagram of an example of the dependency relationship of natural sentences according to another embodiment of the application;
图4为本申请一实施例的基因疾病关系知识库构建装置的结构示意框图;4 is a schematic block diagram of the structure of an apparatus for constructing a gene-disease relationship knowledge base according to an embodiment of the application;
图5为本申请一实施例的计算机设备的结构示意框图。FIG. 5 is a schematic block diagram of the structure of a computer device according to an embodiment of the application.
本发明的最佳实施方式The best mode of the present invention
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions, and advantages of this application clearer and clearer, the following further describes the application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.
本申请涉及人工智能领域,同样涉及智慧城市中的智慧医疗领域。参照图1,本申请实施例中提供一种基因疾病关系知识库构建方法,包括步骤:This application relates to the field of artificial intelligence, as well as the field of smart medical care in smart cities. Referring to Fig. 1, an embodiment of the present application provides a method for constructing a gene-disease relationship knowledge base, which includes the following steps:
S1、对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系;S1. Perform a dependency relationship analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtain the dependency relationship of each natural sentence;
S2、根据每个所述自然语句的依存关系确定每个所述自然语句的路径描述符,其中所 述路径描述符是指自然语句中在基因疾病实体依存路径上所有词的排列顺序;S2. Determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, where the path descriptor refers to the sequence of all words in the natural sentence on the path of genetic disease entity dependence;
S3、根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库;S3. Generate a rule template according to the path descriptor of each natural sentence, and establish a rule template library;
S4、利用所述规则模板库中的所述规则模板对全量的医学文献进行知识抽取,获取基因疾病关系,建立基因疾病关系知识库。S4. Use the rule template in the rule template library to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.
医学关系抽取的任务是根据包含给定基因-疾病实体对的医学文本中语义信息判断在给定基因、疾病实体对之间的关系。在本实施例中,使用了基于规则模板的方案从海量医学文献中抽取基因-疾病关系知识,本实施例中的规则模板并不是专家构建得到的规则模板,专家构建的规则模板需要耗费较高的人工成本,而且规则模板的数量较少,从而导致基于专家构建规则的方式抽取的医学关系知识规模很小且成本高。本实施例中,可以在指定数量的包含实体对的自然语句中自动学习出大量的高质量的可用的规则模板,然后使用这些模板进行知识抽取,在全量的医学文献中获取大量的医学关系知识,构建知识库。The task of medical relationship extraction is to judge the relationship between a given gene and disease entity pair based on the semantic information in a medical text containing a given gene-disease entity pair. In this embodiment, a scheme based on a rule template is used to extract gene-disease relationship knowledge from a large amount of medical literature. The rule template in this embodiment is not a rule template constructed by an expert, and a rule template constructed by an expert needs to be expensive. The labor cost is relatively small, and the number of rule templates is small, which leads to the small scale and high cost of medical relationship knowledge extracted based on the way of constructing rules by experts. In this embodiment, a large number of high-quality usable rule templates can be automatically learned from a specified number of natural sentences containing entity pairs, and then these templates can be used for knowledge extraction, and a large amount of medical relationship knowledge can be obtained from the entire medical literature. , Build a knowledge base.
如上述步骤S1所述,规则模板是基于对自然语句的依存关系设计抽取的。依存关系分析又被称为依存句法分析,是自然语言处理中的关键技术之一,它是对输入的文本句子进行分析以得到句子的句法结构的处理过程。目前常用的依存关系分析工具有斯坦福大学的StanfordNLP工具包、Hanlp、SpaCy以及复旦大学的FudanNLP等。具体地,以一个例子Case 1来举例说明。As described in step S1 above, the rule template is designed and extracted based on the dependency relationship of the natural sentence. Dependency analysis, also known as dependency syntax analysis, is one of the key technologies in natural language processing. It is the process of analyzing the input text sentence to obtain the syntax structure of the sentence. At present, the commonly used dependency analysis tools include StanfordNLP toolkit, Hanlp, SpaCy and FudanNLP of Fudan University. Specifically, take an example Case 1 as an example.
Case 1:“The profile of the apelin makes it a therapeutic target for ischemic heart disease.”的依存关系如图2所示,其中箭头代表句子中不同词(word)之间的依存关系指向,箭头上的文字(如:det、nsubj、case、nmod等)表示具体的依存关系类型,自然语句的依存关系类型有广泛认可的规范化的分类。图中的GENE代表的是apelin,DISE代表的是ischemic heart disease。Case 1: "The profile of the apelin makes it a therapeutic target for ischemic heart disease." The dependency relationship is shown in Figure 2, where the arrow represents the dependency relationship between different words in the sentence, and the text on the arrow (Such as: det, nsubj, case, nmod, etc.) indicate specific types of dependencies, and the types of dependencies of natural sentences have widely recognized and standardized classifications. GENE in the figure represents apelin, and DISE represents ischemic heart disease.
从图中可以看出给定GENE实体和DISE实体的在该句中的依存关系路径是GENE←profile←makes→target→DISE,并且从该依存路径可以看出makes是该路径中的根节点(ROOT)。It can be seen from the figure that the dependency path of a given GENE entity and DISE entity in this sentence is GENE←profile←makes→target→DISE, and it can be seen from the dependency path that makes is the root node in the path ( ROOT).
如上述步骤S2所述,根据依存关系可以确定路径描述符。以Case 1为例,将给定的GENE实体和DISE实体的依存路径上的所有词(word)按照自然语句中的顺序排列可以得到“profile GENE makes target DISE”,“profile GENE makes target DISE”被称为路径描述符。As described in step S2 above, the path descriptor can be determined according to the dependency relationship. Take Case 1 as an example. Arrange all words on the dependency path of a given GENE entity and DISE entity in the order of the natural sentence to get "profile GENE makes target DISE", and "profile GENE makes target DISE" is It is called the path descriptor.
如上述步骤S3所述,对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,可以获得大量的路径描述符,然后对这些路径描述符进行去重等操作可以得到候选规则模板,然后对候选规则模板进行排序,过滤掉抽取到给定路径描述符的case数量小于预设值的路径描述符,然后对剩余的路径描述符质量评估,将通过评估的路径描述符保存为规则模板,储存在规则模板库中。As described in step S3 above, by performing dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, a large number of path descriptors can be obtained, and then these path descriptors can be deduplicated to obtain candidate rule templates. Then sort the candidate rule templates, filter out the path descriptors whose case number extracted from the given path descriptor is less than the preset value, then evaluate the quality of the remaining path descriptors, and save the path descriptors that pass the evaluation as a rule template , Stored in the rule template library.
如上述步骤S4所述,在建立规则模板库之后,对全量的医学文件进行知识抽取,获取海量的基因疾病关系,将获取到的基因疾病关系保存起来,建立基因疾病关系知识库。As described in step S4 above, after the rule template library is established, knowledge extraction is performed on the full amount of medical documents, a large amount of genetic disease relationships are obtained, the acquired genetic disease relationships are stored, and a genetic disease relationship knowledge base is established.
在一个实施例中,所述对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系步骤之前,包括:In one embodiment, before the step of performing dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtaining the dependency of each natural sentence, the step includes:
S01、在指定的医学资料库中获取包含基因-疾病实体对的自然语句;S01. Obtain natural sentences containing gene-disease entity pairs in a designated medical database;
S02、随机选取指定数量的包含基因-疾病实体对的自然语句。S02. Randomly select a specified number of natural sentences containing gene-disease entity pairs.
如上所述,规则模板的建立首先需要对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,也就是说,需要在给定的自然语句中自动学习出规则模板。在本实施例中,指定的医学资料库是Pubmed,Pubmed是最大的医学文献数据库,截止到2019年Pubmed中文献数量超过3000万。基因实体库使用ncbi的gene实体库,疾病的实体库采用mesh疾病实体库,上述基因实体库和疾病实体库都是目前医学领域被广泛认可质量高覆盖率广的实体库。实体库中提供了基因、疾病的英文标准名称以及别名,使用基因、疾病的名称 从医学文献中抽取同时包含基因、疾病的句子,比如“Breastfeeding and the risk of breast cancer in BRCA1 mutation carriers.”,其中breast cancer是疾病实体库一种疾病的名称,BRAC1是基因实体库中一种基因的名称。从医学文献Pubmed中获取同时包含基因、疾病实体的句子集合,然后抽取指定数量的自然语句进行依存关系分析,最终得到规则模板。更具体地,指定数量为100万。As mentioned above, the establishment of a rule template first needs to perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, that is, it is necessary to automatically learn a rule template from a given natural sentence. In this embodiment, the designated medical database is Pubmed, which is the largest medical document database. As of 2019, the number of documents in Pubmed exceeds 30 million. The gene entity database uses ncbi's gene entity database, and the disease entity database uses the mesh disease entity database. The aforementioned gene entity database and disease entity database are currently widely recognized in the medical field with high quality and wide coverage. The entity database provides the English standard names and aliases of genes and diseases. Use the names of genes and diseases to extract sentences that contain both genes and diseases from medical literature, such as "Breastfeeding and the risk of breast cancer in BRCA1 mutation carriers.", Breast cancer is the name of a disease in the disease entity database, and BRAC1 is the name of a gene in the gene entity database. Obtain a set of sentences containing both genes and disease entities from the medical literature Pubmed, and then extract a specified number of natural sentences for dependency analysis, and finally get the rule template. More specifically, the designated number is 1 million.
在一个实施例中,所述对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系的步骤包括:In an embodiment, the step of performing dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtaining the dependency of each natural sentence includes:
S11、利用自然语言处理工具包StanfordNLP对每个所述自然语句进行依存关系分析,获得每个所述自然语句的依存关系。S11. Use the natural language processing toolkit StanfordNLP to perform a dependency relationship analysis on each natural sentence to obtain the dependency relationship of each natural sentence.
如上所述,在本实施中,选用了StanfordNLP作为依存关系分析的工具。StanfordNLP工具包支持多种语言的完整文本分析管道,包括分词、词性标注、词形归并和依存关系解析,此外它还提供了与CoreNLP的Python接口,可以轻松设置本地Python实现。As mentioned above, in this implementation, StanfordNLP is selected as a tool for dependency analysis. The StanfordNLP toolkit supports a complete text analysis pipeline in multiple languages, including word segmentation, part-of-speech tagging, morphological merging, and dependency analysis. In addition, it also provides a Python interface with CoreNLP to easily set up a local Python implementation.
在一个实施例中,所述根据每个所述自然语句的依存关系确定每个所述自然语句的路径描述符的步骤之后,所述根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库的步骤之前还包括:In one embodiment, after the step of determining the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, the generating rule according to the path descriptor of each natural sentence Templates, before the steps to establish a rule template library, it also includes:
S21、计算不同的路径描述符之间的编辑距离,将编辑距离小于等于第一指定值的路径描述符聚类为同一种路径描述符;以及,S21: Calculate the edit distance between different path descriptors, and cluster the path descriptors whose edit distance is less than or equal to the first specified value into the same type of path descriptor; and,
S22、识别所述自然语句中的依存关系中是否存在否定语义,若存在,则过滤掉该所述自然语句对应的路径描述符。S22. Identify whether there is negative semantics in the dependency relationship in the natural sentence, and if so, filter out the path descriptor corresponding to the natural sentence.
如上述步骤S21所述,在上述步骤S2中获取到的路径描述符存在着大量的冗余,比如下列路径描述符{“GENE target in DISE”,“GENE target on DISE”,“GENE targets in DISE”,“GENE targets on DISE”},这些路径描述符实际上是冗余的。关于路径描述符冗余问题,本实施例中采用通过计算不同路径描述符的编辑距离,如果编辑距离小于等于第一指定值,则认为同一个路径描述符。通过后续的统计发现,对路径描述符中通过编辑距离为2(优选的第一指定值)进行聚类,可以将规则模板数量减少60%,从而减少了大量的路径描述符的冗余。编辑距离是指两个给定的字符串,从一个转换到另一个的最小编辑操作次数,这里的编辑操作可以是删除、插入、替换操作。比如“GENE target in DISE”通过一次插入操作(插入s)变成“GENE targets in DISE”,在通过一次替换操作(i替换成o)可以变成“GENE targets on DISE”。所以这里的“GENE target in DISE”和“GENE targets in DISE”的编辑距离为2。As mentioned in the above step S21, the path descriptors obtained in the above step S2 have a lot of redundancy, such as the following path descriptors {"GENE target in DISE", "GENE target on DISE", "GENE targets in DISE ","GENE targets on DISE"}, these path descriptors are actually redundant. Regarding the redundancy of path descriptors, in this embodiment, the edit distance of different path descriptors is calculated. If the edit distance is less than or equal to the first specified value, the same path descriptor is considered. Through subsequent statistics, it is found that clustering the path descriptors with an edit distance of 2 (preferably the first specified value) can reduce the number of rule templates by 60%, thereby reducing the redundancy of a large number of path descriptors. Edit distance refers to the minimum number of editing operations for two given strings to switch from one to the other. The editing operations here can be delete, insert, and replace operations. For example, "GENE target in DISE" can be changed to "GENE targets in DISE" through an insert operation (insert s), and "GENE targets on DISE" can be changed through a replacement operation (i is replaced by o). Therefore, the edit distance of "GENE target in DISE" and "GENE target in DISE" here is 2.
如上述步骤S22所述,现有的路径描述符中通过无法发现否定信息(Neg)。这里以一个具体的例子进行说明。As described in step S22 above, the negative information (Neg) can not be found in the existing path descriptor. Here is a specific example for illustration.
Case 2:“The profile of the apelin did not make it a therapeutic target for ischemic heart disease.”的依存关系如图3所示。Case 2: "The profile of the apelin did not make it a therapeutic target for ischemic heart disease." The dependency relationship is shown in Figure 3.
可以看出这里给的给定GENE、DISE的依存关系路径是GENE←profile←make→target→DISE,对应的路径描述符是“profile GENE make target DISE”,其中make是该路径的根节点(ROOT)。可以发现case 2和case 1的路径描述符“profile GENE makes target DISE”表述的相同的语义,但是实际上case 2表述的否定的语义,可以发现case 2根节点make的依存关系
Figure PCTCN2020125143-appb-000001
中可以看出是有否定语义(neg)。这里通过根节点的所有依存关系中,如果根节点的依存关系存在neg,则在生成规则模板时过滤掉该样例。
It can be seen that the dependency path of the given GENE and DISE given here is GENE←profile←make→target→DISE, and the corresponding path descriptor is "profile GENE make target DISE", where make is the root node of the path (ROOT ). It can be found that the path descriptor "profile GENE makes target DISE" of case 2 and case 1 express the same semantics, but in fact, the negative semantics expressed by case 2 can be found in the dependency relationship of the root node make of case 2
Figure PCTCN2020125143-appb-000001
It can be seen that there is negative semantics (neg). Here, among all the dependencies of the root node, if there is neg in the dependency of the root node, the example will be filtered out when generating the rule template.
在一个实施例中,所述根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库的步骤包括:In an embodiment, the step of generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library includes:
S31、统计同一个路径描述符对应的自然语句case数量,过滤掉所述case数量小于第二指定值的路径描述符;S31. Count the number of natural sentence cases corresponding to the same path descriptor, and filter out path descriptors whose case number is less than a second specified value;
S32、将过滤后的路径描述符进行质量评估,将通过质量评估的路径描述符保存为规 则模板,建立规则模板库。S32. Perform quality evaluation on the filtered path descriptors, save the path descriptors that pass the quality evaluation as a rule template, and establish a rule template library.
如上所述,在步骤S2中可以得到路径描述符,以case 1为例进行说明,case 1的句子是“The profile of the apelin makes it a therapeutic target for ischemic heart disease.”该样例中给定GENE是apelin,给定DISE代表的是ischemic heart disease。得到路径描述符“profile GENE makes target DISE”(这里路径描述符即是候选的规则模板)。每个数据样例处理后可以得到如下信息{数据样例,该数据样例中实体对,对应的路径描述符}。As mentioned above, the path descriptor can be obtained in step S2. Take case 1 as an example. The sentence of case 1 is "The profile of the apelin makes it a therapeutic target for ischemic heart disease." The example is given in this example GENE is apelin, and a given DISE represents ischemic heart disease. Get the path descriptor "profile GENE makes target DISE" (here, the path descriptor is the candidate rule template). After processing each data sample, the following information can be obtained {data sample, entity pair in the data sample, corresponding path descriptor}.
比如case 1即可以得到{“The profile of the apelin makes it a therapeutic target for ischemic heart disease.”,<apelin,ischemic heart disease>,“profile GENE makes target DISE”}For example, in case 1, you can get {"The profile of the apelin makes it a therapeutic target for ischemic heart disease.", <apelin, ischemic heart disease>, "profile GENE makes target DISE"}
然后计算全部的路径描述符中每两个路径描述符之间的编辑距离,如果编辑距离小于等于2则认为是同一种路径描述符,从而解决路径描述符存在的冗余问题。这样通过得到全部的路径描述符即是候选的规则模板。比如case 1的路径描述符“profile GENE makes target DISE”可以简化成“profile GENE make target DISE”,将路径描述符简化后可以到所有数据对应的新的路径描述符。这样所有的数据样例处理之后可以得到这样的数据:Then calculate the edit distance between every two path descriptors in all path descriptors. If the edit distance is less than or equal to 2, it is considered as the same type of path descriptor, thereby solving the redundancy problem of path descriptors. In this way, all the path descriptors are the candidate rule templates. For example, the path descriptor "profile GENE makes target DISE" in case 1 can be simplified to "profile GENE make target DISE". After simplifying the path descriptor, new path descriptors corresponding to all data can be obtained. After processing all the data samples, you can get this data:
{case 1,基因疾病实体对,路径描述符1},…,{case n,基因疾病实体对,路径描述符m}{case 1, genetic disease entity pair, path descriptor 1},..., {case n, genetic disease entity pair, path descriptor m}
上述的数据格式经过简单的统计可以得到每一个路径描述符的cases:The above data format can get the cases of each path descriptor through simple statistics:
{路径描述符1,对应全部cases,对应的所有实体对集合1},…,{路径描述符m,对应全部cases,对应的所有实体对集合m}。{Path descriptor 1, corresponding to all cases, corresponding to all entity pair set 1},..., {path descriptor m, corresponding to all cases, corresponding to all entity pair set m}.
根据每一个路径描述符对应的cases通过简单统计可以得到每一个路径描述符对应的数据样例数,格式如下:{路径描述符1,对应case数},…,{路径描述符m,对应case数}。按照每个路径描述符的case数量进行排序,过滤掉case数小于第二指定值(这里第二指定值设置为3)的路径描述符。这样做提高抽取到的路径描述符的普适性和准确性。According to the cases corresponding to each path descriptor, the number of data samples corresponding to each path descriptor can be obtained through simple statistics. The format is as follows: {path descriptor 1, corresponding case number},...,{path descriptor m, corresponding case number}. Sort according to the case number of each path descriptor, and filter out path descriptors whose case number is less than the second specified value (here the second specified value is set to 3). This improves the universality and accuracy of the extracted path descriptors.
然后对过滤后的路径描述符进行质量评估,将通过质量评估的路径描述符保存为规则模板,建立规则模板库。质量评估的方法可以采用人工众包、监督学习等。Then, perform quality evaluation on the filtered path descriptors, save the path descriptors that pass the quality evaluation as a rule template, and establish a rule template library. The methods of quality evaluation can use manual crowdsourcing, supervised learning, etc.
在一个具体的实施例中,所述将过滤后的路径描述符进行质量评估,将通过质量评估的路径描述符保存为规则模板,建立规则模板库的步骤包括:In a specific embodiment, the quality evaluation is performed on the filtered path descriptors, and the path descriptors passing the quality evaluation are saved as a rule template, and the step of establishing a rule template library includes:
S321、统计待评估的路径描述符所对应的实体对集合;S321. Count the entity pair sets corresponding to the path descriptors to be evaluated;
S322、统计所述实体对集合里的实体对在CTD中存在的数量;S322: Count the number of entity pairs in the entity pair set that exist in the CTD;
S323、若存在的数量大于指定的数量阈值或存在的数量与实体对集合中实体对总数量的比值大于指定的比值阈值,则保留所述待评估的路径描述符为可用规则模板,储存起来建立规则模板库。S323. If the number of existences is greater than the specified number threshold or the ratio of the number of existences to the total number of entities in the entity pair set is greater than the specified ratio threshold, the path descriptor to be evaluated is retained as an available rule template, and stored to establish Rule template library.
如上所述,在本实施例中,采用了基于远程监督的思想对规则模板进行评估。远程监督的核心思想是如果在已有的知识库中已经存在的知识三元组(比如<ACE,target,heart failure>,表示基因ACE和疾病heart failure有target关系),那么在文本中提及到该实体对(比如ACE、heart failure)的文本大概率是在描述该实体对的target语义。具体地,使用的已有知识库是CTD,CTD(Common Technical Document)是医学领域被广泛的认可的医学知识库。As mentioned above, in this embodiment, the rule template is evaluated based on the idea of remote supervision. The core idea of remote supervision is that if there are knowledge triples in the existing knowledge base (such as <ACE, target, heart failure>, which means that the gene ACE and the disease heart failure have a target relationship), then mention it in the text The text to the entity pair (such as ACE, heart failure) has a high probability of describing the target semantics of the entity pair. Specifically, the existing knowledge base used is CTD, and CTD (Common Technical Document) is a widely recognized medical knowledge base in the medical field.
任意选取一个路径描述符集合1~m里的一个路径描述符i以及所述路径描述符i所对应的实体对集合i,统计所述实体对集合i中的实体对在CTD知识库里存在的数量,如果存在的数量大于指定的数量阈值(优选为4)或存在的数量与实体对集合中实体对总数量的比值大于指定的比值阈值(优选为0.5),则保留所述路径描述符i为可用的规则模板,将所有可用的规则模板保存起来,构建成规则模板库。Randomly select a path descriptor i in a path descriptor set 1~m and the entity pair set i corresponding to the path descriptor i, and count the entity pairs in the entity pair set i that exist in the CTD knowledge base Quantity, if the number of existence is greater than the specified number threshold (preferably 4) or the ratio of the number of existence to the total number of entity pairs in the entity pair set is greater than the specified ratio threshold (preferably 0.5), then the path descriptor i is retained For the available rule templates, save all available rule templates to build a rule template library.
在一个实施例中,所述利用所述规则模板库中的所述规则模板对全量的医学文献进行知识抽取,获取基因疾病关系,建立基因疾病关系知识库的步骤包括:In one embodiment, the step of using the rule templates in the rule template library to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base includes:
S41、对全量的医学文献中的自然语句进行实体识别,获取包含基因-疾病实体对的自然语句;S41. Perform entity recognition on natural sentences in a full amount of medical literature, and obtain natural sentences containing gene-disease entity pairs;
S42、分别对所有包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系;S42: Perform a dependency relationship analysis on all natural sentences including gene-disease entity pairs to obtain the dependency relationship of each natural sentence;
S43、根据每个所述自然语句的依存关系确定每个所述自然语句的路径描述符;S43: Determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence;
S44、判断所述路径描述符是否在所述规则模板库中;S44: Determine whether the path descriptor is in the rule template library;
S45、若是,则根据所述路径描述符获取基因疾病关系,将所述基因疾病关系保存在基因疾病关系知识库中。S45. If yes, obtain the genetic disease relationship according to the path descriptor, and save the genetic disease relationship in the genetic disease relationship knowledge base.
如上所述,在步骤S1~S3的中,通过对自然语句进行依存关系分析,得到路径描述符,然后对路径描述符进行质量评估等操作得到了规则模板,并建立了规则模板库。建立规则模板库的过程中选用了100万条包含基因-疾病实体对的自然语句。对全量的医学文献中的自然语句进行基因、疾病实体识别,获取所有包含基因-疾病实体的自然语句,然后利用工具包对这些自然语句依次进行依存关系分析,获取每个自然语句的依存关系,确定路径描述符,然后判断所述路径描述符是否在通过步骤S1~S3创建的规则模板库中,若是,则根据路径描述符获取基因和疾病之间的关系(如case 1中的target),将所述基因疾病关系保存在基因疾病关系知识库中。As described above, in steps S1 to S3, the path descriptor is obtained by analyzing the dependency relationship of the natural sentence, and then performing the quality evaluation on the path descriptor to obtain the rule template, and establish the rule template library. One million natural sentences containing gene-disease entity pairs were selected during the establishment of the rule template library. Recognize genes and disease entities of natural sentences in the full amount of medical literature, obtain all natural sentences that contain gene-disease entities, and then use the toolkit to analyze the dependencies of these natural sentences in turn to obtain the dependency of each natural sentence. Determine the path descriptor, and then determine whether the path descriptor is in the rule template library created through steps S1 to S3. If so, obtain the relationship between the gene and the disease (such as the target in case 1) according to the path descriptor, The genetic disease relationship is stored in the genetic disease relationship knowledge base.
在一个实施例中,所述医学资料库、所述规则模板和所述基因疾病关系知识库等储存于区块链的节点中,在区块链中实现如上所述的基因疾病关系知识库构建方法。In one embodiment, the medical database, the rule template, the genetic disease relationship knowledge base, etc. are stored in the nodes of the blockchain, and the above-mentioned genetic disease relationship knowledge base construction is realized in the blockchain. method.
如上所述,区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。区块链底层平台可以包括用户管理、基础服务、智能合约以及运营监控等处理模块。其中,用户管理模块负责所有区块链参与者的身份信息管理,包括维护公私钥生成(账户管理)、密钥管理以及用户真实身份和区块链地址对应关系维护(权限管理)等,并且在授权的情况下,监管和审计某些真实身份的交易情况,提供风险控制的规则配置(风控审计);基础服务模块部署在所有区块链节点设备上,用来验证业务请求的有效性,并对有效请求完成共识后记录到存储上,对于一个新的业务请求,基础服务先对接口适配解析和鉴权处理(接口配置),然后通过共识算法将业务信息加密(共识管理),在加密之后完整一致的传输至共享账本上(网络通信),并进行记录存储;智能合约模块负责合约的注册发行以及合约触发和合约执行,开发人员可以通过某种编程语言定义合约逻辑,发布到区块链上(合约注册),根据合约条款的逻辑,调用密钥或者其它的事件触发执行,完成合约逻辑,同时还提供对合约升级注销的功能;运营监控模块主要负责产品发布过程中的部署、配置的修改、合约设置、云适配以及产品运行中的实时状态的可视化输出,例如:告警、监控网络情况、监控节点设备健康状态等。As mentioned above, blockchain is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer. The underlying platform of the blockchain can include processing modules such as user management, basic services, smart contracts, and operation monitoring. Among them, the user management module is responsible for the identity information management of all blockchain participants, including the maintenance of public and private key generation (account management), key management, and maintenance of the correspondence between the user’s real identity and the blockchain address (authority management), etc. In the case of authorization, supervise and audit certain real-identity transactions, and provide risk control rule configuration (risk control audit); basic service modules are deployed on all blockchain node devices to verify the validity of business requests, After completing the consensus on the valid request, it is recorded on the storage. For a new business request, the basic service first performs interface adaptation analysis and authentication processing (interface configuration), and then encrypts the business information through the consensus algorithm (consensus management). After encryption, it is completely and consistently transmitted to the shared ledger (network communication), and recorded and stored; the smart contract module is responsible for contract registration and issuance, contract triggering and contract execution. Developers can define the contract logic through a certain programming language and publish it to the district On the block chain (contract registration), according to the logic of the contract terms, the key or other events trigger execution to complete the contract logic, and also provide the function of contract upgrade and cancellation; the operation monitoring module is mainly responsible for the deployment, Configuration modification, contract settings, cloud adaptation, and visual output of real-time status during product operation, such as alarms, monitoring network conditions, monitoring node equipment health status, etc.
本申请实施例的基因疾病关系知识库构建方法,通过对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,自动学习出大量的规则模板,然后利用规则模板从医学文献中自动抽取基因疾病的关系知识,无需高昂的人工成本,而且抽取到的知识数量多,抽取效果好,并且具有良好的迁移性和适用性,可用于更多的医学实体间关系抽取。The method for constructing the genetic disease relationship knowledge base of the embodiment of the application automatically learns a large number of rule templates by analyzing the dependence relationship of a specified number of natural sentences containing gene-disease entity pairs, and then automatically extracts a large number of rule templates from the medical literature using the rule templates The relationship knowledge of genetic diseases does not require high labor costs, and the amount of extracted knowledge is large, the extraction effect is good, and it has good migration and applicability, and can be used for more medical entity relationship extraction.
参照图4,本申请实施例中还提供一种基因疾病关系知识库构建装置,包括:Referring to Fig. 4, an embodiment of the present application also provides an apparatus for constructing a genetic disease relationship knowledge base, including:
依存关系分析模块1,用于对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系;The dependency analysis module 1 is used to perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;
路径描述符确定模块2,用于根据每个所述自然语句的依存关系确定每个所述自然语句的路径描述符,其中所述路径描述符是指自然语句中在基因疾病实体依存路径上所有词的排列顺序;The path descriptor determining module 2 is used to determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to all the path descriptors in the genetic disease entity dependent path in the natural sentence. The order of words;
规则模板生成模块3,用于根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库;The rule template generation module 3 is configured to generate a rule template according to the path descriptor of each natural sentence, and establish a rule template library;
知识抽取模块4,用于利用所述规则模板库中的所述规则模板对全量的医学文献进行知识抽取,获取基因疾病关系,建立基因疾病关系知识库。The knowledge extraction module 4 is configured to use the rule templates in the rule template library to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.
在一个实施例中,所述基因疾病关系知识库构建装置还包括:In an embodiment, the gene-disease relationship knowledge base construction device further includes:
自然语句获取模块,用于在指定的医学资料库中获取包含基因-疾病实体对的自然语句;The natural sentence acquisition module is used to acquire natural sentences containing gene-disease entity pairs in the designated medical database;
选择模块,用于随机选取指定数量的包含基因-疾病实体对的自然语句。The selection module is used to randomly select a specified number of natural sentences containing gene-disease entity pairs.
在一个实施例中,所述依存关系分析模块1包括:In an embodiment, the dependency analysis module 1 includes:
依存关系分析单元,用于利用自然语言处理工具包StanfordNLP对每个所述自然语句进行依存关系分析,获得每个所述自然语句的依存关系。The dependence relationship analysis unit is used to analyze the dependence relationship of each natural sentence by using the natural language processing toolkit StanfordNLP to obtain the dependence relationship of each natural sentence.
在一个实施例中,所述基因疾病关系知识库构建装置还包括:In an embodiment, the gene-disease relationship knowledge base construction device further includes:
聚类模块,用于计算不同的路径描述符之间的编辑距离,将编辑距离小于等于第一指定值的路径描述符聚类为同一种路径描述符;The clustering module is used to calculate the edit distance between different path descriptors, and cluster the path descriptors whose edit distance is less than or equal to the first specified value into the same type of path descriptor;
过滤模块,用于识别所述自然语句中的依存关系中是否存在否定语义,若存在,则过滤掉该所述自然语句对应的路径描述符。The filtering module is used to identify whether there is negative semantics in the dependency relationship in the natural sentence, and if so, filter out the path descriptor corresponding to the natural sentence.
在一个实施例中,所述规则模板生成模块3包括:In an embodiment, the rule template generation module 3 includes:
统计模块,用于统计同一个路径描述符对应的自然语句case数量,过滤掉所述case数量小于第二指定值的路径描述符;The statistics module is used to count the number of natural sentence cases corresponding to the same path descriptor, and filter out path descriptors whose case number is less than the second specified value;
质量评估模块,用于将过滤后的路径描述符进行质量评估,将通过质量评估的路径描述符保存为规则模板,建立规则模板库。The quality evaluation module is used to perform quality evaluation on the filtered path descriptors, save the path descriptors that pass the quality evaluation as a rule template, and establish a rule template library.
在一个实施例中,所述质量评估模块包括:In an embodiment, the quality assessment module includes:
第一统计单元,用于统计待评估的路径描述符所对应的实体对集合;The first statistics unit is used to count the entity pair sets corresponding to the path descriptors to be evaluated;
第二统计单元,用于统计所述实体对集合里的实体对在CTD中存在的数量;The second statistical unit is used to count the number of entity pairs in the entity pair set that exist in the CTD;
处理单元,用于若存在的数量大于指定的数量阈值或存在的数量与实体对集合中实体对总数量的比值大于指定的比值阈值,则保留所述待评估的路径描述符为可用规则模板,储存起来建立规则模板库。A processing unit, configured to reserve the path descriptor to be evaluated as an available rule template if the number of existences is greater than the specified number threshold or the ratio of the number of existences to the total number of entities in the entity pair set is greater than the specified ratio threshold, Store it to build a rule template library.
在一个实施例中,所述知识抽取模块4包括:In an embodiment, the knowledge extraction module 4 includes:
实体识别单元,用于对全量的医学文献中的自然语句进行实体识别,获取包含基因-疾病实体对的自然语句;The entity recognition unit is used for entity recognition of natural sentences in a full amount of medical literature, and obtaining natural sentences containing gene-disease entity pairs;
依存关系分析单元,用于分别对所有包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系;The dependency analysis unit is used to perform dependency analysis on all natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;
路径描述符确定单元,用于根据每个所述自然语句的依存关系确定每个所述自然语句的路径描述符;A path descriptor determining unit, configured to determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence;
判断单元,用于判断所述路径描述符是否在所述规则模板库中;A judging unit, configured to judge whether the path descriptor is in the rule template library;
获取单元,若是,则根据所述路径描述符获取基因疾病关系,将所述基因疾病关系保存在基因疾病关系知识库中。The obtaining unit, if yes, obtains the genetic disease relationship according to the path descriptor, and saves the genetic disease relationship in the genetic disease relationship knowledge base.
如上所述,可以理解地,本申请中提出的所述基因疾病关系知识库构建装置的各组成部分可以实现如上所述基因疾病关系知识库构建方法任一项的功能,具体结构不再赘述。As mentioned above, it is understandable that the components of the genetic disease relationship knowledge base construction device proposed in this application can realize the function of any one of the above-mentioned genetic disease relationship knowledge database construction methods, and the specific structure will not be repeated.
参照图5,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图5所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于储存规则模板等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种基因疾病关系知识库构 建方法。Referring to FIG. 5, an embodiment of the present application also provides a computer device. The computer device may be a server, and its internal structure may be as shown in FIG. 5. The computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer equipment is used to store data such as rule templates. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program is executed by the processor to realize a method for constructing a genetic disease relationship knowledge base.
上述处理器执行上述的基因疾病关系知识库构建方法,包括:The above-mentioned processor executes the above-mentioned method for constructing a genetic disease relationship knowledge base, including:
对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系;Perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;
根据每个所述自然语句的依存关系确定每个所述自然语句的路径描述符,其中所述路径描述符是指自然语句中在基因疾病实体依存路径上所有词的排列顺序;Determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the sequence of all words in the natural sentence on the path of genetic disease entity dependence;
根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库;Generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library;
利用所述规则模板库中的所述规则模板对全量的医学文献进行知识抽取,获取基因疾病关系,建立基因疾病关系知识库。The rule template in the rule template library is used to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性的,其上存储有计算机程序,计算机程序被处理器执行时实现一种基因疾病关系知识库构建方法,包括步骤:The present application also provides a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. A computer program is stored thereon, and when the computer program is executed by a processor, a The method of constructing a knowledge base of genetic disease relationship includes the following steps:
对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系;Perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;
根据每个所述自然语句的依存关系确定每个所述自然语句的路径描述符,其中所述路径描述符是指自然语句中在基因疾病实体依存路径上所有词的排列顺序;Determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the sequence of all words in the natural sentence on the path of genetic disease entity dependence;
根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库;Generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library;
利用所述规则模板库中的所述规则模板对全量的医学文献进行知识抽取,获取基因疾病关系,建立基因疾病关系知识库。The rule template in the rule template library is used to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.
上述执行的基因疾病关系知识库构建方法,通过对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,自动学习出大量的规则模板,然后利用规则模板从医学文献中自动抽取基因疾病的关系知识,无需高昂的人工成本,而且抽取到的知识数量多,抽取效果好,并且具有良好的迁移性和适用性,可用于更多的医学实体间关系抽取。The above-mentioned method for constructing the genetic disease relationship knowledge base implements the dependency relationship analysis of a specified number of natural sentences containing gene-disease entity pairs, automatically learns a large number of rule templates, and then uses the rule templates to automatically extract genetic diseases from medical literature. The relationship knowledge does not require high labor costs, and the amount of extracted knowledge is large, the extraction effect is good, and it has good transferability and applicability, and can be used for more medical entity relationship extraction.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的和实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双速据率SDRAM(SSRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer readable storage. In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media provided in this application and used in the embodiments may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual-rate data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, It also includes other elements not explicitly listed, or elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.
以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the specification and drawings of this application, or directly or indirectly applied to other related The technical field is equally included in the scope of patent protection of this application.

Claims (20)

  1. 一种基因疾病关系知识库构建方法,包括:A method for constructing a knowledge base of genetic disease relationships, including:
    对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系;Perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;
    根据每个所述自然语句的依存关系确定每个所述自然语句的路径描述符,其中所述路径描述符是指自然语句中在基因疾病实体依存路径上所有词的排列顺序;Determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the sequence of all words in the natural sentence on the path of genetic disease entity dependence;
    根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库;Generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library;
    利用所述规则模板库中的所述规则模板对全量的医学文献进行知识抽取,获取基因疾病关系,建立基因疾病关系知识库。The rule template in the rule template library is used to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.
  2. 根据权利要求1所述的基因疾病关系知识库构建方法,其中,所述对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系步骤之前,包括:The method for constructing a genetic disease relationship knowledge base according to claim 1, wherein before the step of performing dependency relationship analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtaining the dependency relationship of each natural sentence, include:
    在指定的医学资料库中获取包含基因-疾病实体对的自然语句;Obtain natural sentences containing gene-disease entity pairs in the designated medical database;
    随机选取指定数量的包含基因-疾病实体对的自然语句。Randomly select a specified number of natural sentences containing gene-disease entity pairs.
  3. 根据权利要求1所述的基因疾病关系知识库构建方法,其中,所述对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系的步骤包括:The method for constructing a genetic disease relation knowledge base according to claim 1, wherein the step of performing dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtaining the dependency relation of each natural sentence comprises :
    利用自然语言处理工具包StanfordNLP对每个所述自然语句进行依存关系分析,获得每个所述自然语句的依存关系。A natural language processing toolkit StanfordNLP is used to analyze the dependence relationship of each natural sentence to obtain the dependence relationship of each natural sentence.
  4. 根据权利要求1所述的基因疾病关系知识库构建方法,其中,所述根据每个所述自然语句的依存关系确定每个所述自然语句的路径描述符的步骤之后,所述根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库的步骤之前还包括:The method for constructing a genetic disease relationship knowledge base according to claim 1, wherein after the step of determining the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, the method according to each natural sentence The path descriptor of the natural sentence generates a rule template, and before the step of establishing a rule template library, it also includes:
    计算不同的路径描述符之间的编辑距离,将编辑距离小于等于第一指定值的路径描述符聚类为同一种路径描述符;以及,Calculate the edit distance between different path descriptors, and cluster the path descriptors whose edit distance is less than or equal to the first specified value into the same type of path descriptor; and,
    识别所述自然语句中的依存关系中是否存在否定语义,若存在,则过滤掉该所述自然语句对应的路径描述符。Identify whether there is negative semantics in the dependency relationship in the natural sentence, and if so, filter out the path descriptor corresponding to the natural sentence.
  5. 根据权利要求1所述的基因疾病关系知识库构建方法,其中,所述根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库的步骤包括:The method for constructing a genetic disease relationship knowledge base according to claim 1, wherein said generating a rule template according to said path descriptor of each said natural sentence, and the step of establishing a rule template library comprises:
    统计同一个路径描述符对应的自然语句case数量,过滤掉所述case数量小于第二指定值的路径描述符;Count the number of natural sentence cases corresponding to the same path descriptor, and filter out path descriptors whose case number is less than the second specified value;
    将过滤后的路径描述符进行质量评估,将通过质量评估的路径描述符保存为规则模板,建立规则模板库。Perform quality evaluation on the filtered path descriptors, save the path descriptors that pass the quality evaluation as a rule template, and establish a rule template library.
  6. 根据权利要求5所述的基因疾病关系知识库构建方法,其中,所述将过滤后的路径描述符进行质量评估,将通过质量评估的路径描述符保存为规则模板,建立规则模板库的步骤包括:The method for constructing a genetic disease relationship knowledge base according to claim 5, wherein the step of performing quality evaluation on the filtered path descriptors, saving the path descriptors passing the quality evaluation as a rule template, and establishing a rule template library includes :
    统计待评估的路径描述符所对应的实体对集合;Count the set of entity pairs corresponding to the path descriptors to be evaluated;
    统计所述实体对集合里的实体对在CTD中存在的数量;Count the number of entity pairs in the entity pair set that exist in the CTD;
    若存在的数量大于指定的数量阈值或存在的数量与实体对集合中实体对总数量的比值大于指定的比值阈值,则保留所述待评估的路径描述符为可用规则模板,储存起来建立规则模板库。If the number of existence is greater than the specified number threshold or the ratio of the number of existence to the total number of entities in the entity pair set is greater than the specified ratio threshold, the path descriptor to be evaluated is retained as an available rule template, and the rule template is stored to create a rule template Library.
  7. 根据权利要求1所述的基因疾病关系知识库构建方法,其中,所述利用 所述规则模板库中的所述规则模板对全量的医学文献进行知识抽取,获取基因疾病关系,建立基因疾病关系知识库的步骤包括:The method for constructing a genetic disease relationship knowledge base according to claim 1, wherein said rule template in said rule template library is used to extract knowledge from a full amount of medical literature to obtain genetic disease relationship and establish genetic disease relationship knowledge The steps of the library include:
    对全量的医学文献中的自然语句进行实体识别,获取包含基因-疾病实体对的自然语句;Perform entity recognition of natural sentences in a full amount of medical literature, and obtain natural sentences containing gene-disease entity pairs;
    分别对所有包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系;Perform dependency analysis on all natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;
    根据每个所述自然语句的依存关系确定每个所述自然语句的路径描述符;Determining the path descriptor of each natural sentence according to the dependency relationship of each natural sentence;
    判断所述路径描述符是否在所述规则模板库中;Judging whether the path descriptor is in the rule template library;
    若是,则根据所述路径描述符获取基因疾病关系,将所述基因疾病关系保存在基因疾病关系知识库中。If so, the genetic disease relationship is obtained according to the path descriptor, and the genetic disease relationship is stored in the genetic disease relationship knowledge base.
  8. 一种基因疾病关系知识库构建装置,包括:A device for constructing a genetic disease relationship knowledge base, including:
    依存关系分析模块,用于对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系;The dependency analysis module is used to perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;
    路径描述符确定模块,用于根据每个所述自然语句的依存关系确定每个所述自然语句的路径描述符,其中所述路径描述符是指自然语句中在基因疾病实体依存路径上所有词的排列顺序;The path descriptor determining module is used to determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to all the words in the natural sentence on the path of genetic disease entity dependence The order of arrangement;
    规则模板生成模块,用于根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库;A rule template generation module, which is used to generate a rule template according to the path descriptor of each natural sentence, and build a rule template library;
    知识抽取模块,用于利用所述规则模板库中的所述规则模板对全量的医学文献进行知识抽取,获取基因疾病关系,建立基因疾病关系知识库。The knowledge extraction module is used for extracting knowledge from the full amount of medical literature by using the rule templates in the rule template library, obtaining genetic disease relationships, and establishing a genetic disease relationship knowledge base.
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现一种基因疾病关系知识库构建方法,其中所述基因疾病关系知识库构建方法包括:A computer device includes a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, a method for constructing a genetic disease relationship knowledge base is realized, wherein the method for constructing a genetic disease relationship knowledge base include:
    对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系;Perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;
    根据每个所述自然语句的依存关系确定每个所述自然语句的路径描述符,其中所述路径描述符是指自然语句中在基因疾病实体依存路径上所有词的排列顺序;Determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the sequence of all words in the natural sentence on the path of genetic disease entity dependence;
    根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库;利用所述规则模板库中的所述规则模板对全量的医学文献进行知识抽取,获取基因疾病关系,建立基因疾病关系知识库。Generate a rule template according to the path descriptor of each natural sentence to establish a rule template library; use the rule template in the rule template library to extract knowledge from the entire medical literature, obtain genetic disease relationships, and establish genes Disease relationship knowledge base.
  10. 根据权利要求9所述的计算机设备,其中,所述对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系步骤之前,包括:9. The computer device according to claim 9, wherein before the step of performing dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtaining the dependency relation of each natural sentence, the step comprises:
    在指定的医学资料库中获取包含基因-疾病实体对的自然语句;Obtain natural sentences containing gene-disease entity pairs in the designated medical database;
    随机选取指定数量的包含基因-疾病实体对的自然语句。Randomly select a specified number of natural sentences containing gene-disease entity pairs.
  11. 根据权利要求9所述的计算机设备,其中,所述对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系的步骤包括:9. The computer device according to claim 9, wherein the step of performing dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtaining the dependency of each natural sentence comprises:
    利用自然语言处理工具包StanfordNLP对每个所述自然语句进行依存关系分析,获得每个所述自然语句的依存关系。A natural language processing toolkit StanfordNLP is used to analyze the dependence relationship of each natural sentence to obtain the dependence relationship of each natural sentence.
  12. 根据权利要求9所述的计算机设备,其中,所述根据每个所述自然语句的依存关系确定每个所述自然语句的路径描述符的步骤之后,所述根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库的步骤之前还包括:8. The computer device according to claim 9, wherein after the step of determining the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, the path descriptor of each natural sentence is determined according to the The path descriptor generates a rule template, and before the step of establishing a rule template library, it also includes:
    计算不同的路径描述符之间的编辑距离,将编辑距离小于等于第一指定值的路径描述符聚类为同一种路径描述符;以及,Calculate the edit distance between different path descriptors, and cluster the path descriptors whose edit distance is less than or equal to the first specified value into the same type of path descriptor; and,
    识别所述自然语句中的依存关系中是否存在否定语义,若存在,则过滤掉该所述自然语句对应的路径描述符。Identify whether there is negative semantics in the dependency relationship in the natural sentence, and if so, filter out the path descriptor corresponding to the natural sentence.
  13. 根据权利要求9所述的计算机设备,其中,所述根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库的步骤包括:9. The computer device according to claim 9, wherein the step of generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library comprises:
    统计同一个路径描述符对应的自然语句case数量,过滤掉所述case数量小于第二指定值的路径描述符;Count the number of natural sentence cases corresponding to the same path descriptor, and filter out path descriptors whose case number is less than the second specified value;
    将过滤后的路径描述符进行质量评估,将通过质量评估的路径描述符保存为规则模板,建立规则模板库。Perform quality evaluation on the filtered path descriptors, save the path descriptors that pass the quality evaluation as a rule template, and establish a rule template library.
  14. 根据权利要求13所述的计算机设备,其中,所述将过滤后的路径描述符进行质量评估,将通过质量评估的路径描述符保存为规则模板,建立规则模板库的步骤包括:The computer device according to claim 13, wherein the step of performing quality evaluation on the filtered path descriptors, saving the path descriptors passing the quality evaluation as a rule template, and establishing a rule template library comprises:
    统计待评估的路径描述符所对应的实体对集合;Count the set of entity pairs corresponding to the path descriptors to be evaluated;
    统计所述实体对集合里的实体对在CTD中存在的数量;Count the number of entity pairs in the entity pair set that exist in the CTD;
    若存在的数量大于指定的数量阈值或存在的数量与实体对集合中实体对总数量的比值大于指定的比值阈值,则保留所述待评估的路径描述符为可用规则模板,储存起来建立规则模板库。If the number of existence is greater than the specified number threshold or the ratio of the number of existence to the total number of entities in the entity pair set is greater than the specified ratio threshold, the path descriptor to be evaluated is retained as an available rule template, and the rule template is stored to create a rule template Library.
  15. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现一种基因疾病关系知识库构建方法,其中所述基因疾病关系知识库构建方法包括:A computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by a processor, a method for constructing a genetic disease relationship knowledge base is realized, wherein the method for constructing a genetic disease relationship knowledge base includes:
    对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系;Perform dependency analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency of each natural sentence;
    根据每个所述自然语句的依存关系确定每个所述自然语句的路径描述符,其中所述路径描述符是指自然语句中在基因疾病实体依存路径上所有词的排列顺序;Determine the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the sequence of all words in the natural sentence on the path of genetic disease entity dependence;
    根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库;Generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library;
    利用所述规则模板库中的所述规则模板对全量的医学文献进行知识抽取,获取基因疾病关系,建立基因疾病关系知识库。The rule template in the rule template library is used to extract knowledge from a full amount of medical literature, obtain genetic disease relationships, and establish a genetic disease relationship knowledge base.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系步骤之前,包括:15. The computer-readable storage medium according to claim 15, wherein the step of performing dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtaining the dependency of each natural sentence comprises:
    在指定的医学资料库中获取包含基因-疾病实体对的自然语句;Obtain natural sentences containing gene-disease entity pairs in the designated medical database;
    随机选取指定数量的包含基因-疾病实体对的自然语句。Randomly select a specified number of natural sentences containing gene-disease entity pairs.
  17. 根据权利要求15所述的计算机可读存储介质,其中,所述对指定数量的包含基因-疾病实体对的自然语句进行依存关系分析,获得每个所述自然语句的依存关系的步骤包括:15. The computer-readable storage medium according to claim 15, wherein the step of performing dependency analysis on a specified number of natural sentences containing gene-disease entity pairs, and obtaining the dependency of each natural sentence comprises:
    利用自然语言处理工具包StanfordNLP对每个所述自然语句进行依存关系分析,获得每个所述自然语句的依存关系。A natural language processing toolkit StanfordNLP is used to analyze the dependence relationship of each natural sentence to obtain the dependence relationship of each natural sentence.
  18. 根据权利要求15所述的计算机可读存储介质,其中,所述根据每个所述自然语句的依存关系确定每个所述自然语句的路径描述符的步骤之后,所述根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库的步骤之前还包括:The computer-readable storage medium according to claim 15, wherein after the step of determining the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, the The path descriptor of the sentence generates a rule template, and before the step of establishing a rule template library, it also includes:
    计算不同的路径描述符之间的编辑距离,将编辑距离小于等于第一指定值的 路径描述符聚类为同一种路径描述符;以及,Calculate the edit distance between different path descriptors, and cluster the path descriptors whose edit distance is less than or equal to the first specified value into the same type of path descriptor; and,
    识别所述自然语句中的依存关系中是否存在否定语义,若存在,则过滤掉该所述自然语句对应的路径描述符。Identify whether there is negative semantics in the dependency relationship in the natural sentence, and if so, filter out the path descriptor corresponding to the natural sentence.
  19. 根据权利要求15所述的计算机可读存储介质,其中,所述根据每个所述自然语句的所述路径描述符生成规则模板,建立规则模板库的步骤包括:15. The computer-readable storage medium according to claim 15, wherein the step of generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library comprises:
    统计同一个路径描述符对应的自然语句case数量,过滤掉所述case数量小于第二指定值的路径描述符;Count the number of natural sentence cases corresponding to the same path descriptor, and filter out path descriptors whose case number is less than the second specified value;
    将过滤后的路径描述符进行质量评估,将通过质量评估的路径描述符保存为规则模板,建立规则模板库。Perform quality evaluation on the filtered path descriptors, save the path descriptors that pass the quality evaluation as a rule template, and establish a rule template library.
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述将过滤后的路径描述符进行质量评估,将通过质量评估的路径描述符保存为规则模板,建立规则模板库的步骤包括:18. The computer-readable storage medium according to claim 19, wherein the step of performing quality evaluation on the filtered path descriptors, storing the path descriptors passing the quality evaluation as a rule template, and establishing a rule template library comprises:
    统计待评估的路径描述符所对应的实体对集合;Count the set of entity pairs corresponding to the path descriptors to be evaluated;
    统计所述实体对集合里的实体对在CTD中存在的数量;Count the number of entity pairs in the entity pair set that exist in the CTD;
    若存在的数量大于指定的数量阈值或存在的数量与实体对集合中实体对总数量的比值大于指定的比值阈值,则保留所述待评估的路径描述符为可用规则模板,储存起来建立规则模板库。If the number of existence is greater than the specified number threshold or the ratio of the number of existence to the total number of entities in the entity pair set is greater than the specified ratio threshold, the path descriptor to be evaluated is retained as an available rule template, and the rule template is stored to create a rule template Library.
PCT/CN2020/125143 2020-09-09 2020-10-30 Gene-disease relationship knowledge base construction method and apparatus, and computer device WO2021155684A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010941642.7A CN112036151B (en) 2020-09-09 2020-09-09 Gene disease relation knowledge base construction method, device and computer equipment
CN202010941642.7 2020-09-09

Publications (1)

Publication Number Publication Date
WO2021155684A1 true WO2021155684A1 (en) 2021-08-12

Family

ID=73584487

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/125143 WO2021155684A1 (en) 2020-09-09 2020-10-30 Gene-disease relationship knowledge base construction method and apparatus, and computer device

Country Status (2)

Country Link
CN (1) CN112036151B (en)
WO (1) WO2021155684A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113626567A (en) * 2021-07-28 2021-11-09 上海基绪康生物科技有限公司 Method for mining information related to genes and diseases from biomedical literature
CN114997398B (en) * 2022-03-09 2023-05-26 哈尔滨工业大学 Knowledge base fusion method based on relation extraction

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140297264A1 (en) * 2012-11-19 2014-10-02 University of Washington through it Center for Commercialization Open language learning for information extraction
CN109902301A (en) * 2019-02-26 2019-06-18 广东工业大学 Relation inference method, device and equipment based on deep neural network
CN111291568A (en) * 2020-03-06 2020-06-16 西南交通大学 Automatic entity relationship labeling method applied to medical texts

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3266246B2 (en) * 1990-06-15 2002-03-18 インターナシヨナル・ビジネス・マシーンズ・コーポレーシヨン Natural language analysis apparatus and method, and knowledge base construction method for natural language analysis
CN105138507A (en) * 2015-08-06 2015-12-09 电子科技大学 Pattern self-learning based Chinese open relationship extraction method
CN106021281A (en) * 2016-04-29 2016-10-12 京东方科技集团股份有限公司 Method for establishing medical knowledge graph, device for same and query method for same
CN106897568A (en) * 2017-02-28 2017-06-27 北京大数医达科技有限公司 The treating method and apparatus of case history structuring
CN107797991B (en) * 2017-10-23 2020-11-24 南京云问网络技术有限公司 Dependency syntax tree-based knowledge graph expansion method and system
CN109284396A (en) * 2018-09-27 2019-01-29 北京大学深圳研究生院 Medical knowledge map construction method, apparatus, server and storage medium
CN109545373A (en) * 2018-11-08 2019-03-29 新博卓畅技术(北京)有限公司 A kind of automatic abstracting method of human body diseases symptom characteristic, system and equipment
CN110032649B (en) * 2019-04-12 2021-10-01 北京科技大学 Method and device for extracting relationships between entities in traditional Chinese medicine literature
CN111339774B (en) * 2020-02-07 2022-11-29 腾讯科技(深圳)有限公司 Text entity relation extraction method and model training method
CN111428036B (en) * 2020-03-23 2022-05-27 浙江大学 Entity relationship mining method based on biomedical literature
CN111554360A (en) * 2020-04-27 2020-08-18 大连理工大学 Drug relocation prediction method based on biomedical literature and domain knowledge data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140297264A1 (en) * 2012-11-19 2014-10-02 University of Washington through it Center for Commercialization Open language learning for information extraction
CN109902301A (en) * 2019-02-26 2019-06-18 广东工业大学 Relation inference method, device and equipment based on deep neural network
CN111291568A (en) * 2020-03-06 2020-06-16 西南交通大学 Automatic entity relationship labeling method applied to medical texts

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SONG, MIN ET AL.: "PKDE4J: Entity and relation extraction for public knowledge discovery", JOURNAL OF BIOMEDICAL INFORMATICS, vol. 57, 12 August 2015 (2015-08-12), XP029961006, DOI: 10.1016/j.jbi.2015.08.008 *

Also Published As

Publication number Publication date
CN112036151B (en) 2024-04-05
CN112036151A (en) 2020-12-04

Similar Documents

Publication Publication Date Title
WO2021139252A1 (en) Operation and maintenance fault root cause identification method and apparatus, computer device, and storage medium
CN112016279B (en) Method, device, computer equipment and storage medium for structuring electronic medical record
CN112347310A (en) Event processing information query method and device, computer equipment and storage medium
WO2021155684A1 (en) Gene-disease relationship knowledge base construction method and apparatus, and computer device
WO2021139282A1 (en) Medical field knowledge graph construction method and apparatus, device and storage medium
CN107680661A (en) System and method for estimating medical resource demand
WO2021012904A1 (en) Data updating method and related device
WO2021120688A1 (en) Medical misdiagnosis detection method and apparatus, electronic device and storage medium
WO2021151291A1 (en) Disease risk analysis method, apparatus, electronic device, and computer storage medium
CN112132624A (en) Medical claims data prediction system
CN112634889B (en) Electronic case input method, device, terminal and medium based on artificial intelligence
WO2022178946A1 (en) Melanoma image recognition method and apparatus, computer device, and storage medium
CN110534185A (en) Labeled data acquisition methods divide and examine method, apparatus, storage medium and equipment
WO2022001517A1 (en) Information sending method and apparatus based on rumor prediction model, and computer device
CN112365939A (en) Data management method and system based on medical health big data
CN113724815A (en) Information pushing method and device based on decision grouping model
WO2023029507A1 (en) Data analysis-based service distribution method and apparatus, device, and storage medium
CN111813946A (en) Medical information feedback method, device, equipment and readable storage medium
WO2021159758A1 (en) Method and apparatus for drug discovery based on relationship extraction and knowledgeable inference, and device
CN113139876A (en) Risk model training method and device, computer equipment and readable storage medium
CN111242779B (en) Financial data characteristic selection and prediction method, device, equipment and storage medium
CN112395401A (en) Adaptive negative sample pair sampling method and device, electronic equipment and storage medium
CN111415285A (en) Specific personnel information management method and terminal based on hierarchical administrative regions
CN112991079B (en) Multi-card co-occurrence medical treatment fraud detection method, system, cloud end and medium
CN112364136B (en) Keyword generation method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20917445

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20917445

Country of ref document: EP

Kind code of ref document: A1