WO2021139247A1 - 医学领域知识图谱的构建方法、装置、设备及存储介质 - Google Patents

医学领域知识图谱的构建方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021139247A1
WO2021139247A1 PCT/CN2020/118499 CN2020118499W WO2021139247A1 WO 2021139247 A1 WO2021139247 A1 WO 2021139247A1 CN 2020118499 W CN2020118499 W CN 2020118499W WO 2021139247 A1 WO2021139247 A1 WO 2021139247A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity
medical field
vector
identified
recognized
Prior art date
Application number
PCT/CN2020/118499
Other languages
English (en)
French (fr)
Inventor
张圣
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021139247A1 publication Critical patent/WO2021139247A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • This application relates to the field of digital medical technology, and in particular to a method, device, equipment and storage medium for constructing a knowledge graph in the medical field.
  • Medical knowledge graphs are of great significance for basic medical research, smart medical care, and clinical diagnosis decision-making.
  • medical knowledge graphs are also widely used, such as intelligent search based on medical knowledge graphs, intelligent question and answer, intelligent recommendation, and auxiliary diagnosis. etc.
  • the current main idea of constructing a medical knowledge graph is to extract knowledge from medical literature to construct a knowledge graph, but the inventor found that the labeling of the label data set of the relation extraction model also requires a lot of expert manpower, and the current best is based on The effect of deep learning relation extraction is still far from the real usability.
  • This application provides a method, device, equipment, and storage medium for constructing a knowledge map in the medical field, which can automatically identify medical field knowledge from the existing massive high-quality general knowledge map, so that it can automatically construct high-quality and cover various types of knowledge.
  • the medical domain knowledge map of different types of medical knowledge has high efficiency, low labor cost and wide coverage.
  • a technical solution adopted in this application is to provide a method for constructing a knowledge graph in the medical field, including:
  • the medical field entity recognition model is used to identify whether the first entity to be recognized and the second entity to be recognized are medical field entities.
  • the network structure of the medical field entity recognition model includes an embedded layer, a splicing layer, and a loop that are sequentially connected. Neural network layer, attention mechanism layer and fully connected layer;
  • the first entity to be identified and the second entity to be identified are entities in the medical field at the same time, determining that the triple to be identified is a target triple;
  • Another technical solution adopted in this application is to provide a device for constructing a knowledge graph in the medical field, including:
  • the obtaining module is used to obtain all sets of triples in the general knowledge graph, obtain the triples to be identified from the set of triples, and determine the first entity to be identified and the triples to be identified from the triples to be identified Identify the second entity;
  • the recognition module is configured to use the medical field entity recognition model to respectively recognize whether the first entity to be recognized and the second entity to be recognized are entities in the medical field;
  • a determining module configured to determine that the triple to be identified is a target triple when the first entity to be identified and the second entity to be identified are entities in the medical field at the same time;
  • the map building module is used to insert the target triples into the medical field triple set to form a new medical field triple set, and construct a medical field knowledge map based on the new medical field triple set.
  • a computer device including: a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor executes all When the computer program is described, the following steps are implemented:
  • the medical field entity recognition model is used to identify whether the first entity to be recognized and the second entity to be recognized are medical field entities.
  • the network structure of the medical field entity recognition model includes an embedded layer, a splicing layer, and a loop that are sequentially connected. Neural network layer, attention mechanism layer and fully connected layer;
  • the first entity to be identified and the second entity to be identified are entities in the medical field at the same time, determining that the triple to be identified is a target triple;
  • another technical solution adopted in this application is to provide a computer storage medium storing a program file capable of realizing the construction of the above-mentioned medical field knowledge graph, and the program file implements the following steps when executed by a processor :
  • the medical field entity recognition model is used to identify whether the first entity to be recognized and the second entity to be recognized are medical field entities.
  • the network structure of the medical field entity recognition model includes an embedded layer, a splicing layer, and a loop that are sequentially connected. Neural network layer, attention mechanism layer and fully connected layer;
  • the first entity to be identified and the second entity to be identified are entities in the medical field at the same time, determining that the triple to be identified is a target triple;
  • the beneficial effect of this application is: based on the medical field entity recognition model, the medical field knowledge can be automatically recognized from the existing massive high-quality general knowledge graph, so that the medical field of high-quality and covering various types of medical knowledge can be automatically constructed
  • the knowledge graph solves the problems of high labor cost, small knowledge scale, and limited coverage of medical knowledge in the existing medical knowledge graph based on experts.
  • FIG. 1 is a schematic flowchart of a method for constructing a knowledge graph in the medical field according to a first embodiment of the present application
  • FIG. 2 is a schematic flowchart of a method for constructing a knowledge graph in the medical field according to a second embodiment of the present application;
  • FIG. 3 is a schematic diagram of the network structure of an entity recognition model in the medical field according to an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of training steps of an entity recognition model in the medical field according to an embodiment of the present application
  • FIG. 5 is a schematic structural diagram of a device for constructing a knowledge graph in the medical field according to a first embodiment of the present application
  • FIG. 6 is a schematic structural diagram of a device for constructing a knowledge graph in the medical field according to a second embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.
  • Fig. 8 is a schematic structural diagram of a computer storage medium according to an embodiment of the present application.
  • first”, “second”, and “third” in this application are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined with “first”, “second”, and “third” may explicitly or implicitly include at least one of the features.
  • "a plurality of” means at least two, such as two, three, etc., unless otherwise specifically defined. All directional indicators (such as up, down, left, right, front, back%) in the embodiments of this application are only used to explain the relative positional relationship between the components in a specific posture (as shown in the drawings) , Movement status, etc., if the specific posture changes, the directional indication will also change accordingly.
  • FIG. 1 is a schematic flowchart of a method for constructing a knowledge graph in the medical field according to a first embodiment of the present application. It should be noted that if there is substantially the same result, the method of the present application is not limited to the sequence of the process shown in FIG. 1. As shown in Figure 1, the method includes steps:
  • Step S101 Obtain all sets of triples in the general knowledge graph, acquire the triples to be identified from the triple sets, and determine the first entity to be identified and the second entity to be identified from the triples to be identified.
  • the general knowledge graph includes English general knowledge graph and Chinese general knowledge graph.
  • Chinese general knowledge graph includes Baidu Knowledge Graph, Sogou Knowledge Cube, zhishime, Fudan CN-DBpedia, etc.
  • English general knowledge graph includes freebase and wikidata. , Probase, etc.
  • the storage format of the knowledge graph is a triplet, and each piece of knowledge in the knowledge graph is called a triplet.
  • the triplet can be divided into (entity, relationship, entity), therefore, the entity to be identified may include the first entity to be identified and the second entity to be identified.
  • Step S102 Use the medical field entity recognition model to identify whether the first entity to be recognized and the second entity to be recognized are medical field entities.
  • the network structure of the medical field entity recognition model includes an embedded layer, a splicing layer, and a cyclic neural network layer connected in sequence , Attention mechanism layer and fully connected layer.
  • the medical field entity recognition model is used to respectively identify whether the first entity and the second entity to be recognized are medical field entities.
  • the sequence of identifying the first entity to be identified and the second entity to be identified is not distinguished.
  • the embedding layer of this embodiment performs word embedding processing and part-of-speech embedding processing on the description text information corresponding to the first entity to be recognized and the second entity to be recognized, and the splicing layer performs splicing processing on the word embedding processing result and the part-of-speech embedding processing result, and the recurrent neural network
  • the layer performs deep learning on the splicing processing results
  • the attention mechanism layer performs feature extraction on the deep learning results
  • the fully connected layer uses the activation function of the classification task to classify and recognize the feature extraction results and output the recognition results.
  • the recognition result of the medical field entity recognition model is "1"
  • it is determined that the first entity to be recognized is a medical field entity.
  • the recognition result of the entity recognition model is "0”
  • it is determined that the first entity to be recognized is a non-medical entity.
  • the recognition result of the medical field entity recognition model is "1"
  • it is determined that the second entity to be recognized is a medical field entity.
  • the recognition result of the entity recognition model is "0”
  • it is determined that the second entity to be recognized is a non-medical entity.
  • Step S103 When the first entity to be identified and the second entity to be identified are entities in the medical field at the same time, it is determined that the triplet to be identified is the target triplet.
  • step S103 only when the first entity to be identified and the second entity to be identified are both medical field entities, the triple to be identified is determined to be the target triple, when the first entity to be identified and the second entity to be identified are When only one of them is determined to be a medical field entity, the triples to be identified are non-target triples.
  • Step S104 Insert the target triples into the medical field triad set to form a new medical field triad set, and construct a medical field knowledge graph based on the new medical field triad set.
  • the method for constructing a knowledge map in the medical field of the first embodiment of the present application automatically recognizes the medical field knowledge from the existing massive high-quality general knowledge map through the medical field entity recognition model, so that it can automatically construct high-quality and cover various types of knowledge.
  • the medical domain knowledge map of different types of medical knowledge solves the problems of high labor cost, small knowledge scale, and limited coverage of medical knowledge in the existing medical knowledge map based on experts.
  • the construction method has a good transferability, in addition to being applied to the medical field, it can also be transferred to other fields, such as entertainment, finance, law, etc.
  • FIG. 2 is a schematic flowchart of a method for constructing a knowledge graph in the medical field according to a second embodiment of the present application. It should be noted that if there is substantially the same result, the method of the present application is not limited to the sequence of the process shown in FIG. 2. As shown in Figure 2, the method includes steps:
  • Step S201 Construct an entity recognition model in the medical field.
  • the network structure of the medical field entity recognition model includes a word embedding layer 31, a part of speech embedding layer 32, a splicing layer 33 connected to the word embedding layer 31 and part of speech embedding layer 32, and a splicing layer 33 connected
  • the word embedding layer 31 converts the vocabulary in the description text information corresponding to the entity to be recognized into word vectors.
  • the word embedding layer 31 of this embodiment uses the pre-trained Bert model instead of the Word2vec model.
  • the Bert model is in general The corpus text is pre-trained, and the effect is general on NLP tasks in the medical field.
  • the Bert model is pre-trained using 10 million medical literature corpora, so that it can be adapted to NLP tasks in the medical field; the part-of-speech embedding layer 32 will describe The part-of-speech of each vocabulary in the text information is converted into a part-of-speech vector. When obtaining the part-of-speech of each word, each word needs to be marked.
  • the part-of-speech embedding layer 32 of this embodiment adopts the Word2vec model; the splicing layer 33 is used to The word vector and part-of-speech vector of the vocabulary are spliced in series to obtain the splicing vector of each vocabulary. After splicing, the dimension of each vocabulary is equal to the dimension of the word embedding of each vocabulary plus the dimension of the part-of-speech embedding; the recurrent neural network layer 34 uses Bi- GRU model. GRU is the core unit of a commonly used recurrent neural network. It is an improvement of LSTM.
  • the Bi-GRU model can learn the forward and backward semantics (contextual semantics) of each vocabulary in a sentence. Attention; The mechanism layer 35 can synthesize the semantics of all words learned in the sentence to obtain a deeper semantic representation; the fully connected layer 36 uses the activation function of the classification task to classify and recognize the output results of the attention mechanism layer 35 and output the recognition results .
  • Step S202 Train an entity recognition model in the medical field.
  • step S202 referring to FIG. 4, it includes the following steps:
  • Step S401 Obtain the description text information of the first entity to be recognized or the second entity to be recognized, where the description text information includes multiple words;
  • Step S402 Embedding the description text information to obtain the word vector and the part-of-speech vector of each vocabulary;
  • step S402 the vocabulary input word is embedded in the model to obtain a word vector; part-of-speech tagging processing is performed on the vocabulary, and the part-of-speech tagging processing result is input into the part-of-speech embedded model to obtain a part-of-speech vector.
  • Step S403 Perform serial splicing processing on the word vector and the part-of-speech vector to obtain a splicing vector
  • Step S404 Input the splicing vector into the recurrent neural network to learn the context semantics of each word, and obtain the hidden vector of the word;
  • Step S405 Use the attention mechanism to perform feature extraction on the hidden vector to obtain the attention vector
  • step S405 first calculate the weight of each hidden vector; then calculate the weighted sum of the weight and the hidden vector to obtain the attention vector.
  • Step S406 Input the attention vector into the fully connected network for classification and recognition, and output the recognition result.
  • y softmax(w*s)
  • w the parameter
  • s the attention vector
  • softmax the activation function of the classification task
  • y the output recognition result
  • y 0 or 1.
  • Step S201 and step S202 in this embodiment may be before step S203 or after step S203.
  • Step S203 Obtain all sets of triples in the general knowledge graph, acquire the triples to be identified from the triple sets, and determine the first entity to be identified and the second entity to be identified from the triples to be identified.
  • step S203 in FIG. 2 is similar to step S101 in FIG.
  • Step S204 Use the medical field entity recognition model to identify whether the first entity to be recognized and the second entity to be recognized are medical field entities.
  • the network structure of the medical field entity recognition model includes successively connected embedding layers, splicing layers, and cyclic neural network layers , Attention mechanism layer and fully connected layer.
  • step S204 in FIG. 2 is similar to step S102 in FIG.
  • Step S205 When the first entity to be identified and the second entity to be identified are entities in the medical field at the same time, it is determined that the triplet to be identified is the target triplet.
  • step S205 in FIG. 2 is similar to step S103 in FIG.
  • Step S206 Insert the target triple into the medical field triad set to form a new medical field triad set, and construct a medical field knowledge graph based on the new medical field triad set.
  • step S206 in FIG. 2 is similar to step S104 in FIG.
  • the method for constructing a knowledge graph in the medical field of the second embodiment of the present application is based on the first embodiment, and by designing and training an entity recognition model in the medical field based on deep learning, it can determine whether the entity is a medical field entity.
  • the recognition model uses a variety of structures such as recurrent neural networks and attention mechanisms, and integrates a variety of embedded information, which can accurately and quickly automatically recognize medical domain knowledge from the existing massive high-quality general knowledge graph.
  • Fig. 5 is a schematic structural diagram of a device for constructing a knowledge graph in the medical field according to the first embodiment of the present application.
  • the device 50 includes an acquisition module 51, an identification module 52, a determination module 53 and an atlas construction module 54.
  • the obtaining module 51 is used to obtain all sets of triples in the general knowledge graph, obtain the triples to be identified from the set of triples, and determine the first entity to be identified and the second entity to be identified from the triples to be identified .
  • the recognition module 52 is coupled to the acquisition module 51, and is used to use the medical field entity recognition model to recognize whether the first entity to be recognized and the second entity to be recognized are medical field entities.
  • the network structure of the medical field entity recognition model includes successively connected embeddings Layer, splicing layer, cyclic neural network layer, attention mechanism layer and fully connected layer.
  • the determination module 53 is coupled to the recognition module 52, and is used for determining that the triplet to be recognized is the target triplet when the first entity to be recognized and the second entity to be recognized are both entities in the medical field.
  • the atlas construction module 54 is coupled with the determination module 53, and is used to insert the target triples into the medical field triad set to form a new medical field triad set, and construct medical field knowledge based on the new medical field triad set Atlas.
  • Fig. 6 is a schematic structural diagram of a device for constructing a knowledge graph in the medical field according to a second embodiment of the present application.
  • the device 60 includes a model construction module 61, a model training module 62, an acquisition module 63, an identification module 64, a determination module 65, and an atlas construction module 66.
  • the model construction module 61 is used to construct an entity recognition model in the medical field.
  • the model training module 62 is coupled to the model construction module 61, and is used to train an entity recognition model in the medical field.
  • the obtaining module 63 is used to obtain all sets of triples in the general knowledge graph, obtain the triples to be identified from the set of triples, and determine the first entity to be identified and the second entity to be identified from the triples to be identified .
  • the recognition module 64 is respectively coupled with the model training module 62 and the acquisition module 63, and is used to use the medical field entity recognition model to recognize whether the first entity to be recognized and the second entity to be recognized are medical field entities.
  • the network of the medical field entity recognition model The structure includes an embedded layer, a splicing layer, a cyclic neural network layer, an attention mechanism layer, and a fully connected layer that are connected in sequence.
  • the determination module 65 is coupled to the recognition module 64, and is used for determining that the triplet to be recognized is the target triplet when the first entity to be recognized and the second entity to be recognized are both medical field entities.
  • the atlas construction module 66 is coupled with the determination module 65, and is used to insert the target triples into the medical field triple set to form a new medical field triple set, and construct medical field knowledge based on the new medical field triple set Atlas.
  • FIG. 7 is a schematic structural diagram of a computer device according to an embodiment of the application.
  • the computer device 70 includes a memory 71, a processor 72, and a computer program stored in the memory and running on the processor 72.
  • the processor 72 implements the above-mentioned method for constructing a knowledge map in the medical field when the computer program is executed.
  • FIG. 8 is a schematic structural diagram of a computer storage medium according to an embodiment of the application.
  • the computer storage medium of the embodiment of the present application stores a program file 81 that can implement all the above methods.
  • the program file 81 can be stored in the above computer storage medium in the form of a software product, and includes several instructions to make a computer device (It may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • Computer storage media can be non-volatile or volatile.
  • the aforementioned computer storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), and random access memory (RAM, Random). Access Memory), magnetic disks or optical disks and other media that can store program codes, or terminal devices such as computers, servers, mobile phones, and tablets.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

一种医学领域知识图谱的构建方法、装置、设备及存储介质,该构建方法包括:获取通用知识图谱中的全部三元组集合,并从三元组集合中获取待识别三元组,从待识别三元组中确定待识别第一实体和待识别第二实体(S101);采用医学领域实体识别模型分别识别待识别第一实体和待识别第二实体是否为医学领域实体(S102);当待识别第一实体和待识别第二实体同时为医学领域实体时,确定待识别三元组为目标三元组(S103);将目标三元组插入医学领域三元组集合中形成新的医学领域三元组集合并构建医学领域知识图谱(S104)。该方法能够从通用知识图谱中自动识别医学领域知识,并自动构建高质量且覆盖各种类型的医学领域知识图谱,效率高、人力成本低且覆盖面广。

Description

医学领域知识图谱的构建方法、装置、设备及存储介质
本申请要求于2020年8月6日提交中国专利局、申请号为202010785288.3,发明名称为“医学领域知识图谱的构建方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数字医疗技术领域,特别是涉及一种医学领域知识图谱的构建方法、装置、设备及存储介质。
背景技术
医学知识图谱对于医学基础研究、智慧医疗、临床诊断决策等方面都有重要的意义,此外,医学知识图谱的应用也非常广泛,比如基于医学知识图谱的智能搜索、智能问答、智能推荐、辅助诊断等方面。
但是,目前市场上高质量的医学领域知识图谱较少,大部分都是医学领域细分的知识图谱,比如基因-疾病-靶标知识图谱、基因-物质-交互关系知识库等,目前还没有比较全面的覆盖各种类型的医学知识的高质量医学知识图谱。目前高质量的医学知识图谱的构建方式主要还是基于专家构建,专家构建的知识图谱质量虽高,但覆盖的医学知识很少。现有的构建医学知识图谱主要思路是从医学文献中进行关系抽取获取知识构建知识图谱,但是发明人发现关系抽取模型标签数据集的标注获取同样需要花费大量的专家人力,而且目前最优的基于深度学习关系抽取效果离真实可用还有很大距离。
发明内容
本申请提供一种医学领域知识图谱的构建方法、装置、设备及存储介质,能够从已有的海量的高质量的通用知识图谱中自动识别医学领域知识,从而可以自动构建高质量且覆盖各种类型的医学知识的医学领域知识图谱,效率高、人力成本低且覆盖面广。
为解决上述技术问题,本申请采用的一个技术方案是:提供一种医学领域知识图谱的构建方法,包括:
获取通用知识图谱中的全部三元组集合,并从所述三元组集合中获取待识别三元组,从所述待识别三元组中确定待识别第一实体和待识别第二实体;
采用医学领域实体识别模型分别识别所述待识别第一实体和所述待识别第二实体是否为医学领域实体,所述医学领域实体识别模型的网络结构包括依次连接的嵌入层、拼接层、循环神经网络层、注意力机制层以及全连接层;
当所述待识别第一实体和所述待识别第二实体同时为医学领域实体时,确定所述待识别三元组为目标三元组;
将所述目标三元组插入医学领域三元组集合中形成新的医学领域三元组集合,并根据新的医学领域三元组集合构建医学领域知识图谱。
为解决上述技术问题,本申请采用的另一个技术方案是:提供一种医学领域知识图谱的构建装置,包括:
获取模块,用于获取通用知识图谱中的全部三元组集合,并从所述三元组集合中获取待识别三元组,从所述待识别三元组中确定待识别第一实体和待识别第二实体;
识别模块,用于采用医学领域实体识别模型分别识别所述待识别第一实体和所述待识别第二实体是否为医学领域实体;
确定模块,用于当所述待识别第一实体和所述待识别第二实体同时为医学领域实体时,确定所述待识别三元组为目标三元组;
图谱构建模块,用于将所述目标三元组插入医学领域三元组集合中形成新的医学领域三元组集合,并根据新的医学领域三元组集合构建医学领域知识图谱。
为解决上述技术问题,本申请采用的再一个技术方案是:提供一种计算机设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现以下步骤:
获取通用知识图谱中的全部三元组集合,并从所述三元组集合中获取待识别三元组,从所述待识别三元组中确定待识别第一实体和待识别第二实体;
采用医学领域实体识别模型分别识别所述待识别第一实体和所述待识别第二实体是否为医学领域实体,所述医学领域实体识别模型的网络结构包括依次连接的嵌入层、拼接层、循环神经网络层、注意力机制层以及全连接层;
当所述待识别第一实体和所述待识别第二实体同时为医学领域实体时,确定所述待识别三元组为目标三元组;
将所述目标三元组插入医学领域三元组集合中形成新的医学领域三元组集合,并根据新的医学领域三元组集合构建医学领域知识图谱。
为解决上述技术问题,本申请采用的再一个技术方案是:提供一种计算机存储介质,存储有能够实现上述医学领域知识图谱的构建的程序文件,所述程序文件被处理器执行时实现以下步骤:
获取通用知识图谱中的全部三元组集合,并从所述三元组集合中获取待识别三元组,从所述待识别三元组中确定待识别第一实体和待识别第二实体;
采用医学领域实体识别模型分别识别所述待识别第一实体和所述待识别第二实体是否为医学领域实体,所述医学领域实体识别模型的网络结构包括依次连接的嵌入层、拼接层、循环神经网络层、注意力机制层以及全连接层;
当所述待识别第一实体和所述待识别第二实体同时为医学领域实体时,确定所述待识别三元组为目标三元组;
将所述目标三元组插入医学领域三元组集合中形成新的医学领域三元组集合,并根据新的医学领域三元组集合构建医学领域知识图谱。
本申请的有益效果是:基于医学领域实体识别模型,从已有的海量的高质量的通用知识图谱中自动识别医学领域知识,从而可以自动构建高质量且覆盖各种类型的医学知识的医学领域知识图谱,解决了现有的基于专家构建的医学知识图谱的人力成本高、知识规模少以及覆盖医学知识种类范围少等问题。
附图说明
图1是本申请第一实施例的医学领域知识图谱的构建方法的流程示意图;
图2是本申请第二实施例的医学领域知识图谱的构建方法的流程示意图;
图3是本申请实施例的医学领域实体识别模型的网络结构示意图;
图4是本申请实施例的医学领域实体识别模型的训练步骤的流程示意图;
图5是本申请第一实施例的医学领域知识图谱的构建装置的结构示意图;
图6是本申请第二实施例的医学领域知识图谱的构建装置的结构示意图;
图7是本申请实施例的计算机设备的结构示意图;
图8是本申请实施例的计算机存储介质的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请的一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请中的术语“第一”、“第二”、“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”、“第三”的特征可以明示或者隐含地包括至少一个该特征。本申请的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。本申请实施例中所有方向性指示(诸如上、下、左、右、前、后……)仅用于解释在某一特定姿态(如 附图所示)下各部件之间的相对位置关系、运动情况等,如果该特定姿态发生改变时,则该方向性指示也相应地随之改变。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
本申请涉及数字医疗技术领域中的医疗信息化。图1是本申请第一实施例的医学领域知识图谱的构建方法的流程示意图。需注意的是,若有实质上相同的结果,本申请的方法并不以图1所示的流程顺序为限。如图1所示,该方法包括步骤:
步骤S101:获取通用知识图谱中的全部三元组集合,并从三元组集合中获取待识别三元组,从待识别三元组中确定待识别第一实体和待识别第二实体。
在步骤S101中,通用知识图谱包括英文通用知识图谱和中文通用知识图谱,其中,中文通用知识图谱包括百度知识图谱、搜狗知立方、zhishime、复旦CN-DBpedia等,英文通用知识图谱包括freebase、wikidata、probase等,知识图谱的存储格式为三元组,知识图谱中的每一条知识称为三元组。在本实施例中,三元组可分为(实体,关系,实体),因此,待识别实体可包括待识别第一实体和待识别第二实体。
步骤S102:采用医学领域实体识别模型分别识别待识别第一实体和待识别第二实体是否为医学领域实体,医学领域实体识别模型的网络结构包括依次连接的嵌入层、拼接层、循环神经网络层、注意力机制层以及全连接层。
在步骤S102中,采用医学领域实体识别模型分别识别第一实体和待识别第二实体是否为医学领域实体。本实施例在识别待识别第一实体和待识别第二实体时不分先后顺序。本实施例的嵌入层对待识别第一实体和待识别第二实体对应的描述文本信息进行词嵌入处理和词性嵌入处理,拼接层对词嵌入处理结果和词性嵌入处理结果进行拼接处理,循环神经网络层对拼接处理结果进行深度学习,注意力机制层对深度学习结果进行特征提取,全连接层采用分类任务的激活函数对特征提取结果进行分类识别并输出识别结果。
在采用医学领域实体识别模型识别第一实体是否为医学领域实体的步骤中,当医学领域实体识别模型的识别结果为“1”时,则确定待识别第一实体为医学领域实体,当医学领域实体识别模型的识别结果为“0”时,则确定待识别第一实体为非医学领域 实体。
在采用医学领域实体识别模型识别第二实体是否为医学领域实体的步骤中,当医学领域实体识别模型的识别结果为“1”时,则确定待识别第二实体为医学领域实体,当医学领域实体识别模型的识别结果为“0”时,则确定待识别第二实体为非医学领域实体。
步骤S103:当待识别第一实体和待识别第二实体同时为医学领域实体时,确定待识别三元组为目标三元组。
在步骤S103中,仅当待识别第一实体和待识别第二实体均为医学领域实体时,确定待识别三元组为目标三元组,当待识别第一实体和待识别第二实体中仅其中一个确定为医学领域实体时,待识别三元组为非目标三元组。
步骤S104:将目标三元组插入医学领域三元组集合中形成新的医学领域三元组集合,并根据新的医学领域三元组集合构建医学领域知识图谱。
本申请第一实施例的医学领域知识图谱的构建方法通过医学领域实体识别模型,从已有的海量的高质量的通用知识图谱中自动识别医学领域知识,从而可以自动构建高质量且覆盖各种类型的医学知识的医学领域知识图谱,解决了现有的基于专家构建的医学知识图谱的人力成本高、知识规模少以及覆盖医学知识种类范围少等问题。该构建方法具有很好的迁移性,除了应用于医学领域,还可以迁移到其他领域,例如娱乐圈、金融、法律等。
图2是本申请第二实施例的医学领域知识图谱的构建方法的流程示意图。需注意的是,若有实质上相同的结果,本申请的方法并不以图2所示的流程顺序为限。如图2所示,该方法包括步骤:
步骤S201:构建医学领域实体识别模型。
在步骤S201中,请参见图3,医学领域实体识别模型的网络结构包括词嵌入层31、词性嵌入层32、与词嵌入层31和词性嵌入层32连接的拼接层33、与拼接层33连接的循环神经网络层34、与循环神经网络层34的注意力机制层35以及与注意力机制层连接35的全连接层36。在本实施例中,词嵌入层31将待识别实体对应的描述文本信息中的词汇转换为词向量,本实施例的词嵌入层31采用预训练Bert模型而不是Word2vec模型,Bert模型是在通用语料文本进行预训练,在医学领域NLP任务上效果一般,但本实施例将Bert模型使用1000万的医学文献语料进行了预训练,从而可以适应于医学领域的NLP任务;词性嵌入层32将描述文本信息中每个词汇的词性转换为词性向量,在获取每个词汇的词性时需要对每个词汇进行词性标注,本实施例的词性嵌入层32采用Word2vec模型;拼接层33用于将每个词汇的词向量和词性向量进行串联拼接,获得每个词汇的拼接向量,拼接之后每个词汇 的维度等于每个词汇的词嵌入的维度加上词性嵌入的维度;循环神经网络层34采用Bi-GRU模型,GRU是常用的循环神经网络的核单元,是对LSTM的改进,Bi-GRU模型可以很好地学习每个词汇在句子中的前向以及后向的语义(上下文语义);注意力机制层35可以综合句子中学习到的所有词的语义,从而获取得到更深层的语义表示;全连接层36采用分类任务的激活函数对注意力机制层35的输出结果进行分类识别并输出识别结果。
步骤S202:对医学领域实体识别模型进行训练。
在步骤S202中,请参见图4,包括以下步骤:
步骤S401:获取待识别第一实体或待识别第二实体的描述文本信息,描述文本信息包括多个词汇;
步骤S402:对描述文本信息进行嵌入处理,获得每个词汇的词向量和词性向量;
在步骤S402中,将词汇输入词嵌入模型中,获得词向量;对词汇进行词性标注处理,将词性标注处理结果输入词性嵌入模型中,获得词性向量。
步骤S403:将词向量和词性向量进行串联拼接处理,获得拼接向量;
在步骤S403中,每个词汇的拼接向量为e i=(e_word i:e_pos i),其中,e为拼接向量,i为词汇的个数,i=1,…n,e_word i表示词向量,e_pos i表示词性向量。
此时,每个词汇的维度为dim(e i)=dim(e_word i)+dim(e_pos i),其中,i为词汇的个数,i=1,…n。
步骤S404:将拼接向量输入循环神经网络中学习每个词汇的上下文语义,获得词汇的隐藏向量;
步骤S405:采用注意力机制对隐藏向量进行特征提取,获得注意力向量;
在步骤S405中,首先计算每个隐藏向量的权重;再计算权重和隐藏向量的加权和,获得注意力向量。
计算每个隐藏向量的权重按照如下公式进行:
Figure PCTCN2020118499-appb-000001
其中,e为拼接向量,i为词汇的个数,i=1,…n,a为隐藏向量的权重,h为隐藏向量。
计算权重和隐藏向量的加权和,获得注意力向量按照如下公式进行:S=∑ ia ih i,其中,s表示注意力向量,i为词汇的个数,i=1,…n,a为隐藏向量的权重,h为隐藏向量。
步骤S406:将注意力向量输入全连接网络中进行分类识别,输出识别结果。
在步骤S406中,按照如下公式进行:y=softmax(w*s),其中,w为参数,s为注意力向量,softmax为分类任务的激活函数,y为输出的识别结果,y取0或1,y取0时,表示待识别实体为非医学领域实体,y取1时,表示待识别实体为医 学领域实体。
本实施例的步骤S201、步骤S202可在步骤S203之前,也可在步骤S203之后。
步骤S203:获取通用知识图谱中的全部三元组集合,并从三元组集合中获取待识别三元组,从待识别三元组中确定待识别第一实体和待识别第二实体。
在本实施例中,图2中的步骤S203和图1中的步骤S101类似,为简约起见,在此不再赘述。
步骤S204:采用医学领域实体识别模型分别识别待识别第一实体和待识别第二实体是否为医学领域实体,医学领域实体识别模型的网络结构包括依次连接的嵌入层、拼接层、循环神经网络层、注意力机制层以及全连接层。
在本实施例中,图2中的步骤S204和图1中的步骤S102类似,为简约起见,在此不再赘述。
步骤S205:当待识别第一实体和待识别第二实体同时为医学领域实体时,确定待识别三元组为目标三元组。
在本实施例中,图2中的步骤S205和图1中的步骤S103类似,为简约起见,在此不再赘述。
步骤S206:将目标三元组插入医学领域三元组集合中形成新的医学领域三元组集合,并根据新的医学领域三元组集合构建医学领域知识图谱。
在本实施例中,图2中的步骤S206和图1中的步骤S104类似,为简约起见,在此不再赘述。
本申请第二实施例的医学领域知识图谱的构建方法在第一实施例的基础上,通过基于深度学习设计并训练了医学领域实体识别模型,可以判别实体是否为医学领域实体,该医学领域实体识别模型使用了循环神经网络、注意力机制等多种结构,并且融合了多种嵌入信息,能够准确、快速地从已有的海量的高质量的通用知识图谱中自动识别医学领域知识。
图5是本申请第一实施例的医学领域知识图谱的构建装置的结构示意图。如图5所示,该装置50包括获取模块51、识别模块52、确定模块53和图谱构建模块54。
获取模块51用于获取通用知识图谱中的全部三元组集合,并从三元组集合中获取待识别三元组,从待识别三元组中确定待识别第一实体和待识别第二实体。
识别模块52与获取模块51耦接,用于采用医学领域实体识别模型分别识别待识别第一实体和待识别第二实体是否为医学领域实体,医学领域实体识别模型的网络结构包括依次连接的嵌入层、拼接层、循环神经网络层、注意力机制层以及全连接层。
确定模块53与识别模块52耦接,用于当待识别第一实体和待识别第二实体同 时为医学领域实体时,确定待识别三元组为目标三元组。
图谱构建模块54与确定模块53耦接,用于将目标三元组插入医学领域三元组集合中形成新的医学领域三元组集合,并根据新的医学领域三元组集合构建医学领域知识图谱。
图6是本申请第二实施例的医学领域知识图谱的构建装置的结构示意图。如图6所示,该装置60包括模型构建模块61、模型训练模块62、获取模块63、识别模块64、确定模块65和图谱构建模块66。
模型构建模块61用于构建医学领域实体识别模型。
模型训练模块62与模型构建模块61耦接,用于对医学领域实体识别模型进行训练。
获取模块63用于获取通用知识图谱中的全部三元组集合,并从三元组集合中获取待识别三元组,从待识别三元组中确定待识别第一实体和待识别第二实体。
识别模块64分别与模型训练模块62、获取模块63耦接,用于采用医学领域实体识别模型分别识别待识别第一实体和待识别第二实体是否为医学领域实体,医学领域实体识别模型的网络结构包括依次连接的嵌入层、拼接层、循环神经网络层、注意力机制层以及全连接层。
确定模块65与识别模块64耦接,用于当待识别第一实体和待识别第二实体同时为医学领域实体时,确定待识别三元组为目标三元组。
图谱构建模块66与确定模块65耦接,用于将目标三元组插入医学领域三元组集合中形成新的医学领域三元组集合,并根据新的医学领域三元组集合构建医学领域知识图谱。
请参见图7,图7为本申请实施例的计算机设备的结构示意图。该计算机设备70包括:存储器71、处理器72及存储在存储器上并可在处理器72上运行的计算机程序,处理器72执行计算机程序时实现上述的医学领域知识图谱的构建方法。
参阅图8,图8为本申请实施例的计算机存储介质的结构示意图。本申请实施例的计算机存储介质存储有能够实现上述所有方法的程序文件81,其中,该程序文件81可以以软件产品的形式存储在上述计算机存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施方式所述方法的全部或部分步骤。计算机存储介质可以是非易失性,也可以是易失性,而前述的计算机存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质,或者是计算机、服务器、手机、平板等终端设备。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
以上仅为本申请的实施方式,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种医学领域知识图谱的构建方法,其中,包括:
    获取通用知识图谱中的全部三元组集合,并从所述三元组集合中获取待识别三元组,从所述待识别三元组中确定待识别第一实体和待识别第二实体;
    采用医学领域实体识别模型分别识别所述待识别第一实体和所述待识别第二实体是否为医学领域实体,所述医学领域实体识别模型的网络结构包括依次连接的嵌入层、拼接层、循环神经网络层、注意力机制层以及全连接层;
    当所述待识别第一实体和所述待识别第二实体同时为医学领域实体时,确定所述待识别三元组为目标三元组;
    将所述目标三元组插入医学领域三元组集合中形成新的医学领域三元组集合,并根据新的医学领域三元组集合构建医学领域知识图谱。
  2. 根据权利要求1所述的构建方法,其中,所述采用医学领域实体识别模型分别识别所述待识别第一实体和所述待识别第二实体是否为医学领域实体的步骤包括:
    采用所述医学领域实体识别模型识别所述待识别第一实体是否为医学领域实体;
    当所述医学领域实体识别模型输出第一预设阈值时,则确定所述待识别第一实体为医学领域实体,当所述医学领域实体识别模型输出第二预设阈值时,则确定所述待识别第一实体为非医学领域实体;
    采用医学领域实体识别模型识别所述待识别第二实体是否为医学领域实体;
    当所述医学领域实体识别模型输出第一预设阈值时,则确定所述待识别第二实体为医学领域实体,当所述医学领域实体识别模型输出第二预设阈值时,则确定所述待识别第二实体为非医学领域实体。
  3. 根据权利要求1所述的构建方法,其中,在所述采用医学领域实体识别模型分别识别所述待识别第一实体和所述待识别第二实体是否为医学领域实体的步骤之前还包括:
    构建所述医学领域实体识别模型;
    对所述医学领域实体识别模型进行训练。
  4. 根据权利要求3所述的构建方法,其中,所述对所述医学领域实体识别模型进行训练的步骤包括:
    获取所述待识别第一实体或所述待识别第二实体的描述文本信息,所述描述文本信息包括多个词汇;
    对所述描述文本信息进行嵌入处理,获得每个词汇的词向量和词性向量;
    将所述词向量和所述词性向量进行串联拼接处理,获得拼接向量;
    将所述拼接向量输入循环神经网络中学习每个词汇的上下文语义,获得所述词汇的隐藏向量;
    采用注意力机制对所述隐藏向量进行特征提取,获得注意力向量;
    将所述注意力向量输入全连接网络中进行分类识别,输出识别结果。
  5. 根据权利要求4所述的构建方法,其中,所述对所述描述文本信息进行嵌入处理,获得每个词汇的词向量和词性向量的步骤包括:
    将所述词汇输入词嵌入模型中,获得所述词向量;
    对所述词汇进行词性标注处理,将所述词性标注处理结果输入词性嵌入模型中,获得所述词性向量。
  6. 根据权利要求4所述的构建方法,其中,所述采用注意力机制对所述隐藏向量进行特征提取,获得注意力向量的步骤包括:
    计算每个所述隐藏向量的权重;
    计算所述权重和所述隐藏向量的加权和,获得所述注意力向量。
  7. 根据权利要求4所述的构建方法,其中,所述将所述注意力向量输入全连接网络中进行分类识别,输出识别结果的步骤按照如下公式进行:
    y=softmax(w·s),其中,w为参数,s为注意力向量,softmax为分类任务的激活函数,y为输出的识别结果,y取0或1,y取0时,表示所述待识别第一实体或所述待识别第二实体为非医学领域实体,y取1时,表示所述待识别第一实体或所述待识别第二实体为医学领域实体。
  8. 一种医学领域知识图谱的构建装置,其中,包括:
    获取模块,用于获取通用知识图谱中的全部三元组集合,并从所述三元组集合中获取待识别三元组,从所述待识别三元组中确定待识别第一实体和待识别第二实体;
    识别模块,用于采用医学领域实体识别模型分别识别所述待识别第一实体和所述待识别第二实体是否为医学领域实体;
    确定模块,用于当所述待识别第一实体和所述待识别第二实体同时为医学领域实体时,确定所述待识别三元组为目标三元组;
    图谱构建模块,用于将所述目标三元组插入医学领域三元组集合中形成新的医学领域三元组集合,并根据新的医学领域三元组集合构建医学领域知识图谱。
  9. 一种计算机设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现以下步骤:
    获取通用知识图谱中的全部三元组集合,并从所述三元组集合中获取待识别三 元组,从所述待识别三元组中确定待识别第一实体和待识别第二实体;
    采用医学领域实体识别模型分别识别所述待识别第一实体和所述待识别第二实体是否为医学领域实体,所述医学领域实体识别模型的网络结构包括依次连接的嵌入层、拼接层、循环神经网络层、注意力机制层以及全连接层;
    当所述待识别第一实体和所述待识别第二实体同时为医学领域实体时,确定所述待识别三元组为目标三元组;
    将所述目标三元组插入医学领域三元组集合中形成新的医学领域三元组集合,并根据新的医学领域三元组集合构建医学领域知识图谱。
  10. 根据权利要求9所述的计算机设备,其中,所述采用医学领域实体识别模型分别识别所述待识别第一实体和所述待识别第二实体是否为医学领域实体的步骤包括:
    采用所述医学领域实体识别模型识别所述待识别第一实体是否为医学领域实体;
    当所述医学领域实体识别模型输出第一预设阈值时,则确定所述待识别第一实体为医学领域实体,当所述医学领域实体识别模型输出第二预设阈值时,则确定所述待识别第一实体为非医学领域实体;
    采用医学领域实体识别模型识别所述待识别第二实体是否为医学领域实体;
    当所述医学领域实体识别模型输出第一预设阈值时,则确定所述待识别第二实体为医学领域实体,当所述医学领域实体识别模型输出第二预设阈值时,则确定所述待识别第二实体为非医学领域实体。
  11. 根据权利要求9所述的计算机设备,其中,在所述采用医学领域实体识别模型分别识别所述待识别第一实体和所述待识别第二实体是否为医学领域实体的步骤之前还包括:
    构建所述医学领域实体识别模型;
    对所述医学领域实体识别模型进行训练。
  12. 根据权利要求11所述的计算机设备,其中,所述对所述医学领域实体识别模型进行训练的步骤包括:
    获取所述待识别第一实体或所述待识别第二实体的描述文本信息,所述描述文本信息包括多个词汇;
    对所述描述文本信息进行嵌入处理,获得每个词汇的词向量和词性向量;
    将所述词向量和所述词性向量进行串联拼接处理,获得拼接向量;
    将所述拼接向量输入循环神经网络中学习每个词汇的上下文语义,获得所述词汇的隐藏向量;
    采用注意力机制对所述隐藏向量进行特征提取,获得注意力向量;
    将所述注意力向量输入全连接网络中进行分类识别,输出识别结果。
  13. 根据权利要求12所述的计算机设备,其中,所述对所述描述文本信息进行嵌入处理,获得每个词汇的词向量和词性向量的步骤包括:
    将所述词汇输入词嵌入模型中,获得所述词向量;
    对所述词汇进行词性标注处理,将所述词性标注处理结果输入词性嵌入模型中,获得所述词性向量。
  14. 根据权利要求12所述的计算机设备,其中,所述采用注意力机制对所述隐藏向量进行特征提取,获得注意力向量的步骤包括:
    计算每个所述隐藏向量的权重;
    计算所述权重和所述隐藏向量的加权和,获得所述注意力向量。
  15. 一种计算机存储介质,其中,存储有能够实现医学领域知识图谱的构建的程序文件,所述程序文件被处理器执行时实现以下步骤:
    获取通用知识图谱中的全部三元组集合,并从所述三元组集合中获取待识别三元组,从所述待识别三元组中确定待识别第一实体和待识别第二实体;
    采用医学领域实体识别模型分别识别所述待识别第一实体和所述待识别第二实体是否为医学领域实体,所述医学领域实体识别模型的网络结构包括依次连接的嵌入层、拼接层、循环神经网络层、注意力机制层以及全连接层;
    当所述待识别第一实体和所述待识别第二实体同时为医学领域实体时,确定所述待识别三元组为目标三元组;
    将所述目标三元组插入医学领域三元组集合中形成新的医学领域三元组集合,并根据新的医学领域三元组集合构建医学领域知识图谱。
  16. 根据权利要求15所述的计算机存储介质,其中,所述采用医学领域实体识别模型分别识别所述待识别第一实体和所述待识别第二实体是否为医学领域实体的步骤包括:
    采用所述医学领域实体识别模型识别所述待识别第一实体是否为医学领域实体;
    当所述医学领域实体识别模型输出第一预设阈值时,则确定所述待识别第一实体为医学领域实体,当所述医学领域实体识别模型输出第二预设阈值时,则确定所述待识别第一实体为非医学领域实体;
    采用医学领域实体识别模型识别所述待识别第二实体是否为医学领域实体;
    当所述医学领域实体识别模型输出第一预设阈值时,则确定所述待识别第二实体为医学领域实体,当所述医学领域实体识别模型输出第二预设阈值时,则确定所 述待识别第二实体为非医学领域实体。
  17. 根据权利要求15所述的计算机存储介质,其中,在所述采用医学领域实体识别模型分别识别所述待识别第一实体和所述待识别第二实体是否为医学领域实体的步骤之前还包括:
    构建所述医学领域实体识别模型;
    对所述医学领域实体识别模型进行训练。
  18. 根据权利要求17所述的计算机存储介质,其中,所述对所述医学领域实体识别模型进行训练的步骤包括:
    获取所述待识别第一实体或所述待识别第二实体的描述文本信息,所述描述文本信息包括多个词汇;
    对所述描述文本信息进行嵌入处理,获得每个词汇的词向量和词性向量;
    将所述词向量和所述词性向量进行串联拼接处理,获得拼接向量;
    将所述拼接向量输入循环神经网络中学习每个词汇的上下文语义,获得所述词汇的隐藏向量;
    采用注意力机制对所述隐藏向量进行特征提取,获得注意力向量;
    将所述注意力向量输入全连接网络中进行分类识别,输出识别结果。
  19. 根据权利要求18所述的计算机存储介质,其中,所述对所述描述文本信息进行嵌入处理,获得每个词汇的词向量和词性向量的步骤包括:
    将所述词汇输入词嵌入模型中,获得所述词向量;
    对所述词汇进行词性标注处理,将所述词性标注处理结果输入词性嵌入模型中,获得所述词性向量。
  20. 根据权利要求18所述的计算机存储介质,其中,所述采用注意力机制对所述隐藏向量进行特征提取,获得注意力向量的步骤包括:
    计算每个所述隐藏向量的权重;
    计算所述权重和所述隐藏向量的加权和,获得所述注意力向量。
PCT/CN2020/118499 2020-08-06 2020-09-28 医学领域知识图谱的构建方法、装置、设备及存储介质 WO2021139247A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010785288.3 2020-08-06
CN202010785288.3A CN111949802B (zh) 2020-08-06 2020-08-06 医学领域知识图谱的构建方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021139247A1 true WO2021139247A1 (zh) 2021-07-15

Family

ID=73331761

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118499 WO2021139247A1 (zh) 2020-08-06 2020-09-28 医学领域知识图谱的构建方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN111949802B (zh)
WO (1) WO2021139247A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113590843A (zh) * 2021-08-06 2021-11-02 中国海洋大学 一种融合分子结构特征的知识表示学习方法
CN113626609A (zh) * 2021-08-10 2021-11-09 南方电网数字电网研究院有限公司 电力计量知识图谱构建方法、装置、设备和存储介质
CN113704497A (zh) * 2021-08-31 2021-11-26 平安科技(深圳)有限公司 医疗知识图谱的填补方法、装置、计算机设备及存储介质
CN114121212A (zh) * 2021-11-19 2022-03-01 东南大学 一种基于知识图谱和群表示学习的中药处方生成方法
CN115168599A (zh) * 2022-06-20 2022-10-11 北京百度网讯科技有限公司 多三元组抽取方法、装置、设备、介质及产品
CN115169326A (zh) * 2022-04-15 2022-10-11 山西长河科技股份有限公司 一种中文关系抽取方法、装置、终端及存储介质
WO2023184226A1 (zh) * 2022-03-30 2023-10-05 京东方科技集团股份有限公司 一种物品推荐方法、物品知识图谱、模型训练方法及装置
CN117012374A (zh) * 2023-10-07 2023-11-07 之江实验室 一种融合事件图谱和深度强化学习的医疗随访系统及方法
CN118571502A (zh) * 2024-08-02 2024-08-30 之江实验室 基于知识引导域自适应的多中心医学数据处理方法、系统、设备、介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327691B (zh) * 2021-06-01 2022-08-12 平安科技(深圳)有限公司 基于语言模型的问询方法、装置、计算机设备及存储介质
CN113470775B (zh) * 2021-07-23 2023-06-16 深圳平安智慧医健科技有限公司 信息采集方法、装置、设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284396A (zh) * 2018-09-27 2019-01-29 北京大学深圳研究生院 医学知识图谱构建方法、装置、服务器及存储介质
CN109871538A (zh) * 2019-02-18 2019-06-11 华南理工大学 一种中文电子病历命名实体识别方法
CN109902171A (zh) * 2019-01-30 2019-06-18 中国地质大学(武汉) 基于分层知识图谱注意力模型的文本关系抽取方法及系统
CN110019839A (zh) * 2018-01-03 2019-07-16 中国科学院计算技术研究所 基于神经网络和远程监督的医学知识图谱构建方法和系统
CN110888944A (zh) * 2019-11-20 2020-03-17 中山大学 基于多卷积窗尺寸注意力卷积神经网络实体关系抽取方法
US20200125641A1 (en) * 2018-10-19 2020-04-23 QwikIntelligence, Inc. Understanding natural language using tumbling-frequency phrase chain parsing
CN111274394A (zh) * 2020-01-16 2020-06-12 重庆邮电大学 一种实体关系的抽取方法、装置、设备及存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359297B (zh) * 2018-09-20 2020-06-09 清华大学 一种关系抽取方法及系统
CN110825721B (zh) * 2019-11-06 2023-05-02 武汉大学 大数据环境下高血压知识库构建与系统集成方法
CN111368528B (zh) * 2020-03-09 2022-07-08 西南交通大学 一种面向医学文本的实体关系联合抽取方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019839A (zh) * 2018-01-03 2019-07-16 中国科学院计算技术研究所 基于神经网络和远程监督的医学知识图谱构建方法和系统
CN109284396A (zh) * 2018-09-27 2019-01-29 北京大学深圳研究生院 医学知识图谱构建方法、装置、服务器及存储介质
US20200125641A1 (en) * 2018-10-19 2020-04-23 QwikIntelligence, Inc. Understanding natural language using tumbling-frequency phrase chain parsing
CN109902171A (zh) * 2019-01-30 2019-06-18 中国地质大学(武汉) 基于分层知识图谱注意力模型的文本关系抽取方法及系统
CN109871538A (zh) * 2019-02-18 2019-06-11 华南理工大学 一种中文电子病历命名实体识别方法
CN110888944A (zh) * 2019-11-20 2020-03-17 中山大学 基于多卷积窗尺寸注意力卷积神经网络实体关系抽取方法
CN111274394A (zh) * 2020-01-16 2020-06-12 重庆邮电大学 一种实体关系的抽取方法、装置、设备及存储介质

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113590843B (zh) * 2021-08-06 2023-06-23 中国海洋大学 一种融合分子结构特征的知识表示学习方法
CN113590843A (zh) * 2021-08-06 2021-11-02 中国海洋大学 一种融合分子结构特征的知识表示学习方法
CN113626609A (zh) * 2021-08-10 2021-11-09 南方电网数字电网研究院有限公司 电力计量知识图谱构建方法、装置、设备和存储介质
CN113626609B (zh) * 2021-08-10 2024-03-26 南方电网数字电网研究院有限公司 电力计量知识图谱构建方法、装置、设备和存储介质
CN113704497B (zh) * 2021-08-31 2024-01-26 平安科技(深圳)有限公司 医疗知识图谱的填补方法、装置、计算机设备及存储介质
CN113704497A (zh) * 2021-08-31 2021-11-26 平安科技(深圳)有限公司 医疗知识图谱的填补方法、装置、计算机设备及存储介质
CN114121212A (zh) * 2021-11-19 2022-03-01 东南大学 一种基于知识图谱和群表示学习的中药处方生成方法
CN114121212B (zh) * 2021-11-19 2024-04-02 东南大学 一种基于知识图谱和群表示学习的中药处方生成方法
WO2023184226A1 (zh) * 2022-03-30 2023-10-05 京东方科技集团股份有限公司 一种物品推荐方法、物品知识图谱、模型训练方法及装置
CN115169326A (zh) * 2022-04-15 2022-10-11 山西长河科技股份有限公司 一种中文关系抽取方法、装置、终端及存储介质
CN115168599A (zh) * 2022-06-20 2022-10-11 北京百度网讯科技有限公司 多三元组抽取方法、装置、设备、介质及产品
CN117012374A (zh) * 2023-10-07 2023-11-07 之江实验室 一种融合事件图谱和深度强化学习的医疗随访系统及方法
CN117012374B (zh) * 2023-10-07 2024-01-26 之江实验室 一种融合事件图谱和深度强化学习的医疗随访系统及方法
CN118571502A (zh) * 2024-08-02 2024-08-30 之江实验室 基于知识引导域自适应的多中心医学数据处理方法、系统、设备、介质

Also Published As

Publication number Publication date
CN111949802A (zh) 2020-11-17
CN111949802B (zh) 2022-11-01

Similar Documents

Publication Publication Date Title
WO2021139247A1 (zh) 医学领域知识图谱的构建方法、装置、设备及存储介质
CN111090987B (zh) 用于输出信息的方法和装置
CN109192300B (zh) 智能问诊方法、系统、计算机设备和存储介质
CN108984683B (zh) 结构化数据的提取方法、系统、设备及存储介质
CN108985358B (zh) 情绪识别方法、装置、设备及存储介质
WO2021121198A1 (zh) 基于语义相似度的实体关系抽取方法、装置、设备及介质
CN111506714A (zh) 基于知识图嵌入的问题回答
CN110704576B (zh) 一种基于文本的实体关系抽取方法及装置
CN115618045B (zh) 一种视觉问答方法、装置及存储介质
KR102424085B1 (ko) 기계-보조 대화 시스템 및 의학적 상태 문의 장치 및 방법
WO2023040493A1 (zh) 事件检测
WO2016092406A1 (en) Inferred facts discovered through knowledge graph derived contextual overlays
WO2024011814A1 (zh) 一种图文互检方法、系统、设备及非易失性可读存储介质
CN112131883B (zh) 语言模型训练方法、装置、计算机设备和存储介质
WO2021212601A1 (zh) 一种基于图像的辅助写作方法、装置、介质及设备
CN109284414B (zh) 基于语义保持的跨模态内容检索方法和系统
WO2024099037A1 (zh) 数据处理、实体链接方法、装置和计算机设备
US20230008897A1 (en) Information search method and device, electronic device, and storage medium
CN111160041A (zh) 语义理解方法、装置、电子设备和存储介质
WO2023173554A1 (zh) 坐席违规话术识别方法、装置、电子设备、存储介质
CN109408834A (zh) 辅助机器翻译方法、装置、设备及存储介质
CN112214595A (zh) 类别确定方法、装置、设备及介质
WO2023134085A1 (zh) 问题答案的预测方法、预测装置、电子设备、存储介质
CN112800244B (zh) 一种中医药及民族医药知识图谱的构建方法
CN117611845B (zh) 多模态数据的关联识别方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20911846

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20911846

Country of ref document: EP

Kind code of ref document: A1