WO2021139282A1 - 医疗领域知识图谱构建方法、装置、设备及存储介质 - Google Patents

医疗领域知识图谱构建方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021139282A1
WO2021139282A1 PCT/CN2020/119374 CN2020119374W WO2021139282A1 WO 2021139282 A1 WO2021139282 A1 WO 2021139282A1 CN 2020119374 W CN2020119374 W CN 2020119374W WO 2021139282 A1 WO2021139282 A1 WO 2021139282A1
Authority
WO
WIPO (PCT)
Prior art keywords
knowledge
entity
data
medical field
knowledge graph
Prior art date
Application number
PCT/CN2020/119374
Other languages
English (en)
French (fr)
Inventor
张圣
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021139282A1 publication Critical patent/WO2021139282A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Definitions

  • This application relates to the field of smart medical care in smart cities, and in particular to a method, device, equipment, and storage medium for constructing a knowledge graph in the medical field.
  • Knowledge graph expresses knowledge in triples (entity, relationship/attribute, attribute value), so that humans can understand the organization form easily, and use graph as the data structure to represent knowledge, so it is called knowledge graph.
  • the nodes of the graph are used to represent the concepts and entities of the objective world or their attribute values, and the edges between the nodes are used to represent the relationship or attributes between the concepts and the reality.
  • the node-edge-node constitutes a statement that represents knowledge and facts.
  • the applied knowledge graph represents the knowledge and facts of the objective world at the semantic level, which can establish various intelligent applications, and has the characteristics of integration and accumulation.
  • a high-quality medical knowledge map is an important foundation for smart medicine and precision smart medicine.
  • the inventor realizes that there are few high-quality medical domain knowledge graphs on the market at present, because the current professional knowledge graph construction process has limitations in data source selection, and generally only extracts from vertical websites related to the field.
  • the relevant knowledge data in encyclopedia websites is ignored.
  • Encyclopedia websites have a large amount of knowledge data in various fields, and knowledge extraction is relatively complicated and cumbersome.
  • the main purpose of this application is to provide a method, device, equipment and storage medium for constructing a knowledge graph in the medical field, aiming to solve the technical problem of how to construct a sound knowledge graph in the medical field.
  • this application proposes a method for constructing a knowledge graph in the medical field, including:
  • the medical domain knowledge graph is applied to medical-related knowledge intelligent question answering.
  • an embodiment of the present application also provides a device for constructing a knowledge graph in the medical field, including:
  • the first knowledge extraction unit is used to extract knowledge from vertical websites related to the medical field and store it in the knowledge base;
  • the second knowledge extraction unit is used to extract knowledge from encyclopedia websites, perform entity text recognition on the extracted knowledge data, and input the entity text into a pre-trained entity domain recognition model, and the recognition result is a medical domain entity
  • the knowledge data corresponding to the entity text of is stored in the knowledge base;
  • the knowledge processing unit is used to perform knowledge processing on the data in the knowledge base
  • the quality evaluation unit is used to evaluate the quality of the knowledge data after knowledge processing
  • the construction unit is used to construct the knowledge data that has passed the quality assessment into a knowledge graph in the medical field;
  • the intelligent question answering unit is used to apply the medical domain knowledge graph to the intelligent question answering of medical related knowledge.
  • the present application also provides a computer device, including a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, a method for constructing a knowledge graph in the medical field is implemented, wherein:
  • the methods for constructing knowledge graphs in the medical field include:
  • the medical domain knowledge graph is applied to medical-related knowledge intelligent question answering.
  • the present application also provides a computer-readable storage medium on which a computer program is stored.
  • a method for constructing a knowledge graph in the medical field is realized, wherein the knowledge graph in the medical field is Construction methods include:
  • the medical domain knowledge graph is applied to medical-related knowledge intelligent question answering.
  • the medical field knowledge map construction method, device, equipment and storage medium of the present application can construct a high-quality medical field knowledge map, and can realize real-time update of the knowledge map at a relatively small cost, and has good mobility. Construction and update of knowledge graphs in other fields.
  • FIG. 1 is a schematic flowchart of a method for constructing a knowledge graph in the medical field according to an embodiment of the application;
  • FIG. 2 is a schematic block diagram of the structure of an apparatus for constructing a knowledge graph in the medical field according to an embodiment of the application;
  • FIG. 3 is a schematic block diagram of the structure of a computer device according to an embodiment of the application.
  • an embodiment of the present application provides a method for constructing a knowledge graph in the medical field, which includes the steps:
  • the establishment of the knowledge graph must first extract knowledge from the original data of the data source.
  • raw data is divided into structured data, semi-structured data and unstructured data.
  • different methods are used for processing.
  • the high-quality data sources in the general medical field are vertical websites and encyclopedia websites in the corresponding field.
  • the data of this data source is generally semi-structured and unstructured data. Therefore, the construction method of the knowledge graph in this application is mainly aimed at Extract knowledge from knowledge sources of semi-structured and unstructured data.
  • a pre-trained entity domain recognition model is used here, which can perform domain recognition on entities in the data, and store the knowledge data corresponding to the entity text of the medical domain entity as the result of the recognition in the knowledge base, thus ensuring the knowledge graph The extensiveness and professionalism of the data sources.
  • Knowledge processing refers to the process of integrating the knowledge in multiple knowledge bases to form a knowledge base. Different knowledge bases focus on different knowledge collection. For the same entity, the knowledge base may focus on the description of a certain aspect of itself, and some knowledge bases may focus on describing the relationship between the entity and other entities, and the knowledge processing The purpose is to integrate the descriptions of entities from different knowledge bases to obtain a complete description of the entities. For example, for the description of historical figure Cao Cao, there are some differences in different knowledge bases such as Baidu Encyclopedia, Interactive Encyclopedia, Wikipedia, etc.
  • the era of Cao Cao belongs to the Eastern Han Dynasty on Baidu Encyclopedia, the Eastern Han Dynasty on Baidu Encyclopedia, and the end of Eastern Han Dynasty on Wikipedia.
  • the main achievements of Cao Cao, Baidu Encyclopedia is "implement the farmland system, appease the refugees and destroy the heroes, unify the north, lay the foundation for the Cao Wei regime, create the Anan literature, and promote thin burial", the interactive encyclopedia is "unification of the north", Wikipedia is "unification” The core area of the Eastern Han Empire". It can be seen from this that different knowledge bases still have some differences in the description of the same entity.
  • the difference in the description of the era lies in the specific degree of the age, and the difference in the main achievements lies in the range of achievements, etc., through knowledge processing,
  • the knowledge in different knowledge bases can be complemented and merged to form a comprehensive, accurate and complete entity description.
  • the main work involved is entity standardization, including attribute standardization, value standardization, and processing of multi-valued attributes, which can be achieved through similarity calculations, manual crowdsourcing, heuristic rules and other methods.
  • the quality evaluation is to evaluate the final result data, and put the qualified data into the knowledge graph.
  • Quality evaluation can be done by cross-checking the knowledge data after knowledge processing by using data from the data source, or it can be evaluated manually by means of manual crowdsourcing.
  • the knowledge data that has passed the quality assessment is constructed into a knowledge graph.
  • a top-down construction method is generally adopted.
  • the top-down construction method refers to first determining the data model of the knowledge graph, and then filling in specific data according to the model, and finally forming the knowledge graph of the medical field.
  • the medical field knowledge graph can be applied to medical-related knowledge questions and answers to provide help for patients and doctors.
  • the step of inputting the recognized entity text into a pre-trained entity domain recognition model includes:
  • the input Token Embedding layer As described above, information for a text description of the entity in each word, the input Token Embedding layer, the word into word vectors e 1, e 2 ..., e n, a vector carrying the word semantic information word, then the word input vector LSTM layer, get hidden vectors h 1 , h 2 ..., h n , each hidden vector carries a part of sentence information, and then perform Attention operation to get the characterization vector of the current sentence, and finally input the characterization vector to the fully connected layer to get the output label , Which is the category recognized by the model. Specifically, the output results are classified into 1 and 0, where 1 represents an entity in the medical field (vaccine, disease, gene, protein, etc.), and 0 represents an entity in other fields.
  • the step of performing knowledge processing on the extracted data includes:
  • the normalization of attributes and attribute values refers to the normalization of descriptions in entity triples.
  • the principle of vaccine action and the mechanism of vaccine action are the same type of relationship, which can be unified into the principle of vaccine action. This is the process of attribute standardization. Similarly, the required attribute values are also normalized.
  • Multi-value attribute processing can use a value segmentation algorithm to divide the attribute value into multiple parts according to the segmentation characters such as punctuation and spaces, and score before and after the segmentation. If the segmented attribute value corresponds to the entity, points will be added, otherwise, points will be subtracted. Determine whether to split based on the score.
  • the method for normalizing the attributes and attribute values of the extracted entity data includes:
  • S311 Use the method of text similarity calculation to normalize the attributes and attribute values of the extracted entity data; or,
  • S312 Use the method of manual crowdsourcing to normalize the attributes and attribute values of the extracted entity data.
  • the method of text similarity calculation can be used to normalize the attributes and attribute values of the entity data
  • the machine learning method can be used to use neural network models to perform semantic analysis on the attributes and attribute values of the entity data.
  • the algorithm calculates the similarity between the attributes of the entity data and the similarity between the numeric types, and the attributes or attribute values whose similarity reaches a preset threshold are standardized to the same kind.
  • the frequency of occurrence of attributes or attribute values of entity data can be counted in the data extraction process.
  • attributes or attribute values whose similarity reaches a preset threshold they can be standardized as the attribute or attribute value with the highest frequency in the data extraction process.
  • artificial crowdsourcing methods can also be used to normalize the attributes and attribute values of the extracted entity data, which can ensure the accuracy of the knowledge graph to a greater extent.
  • the step of evaluating the quality of the knowledge data after knowledge processing includes:
  • quality evaluation is to evaluate the quality of knowledge after knowledge processing, so as to ensure the quality of knowledge in the knowledge map.
  • the data of the data source is first used to cross-check the knowledge data after knowledge processing.
  • the process of cross-checking can be realized through a pre-trained neural network model.
  • the knowledge data that passed the cross-check can be saved to construct a knowledge graph, and the knowledge data that failed the cross-check can be manually evaluated by means of manual crowdsourcing.
  • the crowdsourcing algorithm refers to an algorithm that assigns crowdsourcing tasks to humans. Although manual crowdsourcing has a higher cost, it has better professionalism and higher accuracy. For quality evaluation that cannot be completed by a machine, manual crowdsourcing can be used.
  • the processing method is carried out so that the operation not only improves the efficiency but also guarantees the quality.
  • the method further includes updating the knowledge graph, wherein the updating method is:
  • the entity data in the knowledge graph comes from various data sources, and the data in the data source is related to the update frequency of the entity.
  • Estimate(e) is the update frequency estimate of entity e, which is obtained by the entity update frequency estimation algorithm.
  • Estimation(e) is the update frequency estimation of the entity.
  • the update frequency estimation of the entity is based on an assumption of statistics.
  • the change of the event (here, the change of the data) obeys the Poisson distribution.
  • the total number of changes/time interval is an effective change frequency estimate.
  • T(e) represents the existence time period of the entity
  • X(e) represents the number of times the entity e changes within the time period T(e).
  • S621 Determine the update period of the entity according to the update frequency.
  • S622 Determine the next update time of the attribute value corresponding to the entity based on the current time and the update period of the entity.
  • the model used for entity recognition and entity field recognition, the constructed medical field knowledge graph related data and other information can all be stored in the blockchain, and the above-mentioned medical field can be realized in the blockchain network.
  • Knowledge graph construction method
  • Blockchain is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the underlying platform of the blockchain can include processing modules such as user management, basic services, smart contracts, and operation monitoring.
  • the user management module is responsible for the identity information management of all blockchain participants, including the maintenance of public and private key generation (account management), key management, and maintenance of the correspondence between the user’s real identity and the blockchain address (authority management), etc.
  • authorization supervise and audit certain real-identity transactions, and provide risk control rule configuration (risk control audit); basic service modules are deployed on all blockchain node devices to verify the validity of business requests, After completing the consensus on the valid request, it is recorded on the storage.
  • the basic service For a new business request, the basic service first performs interface adaptation analysis and authentication processing (interface adaptation), and then encrypts the business information through the consensus algorithm (consensus management), After encryption, it is completely and consistently transmitted to the shared ledger (network communication), and recorded and stored; the smart contract module is responsible for contract registration and issuance, contract triggering and contract execution.
  • interface adaptation interface adaptation
  • consensus algorithm consensus algorithm
  • the smart contract module is responsible for contract registration and issuance, contract triggering and contract execution.
  • the operation monitoring module is mainly responsible for the deployment of the product release process , Configuration modification, contract settings, cloud adaptation, and visual output of real-time status during product operation, such as: alarms, monitoring network conditions, monitoring node equipment health status, etc.
  • the method for constructing and updating the knowledge graph in the medical field can automatically construct the knowledge graph in the medical field, effectively reducing labor costs, while ensuring the quality of the graph, and realizing the update of the knowledge graph at a relatively low cost. This method is good
  • the mobility of also applies to the construction and update of knowledge graphs in other fields, and has been applied to many actual knowledge graphs with good results.
  • an embodiment of the present application also provides a device for constructing a knowledge graph in the medical field, including:
  • the first knowledge extraction unit 1 is used to extract knowledge from vertical websites related to the medical field and store it in the knowledge base;
  • the second knowledge extraction unit 2 is used to extract knowledge from encyclopedia websites, perform entity text recognition on the extracted knowledge data, and input the entity text into a pre-trained entity domain recognition model, and the recognition result is the medical domain
  • the knowledge data corresponding to the entity text of the entity is stored in the knowledge base;
  • the knowledge processing unit 3 is used to perform knowledge processing on the data in the knowledge base;
  • the quality evaluation unit 4 is used to evaluate the quality of the knowledge data after knowledge processing
  • Construction unit 5 is used to construct the knowledge data that has passed the quality assessment into a knowledge graph in the medical field;
  • the intelligent question answering unit 6 is used for applying the medical field knowledge graph to intelligent question answering of medical related knowledge.
  • the second knowledge extraction unit 2 includes:
  • Word vector obtaining unit for word processing text entities, to the input layer TokenEmbedding give term vectors e 1, e 2 ..., e n;
  • Hide vector obtaining unit configured to word vectors e 1, e 2 ..., e n LSTM input layer, hidden obtain a vector h 1, h 2 ..., h n;
  • the characterization vector acquisition unit is used to perform Attention calculation on the hidden vector to obtain the characterization vector v;
  • the output result obtaining unit is used to input the characterization vector v into the fully connected layer to obtain an output result.
  • the knowledge processing unit 3 includes:
  • the normalization unit is used to normalize the attributes and attribute values of the extracted entity data
  • the multi-value attribute processing unit is used to perform multi-value attribute processing on the extracted entity data.
  • the normalization unit includes:
  • the similarity calculation unit is used to normalize the attributes and attribute values of the extracted entity data using the method of text similarity calculation
  • the manual crowdsourcing unit is used to normalize the attributes and attribute values of the extracted entity data by using the manual crowdsourcing method.
  • the quality evaluation unit 4 includes:
  • the inspection unit is used to cross-check the knowledge data after knowledge processing by using the data from the data source;
  • the allocation unit is used to allocate the knowledge data that fails the cross-check to manual evaluation through a crowdsourcing algorithm.
  • the device for constructing a knowledge graph in the medical field further includes an update unit for updating the knowledge graph.
  • the update unit includes:
  • the update frequency prediction unit is used to predict the update frequency of entities in the knowledge graph by using a statistical Poisson distribution formula
  • the intelligent update unit is used to intelligently update the entity data in the knowledge graph according to the update frequency.
  • the components of the medical field knowledge graph construction device proposed in this application can realize the functions of any one of the above medical field knowledge graph construction methods, and the specific structure will not be repeated.
  • an embodiment of the present application also provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 3.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium.
  • the database of the computer equipment uses knowledge graph related data.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer program is executed by the processor, a method for constructing a knowledge graph in the medical field is realized.
  • the processor executes the above-mentioned method for constructing a knowledge graph in the medical field, including: extracting knowledge from vertical websites related to the medical field and storing it in a knowledge base; and extracting knowledge from encyclopedia websites, and performing entities on the extracted knowledge data
  • the recognized entity text is input into a pre-trained entity domain recognition model, and the knowledge data corresponding to the entity text whose recognition result is a medical domain entity is stored in the knowledge base; Perform knowledge processing on the data after knowledge processing; perform quality evaluation on the data after knowledge processing; construct the data through the quality evaluation into a medical field knowledge map; apply the medical field knowledge map to medical-related knowledge intelligent question and answer.
  • the step of inputting the recognized entity text into a pre-trained entity domain recognition model includes: word segmentation processing of the entity text and input to the TokenEmbedding layer to obtain word vectors e 1 , e 2 ... , e n; the term vectors e 1, e 2 ..., e n LSTM input layer, hidden obtain a vector h 1, h 2 ..., h n; hidden vectors characterizing vector V calculated Attention, Attention calculated as follows:
  • the step of performing knowledge processing on the extracted data includes: normalizing the attributes and attribute values of the extracted entity data; and performing multi-value attribute processing on the extracted entity data.
  • the method for normalizing the attributes and attribute values of the extracted entity data includes: normalizing the attributes and attribute values of the extracted entity data by using a text similarity calculation method; or, Use the method of manual crowdsourcing to normalize the attributes and attribute values of the extracted entity data.
  • the step of evaluating the quality of the knowledge data after the knowledge processing includes: using data from the data source to cross-check the knowledge data after the knowledge processing; passing the knowledge data that fails the cross-check to the public
  • the package algorithm is assigned to humans for evaluation.
  • the method further includes updating the knowledge map, wherein the update method is: using a statistical Poisson distribution formula Predict the update frequency of the entity in the knowledge graph, where Estimate(e) is the update frequency of the entity, T(e) represents the existence time period of the entity, and X(e) represents the number of times the entity e changes in the time period T(e);
  • the entity data in the knowledge graph is intelligently updated according to the update frequency.
  • the step of intelligently updating the entity data in the knowledge graph according to the update frequency includes: determining the update period of the entity according to the update frequency; and based on the current time and the update of the entity Period, determining the next update time of the attribute value corresponding to the entity; according to the next update time of the attribute value, updating the attribute value corresponding to the entity in the knowledge graph.
  • An embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile.
  • a computer program is stored thereon. When the computer program is executed by a processor, The method for constructing a knowledge graph in the medical field includes steps: extracting knowledge from vertical websites related to the medical field and storing it in the knowledge base; and, extracting knowledge from encyclopedia websites, and performing entity text recognition on the extracted knowledge data.
  • the recognized entity text is input into a pre-trained entity domain recognition model, and the knowledge data corresponding to the entity text whose recognition result is a medical domain entity is stored in the knowledge base; the data in the knowledge base is knowledgeable Processing; quality evaluation of the data after knowledge processing; constructing the data through the quality evaluation into a medical field knowledge map; applying the medical field knowledge map to medical-related knowledge intelligent question answering.
  • the step of inputting the recognized entity text into a pre-trained entity domain recognition model includes: word segmentation processing of the entity text and input to the TokenEmbedding layer to obtain word vectors e 1 , e 2 ... , e n; the term vectors e 1, e 2 ..., e n LSTM input layer, hidden obtain a vector h 1, h 2 ..., h n; hidden vectors characterizing vector V calculated Attention, Attention calculated as follows:
  • the step of performing knowledge processing on the extracted data includes: normalizing the attributes and attribute values of the extracted entity data; and performing multi-value attribute processing on the extracted entity data.
  • the method for normalizing the attributes and attribute values of the extracted entity data includes: normalizing the attributes and attribute values of the extracted entity data by using a text similarity calculation method; or, Use the method of manual crowdsourcing to normalize the attributes and attribute values of the extracted entity data.
  • the step of evaluating the quality of the knowledge data after the knowledge processing includes: using data from the data source to cross-check the knowledge data after the knowledge processing; passing the knowledge data that fails the cross-check to the public
  • the package algorithm is assigned to humans for evaluation.
  • the method further includes updating the knowledge map, wherein the update method is: using a statistical Poisson distribution formula Predict the update frequency of the entity in the knowledge graph, where Estimate(e) is the update frequency of the entity, T(e) represents the existence time period of the entity, and X(e) represents the number of times the entity e changes in the time period T(e);
  • the entity data in the knowledge graph is intelligently updated according to the update frequency.
  • the step of intelligently updating the entity data in the knowledge graph according to the update frequency includes: determining the update period of the entity according to the update frequency; and based on the current time and the update of the entity Period, determining the next update time of the attribute value corresponding to the entity; according to the next update time of the attribute value, updating the attribute value corresponding to the entity in the knowledge graph.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual-rate data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

本申请涉及到智慧城市中智慧医疗领域,公开了一种医疗领域知识图谱构建方法,包括:对医学领域相关的垂直性网站进行知识抽取,存入知识库;以及,对百科类网站进行知识抽取,对抽取到的知识数据进行实体文本识别,将识别到的所述实体文本输入到预先训练的实体领域识别模型中,将识别结果为医疗领域实体的实体文本所对应的知识数据存入所述知识库;对所述知识库中的数据进行知识加工;对知识加工后的数据进行质量评估;将通过质量评估的数据构建成知识图谱。所述领域识别模型和构建完成知识图谱可以存储及应用于区块链中。本申请的医疗领域知识图谱构建方法,可以构建健全的医疗领域知识图谱,具有良好的迁移性,可用于其他领域知识图谱构建。

Description

医疗领域知识图谱构建方法、装置、设备及存储介质
本申请要求于2020年6月24日提交中国专利局、申请号为202010592333.3,发明名称为“医疗领域知识图谱构建方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及到智慧城市中智慧医疗领域,特别是涉及到一种医疗领域知识图谱构建方法、装置、设备及存储介质。
背景技术
近年来,基于知识图谱的问答系统则成为各行业研究和应用的热点方向。知识图谱将知识用三元组(实体、关系/属性、属性值)的方式表达,做到人类易懂组织形式,使用图作为表示知识的数据结构,因此称为知识图谱。用图的节点来表示客观世界的概念和实体或是他们的属性值,用节点之间的边表示概念和实际的关系或属性,节点-边-节点组成了表示知识和事实的陈述语句。而且应用知识图谱在语义层面表示客观世界的知识和事实,能够建立各种智能应用,具有集成和积累的特性。基于知识图谱构建问答系统在数据上具有以下优势:(1)利用数据关联度解决语义理解智能化程度问题;(2)利用数据精度解决回答准确率问题;(3)利用三元组的数据结构化提高问题检索效率。
高质量的医学知识图谱是智慧医疗、精准智能医疗的重要基础。发明人意识到,目前市场上高质量的医学领域知识图谱较少,因为目前专业性较强的知识图谱构建过程的数据源选择具有局限性,一般仅在领域相关的垂直性网站上进行抽取,而忽略了百科类网站中的相关知识数据,百科类网站中具有各个领域的大量的知识数据,知识抽取工作相对复杂和繁琐。
技术问题
本申请的主要目的为提供一种医疗领域知识图谱构建方法、装置、设备及存储介质,旨在解决如何构建健全的医疗领域知识图谱的技术问题。
技术解决方案
为了实现上述发明目的,第一方面,本申请提出一种医疗领域知识图谱构建方法,包括:
对医学领域相关的垂直性网站进行知识抽取,存入知识库;以及,
对百科类网站进行知识抽取,对抽取到的知识数据进行实体文本识别,将识别到的所述实体文本输入到预先训练的实体领域识别模型中,将识别结果为医疗领域实体的实体文本所对应的知识数据存入所述知识库;
对所述知识库中的数据进行知识加工;
对知识加工后的数据进行质量评估;
将通过质量评估的数据构建成医疗领域知识图谱;
将所述医疗领域知识图谱应用于医学相关知识智能问答。
第二方面,本申请实施例还提供一种医疗领域知识图谱构建装置,包括:
第一知识抽取单元,用于对医学领域相关的垂直性网站进行知识抽取,存入知识库;
第二知识抽取单元,用于对百科类网站进行知识抽取,对抽取到的知识数据进行实体文本识别,将所述实体文本输入到预先训练的实体领域识别模型中,将识别结果为医疗领域实体的实体文本所对应的知识数据存入所述知识库;
知识加工单元,用于对所述知识库中的数据进行知识加工;
质量评估单元,用于对知识加工后的知识数据进行质量评估;
构建单元,用于将通过质量评估的知识数据构建成医疗领域知识图谱;
智能问答单元,用于将所述医疗领域知识图谱应用于医学相关知识智能问答。
第三方面,本申请还提供一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现一种医疗领域知识图谱构建方法,其中,所述医疗领域知识图谱构建方法包括:
对医学领域相关的垂直性网站进行知识抽取,存入知识库;以及,
对百科类网站进行知识抽取,对抽取到的知识数据进行实体文本识别,将识别到的所述实体文本输入到预先训练的实体领域识别模型中,将识别结果为医疗领域实体的实体文本所对应的知识数据存入所述知识库;
对所述知识库中的数据进行知识加工;
对知识加工后的数据进行质量评估;
将通过质量评估的数据构建成医疗领域知识图谱;
将所述医疗领域知识图谱应用于医学相关知识智能问答。
第四方面,本申请还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现一种医疗领域知识图谱构建方法,其中,所述医疗领域知识图谱构建方法包括:
对医学领域相关的垂直性网站进行知识抽取,存入知识库;以及,
对百科类网站进行知识抽取,对抽取到的知识数据进行实体文本识别,将识别到的所述实体文本输入到预先训练的实体领域识别模型中,将识别结果为医疗领域实体的实体文本所对应的知识数据存入所述知识库;
对所述知识库中的数据进行知识加工;
对知识加工后的数据进行质量评估;
将通过质量评估的数据构建成医疗领域知识图谱;
将所述医疗领域知识图谱应用于医学相关知识智能问答。
有益效果
本申请的医疗领域知识图谱构建方法、装置、设备及存储介质,可以构建高质量的医疗领域知识图谱,并可以以较小的代价实现知识图谱的实时更新,同时具有良好的迁移性,用于其他领域的知识图谱的构建和更新。
附图说明
图1为本申请一实施例的医疗领域知识图谱构建方法的流程示意图;
图2为本申请一实施例的医疗领域知识图谱构建装置的结构示意框图;
图3为本申请一实施例的计算机设备的结构示意框图。
本发明的最佳实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请涉及到智慧城市中智慧医疗领域,参照图1,本申请实施例中提供一种医疗领域知识图谱构建方法,包括步骤:
S1、对医学领域相关的垂直性网站进行知识抽取,存入知识库;以及,
S2、对百科类网站进行知识抽取,对抽取到的知识数据进行实体文本识别,将识别到的所述实体文本输入到预先训练的实体领域识别模型中,将识别结果为医疗领域实体的实 体文本所对应的知识数据存入所述知识库;
S3、对所述知识库中的数据进行知识加工;
S4、对知识加工后的数据进行质量评估;
S5、将通过质量评估的数据构建成医疗领域知识图谱;
S6、将所述医疗领域知识图谱应用于医学相关知识智能问答。
如上述步骤S1~S2所述,知识图谱的建立首先要对数据源的原始数据进行知识抽取。一般来说原始数据分为结构化数据、半结构化数据和非结构化数据,根据不同的数据类型,采用不同的方法进行处理。一般医疗领域的高质量的数据源是垂直性网站和对应领域的百科类网站,这种数据源的数据一般是半结构化和非结构化的数据,所以本申请的知识图谱的构建方法主要针对于半结构化和非结构化的数据的知识源进行知识抽取。对于疫苗类的垂直性网站,其知识内容大都是和疫苗知识相关的,可以直接进行数据抽取;对于百科类网站,里面也包含了大量的医学等领域的实体,但是百科类网站也包含大量其他领域的实体,所以需要识别出医学领域实体。这里采用了预先训练完成的实体领域识别模型,可以对数据中的实体进行领域识别,将识别结果为医疗领域实体的实体文本所对应的知识数据存入所述知识库中,这样保证了知识图谱的数据来源的广泛性和专业性。
如上述步骤S3所述,需要对在数据源抽取到的知识数据进行知识加工。知识加工是指对多个知识库中的知识进行整合,形成一个知识库的过程。不同的知识库,收集知识的侧重点不同,对于同一个实体,有知识库的可能侧重于其本身某个方面的描述,有的知识库可能侧重于描述实体与其它实体的关系,知识加工的目的就是将不同知识库对实体的描述进行整合,从而获得实体的完整描述。比如,对于历史人物曹操的描述,在百度百科、互动百科、维基百科等不同的知识库中,描述有一些差别,曹操所属时代,百度百科为东汉,互动百科为东汉末年,维基百科为东汉末期;曹操的主要成就,百度百科为“实行屯田制,安抚流民消灭群雄,统一北方,奠定曹魏政权的基础,开创建安文学,提倡薄葬”,互动百科为“统一北方”,维基百科为“统一了东汉帝国核心地区”。由此可以看出,不同的知识库对于同一个实体的描述,还是有一些差异,所属时代的描述差别在于年代的具体程度,主要成就的差别在于成就的范围不同,等等,通过知识加工,可以将不同知识库中的知识进行互补融合,形成全面、准确、完整的实体描述。知识加工过程中,主要涉及到的工作就是实体规范化,也包括属性规范化、值规范化、以及多值属性的处理,可以通过相似度计算,人工众包、启发式规则等方法来实现。
如上述步骤S4所述,质量评估就是对最后的结果数据进行评估,将合格的数据放入知识图谱中。质量评估可以通过利用数据来源的数据对知识加工后的知识数据进行交叉检验,也可以由人工众包的方式进行人工评估。
如上述步骤S5~S6所述,将通过质量评估的知识数据构建成知识图谱。对于医疗领域的知识图谱,一般采用自顶向下的构建方式。自顶向下的构建方式,是指先确定知识图谱的数据模型,再根据模型去填充具体数据,最终形成医疗领域知识图谱。所述医疗领域知识图谱可以应用于医学相关知识问答,为患者和医生提供帮助。
在一个具体的实施例中,所述将识别到的所述实体文本输入到预先训练的实体领域识别模型中的步骤包括:
S21、将实体文本分词处理,输入到Token Embedding层,得到词向量e 1,e 2…,e n
S22、将词向量e 1,e 2…,e n输入LSTM层,得到隐藏向量h 1,h 2…,h n
S23、将隐藏向量进行Attention计算得到表征向量v,Attention计算过程如下:
Figure PCTCN2020119374-appb-000001
v=∑ iα ih i,i=1,…,n
S24、将表征向量v输入全连接层得到输出结果,具体公式为y=sigmoid(W*v),其中 y为所述识别结果,识别结果包括1和0,分别对应医学领域实体和非医学领域实体,W是参数,sigmoid是激活函数。
如上所述,对于实体的文本描述信息中的每一个word,输入Token Embedding层,将单词转化为词向量e 1,e 2…,e n,词向量携带了单词语义信息,然后将词向量输入LSTM层,得到隐藏向量h 1,h 2…,h n,每个隐藏向量都携带了一部分句子信息,然后进行Attention操作得到当前句子的表征向量,最后将表征向量输入到全连接层得到输出标签,即为模型识别到的类别。具体地,输出的结果分类为1和0,其中1表示是医学领域(疫苗、疾病、基因、蛋白等)的实体,0表示是其他领域的实体。
在一个具体的实施例中,所述对抽取的数据进行知识加工的步骤包括:
S31、对抽取到的实体数据的属性和属性值进行规范化;
S32、对抽取到的实体数据进行多值属性处理。
如上所述,需要对在数据源抽取到的知识数据进行知识加工。不同的网站,收集知识的侧重点不同,对于同一个实体,有网站上的信息的可能侧重于其本身某个方面的描述,有的网站可能侧重于描述实体与其它实体的关系,知识加工的目的就是将不同知识库对实体的描述进行整合,从而获得实体的完整描述。对属性和属性值的规范化就是指对实体三元组中描述的规范化。例如:疫苗作用原理、疫苗作用机制是同一种关系类型,可以将其统一为疫苗作用原理,这就是属性规范化的过程。同样的,也需要的属性值进行规范化处理。对于有多个属性值的实体和实体属性,需要对多值进行处理,方便知识的储存,若不进行处理,也会影响知识图谱的下游应用。例如:关于麻腮风疫苗的适用症的知识(麻腮风疫苗,适用症,麻疹、腮腺炎、风疹)通过多值属性处理为(麻腮风疫苗,适用症,麻疹)、(麻腮风疫苗,适用症,腮腺炎)、(麻腮风疫苗,适用症,风疹)。多值属性处理可以通过一个值分割算法根据分割符如标点符号和空格等将属性值划分多个部分,对分割前后进行打分,分割后的属性值对应到了实体则加分,反之则减分,根据得分判断是否进行分割。
在一个具体的实施例中,所述对抽取到的实体数据的属性和属性值进行规范化的方法包括:
S311、利用文本相似度计算的方法对抽取到的实体数据的属性和属性值进行规范化;或者,
S312、利用人工众包的方法对抽取到的实体数据的属性和属性值进行规范化。
如上所述,可以利用文本相似度计算的方法对实体数据的属性和属性值进行规范化,采用机器学习的方法利用神经网络模型去对实体数据的属性和属性值进行语义分析,基于余弦相似度的算法计算实体数据属性之间的相似度和数值型之间的相似度,对于相似度达到预设阈值的属性或属性值规范为同一种。可以统计在数据抽取过程中实体数据的属性或属性值的出现频次,对于相似度达到预设阈值的属性或属性值,将其规范为在数据抽取过程中出现频次最高的属性或属性值。对于具体领域如疫苗领域的知识图谱,也可以利用人工众包的方法对抽取到的实体数据的属性和属性值进行规范化,这样可以更大程度的保证知识图谱的准确性。
在一个具体的实施例中,所述对知识加工后的知识数据进行质量评估的步骤包括:
S41、利用数据来源的数据对知识加工后的知识数据进行交叉检验;
S42、将交叉检验不通过的知识数据通过众包算法分配给人工进行评估。
如上所述,质量评估是对知识加工后的知识进行质量评估,从而保证知识图谱中知识的质量。在此实施例中,首先利用数据来源的数据对知识加工后的知识数据进行交叉检验。交叉检验的过程可以通过预先训练的神经网络模型实现。对于交叉检验通过的知识数据可以保存起来以构建知识图谱,对于交叉检验不通过的知识数据,可以通过人工众包的方式进行人工评估。所述众包算法是指将众包任务分配给人工的算法,人工众包虽然成本较高, 但是具有更好的专业性和更高的准确定,对于机器无法完成的质量评估,可以采用人工处理的方式来进行,这样操作既提高了效率又保证了质量。
在一个实施例中,在所述将通过质量评估的数据构建成医疗领域知识图谱的步骤之后还包括,对知识图谱进行更新,其中更新方法为:
S61、利用基于统计学泊松分布公式
Figure PCTCN2020119374-appb-000002
预测知识图谱中实体的更新频率,其中Estimation(e)是实体的更新频率,T(e)表示实体的存在时间周期,X(e)表示实体e在时间周期T(e)内变化的次数;
S62、根据所述更新频率对知识图谱中的实体数据进行智能更新。
如上所述,现实世界的知识是在不断变化的,如果不进行及时更新会导致知识图谱的知识过时,从而影响知识图谱的下游应用。一个最常用的方法是周期全量更新,这也是很多知识图谱的更新策略,这种方式会消耗大量的时间和网络带宽资源。本申请可以对知识图谱中的实体进行更新频次预测,可以有效的识别出发生变化的实体以及新出现的实体,以较小的代价实现知识图谱的实时更新。知识图谱中的实体数据来自于各个数据源,而数据源中的数据和实体的更新频次有关。这里使用了基于统计学泊松分布假设的实体更新频率估计方案,其中Estimation(e)是实体e的更新频率估计值,该估计值由实体更新频率估计算法得到。Estimation(e)是实体的更新频率估计,关于实体的更新频率估计是基于统计学的一个假设,事件的变化(这里指数据的变化)服从泊松分布。总的变化次数/时间间隔就是一个有效地变化频率估计,具体公式如下:
Figure PCTCN2020119374-appb-000003
其中T(e)表示实体的存在时间周期,X(e)表示实体e在时间周期T(e)内变化的次数。
在一个具体的实施例中,所述根据所述更新频率次对知识图谱中的实体数据进行智能更新的步骤包括:
S621、根据所述更新频率确定实体的更新周期;
S622、基于当前时间以及所述实体的更新周期,确定所述实体所对应的属性值的下次更新时间;
S623、根据所述属性值的下次更新时间,更新知识图谱中实体对应的属性值。
如上所述,因为知识是在不断变化的,如果不进行及时更新会导致知识图谱的知识过时,从而影响知识图谱的下游应用。假设某知识数据实体经过模型预测,其更新频率是约每月更新一次,在一次更新过后,一个月之后再次对此实体数据进行更新即可,这样知识图谱在每次更新时,只需要更新一小部分变化的实体即可保证整个知识图谱的鲜度。
在一个实施例中,用于实体识别和实体领域识别的模型、构建完成的医疗领域知识图谱相关数据等信息均可以存储与区块链中,在区块链网络中实现如上所述的医疗领域知识图谱构建方法。
如上所述,区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层。
区块链底层平台可以包括用户管理、基础服务、智能合约以及运营监控等处理模块。其中,用户管理模块负责所有区块链参与者的身份信息管理,包括维护公私钥生成(账户管理)、密钥管理以及用户真实身份和区块链地址对应关系维护(权限管理)等,并且在授权的情况下,监管和审计某些真实身份的交易情况,提供风险控制的规则配置(风控审计);基础服务模块部署在所有区块链节点设备上,用来验证业务请求的有效性,并对有效请求完成共识后记录到存储上,对于一个新的业务请求,基础服务先对接口适配解析和鉴权处 理(接口适配),然后通过共识算法将业务信息加密(共识管理),在加密之后完整一致的传输至共享账本上(网络通信),并进行记录存储;智能合约模块负责合约的注册发行以及合约触发和合约执行,开发人员可以通过某种编程语言定义合约逻辑,发布到区块链上(合约注册),根据合约条款的逻辑,调用密钥或者其它的事件触发执行,完成合约逻辑,同时还提供对合约升级注销的功能;运营监控模块主要负责产品发布过程中的部署、配置的修改、合约设置、云适配以及产品运行中的实时状态的可视化输出,例如:告警、监控网络情况、监控节点设备健康状态等。
本申请实施例的医疗领域知识图谱构建及更新方法,可以自动构建医疗领域知识图谱,有效减少人力成本,同时保证图谱的质量,并以较小的代价实现对知识图谱的更新,该方法具有良好的迁移性,同样适用于其他领域的知识图谱的构建和更新,并已经落地应用于多个实际的知识图谱中,取得了良好的效果。
参照图2,本申请实施例中还提供一种医疗领域知识图谱构建装置,包括:
第一知识抽取单元1,用于对医学领域相关的垂直性网站进行知识抽取,存入知识库;
第二知识抽取单元2,用于对百科类网站进行知识抽取,对抽取到的知识数据进行实体文本识别,将所述实体文本输入到预先训练的实体领域识别模型中,将识别结果为医疗领域实体的实体文本所对应的知识数据存入所述知识库;
知识加工单元3,用于对所述知识库中的数据进行知识加工;
质量评估单元4,用于对知识加工后的知识数据进行质量评估;
构建单元5,用于将通过质量评估的知识数据构建成医疗领域知识图谱;
智能问答单元6,用于将所述医疗领域知识图谱应用于医学相关知识智能问答。
在一个具体的实施例中,所述第二知识抽取单元2包括:
词向量获取单元,用于将实体文本分词处理,输入到TokenEmbedding层,得到词向量e 1,e 2…,e n
隐藏向量获取单元,用于将词向量e 1,e 2…,e n输入LSTM层,得到隐藏向量h 1,h 2…,h n
表征向量获取单元,用于将隐藏向量进行Attention计算得到表征向量v;
输出结果获取单元,用于将表征向量v输入全连接层得到输出结果。
在一个具体的实施例中,所述知识加工单元3包括:
规范化单元,用于对抽取到的实体数据的属性和属性值进行规范化;
多值属性处理单元,用于对抽取到的实体数据进行多值属性处理。
在一个具体的实施例中,所述规范化单元包括:
相似度计算单元,用于利用文本相似度计算的方法对抽取到的实体数据的属性和属性值进行规范化;
人工众包单元,用于利用人工众包的方法对抽取到的实体数据的属性和属性值进行规范化。
在一个具体的实施例中,所述质量评估单元4包括:
检验单元,用于利用数据来源的数据对知识加工后的知识数据进行交叉检验;
分配单元,用于将交叉检验不通过的知识数据通过众包算法分配给人工进行评估。
在一个实施例中,所述医疗领域知识图谱构建装置还包括更新单元,用于对知识图谱进行更新。
在一个具体的实施例中,所述更新单元包括:
更新频率预测单元,用于利用基于统计学泊松分布公式预测知识图谱中实体的更新频率;
智能更新单元,用于根据所述更新频率对知识图谱中的实体数据进行智能更新。
如上所述,可以理解地,本申请中提出的所述医疗领域知识图谱构建装置的各组成部 分可以实现如上所述医疗领域知识图谱构建方法任一项的功能,具体结构不再赘述。
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用知识图谱相关数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种医疗领域知识图谱构建方法。
上述处理器执行上述的医疗领域知识图谱构建方法,包括:对医学领域相关的垂直性网站进行知识抽取,存入知识库;以及,对百科类网站进行知识抽取,对抽取到的知识数据进行实体文本识别,将识别到的所述实体文本输入到预先训练的实体领域识别模型中,将识别结果为医疗领域实体的实体文本所对应的知识数据存入所述知识库;对所述知识库中的数据进行知识加工;对知识加工后的数据进行质量评估;将通过质量评估的数据构建成医疗领域知识图谱;将所述医疗领域知识图谱应用于医学相关知识智能问答。
在一个实施例中,所述将识别到的所述实体文本输入到预先训练的实体领域识别模型中的步骤包括:将实体文本分词处理,输入到TokenEmbedding层,得到词向量e 1,e 2…,e n;将词向量e 1,e 2…,e n输入LSTM层,得到隐藏向量h 1,h 2…,h n;将隐藏向量进行Attention计算得到表征向量v,Attention计算过程如下:
Figure PCTCN2020119374-appb-000004
v=∑ iα ih i,i=1,…,n
将表征向量v输入全连接层得到输出结果,具体公式为y=sigmoid(W*v),其中y为所述识别结果,识别结果包括1和0,分别对应医学领域实体和非医学领域实体,W是参数,sigmoid是激活函数。
在一个具体的实施例中,所述对抽取的数据进行知识加工的步骤包括:对抽取到的实体数据的属性和属性值进行规范化;对抽取到的实体数据进行多值属性处理。
在一个具体的实施例中,所述对抽取到的实体数据的属性和属性值进行规范化的方法包括:利用文本相似度计算的方法对抽取到的实体数据的属性和属性值进行规范化;或者,利用人工众包的方法对抽取到的实体数据的属性和属性值进行规范化。
在一个具体的实施例中,所述对知识加工后的知识数据进行质量评估的步骤包括:利用数据来源的数据对知识加工后的知识数据进行交叉检验;将交叉检验不通过的知识数据通过众包算法分配给人工进行评估。
在一个实施例中,在所述将通过质量评估的数据构建成医疗领域知识图谱的步骤之后还包括,对知识图谱进行更新,其中更新方法为:利用基于统计学泊松分布公式
Figure PCTCN2020119374-appb-000005
预测知识图谱中实体的更新频率,其中Estimation(e)是实体的更新频率,T(e)表示实体的存在时间周期,X(e)表示实体e在时间周期T(e)内变化的次数;根据所述更新频率对知识图谱中的实体数据进行智能更新。
在一个具体的实施例中,所述根据所述更新频率次对知识图谱中的实体数据进行智能更新的步骤包括:根据所述更新频率确定实体的更新周期;基于当前时间以及所述实体的更新周期,确定所述实体所对应的属性值的下次更新时间;根据所述属性值的下次更新时间,更新知识图谱中实体对应的属性值。
本申请一实施例还提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性的,其上存储有计算机程序,计算机程序被处理器执行时实现医疗领域知识图谱构建方法,包括步骤:对医学领域相关的垂直性网站进行知识抽取,存 入知识库;以及,对百科类网站进行知识抽取,对抽取到的知识数据进行实体文本识别,将识别到的所述实体文本输入到预先训练的实体领域识别模型中,将识别结果为医疗领域实体的实体文本所对应的知识数据存入所述知识库;对所述知识库中的数据进行知识加工;对知识加工后的数据进行质量评估;将通过质量评估的数据构建成医疗领域知识图谱;将所述医疗领域知识图谱应用于医学相关知识智能问答。
在一个实施例中,所述将识别到的所述实体文本输入到预先训练的实体领域识别模型中的步骤包括:将实体文本分词处理,输入到TokenEmbedding层,得到词向量e 1,e 2…,e n;将词向量e 1,e 2…,e n输入LSTM层,得到隐藏向量h 1,h 2…,h n;将隐藏向量进行Attention计算得到表征向量v,Attention计算过程如下:
Figure PCTCN2020119374-appb-000006
v=∑ iα ih i,i=1,…,n
将表征向量v输入全连接层得到输出结果,具体公式为y=sigmoid(W*v),其中y为所述识别结果,识别结果包括1和0,分别对应医学领域实体和非医学领域实体,W是参数,sigmoid是激活函数。
在一个具体的实施例中,所述对抽取的数据进行知识加工的步骤包括:对抽取到的实体数据的属性和属性值进行规范化;对抽取到的实体数据进行多值属性处理。
在一个具体的实施例中,所述对抽取到的实体数据的属性和属性值进行规范化的方法包括:利用文本相似度计算的方法对抽取到的实体数据的属性和属性值进行规范化;或者,利用人工众包的方法对抽取到的实体数据的属性和属性值进行规范化。
在一个具体的实施例中,所述对知识加工后的知识数据进行质量评估的步骤包括:利用数据来源的数据对知识加工后的知识数据进行交叉检验;将交叉检验不通过的知识数据通过众包算法分配给人工进行评估。
在一个实施例中,在所述将通过质量评估的数据构建成医疗领域知识图谱的步骤之后还包括,对知识图谱进行更新,其中更新方法为:利用基于统计学泊松分布公式
Figure PCTCN2020119374-appb-000007
预测知识图谱中实体的更新频率,其中Estimation(e)是实体的更新频率,T(e)表示实体的存在时间周期,X(e)表示实体e在时间周期T(e)内变化的次数;根据所述更新频率对知识图谱中的实体数据进行智能更新。
在一个具体的实施例中,所述根据所述更新频率次对知识图谱中的实体数据进行智能更新的步骤包括:根据所述更新频率确定实体的更新周期;基于当前时间以及所述实体的更新周期,确定所述实体所对应的属性值的下次更新时间;根据所述属性值的下次更新时间,更新知识图谱中实体对应的属性值。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的和实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双速据率SDRAM(SSRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非 排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。
以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种医疗领域知识图谱构建方法,包括:
    对医学领域相关的垂直性网站进行知识抽取,存入知识库;以及,
    对百科类网站进行知识抽取,对抽取到的知识数据进行实体文本识别,将识别到的所述实体文本输入到预先训练的实体领域识别模型中,将识别结果为医疗领域实体的实体文本所对应的知识数据存入所述知识库;
    对所述知识库中的数据进行知识加工;
    对知识加工后的数据进行质量评估;
    将通过质量评估的数据构建成医疗领域知识图谱;
    将所述医疗领域知识图谱应用于医学相关知识智能问答。
  2. 根据权利要求1所述的医疗领域知识图谱构建方法,其中,所述将识别到的所述实体文本输入到预先训练的实体领域识别模型中的步骤包括:
    将实体文本分词处理,输入到Token Embedding层,得到词向量e 1,e 2…,e n
    将词向量e 1,e 2…,e n输入LSTM层,得到隐藏向量h 1,h 2…,h n
    将隐藏向量进行Attention计算得到表征向量v,Attention计算过程如下:
    Figure PCTCN2020119374-appb-100001
    v=∑ iα ih i,i=1,…,n;
    将表征向量v输入全连接层得到输出结果,具体公式为y=sigmoid(W*v),其中y为所述识别结果,识别结果包括1和0,分别对应医学领域实体和非医学领域实体,W是参数,sigmoid是激活函数。
  3. 根据权利要求1所述的医疗领域知识图谱构建方法,其中,所述对抽取的数据进行知识加工的步骤包括:
    对抽取到的实体数据的属性和属性值进行规范化;
    对抽取到的实体数据进行多值属性处理。
  4. 根据权利要求3所述的医疗领域知识图谱构建方法,其中,所述对抽取到的实体数据的属性和属性值进行规范化的方法包括:
    利用文本相似度计算的方法对抽取到的实体数据的属性和属性值进行规范化;或者,
    利用人工众包的方法对抽取到的实体数据的属性和属性值进行规范化。
  5. 根据权利要求1所述的医疗领域知识图谱构建方法,其中,所述对知识加工后的知识数据进行质量评估的步骤包括:
    利用数据来源的数据对知识加工后的知识数据进行交叉检验;
    将交叉检验不通过的知识数据通过众包算法分配给人工进行评估。
  6. 根据权利要求1所述的医疗领域知识图谱构建方法,其中,在所述将通过质量评估的数据构建成医疗领域知识图谱的步骤之后还包括,对知识图谱进行更新,其中更新方法为:
    利用基于统计学泊松分布公式
    Figure PCTCN2020119374-appb-100002
    预测知识图谱中实体的更新频率,其中Estimation(e)是实体的更新频率,T(e)表示实体的存在时间周期,X(e)表示实体e在时间周期T(e)内变化的次数;
    根据所述更新频率对知识图谱中的实体数据进行智能更新。
  7. 根据权利要求6所述的医疗领域知识图谱构建方法,其中,所述根据所述更新频率次对知识图谱中的实体数据进行智能更新的步骤包括:
    根据所述更新频率确定实体的更新周期;
    基于当前时间以及所述实体的更新周期,确定所述实体所对应的属性值的下次更新时间;
    根据所述属性值的下次更新时间,更新知识图谱中实体对应的属性值。
  8. 一种医疗领域知识图谱构建装置,包括:
    第一知识抽取单元,用于对医学领域相关的垂直性网站进行知识抽取,存入知识库;
    第二知识抽取单元,用于对百科类网站进行知识抽取,对抽取到的知识数据进行实体文本识别,将所述实体文本输入到预先训练的实体领域识别模型中,将识别结果为医疗领域实体的实体文本所对应的知识数据存入所述知识库;
    知识加工单元,用于对所述知识库中的数据进行知识加工;
    质量评估单元,用于对知识加工后的知识数据进行质量评估;
    构建单元,用于将通过质量评估的知识数据构建成医疗领域知识图谱;
    智能问答单元,用于将所述医疗领域知识图谱应用于医学相关知识智能问答。
  9. 根据权利要求8所述的医疗领域知识图谱构建装置,其中,所述第二知识抽取单元包括:
    词向量获取单元,用于将实体文本分词处理,输入到TokenEmbedding层,得到词向量e 1,e 2…,e n
    隐藏向量获取单元,用于将词向量e 1,e 2…,e n输入LSTM层,得到隐藏向量h 1,h 2…,h n
    表征向量获取单元,用于将隐藏向量进行Attention计算得到表征向量v;
    输出结果获取单元,用于将表征向量v输入全连接层得到输出结果。
  10. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现一种医疗领域知识图谱构建方法,其中,所述医疗领域知识图谱构建方法包括:
    对医学领域相关的垂直性网站进行知识抽取,存入知识库;以及,
    对百科类网站进行知识抽取,对抽取到的知识数据进行实体文本识别,将识别到的所述实体文本输入到预先训练的实体领域识别模型中,将识别结果为医疗领域实体的实体文本所对应的知识数据存入所述知识库;
    对所述知识库中的数据进行知识加工;
    对知识加工后的数据进行质量评估;
    将通过质量评估的数据构建成医疗领域知识图谱;
    将所述医疗领域知识图谱应用于医学相关知识智能问答。
  11. 根据权利要求10所述的计算机设备,其中,所述将识别到的所述实体文本输入到预先训练的实体领域识别模型中的步骤包括:
    将实体文本分词处理,输入到Token Embedding层,得到词向量e 1,e 2…,e n
    将词向量e 1,e 2…,e n输入LSTM层,得到隐藏向量h 1,h 2…,h n
    将隐藏向量进行Attention计算得到表征向量v,Attention计算过程如下:
    Figure PCTCN2020119374-appb-100003
    v=∑ iα ih i,i=1,…,n;
    将表征向量v输入全连接层得到输出结果,具体公式为y=sigmoid(W*v),其中y为所述识别结果,识别结果包括1和0,分别对应医学领域实体和非医学领域实体,W是参数,sigmoid是激活函数。
  12. 根据权利要求10所述的计算机设备,其中,所述对抽取的数据进行知识加工的步骤包括:
    对抽取到的实体数据的属性和属性值进行规范化;
    对抽取到的实体数据进行多值属性处理。
  13. 根据权利要求12所述的计算机设备,其中,所述对抽取到的实体数据的属性和属性值进行规范化的方法包括:
    利用文本相似度计算的方法对抽取到的实体数据的属性和属性值进行规范化;或者,
    利用人工众包的方法对抽取到的实体数据的属性和属性值进行规范化。
  14. 根据权利要求10所述的计算机设备,其中,所述对知识加工后的知识数据进行质量评估的步骤包括:
    利用数据来源的数据对知识加工后的知识数据进行交叉检验;
    将交叉检验不通过的知识数据通过众包算法分配给人工进行评估。
  15. 根据权利要求1所述的计算机设备,其中,在所述将通过质量评估的数据构建成医疗领域知识图谱的步骤之后还包括,对知识图谱进行更新,其中更新方法为:
    利用基于统计学泊松分布公式
    Figure PCTCN2020119374-appb-100004
    预测知识图谱中实体的更新频率,其中Estimation(e)是实体的更新频率,T(e)表示实体的存在时间周期,X(e)表示实体e在时间周期T(e)内变化的次数;
    根据所述更新频率对知识图谱中的实体数据进行智能更新。
  16. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现一种医疗领域知识图谱构建方法,其中,所述医疗领域知识图谱构建方法包括:
    对医学领域相关的垂直性网站进行知识抽取,存入知识库;以及,
    对百科类网站进行知识抽取,对抽取到的知识数据进行实体文本识别,将识别到的所述实体文本输入到预先训练的实体领域识别模型中,将识别结果为医疗领域实体的实体文本所对应的知识数据存入所述知识库;
    对所述知识库中的数据进行知识加工;
    对知识加工后的数据进行质量评估;
    将通过质量评估的数据构建成医疗领域知识图谱;
    将所述医疗领域知识图谱应用于医学相关知识智能问答。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述将识别到的所述实体文本输入到预先训练的实体领域识别模型中的步骤包括:
    将实体文本分词处理,输入到Token Embedding层,得到词向量e 1,e 2…,e n
    将词向量e 1,e 2…,e n输入LSTM层,得到隐藏向量h 1,h 2…,h n
    将隐藏向量进行Attention计算得到表征向量v,Attention计算过程如下:
    Figure PCTCN2020119374-appb-100005
    v=∑ iα ih i,i=1,…,n;
    将表征向量v输入全连接层得到输出结果,具体公式为y=sigmoid(W*v),其中y为所述识别结果,识别结果包括1和0,分别对应医学领域实体和非医学领域实体,W是参数,sigmoid是激活函数。
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述对抽取的数据进行知识加工的步骤包括:
    对抽取到的实体数据的属性和属性值进行规范化;
    对抽取到的实体数据进行多值属性处理。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述对抽取到的实体数据的属性和属性值进行规范化的方法包括:
    利用文本相似度计算的方法对抽取到的实体数据的属性和属性值进行规范化;或者,
    利用人工众包的方法对抽取到的实体数据的属性和属性值进行规范化。
  20. 根据权利要求16所述的计算机可读存储介质,其中,在所述将通过质量评估的数据构建成医疗领域知识图谱的步骤之后还包括,对知识图谱进行更新,其中更新方法为:
    利用基于统计学泊松分布公式
    Figure PCTCN2020119374-appb-100006
    预测知识图谱中实体的更新频率,其中Estimation(e)是实体的更新频率,T(e)表示实体的存在时间周期,X(e)表示实体e在时间周期T(e)内变化的次数;
    根据所述更新频率对知识图谱中的实体数据进行智能更新。
PCT/CN2020/119374 2020-06-24 2020-09-30 医疗领域知识图谱构建方法、装置、设备及存储介质 WO2021139282A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010592333.3 2020-06-24
CN202010592333.3A CN111831908A (zh) 2020-06-24 2020-06-24 医疗领域知识图谱构建方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021139282A1 true WO2021139282A1 (zh) 2021-07-15

Family

ID=72899410

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/119374 WO2021139282A1 (zh) 2020-06-24 2020-09-30 医疗领域知识图谱构建方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN111831908A (zh)
WO (1) WO2021139282A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023061377A1 (zh) * 2021-10-13 2023-04-20 浙江大学 一种多中心知识图谱联合决策支持方法与系统

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656692B (zh) * 2021-08-17 2023-05-30 中国平安财产保险股份有限公司 基于知识迁移算法的产品推荐方法、装置、设备及介质
CN113990473B (zh) * 2021-10-28 2022-09-30 上海昆亚医疗器械股份有限公司 一种医疗设备运维信息收集分析系统及其使用方法
CN115080762A (zh) * 2022-06-17 2022-09-20 瀚云瑞科技(北京)有限公司 一种考试知识图谱关系建立方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180218126A1 (en) * 2017-01-31 2018-08-02 Pager, Inc. Determining Patient Symptoms and Medical Recommendations Based on Medical Information
CN109271530A (zh) * 2018-10-17 2019-01-25 长沙瀚云信息科技有限公司 一种疾病知识图谱构建方法和平台系统、设备、存储介质
CN109471948A (zh) * 2018-11-08 2019-03-15 威海天鑫现代服务技术研究院有限公司 一种老年健康领域知识问答系统构建方法
CN109543047A (zh) * 2018-11-21 2019-03-29 焦点科技股份有限公司 一种基于医疗领域网站的知识图谱构建方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102103636B (zh) * 2011-01-18 2013-08-07 南京信息工程大学 一种面向深层网页的增量信息获取方法
WO2018077906A1 (en) * 2016-10-25 2018-05-03 Koninklijke Philips N.V. Knowledge graph-based clinical diagnosis assistant
CN106776711B (zh) * 2016-11-14 2020-04-07 浙江大学 一种基于深度学习的中文医学知识图谱构建方法
CN110019839B (zh) * 2018-01-03 2021-11-05 中国科学院计算技术研究所 基于神经网络和远程监督的医学知识图谱构建方法和系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180218126A1 (en) * 2017-01-31 2018-08-02 Pager, Inc. Determining Patient Symptoms and Medical Recommendations Based on Medical Information
CN109271530A (zh) * 2018-10-17 2019-01-25 长沙瀚云信息科技有限公司 一种疾病知识图谱构建方法和平台系统、设备、存储介质
CN109471948A (zh) * 2018-11-08 2019-03-15 威海天鑫现代服务技术研究院有限公司 一种老年健康领域知识问答系统构建方法
CN109543047A (zh) * 2018-11-21 2019-03-29 焦点科技股份有限公司 一种基于医疗领域网站的知识图谱构建方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023061377A1 (zh) * 2021-10-13 2023-04-20 浙江大学 一种多中心知识图谱联合决策支持方法与系统

Also Published As

Publication number Publication date
CN111831908A (zh) 2020-10-27

Similar Documents

Publication Publication Date Title
WO2021139282A1 (zh) 医疗领域知识图谱构建方法、装置、设备及存储介质
Birks et al. Street network structure and crime risk: An agent‐based investigation of the encounter and enclosure hypotheses
CN111506722A (zh) 基于深度学习技术的知识图谱问答方法、装置及设备
Broecheler et al. A scalable framework for modeling competitive diffusion in social networks
WO2021169364A1 (zh) 分析语义情感的方法、装置、设备及存储介质
WO2022095434A1 (zh) 基于自编码器的数据异常识别方法、装置和计算机设备
CN113011895B (zh) 关联账户样本筛选方法、装置和设备及计算机存储介质
CN111506723A (zh) 问答响应方法、装置、设备及存储介质
Zinilli Competitive project funding and dynamic complex networks: evidence from Projects of National Interest (PRIN)
Cedeno-Mieles et al. Networked experiments and modeling for producing collective identity in a group of human subjects using an iterative abduction framework
Gore et al. A value sensitive ABM of the refugee crisis in the Netherlands
CN111768001A (zh) 语言模型的训练方法、装置和计算机设备
CN113642039A (zh) 单证模板的配置方法、装置、计算机设备和存储介质
WO2021159758A1 (zh) 基于关系抽取及知识推理的药物发现方法、装置及设备
CN113761375A (zh) 基于神经网络的消息推荐方法、装置、设备及存储介质
CN113761217A (zh) 基于人工智能的题目集数据处理方法、装置和计算机设备
Tan et al. A model-based approach to generate dynamic synthetic test data: A conceptual model
Lim et al. Mediating effects of public trust in government on national competitiveness: Evidence from Asian countries
CN117316409A (zh) 一种基于大数据的医院信息管理方法及系统
CN111859238A (zh) 基于模型的预测数据变化频率的方法、装置和计算机设备
CN112966787B (zh) 相似患者的识别方法、装置、计算机设备和存储介质
CN115130545A (zh) 数据处理方法、电子设备、程序产品及介质
CN112364136B (zh) 关键词生成方法、装置、设备及存储介质
CN114547053A (zh) 基于系统的数据处理方法、装置、计算机设备和存储介质
US20210248515A1 (en) Assisted learning with module privacy

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20912706

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20912706

Country of ref document: EP

Kind code of ref document: A1