CN116932774A

CN116932774A - Knowledge graph construction method, device, equipment and storage medium

Info

Publication number: CN116932774A
Application number: CN202310829041.0A
Authority: CN
Inventors: 孙小婉; 蔡巍; 张霞
Original assignee: Neusoft Corp; Shenyang Neusoft Intelligent Medical Technology Research Institute Co Ltd
Current assignee: Neusoft Corp; Shenyang Neusoft Intelligent Medical Technology Research Institute Co Ltd
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-10-24

Abstract

The embodiment of the application provides a knowledge graph construction method, a knowledge graph construction device, knowledge graph construction equipment and a storage medium. The method comprises the following steps: constructing a corresponding first knowledge graph according to the element traceability representation information in each structured data source; constructing a corresponding second knowledge graph according to the identified entity and the data source identification in each unstructured data source; and fusing the first knowledge graph and the second knowledge graph to obtain a corresponding target knowledge graph. The embodiment of the application can realize the comprehensive construction of the target knowledge graph on the basis of supporting the traceability of knowledge in the target knowledge graph, so that the target knowledge graph can comprehensively represent the association relationship among all knowledge entities, and the reliability and the integrity of the target knowledge graph are improved.

Description

Knowledge graph construction method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of data processing, in particular to a method, a device, equipment and a storage medium for constructing a knowledge graph.

Background

Within the medical field, management and exploitation of medical knowledge is critical to improving the effectiveness of diagnosis, treatment and research. Medical knowledge may include not only structured medical data such as patient electronic medical records, laboratory reports, etc., but unstructured medical data such as medical literature, clinical guidelines, drug databases, etc.

Currently, in order to realize the association analysis of medical knowledge, various medical knowledge under a single data source is subjected to simple format conversion by analyzing the medical knowledge under each data source so as to respectively construct a medical knowledge graph under each data source. However, the medical knowledge graph constructed under each data source cannot comprehensively represent the association relationship among the medical knowledge, so that the medical knowledge graph has certain limitation.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for constructing a knowledge graph, which are used for realizing comprehensive construction of a target knowledge graph, ensuring traceability of knowledge in the target knowledge graph and improving reliability of the target knowledge graph.

In a first aspect, an embodiment of the present application provides a method for constructing a knowledge graph, where the method includes:

constructing a corresponding first knowledge graph according to the element traceability representation information in each structured data source;

constructing a corresponding second knowledge graph according to the identified entity and the data source identification in each unstructured data source;

and fusing the first knowledge graph and the second knowledge graph to obtain a corresponding target knowledge graph.

In a second aspect, an embodiment of the present application provides a device for constructing a knowledge graph, where the device includes:

the first knowledge graph construction module is used for constructing a corresponding first knowledge graph according to the element traceability representation information in each structured data source;

the second knowledge graph construction module is used for constructing a corresponding second knowledge graph according to the identified entity and the data source identifier in each unstructured data source;

and the knowledge spectrum fusion module is used for fusing the first knowledge spectrum and the second knowledge spectrum to obtain a corresponding target knowledge spectrum.

In a third aspect, an embodiment of the present application provides an electronic device, including:

the knowledge graph constructing method comprises a processor and a memory, wherein the memory is used for storing a computer program, and the processor is used for calling and running the computer program stored in the memory to execute the knowledge graph constructing method provided in the first aspect of the application.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium storing a computer program that causes a computer to execute the knowledge graph construction method as provided in the first aspect of the present application.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program/instruction which, when executed by a processor, implements a method of constructing a knowledge graph as provided in the first aspect of the present application.

The embodiment of the application provides a method, a device, equipment and a storage medium for constructing a first knowledge graph by analyzing element traceability representation information in each structured data source, and the knowledge traceability of the first knowledge graph is ensured. And constructing a second knowledge graph according to the identified entity in each unstructured data source and the data source identification, and ensuring the traceability of the knowledge of the second knowledge graph. And then, fusing the first knowledge graph and the second knowledge graph to obtain a corresponding target knowledge graph, so that the comprehensive construction of the target knowledge graph is realized on the basis of supporting the traceability of knowledge in the target knowledge graph, the association relationship among all knowledge entities can be comprehensively represented by the target knowledge graph, and the reliability and the integrity of the target knowledge graph are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a knowledge graph construction method according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for a specific construction process of a first knowledge-graph corresponding to a structured data source according to an embodiment of the present application;

3 a-3 e are exemplary diagrams of a first knowledge subgraph under each primary key in a structured data source, as shown in an embodiment of the application;

FIG. 4 is an exemplary schematic diagram of a first knowledge-graph constructed from structured data sources, in accordance with an embodiment of the application;

FIG. 5 is a flowchart of a method for a specific construction process of a second knowledge-graph corresponding to an unstructured data source according to an embodiment of the present application;

FIG. 6 is a flow chart of a method for entity relationship completion procedure of a first knowledge-graph, according to an embodiment of the application;

FIG. 7 is a schematic block diagram of a knowledge graph construction apparatus according to an embodiment of the present application;

fig. 8 is a schematic block diagram of an electronic device shown in an embodiment of the application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In order to solve the problem that the existing target knowledge graph has certain limitation, the embodiment of the application designs a construction scheme of the target knowledge graph, which can comprehensively represent the association relation among all knowledge entities. According to the method, for each structured data source, the first knowledge graph can be constructed by analyzing the element traceability representation information in each structured data source, and the traceability of the knowledge of the first knowledge graph is guaranteed. For each unstructured data source, a second knowledge graph can be constructed by analyzing the identified entity and the data source identification in each unstructured data source, so that the traceability of the knowledge of the first knowledge graph is ensured. And then, fusing the first knowledge graph and the second knowledge graph to obtain a target knowledge graph, so that the comprehensive construction of the target knowledge graph is realized on the basis of supporting the traceability of knowledge in the target knowledge graph.

Fig. 1 is a flowchart of a knowledge graph construction method according to an embodiment of the present application. Referring to fig. 1, the method may include the steps of:

s110, constructing a corresponding first knowledge graph according to the element traceability representation information in each structured data source.

In the medical field, in order to comprehensively analyze the association relationship between various medical knowledge, various related medical information is generally required to be extracted from a large number of medical data sources, so as to effectively process, process and integrate the various medical information, and convert the various medical information into simple and clear 'entity, relationship and entity' triples. Then, the entities in the triples of each entity, relation and entity are used as nodes, and the corresponding two entities are connected through the relation in the triples of each entity, relation and entity, so that the corresponding knowledge graph is formed.

Then, in order to ensure knowledge integrity in the knowledge graph, the medical data sources in the present application can be generally classified into structured data sources and unstructured data sources.

The structured data source may be relational data storing various relevant medical data, and the like, and the various medical data may be stored in a data table manner. The unstructured data sources can be medical journal, report, paper and other literature materials describing various medical knowledge, and various medical data can be generally described in text form.

Because the accuracy of medical knowledge in a knowledge graph generally affects whether a medical decision is successful, medical researchers generally need to perform credibility verification on the medical knowledge in the knowledge graph when looking up a certain knowledge graph so as to ensure the true validity of the knowledge graph. The traceable knowledge source is a key concept in the knowledge graph, so that a user can trace and record sources, generation processes, related metadata information and the like of various medical knowledge serving as each entity in the knowledge graph, and various medical knowledge can be comprehensively and thoroughly understood.

Therefore, in order to ensure the authenticity verification of the knowledge graph, the application needs to realize the traceable analysis of the knowledge of each entity in the constructed knowledge graph so that medical researchers can accurately know the recording source of any entity in a knowledge graph when looking up the medical knowledge represented by the entity, thereby analyzing the specific content of the medical knowledge represented by the entity from the recording source.

In the present application, for each structured data source, various medical knowledge may be stored via a data table. Then, each element in the structured data source can be each medical data item stored in the data table. And a certain association exists between each medical data item in any row of records in the data table. For example, each row record of one data table may store identification information of the same patient's visit identity, gender, age, etc., while each row record of another data table may store a medication identification of a medication, a visit identification of the patient in use, etc. Therefore, a certain association relationship exists between each element in each structured data source, and the identification information of the structured data source and one element in the structured data source can be used as the tracing indication information of each other element in the row record. For example, the patient's visit identifier and the drug identifier of the drug may each represent the source of other medical data items.

Therefore, in each structured data source, the row information and the column information of each element in the structured data source can be analyzed, so that other elements which are in association with each element in the same row and have association relation with each other element and specific column information which can indicate source information of the element can be determined. Then, according to the information, each element can be converted into a piece of record data capable of showing the source of the element and the association relationship between the element and other elements according to a preset element tracing format, and the record data is used as element tracing representation information of the element. According to the method, the element traceability representation information corresponding to each element in each structured data source can be determined.

Then, each element is taken as an entity, other elements with association relation with the element can be determined by analyzing the element traceability representation information corresponding to the element, and the traceability tag corresponding to the element traceability representation information is stored in the entity where the element is located so as to indicate the source information of the element. Then, by connecting the entities corresponding to the elements with the association relationship, a first knowledge graph can be constructed. The first knowledge graph can describe the association relation among various medical knowledge related in each structured data source, and can support a user to check corresponding element traceability representation information at any time through traceability labels stored in each entity so as to clearly represent source information of the entity, namely the specific position of the structured data source where the entity is located, so that the first knowledge graph supports traceability of various medical knowledge represented by each entity in the first knowledge graph, and accuracy and reliability of knowledge sources in the first knowledge graph are ensured.

S120, constructing a corresponding second knowledge graph according to the identified entity and the data source identification in each unstructured data source.

For each unstructured data source, various medical knowledge is documented, as unstructured data sources are typically in text form. Therefore, in order to realize accurate construction of the knowledge graph corresponding to the unstructured data sources, the application needs to perform corresponding natural language analysis on the text information corresponding to each unstructured data source, so as to extract various medical knowledge points as identified entities in the application. For example, for the text "please stop the use and give antiviral treatment when the reaction name1 appears", after natural language analysis, the following entities can be extracted: { adverse reaction: name1} and { method: antiviral }.

Then, through natural language analysis on the specific text content of each two identified entities, whether the corresponding association relationship exists between the two identified entities can be judged, so that each identified entity pair with the association relationship can be determined.

In addition, in order to realize the knowledge traceability of the knowledge graph corresponding to the unstructured data sources, the application also needs to analyze the source information of each identified entity. Therefore, the source information of each identified entity is represented by analyzing which unstructured data sources each identified entity originates from and adding the data source identification of the unstructured data sources to the identified entity.

Then, by connecting two identified entities in each identified entity pair having an association relationship, a corresponding second knowledge-graph can be constructed. Each entity in the second knowledge graph is added with a corresponding data source identifier to clearly express the source information of the entity, so that the second knowledge graph supports traceability of various medical knowledge expressed by each entity in the second knowledge graph, and accuracy and reliability of knowledge sources in the second knowledge graph are ensured.

In some implementations, for each unstructured data source, if the entire text content of the unstructured data source is divided into multiple sub-texts, the corresponding identified entities are extracted separately. The data source identifier in the application can be the text identifier of each divided sub-text, and the text identifier of each sub-text is added to each identified entity extracted from the sub-text so as to accurately represent the specific source information of the identified entity.

It can be understood that S110 and S120 in the present application are operations of respectively constructing the corresponding knowledge maps of the structured data source and the unstructured data source, and the two operations do not interfere with each other. Therefore, S110 and S120 may be performed simultaneously, and there is no execution order.

And S130, fusing the first knowledge graph and the second knowledge graph to obtain a corresponding target knowledge graph.

After the first knowledge graph corresponding to the structured data source and the second knowledge graph corresponding to the unstructured data source are obtained, considering that various medical knowledge can be recorded in each structured data source or each unstructured data source, the first knowledge graph and the second knowledge graph may have the same entity. Therefore, in order to ensure the comprehensiveness of the knowledge spectrums, the method can be used for fusing the first knowledge spectrums and the second knowledge spectrums by analyzing the same entities in the first knowledge spectrums and the second knowledge spectrums so as to obtain corresponding target knowledge spectrums. The target knowledge graph can comprehensively represent the association relation among all knowledge entities on the basis of supporting the traceability of knowledge in the target knowledge graph, so that the integrity of the target knowledge graph is ensured.

According to the technical scheme provided by the embodiment of the application, the first knowledge graph is constructed by analyzing the element traceability representation information in each structured data source, so that the traceability of the knowledge of the first knowledge graph is ensured. And constructing a second knowledge graph according to the identified entity in each unstructured data source and the data source identification, and ensuring the traceability of the knowledge of the second knowledge graph. And then, fusing the first knowledge graph and the second knowledge graph to obtain a corresponding target knowledge graph, so that the comprehensive construction of the target knowledge graph is realized on the basis of supporting the traceability of knowledge in the target knowledge graph, the association relationship among all knowledge entities can be comprehensively represented by the target knowledge graph, and the reliability and the integrity of the target knowledge graph are improved.

As an alternative implementation scheme in the application, considering that the structured data source can generally store medical data related to various medical knowledge in a data table manner, the information such as a header, a main key, an external key and the like in the structured data source is described, so that specific attribute information of each element in the structured data source can be described. Then, in order to ensure the accuracy of the first knowledge graph, the method and the device can explain the specific construction process of the first knowledge graph corresponding to the structured data source.

Fig. 2 is a flowchart of a method for a specific construction process of a first knowledge graph corresponding to a structured data source according to an embodiment of the present application. As shown in fig. 2, the method may include the steps of:

s210, performing traceability representation on each element in each structured data source according to the structured header in each structured data source and a preset element traceability representation format to obtain corresponding element traceability representation information.

For each structured data source, attribute information for each column of elements in the structured data source may be determined by analyzing structured header information in the structured data source. Moreover, each element in the same row in the structured data source can represent numerical information of the same medical object under different attributes, so that a certain association relationship between each element in the same row can be determined.

Because each element in the structured data source can represent a medical object related to medical knowledge, each element in the structured data source can be used as an entity when the first knowledge graph is constructed, and the first knowledge graph is formed by judging the association relationship between every two entities to connect the corresponding entities.

Then, to achieve traceability of the knowledge represented by each entity within the first knowledge graph, a specific source of each element in the structured data source needs to be analyzed. Therefore, the application can preset an element traceability representation format according to the structured header in the structured data source. The element traceability representation format may include the information of the structured data source where each element is located, the specific location attribute information in the structured data source, and other element information related to the element, so as to represent the specific source of medical knowledge corresponding to the element.

Illustratively, the element traceability representation format of each element in the present application may be (table name, column name, element value, [ (primary key column name, primary key value), (foreign key 1 column name, foreign key 1 value) ]), … (foreign key n column name, foreign key n value) ]).

Specific information such as table names, column names, primary key column names, external key column names and the like of each element in each structured data source can be determined through the structured table header in each structured data source. Then, according to a preset element traceability representation format, each element in the structured data source can be converted into a piece of record data capable of representing the element source and the association relationship between the element and other elements, and the record data is used as element traceability representation information corresponding to the element. In addition, the application can define a tracing label for each element tracing representation information, so that the tracing label of the element tracing representation information can be added to the entity represented by each element in the first knowledge graph to clearly represent the source information of the entity.

Taking a structured data source represented by the following three data tables as an example, the element traceability representation information in the structured data source is described:

TABLE 1 Basic information of patients with treatment (Basic)

Diagnosis marker (Visit_ID)	Sex (Sex)	Marital status (Marry)
			01123	1	1
02123	1	1
			06456	2	0

Table 2 medicine for treatment

Number (ID)	Diagnosis marker (Visit_ID)	Medicine label (drug_ID)
			001	01123	7683
002	01123	9578
			003	06456	4596
004	06456	7683

TABLE 3 drug usage information (Arug_info)

When constructing the first knowledge graph related to the Visit mark as the Visit_ID:01123, determining the element traceability representation information corresponding to each element related to the Visit_ID:01123 in the three data tables, wherein the element traceability representation information can be as follows:

table 1:

①(Basic,Visit_ID,01123,[(visit_ID,01123),(Sex,1),(Marry,1)])

②(Basic,Sex,1,[(visit_ID,01123),(Sex,1),(Marry,1)])

③(Basic,Marry,1,[(visit_ID,01123),(Sex,1),(Marry,1)])

table 2:

④(Drug,ID,001,[(ID,001),(Visit_ID,01123),(Drug_ID,7683)])

⑤(Drug,Visit_ID,01123,[(ID,001),(Visit_ID,01123),(Drug_ID,7683)])

⑥(Drug,Drug_ID,7683,[(ID,001),(Visit_ID,01123),(Drug_ID,7683)])

⑦(Drug,ID,002,[(ID,002),(Visit_ID,01123),(Drug_ID,9578)])

⑧(Drug,Visit_ID,01123,[(ID,002),(Visit_ID,01123),(Drug_ID,9578)])

⑨(Drug,Drug_ID,9578,[(ID,002),(Visit_ID,01123),(Drug_ID,9578)])

table 3:

⑩(Arug_info,Drug_ID,7683,[(Drug_ID,7683),(Drug_name,Name1),(Drug_fre,bid),(Drug_rea,Real1)])

(Arug_info,Drug_name,Name1,[(Drug_ID,7683),(Drug_name,Name1),(Drug_fre,bid),(Drug_rea,Real1)])

(Arug_info,Drug_fre,bid,[(Drug_ID,7683),(Drug_name,Name1),(Drug_fre,bid),(Drug_rea,Real1)])

(Arug_info,Drug_rea,Real1,[(Drug_ID,7683),(Drug_name,Name1),(Drug_fre,bid),(Drug_rea,Real1)])

(Arug_info,Drug_ID,9578,[(Drug_ID,9578),(Drug_name,Name2),(Drug_fre,bid),(Drug_rea,Real2)])

(Arug_info,Drug_name,Name2,[(Drug_ID,9578),(Drug_name,Name2),(Drug_fre,bid),(Drug_rea,Real2)])

(Arug_info,Drug_fre,bid,[(Drug_ID,9578),(Drug_name,Name2),(Drug_fre,bid),(Drug_rea,Real2)])

(Arug_info,Drug_rea,Real2,[(Drug_ID,9578),(Drug_name,Name2),(Drug_fre,bid),(Drug_rea,Real2)])

the element traceability representation information corresponding to each element related to the Visit_ID 01123 can be represented,and the traceability labels are respectively used for tracing the information of the representation of each element and are added into the entity represented by the corresponding element when the first knowledge graph is constructed so as to realize the traceability of the knowledge corresponding to each entity in the first knowledge graph.

S220, constructing a corresponding first knowledge graph according to the primary key and the external key in the element traceability representation information.

By analyzing the primary key and the foreign key in the traceability representation information of each element, a certain association relationship between the primary key element and each foreign key element in the same row in each structured data source can be determined. Moreover, according to the structured header of the primary key and the foreign key, the actual association relationship between the primary key element and each foreign key element can be determined.

And then, taking each element as an entity, and adding the traceability tag of the element traceability representation information corresponding to the element to the entity. And connecting the entities according to the association relation among the elements, so as to obtain a corresponding first knowledge graph. Therefore, the corresponding element traceability representation information can be clearly checked through the traceability label on each entity in the first knowledge graph, so that the traceability of the knowledge represented by each entity is realized.

In some implementations, for the primary key and the foreign key in the information represented by element tracing, the first knowledge graph is constructed, which may be specifically: constructing a first knowledge subgraph under each primary key according to the associated element under the primary key in the element traceability representation information; and fusing the first knowledge subgraphs under each main key according to the external keys in the element traceability representation information to obtain corresponding first knowledge maps.

Because only one main key exists in each structured data source, a certain association exists between the main key elements in the same row and each external key element. Therefore, by analyzing the primary key in each element traceability representation information, each element corresponding to each element traceability representation information with the same primary key value can be divided into the same category. Then, the element under the main key and other elements under each external key in the same category are used as the associated elements under the same main key in the application. And then, connecting the element value under the main key and the element value under each external key in each associated element under the same main key, and determining the association relationship between the element value under the main key and the element value under each external key according to the main key name and the external key name, so as to construct a first knowledge subgraph under the main key.

Since there is only one primary key in each structured data source, then in the manner described above, one first knowledge sub-graph can be constructed for each primary key value within each structured data source.

Taking the three data tables as an example, when constructing a first knowledge graph related to the diagnosis mark of Visit_ID 01123, connecting elements corresponding to the trace source representation information of each element with the same main key value in each data table to construct a first knowledge subgraph for the data table.

Wherein fig. 3a may represent a first knowledge sub-graph constructed by table 1 above when the Visit identifier is visitid 01123. Fig. 3b and 3c may represent a first knowledge sub-graph constructed by table 2 above when the Visit is identified as visitid 01123. Fig. 3d and 3e may represent a first knowledge sub-graph constructed by table 3 above when the Visit is identified as visitid 01123. The square boxes in the knowledge subgraph can represent the primary key names corresponding to each data table, and the circular boxes can represent the entities represented by the elements in each data table.

Then, since there may be a coincidence of the primary and foreign keys in different structured data sources, it is stated that there may be the same entity in the first knowledge sub-graph under each primary key, e.g., visit identifier Visit_ID 01123 belongs to the primary key in Table 1 and the foreign key in Table 2.

Therefore, according to the foreign keys in the traceability representation information of each element, whether the entities represented by the same element exist in the first knowledge subgraph under each main key can be analyzed. And then, fusing the first knowledge subgraphs under each main key according to the entities represented by the same elements in each first knowledge subgraph, and obtaining the corresponding first knowledge graph.

Taking the first knowledge subgraphs corresponding to the three data tables as an example, a first knowledge graph as shown in fig. 4 can be obtained.

According to the technical scheme provided by the embodiment of the application, each element in each structured data source is traced and represented according to the structured header and the preset element tracing and representing format in the structured data source to obtain corresponding element tracing and representing information, so that a corresponding first knowledge graph is constructed, the traceability of the knowledge of the first knowledge graph is ensured, and the accuracy of the first knowledge graph is improved.

As an alternative implementation of the present application, considering that unstructured data sources may typically record various medical knowledge in text form, it is illustrated that the unstructured data sources need to use text analysis functionality to determine identified entities in the unstructured data sources. Then, in order to ensure the accuracy of the second knowledge-graph, the present application may explain a specific construction process of the second knowledge-graph corresponding to the unstructured data source.

Fig. 5 is a flowchart of a method for a specific construction process of a second knowledge-graph corresponding to an unstructured data source according to an embodiment of the present application. As shown in fig. 5, the method may include the steps of:

s510, inputting each unstructured data source into a pre-trained named entity recognition model to obtain recognized entities in the unstructured data sources, and adding the data source identification of the unstructured data sources to the recognized entities.

In order to ensure accurate identification of entities in unstructured data sources, a named entity identification model can be trained in advance. The named entity recognition model can recognize entities corresponding to various medical knowledge in the text content by analyzing various text content in the unstructured data sources.

The named entity recognition model may be composed of a language characterization model (namely a Bert model), a Long Short-Term Memory (LSTM) model, and a conditional random field (Conditional Random Fields, CRF) model.

In the application, for each unstructured data source, the unstructured data source can be input into a trained named entity recognition model, and specific text content in the unstructured data source is subjected to corresponding word segmentation processing, semantic analysis and the like through the named entity recognition model, so that various description objects related to medical knowledge can be extracted and used as recognized entities in the unstructured data source.

In addition, in order to ensure the traceability of the knowledge represented by each entity in the second knowledge graph, the application can determine the data source identification of each unstructured data source, wherein the data source identification can be the text identification of each sub-text after the unstructured data source is divided.

The source information for each identified entity is then represented by analyzing which unstructured data sources each identified entity originated from and adding the data source identification of the unstructured data sources to the identified entity.

S520, the associated entity pairs in the unstructured data sources are input into a pre-trained entity relationship recognition model, and a corresponding entity relationship set is obtained.

By analyzing the types of the identified entities, the identified entities extracted from each unstructured data source can be combined in pairs, so that a plurality of associated entity pairs are obtained. For example, for the identified entity { adverse reaction) extracted from the text "please stop the use and give antiviral treatment" when reaction name1 appears: name1} and { method: antiviral, may constitute a pair of associative entities.

Then, in order to ensure accurate analysis of entity relationships, the application can pre-train an entity relationship recognition model. The entity relationship identification model can accurately identify whether an association relationship exists between any two entities. The entity relationship recognition model in the application can be a Bi-directional Long Short-Term Memory (BiLSTM) model.

Therefore, each associated entity pair in the unstructured data source can be input into a trained entity relationship identification model, and whether the associated relationship exists between the two entities in the associated entity pair or not is analyzed through the entity relationship identification model, so that the associated relationship of each associated entity pair is output, and a corresponding entity relationship set is obtained.

If no association exists between two entities in a certain association entity pair, the association relationship outputted by the entity relationship identification model for the association entity pair is null. If there is an association relationship between two entities in a certain association entity pair, the entity relationship recognition model may output a specific association relationship name for the association entity pair.

And S530, constructing a corresponding second knowledge graph according to the identified entity and the entity relationship set.

According to the entity relationship of each associated entity pair in the entity relationship set, the identified entity with the associated relationship can be determined. And then, connecting the identified entities with the association relationship to obtain a corresponding second knowledge graph. Each identified entity in the second knowledge graph is added with a corresponding data source identifier so as to realize the traceability of the knowledge of the second knowledge graph.

According to the technical scheme provided by the embodiment of the application, the second knowledge graph is constructed according to the identified entity in each unstructured data source and the data source identification, so that the traceability of the knowledge of the second knowledge graph is ensured, and the association relation among the knowledge entities can be comprehensively represented by the second knowledge graph.

According to one or more embodiments of the present application, it is considered that a certain association relationship may exist between different elements in different structured data sources, but the association relationship between each entity in the first knowledge graph that is initially constructed is not complete because the association relationship is not represented in the primary key and the external key of the structured data source. Therefore, in order to ensure the integrity of the knowledge patterns, the method and the system also need to complement the entity relationship between the connecting entities in the first knowledge patterns before fusing the first knowledge patterns and the second knowledge patterns to obtain the corresponding target knowledge patterns.

As shown in fig. 6, the present application may explain the specific process of entity relationship completion of the first knowledge-graph in detail. The entity relationship completion process of the first knowledge graph may include the steps of:

s610, determining unconnected entity pairs in the first knowledge-graph.

Because the corresponding association relationship is already determined between the two connected entities in the first knowledge-graph, by analyzing the connection condition of the entities in the first knowledge-graph, the two unconnected entities in the first knowledge-graph can be determined to form unconnected entity pairs in the application, so that whether the association relationship exists between the two entities in each unconnected entity pair can be analyzed later.

S620, determining entity relations among unconnected entity pairs according to the entity relation set so as to complement the entity relations of the first knowledge graph.

When the unstructured data sources are used for constructing the second knowledge graph, the entity relations among the identified entities in the unstructured data sources are analyzed, so that a corresponding entity relation set is obtained. The entity relation set comprehensively comprises entity relations among various medical knowledge.

Thus, by analyzing the two identified entities within the set of entity relationships with which each entity relationship is associated in the unstructured data sources, the type of entity pair for which each entity relationship is oriented can be determined. Then, for each unconnected entity pair, the similarity between two entities in the unconnected entity pair and the entity pair type facing each entity relationship in the entity relationship set can be judged, so that the possibility that each entity relationship in the entity relationship set is used as the association relationship of the unconnected entity pair is analyzed.

If the likelihood of each entity relationship within the set of entity relationships as an association relationship for the unconnected pair of entities is low, it is stated that there may be no association relationship between the two entities in the unconnected pair of entities. If the probability that an entity relationship in the entity relationship set is used as the association relationship of the unconnected entity pair is high, the entity relationship can be used as the association relationship between two entities in the unconnected entity pair. In the above manner, it can be determined whether there is an association relationship between two entities in each unconnected entity pair. Therefore, the two entities in the part of the unconnected entity pairs with the association relationship are connected, namely, the entity relationship of each unconnected entity pair in the first knowledge graph can be completed, and the integrity of the first knowledge graph is ensured.

As an alternative implementation scheme in the present application, in order to accurately analyze the similarity between two entities in the unconnected entity pair and two entities facing each entity relationship in the entity relationship set in the second knowledge-graph, the present application may complete the entity relationship for the unconnected entity pair in the first knowledge-graph according to the entity relationship set by:

And determining entity expression vectors in the first knowledge graph according to the entity semantic text of the connected entity pairs in the first knowledge graph.

Considering that the similarity between unconnected entities to two entities facing each entity relationship within a set of entity relationships is equivalent to the analysis of two text similarities consisting of different entities, text similarities can typically be analyzed using text representation vectors.

Therefore, in order to accurately analyze whether an association relationship exists between each unconnected entity pair in the first knowledge graph, the application can determine the entity semantic text of each connected entity pair through the real semantic of each entity in each connected entity pair in the first knowledge graph and the entity relationship between the connected entity pairs. Then, natural language analysis is performed on the entity semantic text of each connected entity pair to analyze the representation vector of each word segment in the entity semantic text. And the two entities in each connected entity pair are used as two segmentation words in the entity semantic text of the connected entity pair, namely, the two entity representation vectors in the connected entity pair can be determined according to the representation vector of each segmentation word in the entity semantic text of each connected entity pair.

It will be appreciated that each entity in each connected entity pair in the first knowledge-graph may be fully involved in each entity in the first knowledge-graph. Then, by the above manner, after determining the two entity representation vectors in each connected entity pair in the first knowledge-graph, the entity representation vector of each entity in the first knowledge-graph can be obtained.

In some implementations, for entity representation vectors in the first knowledge-graph, the present application may determine by: constructing a corresponding entity semantic text according to the connected entity pairs in the first knowledge graph; inputting the entity semantic text into a pre-trained language characterization model to obtain a representation vector of each character in the entity semantic text; and determining entity expression vectors in the first knowledge-graph according to the expression vectors of the associated characters of each entity in the first knowledge-graph.

That is, according to the connection condition between the entities in the first knowledge-graph, each connected entity pair in the first knowledge-graph can be determined. Moreover, there is a certain association between the two entities in each connected entity pair. Therefore, for each connected entity pair in the first knowledge graph, according to the specific semantics of two entities in the connected entity pair and the entity relationship between the two entities, according to the semantic description habit, the entity semantic text of the connected entity pair can be constructed.

Illustratively, the entity semantic text for each connected entity pair may be constructed in the form of: the "entity relationship" of one entity in the connected entity pair as the "subject" is the other entity in the connected entity pair as the "object". For example, a connected entity pair includes an entity "drug 9578" and an entity "nausea", where the entity relationship is "adverse effect", and then the entity semantic text of the connected entity pair may be "adverse effect of drug 9578 is nausea".

After determining the entity semantic text of each connected entity pair, in order to ensure the vector representation accuracy of the entity semantic text, the application can pre-train a language characterization model, namely a Bert model. The language characterization model may bi-directionally encode any semantic text to represent each character therein as a corresponding vector.

Therefore, the application can sequentially input the entity semantic text of each connected entity pair into the trained language representation model, and bi-directionally encode each character in the entity semantic text through the language representation model, thereby obtaining the representation vector of each character in the entity semantic text.

Since each entity in the first knowledge-graph is composed of a plurality of characters, each entity may appear in the entity semantic text of each connected entity pair. Therefore, after determining the representation vector of each character in the entity semantic text of each connected entity pair, each associated character included in each entity in the first knowledge-graph can be determined, and the representation vector of each associated character can be found.

The entity representation vector for each entity can then be determined by averaging the representation vectors of the respective associated characters within each entity.

For example, assuming that an entity includes n associated characters, the entity representation vector for that entity may be

Wherein x is _i A representation vector of an i-th associated character in the entity may be represented.

And secondly, determining entity relation vectors of unconnected entity pairs in the first knowledge graph according to the entity representation vectors.

After determining the entity representation vector of each entity in the first knowledge-graph, each unconnected entity pair in the first knowledge-graph may be determined. Then, by analyzing the difference vector between the entity representation vectors of the two entities in each unconnected entity pair, the entity relationship vector of the unconnected entity pair can be obtained. According to the mode, the entity relation vector of each unconnected entity pair in the first knowledge graph can be determined.

Exemplary, a certain unconnected entity pair consists of an ith entity and a jth entity in the first knowledge-graph, and the entity representation vector of the ith entity is a _i The entity representation vector of the jth entity is a _j . Then the entity relationship vector for the unconnected entity pair may be r=a _i -a _j 。

And thirdly, determining the entity relationship among the unconnected entity pairs according to the similarity between the entity relationship vector of the unconnected entity pairs and the reference relationship vector of each entity relationship in the entity relationship set, so as to complement the entity relationship of the first knowledge graph.

In order to accurately analyze whether an association relationship exists between two entities in each unconnected entity pair in the first knowledge graph, the method can firstly acquire the entity relationship set obtained when the second knowledge graph is constructed through the unstructured data source.

Moreover, when the second knowledge graph is constructed through the unstructured data sources, the entity representation vector of each identified entity in the unstructured data sources can be determined by determining the representation vector of each character in the unstructured data sources through the language representation model. Then, for each entity relationship in the entity relationship set, the entity relationship may be an association relationship between some two identified entities connected in the second knowledge-graph. Then, by analyzing the difference vector between the entity representation vectors of the two identified entities for which each entity relationship is oriented in the second knowledge-graph, a reference relationship vector for the entity relationship may be determined.

Then, for the entity relationship vector of each unconnected entity pair in the first knowledge graph, the similarity between the entity relationship vector of the unconnected entity pair and the reference relationship vector of each entity relationship in the entity relationship set can be calculated, and the maximum similarity of the unconnected entity pair is determined.

In addition, in order to ensure accurate analysis of entity relationships in the first knowledge graph, a similarity threshold may be predefined. And judging whether an association relationship exists between two entities in each unconnected entity pair by analyzing the maximum similarity between each unconnected entity pair in the first knowledge graph and the similarity threshold. Illustratively, the similarity threshold may be defined as 0.8.

If the maximum similarity of a certain unconnected entity pair in the first knowledge graph is greater than or equal to the similarity threshold, a certain association relationship exists between two entities in the unconnected entity pair, and the entity relationship corresponding to the maximum similarity in the entity relationship set is determined as the association relationship between the two entities in the unconnected entity pair. Therefore, the two entities in the unconnected entity pair are connected in the first knowledge graph, and corresponding entity relations are set, so that entity relation completion in the first knowledge graph is realized.

And the maximum similarity in a certain unconnected entity pair in the first knowledge graph is smaller than the similarity threshold, which indicates that a certain association relationship does not exist between the two entities in the unconnected entity pair, that is, the two entities in the unconnected entity pair are not connected in the first knowledge graph.

According to the technical scheme provided by the embodiment of the application, the entity relationship in the first knowledge graph is complemented through the entity relationship set in the first knowledge graph, so that the first knowledge graph can comprehensively represent the association relationship among all knowledge entities, and the reliability and the integrity of the target knowledge graph are improved.

Fig. 7 is a schematic block diagram of a knowledge graph construction apparatus according to an embodiment of the present application. As shown in fig. 7, the apparatus 700 may include:

the first knowledge graph construction module 710 is configured to construct a corresponding first knowledge graph according to the element traceability representation information in each structured data source;

a second knowledge graph construction module 720, configured to construct a corresponding second knowledge graph according to the identified entity and the data source identifier in each unstructured data source;

and a knowledge-graph fusion module 730, configured to fuse the first knowledge-graph and the second knowledge-graph to obtain a corresponding target knowledge-graph.

In some implementations, the first knowledge-graph construction module 710 may include:

the element traceability representation unit is used for carrying out traceability representation on each element in each structured data source according to the structured header in each structured data source and a preset element traceability representation format to obtain corresponding element traceability representation information;

and the first knowledge graph construction unit is used for constructing a corresponding first knowledge graph according to the primary key and the external key in the element traceability representation information.

In some implementations, the first knowledge graph construction unit may be specifically configured to:

constructing a first knowledge subgraph under each primary key according to the associated element under the primary key in the element traceability representation information;

and fusing the first knowledge subgraphs under each main key according to the external keys in the element traceability representation information to obtain corresponding first knowledge maps.

In some implementations, the second knowledge-graph construction module 720 may be specifically configured to:

inputting each unstructured data source into a pre-trained named entity recognition model to obtain recognized entities in the unstructured data sources, and adding a data source identifier of the unstructured data sources on the recognized entities;

Inputting the associated entity pairs in the unstructured data sources into a pre-trained entity relationship identification model to obtain corresponding entity relationship sets;

and constructing a corresponding second knowledge graph according to the identified entity and the entity relationship set.

In some implementations, the knowledge graph construction apparatus 700 may further include an entity relationship completion module. The entity relationship completion module may be used to:

determining unconnected entity pairs in the first knowledge-graph;

and determining the entity relationship between the unconnected entity pairs according to the entity relationship set so as to complement the entity relationship of the first knowledge graph.

In some implementations, the entity relationship completion module may include:

the entity representation unit is used for determining entity representation vectors in the first knowledge graph according to entity semantic texts of connected entity pairs in the first knowledge graph;

the entity relation vector determining unit is used for determining an entity relation vector of the unconnected entity pair in the first knowledge graph according to the entity expression vector;

and the entity relationship completion unit is used for determining the entity relationship among the unconnected entity pairs according to the similarity between the entity relationship vector of the unconnected entity pairs and the reference relationship vector of each entity relationship in the entity relationship set so as to complete the entity relationship of the first knowledge graph.

In some implementations, the entity representation unit may be specifically configured to:

constructing a corresponding entity semantic text according to the connected entity pairs in the first knowledge graph;

inputting the entity semantic text into a pre-trained language characterization model to obtain a representation vector of each character in the entity semantic text;

and determining entity expression vectors in the first knowledge-graph according to the expression vectors of the associated characters of each entity in the first knowledge-graph.

In the embodiment of the application, the first knowledge graph is constructed by analyzing the element traceability representation information in each structured data source, so that the traceability of the knowledge of the first knowledge graph is ensured. And constructing a second knowledge graph according to the identified entity in each unstructured data source and the data source identification, and ensuring the traceability of the knowledge of the second knowledge graph. And then, fusing the first knowledge graph and the second knowledge graph to obtain a corresponding target knowledge graph, so that the comprehensive construction of the target knowledge graph is realized on the basis of supporting the traceability of knowledge in the target knowledge graph, the association relationship among all knowledge entities can be comprehensively represented by the target knowledge graph, and the reliability and the integrity of the target knowledge graph are improved.

It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, the apparatus 700 shown in fig. 7 may perform any method embodiment of the present application, and the foregoing and other operations and/or functions of each module in the apparatus 700 are respectively for implementing corresponding flows in each method in the embodiment of the present application, which are not described herein for brevity.

The apparatus 700 of the embodiment of the present application is described above in terms of functional modules in conjunction with the accompanying drawings. It should be understood that the functional module may be implemented in hardware, or may be implemented by instructions in software, or may be implemented by a combination of hardware and software modules. Specifically, each step of the method embodiment in the embodiment of the present application may be implemented by an integrated logic circuit of hardware in a processor and/or an instruction in a software form, and the steps of the method disclosed in connection with the embodiment of the present application may be directly implemented as a hardware decoding processor or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in a well-established storage medium in the art such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, and the like. The storage medium is located in a memory, and the processor reads information in the memory, and in combination with hardware, performs the steps in the above method embodiments.

As shown in fig. 8, the electronic device 800 may include:

a memory 810 and a processor 820, the memory 810 being for storing a computer program and transmitting the program code to the processor 820. In other words, the processor 820 may call and run a computer program from the memory 810 to implement the methods in embodiments of the present application.

For example, the processor 820 may be configured to perform the above-described method embodiments according to instructions in the computer program.

In some embodiments of the application, the processor 820 may include, but is not limited to:

a general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

In some embodiments of the application, the memory 810 includes, but is not limited to:

volatile memory and/or nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DR RAM).

In some embodiments of the application, the computer program may be partitioned into one or more modules that are stored in the memory 810 and executed by the processor 820 to perform the methods provided by the application. The one or more modules may be a series of computer program instruction segments capable of performing the specified functions, which are used to describe the execution of the computer program in the electronic device.

As shown in fig. 8, the electronic device may further include:

a transceiver 830, the transceiver 830 being connectable to the processor 820 or the memory 810.

Processor 820 may control transceiver 830 to communicate with other devices, and in particular, may send information or data to other devices or receive information or data sent by other devices. Transceiver 830 may include a transmitter and a receiver. Transceiver 830 may further include antennas, the number of which may be one or more.

It will be appreciated that the various components in the electronic device are connected by a bus system that includes, in addition to a data bus, a power bus, a control bus, and a status signal bus.

The present application also provides a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. Alternatively, embodiments of the present application also provide a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the method of the method embodiments described above.

When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. For example, functional modules in various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.

The foregoing is merely illustrative embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about variations or substitutions within the technical scope of the present application, and the application should be covered. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The method for constructing the knowledge graph is characterized by comprising the following steps of:

2. The method of claim 1, wherein constructing a corresponding first knowledge-graph from the element traceability representation information in each structured data source comprises:

according to the structured header and a preset element traceability representation format in each structured data source, carrying out traceability representation on each element in the structured data source to obtain corresponding element traceability representation information;

and constructing a corresponding first knowledge graph according to the primary key and the external key in the element traceability representation information.

3. The method according to claim 2, wherein the constructing a corresponding first knowledge graph according to the primary key and the foreign key in the element traceability representation information includes:

4. The method of claim 1, wherein constructing a corresponding second knowledge-graph from the identified entities in each unstructured data source and the data source identification comprises:

5. The method of claim 4, further comprising, prior to fusing the first knowledge-graph and the second knowledge-graph to obtain the corresponding target knowledge-graph:

determining unconnected entity pairs in the first knowledge-graph;

6. The method of claim 5, wherein determining the entity relationship between the unconnected entity pairs to complement the entity relationship for the first knowledge-graph based on the set of entity relationships comprises:

Determining entity expression vectors in the first knowledge graph according to entity semantic texts of connected entity pairs in the first knowledge graph;

determining an entity relation vector of the unconnected entity pair in the first knowledge graph according to the entity representation vector;

and determining the entity relationship among the unconnected entity pairs according to the similarity between the entity relationship vector of the unconnected entity pairs and the reference relationship vector of each entity relationship in the entity relationship set, so as to complement the entity relationship of the first knowledge graph.

7. The method of claim 6, wherein the determining the entity representation vector in the first knowledge-graph based on the entity semantic text of the connected entity pairs in the first knowledge-graph comprises:

8. The knowledge graph construction device is characterized by comprising:

9. An electronic device, comprising:

a processor and a memory for storing a computer program, the processor being adapted to invoke and run the computer program stored in the memory to perform the knowledge-graph construction method of any of claims 1-7.

10. A computer-readable storage medium storing a computer program for causing a computer to execute the knowledge-graph construction method according to any one of claims 1 to 7.