CN111400503B

CN111400503B - Knowledge graph generation method based on multiple indexes

Info

Publication number: CN111400503B
Application number: CN202010126582.3A
Authority: CN
Inventors: 何宇轩; 牟昊; 徐亚波; 李旭日
Original assignee: Guangzhou Datastory Information Technology Co ltd
Current assignee: Guangzhou Datastory Information Technology Co ltd
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2023-09-22
Anticipated expiration: 2040-02-28
Also published as: CN111400503A

Abstract

The invention discloses a knowledge graph generation method based on multiple indexes, which comprises the following steps: when a newly appeared entity in the result is input, the processing scheme of the entity is output through three indexes of editing distance, text sentence vector and sound volume. When a new relation appears in the input result, judging the processing scheme of the relation through the co-occurrence sound quantity and the relation sound quantity of the corresponding entity. When new entity attributes appear in the input result, judging the processing scheme of the entity attributes according to the editing distance and sound volume of the attributes. When new relation attributes appear in the input result, judging the processing scheme of the relation attributes through the editing distance and sound volume of the relation attributes. The method has good effect on the specific field with less data, improves the quality of the knowledge graph in the specific field under the condition of little or no manual participation, has high accuracy of identifying input information and updating the knowledge graph, and can improve the accuracy of automatically generating the knowledge graph by knowledge.

Description

Knowledge graph generation method based on multiple indexes

Technical Field

The invention relates to the technical field of computer text processing, in particular to a method for generating a knowledge graph.

Background

Knowledge maps are common knowledge visualization and storage tools. Because of the complexity and diversity of knowledge, a great deal of manpower is required to make knowledge maps. The automatically generated knowledge graph is generally only aimed at the field with a large amount of data, and has poor effect on the specific field with less data. Therefore, a method is needed to improve the quality of knowledge maps in specific fields with little or no human involvement.

Disclosure of Invention

The invention provides a knowledge graph generation method based on multiple indexes, which aims at the problems and specifically comprises the following steps of:

s001, defining a database data structure, wherein the defined data structure comprises four types of entities, relations, entity attributes and relation attributes; the entity at least comprises three attributes of a name, a name and a document ID; the relation is a directed link between two entities, the link starts from a starting entity, points to an ending entity and at least comprises a name attribute; the entity attribute corresponds to a specific entity and is key value pair information in the corresponding entity; the relation attribute corresponds to a specific relation and is key value pair information in the corresponding relation;

s002, inputting information; the information is one or more of an entity, a relationship, an entity attribute and a relationship attribute;

s003, matching the input information one by one, wherein the step S007 is directly executed after the matching is successful, and the step S004 is executed after the matching is failed;

s004, information matching: according to the data structure type of the matching failure information, a processing scheme is correspondingly generated;

s005, calculating the confidence coefficient of each processing scheme by using the multi-index parameters;

s006, selecting a processing scheme of the matching failure information according to the confidence coefficient;

s007, updating the data of the database with the successfully matched input information or the selected processing scheme, namely updating the knowledge graph, and starting from the step S002 when the information is input again and the knowledge graph is updated subsequently.

As a further explanation of the present invention, the input in step S002 is that the entity, the relationship, the entity attribute and the relationship attribute information are obtained by manual labeling or data model prediction.

Still further, the steps S002 to S003 include an information filtering step, in which the input information with the sound volume smaller than the sound volume threshold is filtered by the sound volume parameter of the input information and the set sound volume threshold.

Further, in the information matching in step S003, the processing schemes correspondingly generated according to the input information types are different, and the method for calculating the confidence of the corresponding processing scheme in the subsequent step S005 is also different.

Further, when the input information type is an entity and the information matching fails, the correspondingly generated processing scheme comprises four types of fusion into a certain database entity, fusion into a certain new entity, newly added entity and discarding; the confidence coefficient of the two processing schemes fused into a certain database entity and a certain new entity is calculated by three indexes of editing distance, text sentence vector and sound volume, and the calculation formula is as follows: confidence = (sound volume index + edit distance index + sentence vector index)/3.

Further, when the input information type is a relation and the information matching fails, the correspondingly generated processing scheme comprises a newly added relation and a discarded relation; the confidence of the new relation processing mode is calculated by two indexes of co-occurrence sound quantity and relation sound quantity of the starting entity and the ending entity, and the calculation formula is as follows: confidence = (co-occurrence sound volume index + sound volume index)/2.

Further, when the input information type is entity attribute and the information matching fails, the correspondingly generated processing scheme comprises correction or new attribute and discarding two types; the confidence of the modified or newly added attribute processing mode is calculated by two indexes of editing distance and sound volume of the entity attribute, and the calculation formula is as follows: confidence = (edit distance index+sound volume index)/2.

Further, when the input information type is a relation attribute and the information matching fails, the correspondingly generated processing scheme comprises correction or new attribute and discard two types; the confidence of the modified or newly added attribute processing mode is calculated by two indexes of editing distance and sound volume of the entity attribute, and the calculation formula is as follows: confidence = (edit distance index+sound volume index)/2.

Still further, the manner of selecting the processing scheme in step S006 includes manual selection and machine-automated selection.

Still further, the machine automatically performing the selection includes inputting a confidence threshold, automatically performing the processing scheme with the greatest confidence when the scheme with the greatest confidence is greater than the confidence threshold among all the categories of processing schemes, and otherwise selecting the discarding.

The invention has the beneficial effects that:

the knowledge graph generation method based on multiple indexes has good effect on the specific field with less data, and the method for improving the quality of the knowledge graph in the specific field under the condition of little or no manual participation has high accuracy of identifying input information and updating the knowledge graph, can improve the accuracy of automatically generating the knowledge graph by knowledge, and can reduce the manual workload when manual intervention is needed.

Drawings

FIG. 1 is an overall flow chart of the method of the present invention;

FIG. 2 is an example of a knowledge-graph database structure according to the present invention;

FIG. 3 is a flowchart of a method for generating a knowledge graph entity according to the present invention;

FIG. 4 is a flowchart of a knowledge graph relationship generation method according to the present invention;

FIG. 5 is a flowchart of a method for generating the entity attribute of the knowledge graph of the present invention;

FIG. 6 is a flowchart of a method for generating knowledge graph relationship attributes according to the present invention.

Detailed Description

The following detailed description of specific embodiments of the invention, taken in conjunction with the accompanying drawings, will be readily apparent that the embodiments described are merely some, but not all embodiments of the invention.

The overall flowchart of the knowledge graph generation method based on multiple indexes as shown in fig. 1 comprises the following steps:

s001, defining a database data structure. The database may be a null database or a non-null database. If the database is not empty, each entity should contain at least three attributes, namely name, name and document ID. The name is the name most representative of the entity, and is otherwise referred to as the name of the entity, and the document ID is a list of IDs of documents in which the entity appears. The relationship refers to a directed link between two entities, the link starting from a starting entity and pointing to an ending entity. The relationship contains at a minimum name attributes. Fig. 2 is an example of this data structure.

S002, inputting entity, relationship, entity attribute and relationship attribute information.

Wherein a relationship must correspond to a particular starting entity and a particular ending entity, an entity attribute must correspond to a particular entity, and a relationship attribute must correspond to a particular relationship.

The information can be derived from manual arrangement or from an algorithm recognition result.

Wherein the input information should contain a list of IDs of the documents in which the information is located.

Table 1 is a specific example of input of entities, relationships, entity attributes, and relationship attribute information in this embodiment.

Table 1 input samples:

s003, information with too low sound volume is filtered. The volume is the number of occurrences of an entity, a relationship, an entity attribute, or a relationship attribute in the original document. And respectively inputting sound volume thresholds of the entity, the relation, the entity attribute and the relation attribute, filtering out input information with sound volume smaller than the sound volume thresholds, and not performing any processing.

S004, an information matching and processing scheme generating flow comprises four matching processing flows of entities, relations, entity attributes and relation attributes, and flow charts of the four matching processing flows respectively correspond to those shown in figures 3-6. Specifically:

the entity matching process flow comprises entity matching and entity processing scheme generation.

Entity matching: all input entities are processed one by one. First, it is confirmed whether the entity exists in the database. The specific method is to search the database, if the name or name of only one entity in the database is equal to the input entity, the matching is successful, which indicates that the entity exists in the database, otherwise, the entity does not exist in the database, and the input entity is a new entity.

For the entry in Table 1 and the database in FIG. 2, the entity "Ming" matches successfully, and the entity "Ming" does not.

If the input entity is successfully matched, the document ID of the input entity is added into the document ID of the corresponding entity of the database.

If the input entity fails to match, the input entity is a new entity, and the next operation is performed.

Generating an entity processing scheme: for new entities, there are 4 types of processing schemes, respectively: merging into a certain database entity, merging into a certain new entity, adding an entity newly and discarding. The first two types of schemes may contain more than one processing scheme. The different processing schemes require separate calculations.

Where "fuse into a database entity," such schemes require traversing all database entities to calculate the confidence level fused into them.

The confidence index consists of an average of three indices

Confidence = (sound volume index + edit distance index + sentence vector index)/3 (1)

In the formula (1), the sound volume index, the edit distance index, and the sentence vector index are calculated by:

sound volume index = input entity sound volume/total number of documents (2)

In the formula (2), the total number of documents is the total number of documents involved in the information input at this time.

In the formula (3), the names and the names of the database entities are respectively calculated, and the maximum value is taken as an editing distance index

Sentence vector index = cosine similarity (text sentence vector where input entity is located, text sentence vector where database entity is located) (4)

In the formula (4), the text where the entity is can be found through the text ID, the texts are combined, and then the sentence vector is calculated, so that the result of the formula (4) can be obtained.

Where "fuse into a new entity," such a scheme requires traversing all known new entities, calculating the confidence level fused into them. The method of calculating the confidence is consistent with the class method of fusing into a certain database entity.

Wherein, the new entity is only one processing scheme, which is to add an entity named as an input entity name in the database, the confidence of the scheme is 1- ("fused into a certain database entity" the maximum value of the confidence in the class scheme)

Where "discard" such methods involve only one processing scheme that does not output confidence in order to discard the input entity.

The relationship matching process flow comprises relationship matching and relationship processing scheme generation.

Relationship matching: and processing all the input relations one by one, firstly, confirming whether the relation exists in the database, if the initial entity and the end entity of the input relation are successfully matched, and meanwhile, the relation exists between the initial entity and the end entity in the database, which indicates that the relation exists in the database, otherwise, the relation does not exist in the database, and the input relation is a new relation.

For the entry in Table 1 and the database in FIG. 2, the relationship "sibling" from Ming to Pink matches successfully, and the relationship "friend" from Ming to Pink does not match successfully.

If the input relation fails to match, the input relation is a new relation, and the next operation is performed.

And (3) generating a relation processing scheme: for new relationships, there are 2 types of processing schemes, new relationships, obsolete.

Wherein, the confidence index of the scheme is composed of two index average values:

confidence = (co-occurrence sound volume index + sound volume index)/2 (5)

In the formula (5), the co-occurrence sound volume index is calculated by:

the sound volume index is calculated by

The entity attribute matching process flow comprises entity attribute matching and entity attribute processing scheme generation.

Entity attribute matching: processing all the input entity attributes one by one, firstly confirming whether the entity attributes exist in the database, if the matching of the entity corresponding to the input entity attributes is successful, and if the entity exists in the database, the matching is successful, otherwise, the matching is failed.

Wherein the attribute is a key-value pair consisting of two parts, namely an attribute key and an attribute value.

For the input to table 1 and the database in fig. 2, the entity attribute "height: 170cm "failed the match.

If the matching fails, the input entity attribute is a new entity attribute, and the next operation is performed.

Generating an entity attribute processing scheme: the new entity attributes are played by two types of processing schemes, namely correction or new attribute and discard.

confidence = (edit distance index+sound volume index)/2 (8)

In the formula (8), the edit distance index is calculated by:

in the formula (9), the entities with the same attribute keys in the database are found, the entities are respectively calculated, and the maximum value is taken as an editing distance index.

The sound volume index is calculated by

Where "discard" such methods involve only one processing scheme that does not output confidence in order to discard the input entity attributes.

The relationship attribute matching process flow comprises relationship attribute matching and relationship attribute processing scheme generation.

Relationship attribute matching: processing all the input relation attributes one by one, firstly confirming whether the relation attributes exist in the database, if the corresponding relation of the input relation attributes is successfully matched, and if the relation exists in the database, the matching is successful, otherwise, the matching is failed.

For the entry into table 1 and the database in fig. 2, the entity attribute "affinity: low match failure.

If the matching fails, the input relationship attribute is a new relationship attribute, and the next operation is performed.

Generating a relation attribute processing scheme: for new relationship attributes, there are two types of processing schemes, correction or new attributes and discard.

confidence = (edit distance index+sound volume index)/2 (8)

In the formula (8), the edit distance index is calculated by:

in the formula (9), the relation of the same attribute key in the database is found, the relation is calculated respectively, and the maximum value is taken as an editing distance index.

The sound volume index is calculated by

Among these, "discard" such methods have only one processing scheme that does not output confidence in order to discard the input relationship attributes.

S005, selecting a processing scheme, wherein the processing scheme comprises two modes of manual selection and automatic machine execution selection. If manual processing is selected, the schemes of each input are selected according to the confidence level and personal experience of the schemes. If the scheme with the highest confidence is automatically executed by the machine, a confidence threshold is input, if the scheme with the highest confidence is larger than the confidence threshold in all the processing schemes, the processing scheme with the highest confidence is automatically executed, otherwise, discarding is selected.

Wherein the selection must be in the order of entity, relationship, entity attribute, relationship attribute.

In the process of selecting the processing scheme, when the processing scheme of the starting entity or the ending entity corresponding to the relation is selected to be fused to a certain entity, the starting entity or the ending entity corresponding to the relation is changed. When an entity corresponding to an entity attribute is selected to be fused to a certain entity, the entity corresponding to the entity attribute is changed accordingly. When the relationship corresponding to the relationship attribute is selected to be fused to a certain relationship, the relationship corresponding to the relationship attribute is changed accordingly.

S006, updating the knowledge graph, and modifying the knowledge graph according to the processing scheme. When an input entity is selected to be fused to a certain entity scheme, the name of the input entity is added to the unique attribute of the database entity, and the document ID of the input entity is added to the document ID attribute. When the input entity is selected to the newly added entity scheme, the entity information is input to the newly created entity in the database. When the input relationship is selected to the new relationship schema, a relationship is created in the database. When an input entity attribute is selected to modify the current attribute schema, the attribute is modified or newly created in the database counterpart entity. When the input relation attribute is selected to modify the current attribute scheme, the attribute is modified or newly established in the corresponding relation of the database.

The foregoing is illustrative of the preferred embodiments of the present invention, and is not to be construed as limiting the claims. The invention is not limited to the above embodiments, the specific construction of which is susceptible to variations, in any case all of which are within the scope of the invention as defined in the independent claims.

Claims

1. The knowledge graph generation method based on the multiple indexes is characterized by comprising the following steps of:

s004, information matching: according to the data structure type of the matching failure information, a processing scheme is correspondingly generated; the method comprises four matching processing flows of entities, relations, entity attributes and relation attributes, wherein:

the entity matching processing flow comprises entity matching and entity processing scheme generation;

the relation matching processing flow comprises relation matching and relation processing scheme generation;

the entity attribute matching processing flow comprises entity attribute matching and entity attribute processing scheme generation;

the relationship attribute matching processing flow comprises relationship attribute matching and relationship attribute processing scheme generation;

s006, selecting a processing scheme of the matching failure information according to the confidence coefficient; the method comprises two modes of manual selection and automatic machine execution selection, and the selection is performed according to the sequence of entities, relations, entity attributes and relation attributes; if manual processing is selected, selecting each input scheme according to the confidence level of the scheme and personal experience; if the scheme with the highest confidence is automatically executed by the machine, a confidence threshold is input, if the scheme with the highest confidence is larger than the confidence threshold in all the processing schemes, the processing scheme with the highest confidence is automatically executed, otherwise, discarding is selected;

2. The multi-index-based knowledge-graph generation method according to claim 1, wherein: in the step S002, the entity, the relationship, the entity attribute and the relationship attribute information are input and obtained by manual labeling or data model prediction.

3. The multi-index-based knowledge-graph generation method according to claim 1, wherein: and the step S002 to the step S003 also comprise an information filtering step, wherein the information is filtered through the sound volume parameter of the input information and the set sound volume threshold value, and the input information with the sound volume smaller than the sound volume threshold value is filtered.

4. The multi-index-based knowledge-graph generation method according to claim 1, wherein: in the information matching in step S003, the processing schemes correspondingly generated according to the input information types are different, and the method for calculating the confidence of the corresponding processing scheme in the subsequent step S005 is also different.

5. The multi-index-based knowledge-graph generation method according to claim 4, wherein: when the input information type is an entity and the information matching fails, the correspondingly generated processing scheme comprises four types of fusion into a certain database entity, fusion into a certain new entity, newly added entity and discarding; the confidence coefficient of the two processing schemes fused into a certain database entity and a certain new entity is calculated by three indexes of editing distance, text sentence vector and sound volume, and the calculation formula is as follows: confidence = (sound volume index + edit distance index + sentence vector index)/3.

6. The multi-index-based knowledge-graph generation method according to claim 4, wherein: when the input information types are relations and the information matching fails, the corresponding generated processing scheme comprises a newly added relation and a discarded relation; the confidence of the new relation processing mode is calculated by two indexes of co-occurrence sound quantity and relation sound quantity of the starting entity and the ending entity, and the calculation formula is as follows: confidence = (co-occurrence sound volume index + sound volume index)/2.

7. The multi-index-based knowledge-graph generation method according to claim 4, wherein: when the input information type is entity attribute and the information matching fails, the correspondingly generated processing scheme comprises two types of correction or new attribute and discard; the confidence of the modified or newly added attribute processing mode is calculated by two indexes of editing distance and sound volume of the entity attribute, and the calculation formula is as follows: confidence = (edit distance index+sound volume index)/2.

8. The multi-index-based knowledge-graph generation method according to claim 4, wherein: when the input information type is a relation attribute and the information matching fails, the correspondingly generated processing scheme comprises two types of correction or new attribute and discard; the confidence of the modified or newly added attribute processing mode is calculated by two indexes of editing distance and sound volume of the entity attribute, and the calculation formula is as follows: confidence = (edit distance index+sound volume index)/2.

9. The multi-index-based knowledge-graph generation method according to claim 1, wherein: the machine automatically executes selection, which comprises inputting a confidence threshold, automatically executing the processing scheme with the largest confidence when the scheme with the largest confidence is larger than the confidence threshold in all the processing schemes, otherwise, selecting discarding.