CN112487214B

CN112487214B - Knowledge graph relation extraction method and system based on entity co-occurrence matrix

Info

Publication number: CN112487214B
Application number: CN202011535409.5A
Authority: CN
Inventors: 傅兴玉; 程国艮
Original assignee: Glabal Tone Communication Technology Co ltd
Current assignee: Glabal Tone Communication Technology Co ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2024-06-04
Anticipated expiration: 2040-12-23
Also published as: CN112487214A

Abstract

The invention discloses a knowledge graph relation extraction method and a system based on entity co-occurrence matrix, wherein the method comprises the following steps: identifying named entities of each segment corpus in the segment corpus set to obtain a named entity list; solving an entity co-occurrence matrix; calculating a segmentation threshold value of the entity co-occurrence matrix; for each position point with the element value larger than the segmentation threshold value, finding an entity pair corresponding to each position point and a fragment corpus list which jointly appears in the entity pair; extracting keywords from each segment corpus list to obtain a keyword list of each segment corpus list; and determining the relation of each entity pair according to the mapping relation between each keyword list and the relation dictionary. The method and the system for extracting the entity relationship based on the entity co-occurrence matrix can automatically and rapidly extract the entity relationship of the knowledge graph, and are high in efficiency.

Description

Knowledge graph relation extraction method and system based on entity co-occurrence matrix

Technical Field

The invention relates to the technical field of natural language processing, in particular to a knowledge graph relation extraction method and system based on an entity co-occurrence matrix.

Background

The relation extraction is an important work in natural language processing, and particularly under the background of current information explosion, the entity extraction and the relation extraction between entities are realized from massive unstructured texts, so that the relation extraction is an important information dimension reduction method and is also a key technology for constructing an industry knowledge graph.

Relationship extraction and classification tasks refer specifically to classifying an entity pair among the entities in a set of known relationships using documents that contain references to the entity pair. Currently, a supervision relation extraction method is generally adopted. Namely, the relation extraction task is taken as a classification problem, effective characteristics are designed according to training data, so that various classification models are learned, and then the trained classifier is used for predicting the relation. The problem of the supervised relation extraction method is that a large amount of manual annotation of the corpus is needed, and the corpus annotation work is usually very time-consuming and labor-consuming, while in the semi-supervised learning method, for the relation to be extracted, a plurality of seed examples are needed to be manually set first, and then a relation template and more examples corresponding to the relation are extracted from the data iteratively. Therefore, the current relation extraction needs to rely on a large amount of manpower, and the efficiency is low.

The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person of ordinary skill in the art.

Disclosure of Invention

The invention aims to provide a knowledge graph relation extraction method and system based on an entity co-occurrence matrix, which can automatically and rapidly extract the entity relation of a knowledge graph and has higher efficiency.

In order to achieve the above object, the present invention provides a knowledge graph relation extraction method based on entity co-occurrence matrix, which includes: identifying named entities of each segment corpus in the segment corpus set to obtain a named entity list; combining the entities in the named entity list two by two to obtain an entity co-occurrence matrix, wherein the element value in the entity co-occurrence matrix is N _i, and N _i represents that the ith pair of entities are co-present in N segment corpus; calculating a segmentation threshold value of the entity co-occurrence matrix; for each position point with the element value larger than the segmentation threshold value, finding an entity pair corresponding to each position point and a fragment corpus list which jointly appears in the entity pair; extracting keywords from each segment corpus list to obtain a keyword list of each segment corpus list; and determining the relation of each entity pair according to the mapping relation between each keyword list and the relation dictionary.

In an embodiment of the present invention, the identifying the named entity for each segment corpus in the segment corpus includes: and (3) carrying out named entity recognition on each fragment corpus by adopting a conditional random field algorithm or a two-way long and short memory network model algorithm.

In an embodiment of the present invention, the calculating the segmentation threshold of the entity co-occurrence matrix includes: and calculating a segmentation threshold T of the entity co-occurrence matrix by adopting a first formula, wherein the first formula is T=MAX (M x alpha, P), M is the total number of named entities, alpha is a first preset value set according to manual experience, and P is a second preset value set according to manual experience.

In an embodiment of the present invention, the determining the relationship of each entity pair according to the mapping relationship between each keyword list and the relationship dictionary includes: and establishing a mapping relation table of the keywords and the relation dictionary in each keyword list, calculating a distribution histogram of the keywords in the relation dictionary by taking the relation value as a horizontal axis and the occurrence frequency of the keywords as a vertical axis, and determining an abscissa value corresponding to the maximum pole in the distribution histogram as a relation value of the entity pair.

Based on the same inventive concept, the invention also provides a knowledge graph relation extraction system based on the entity co-occurrence matrix, which comprises the following steps: the entity recognition module is used for recognizing named entities of each fragment corpus in the fragment corpus set to obtain a named entity list; the co-occurrence matrix determining module is coupled with the entity identifying module and is used for combining the entities in the named entity list two by two to obtain an entity co-occurrence matrix, wherein the element value in the entity co-occurrence matrix is N _i, and N _i represents that the ith pair of entities are commonly present in N fragment corpora; the segmentation threshold determining module is used for calculating a segmentation threshold of the entity co-occurrence matrix; the segment corpus list acquisition module is coupled with the segmentation threshold determination module and the co-occurrence matrix determination module and is used for finding out an entity pair corresponding to each position point and a segment corpus list which co-occurs with the entity pair for each position point of which the element value is larger than the segmentation threshold; the keyword extraction module is coupled with the segment corpus list acquisition module and is used for extracting keywords from each segment corpus list to obtain a keyword list of each segment corpus list; the relation determining module is coupled with the keyword extracting module and is used for determining the relation of each entity pair according to the mapping relation between each keyword list and the relation dictionary.

In an embodiment of the present invention, the entity recognition module is configured to perform named entity recognition on each segment corpus by using a conditional random field algorithm or a two-way long and short memory network model algorithm.

In an embodiment of the present invention, the segmentation threshold determining module is configured to calculate the segmentation threshold T of the entity co-occurrence matrix by using a first formula, where the first formula is t=max (m×α, P), where M is a total number of entities, α is a first preset value set according to a manual experience, and P is a second preset value set according to a manual experience.

In an embodiment of the present invention, the relationship determining module is configured to establish a mapping relationship table of the keywords in each keyword list and the relationship dictionary, calculate a distribution histogram of the keywords in the relationship dictionary with a relationship value as a horizontal axis and the number of occurrences of the keywords as a vertical axis, and determine an abscissa value corresponding to a maximum pole in the distribution histogram as a relationship value of the entity pair.

Based on the same inventive concept, the invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the knowledge graph relation extraction method in any embodiment when executing the program.

Based on the same inventive concept, the present invention further provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the knowledge graph relation extraction method according to any of the above embodiments.

Compared with the prior art, the knowledge graph relation extraction method and system based on the entity co-occurrence matrix mainly aims at solving the difficulty of relation discrimination among entities in the knowledge graph construction process aiming at the situation of lack of relation labeling corpus and lack of knowledge base support in certain specific industry fields, carries out named entity identification based on large-scale text data, and then realizes entity co-occurrence knowledge graph construction in an aggregation mode to obtain a large number of co-occurrence fragments of entity pairs; and classifying the co-occurrence fragments, extracting core keywords of the fragments, and finally realizing entity relationship judgment in a keyword mapping mode, so that the relationship extraction efficiency is greatly improved, and the method does not depend on a large amount of manpower.

Drawings

FIG. 1 is a block diagram of a knowledge-graph relationship extraction method based on entity co-occurrence matrix, according to an embodiment of the present invention;

FIG. 2 is a block diagram of a knowledge-graph relationship extraction system based on entity co-occurrence matrices, in accordance with an embodiment of the invention.

Detailed Description

The following detailed description of embodiments of the invention is, therefore, to be taken in conjunction with the accompanying drawings, and it is to be understood that the scope of the invention is not limited to the specific embodiments.

Throughout the specification and claims, unless explicitly stated otherwise, the term "comprise" or variations thereof such as "comprises" or "comprising", etc. will be understood to include the stated element or component without excluding other elements or components.

In order to improve the entity relation extraction efficiency of the knowledge graph, the invention provides a knowledge graph relation extraction method and system based on an entity co-occurrence matrix.

FIG. 1 is a knowledge-graph relationship extraction method based on entity co-occurrence matrix according to an embodiment of the invention, which includes: step S1 to step S6.

In step S1, the named entity is identified for each segment corpus in the segment corpus set, and a named entity list is obtained.

Optionally, a conditional random field algorithm (CRF) or a two-way long and short memory network model algorithm (biLSTM) may be used to perform named entity recognition on each segment corpus to obtain a named entity list: e (n, M) = { E _n1,E_n2,…,E_nm }, the total number of entities is denoted M.

In step S2, the entities in the named entity list are combined two by two, and an entity co-occurrence matrix is obtained, where an element value in the entity co-occurrence matrix is N _i, and N _i represents that the ith pair of entities co-occur in N segment corpora.

A segmentation threshold for the entity co-occurrence matrix is calculated in step S3.

Alternatively, the segmentation threshold T of the entity co-occurrence matrix may be calculated by using a first formula, where the first formula is t=max (m×α, P), where M is the total number of named entities, α is a first preset value set according to manual experience, for example, 10%, 12%, etc., and P is a second preset value set according to manual experience, for example, 3, 4, etc.

In step S4, for each location point whose element value is greater than the segmentation threshold, an entity pair corresponding to each location point and a segment corpus list in which the entity pairs co-occur are found. Wherein the location point is the intersection of a row and a column in the matrix.

In step S5, extracting keywords from each of the segment corpus lists to obtain a keyword list of each of the segment corpus lists. Alternatively, the keywords may be extracted using a word frequency-inverse text frequency algorithm (TF/IDF).

In step S6, a relationship of each entity pair is determined according to a mapping relationship between each keyword list and a relationship dictionary. Specifically, a mapping relation table of the keywords and the relation dictionary in each keyword list is established, the relation value is taken as a horizontal axis, the number of times of occurrence of the keywords is taken as a vertical axis, a distribution histogram of the keywords in the relation dictionary is calculated, and an abscissa value corresponding to the maximum pole in the distribution histogram is determined as a relation value of the entity pair.

Based on the same inventive concept, as shown in fig. 2, in an embodiment, there is further provided a knowledge graph relationship extraction system based on an entity co-occurrence matrix, including: the system comprises an entity identification module 10, a co-occurrence matrix determination module 11, a segmentation threshold determination module 12, a segment corpus list acquisition module 13, a keyword extraction module 14 and a relationship determination module 15.

The entity recognition module 10 is configured to recognize named entities of each segment corpus in the segment corpus set, and obtain a named entity list. Optionally, a conditional random field algorithm or a two-way long and short memory network model algorithm is adopted to conduct named entity recognition on each segment corpus.

The co-occurrence matrix determining module 11 is coupled to the entity identifying module 10, and is configured to combine the entities in the named entity list two by two, and calculate an entity co-occurrence matrix, where an element value in the entity co-occurrence matrix is N _i, and N _i represents that the ith pair of entities appear in N segment corpora together.

The segmentation threshold determination module 12 is configured to calculate a segmentation threshold of the entity co-occurrence matrix. Optionally, the segmentation threshold determining module 12 calculates the segmentation threshold T of the entity co-occurrence matrix by using a first formula, where the first formula is t=max (m×α, P), where M is the total number of entities, α is a first preset value set according to human experience, and P is a second preset value set according to human experience.

The segment corpus list obtaining module 13 is coupled to the segmentation threshold determining module 12 and the co-occurrence matrix determining module 11, and is configured to find, for each location point where the element value is greater than the segmentation threshold, an entity pair corresponding to each location point and a segment corpus list in which the entity pair co-occurs.

The keyword extraction module 14 is coupled to the segment corpus list obtaining module 13, and is configured to extract keywords from each segment corpus list, so as to obtain a keyword list of each segment corpus list. Alternatively, the keywords may be extracted using a word frequency-inverse text frequency algorithm (TF/IDF).

A relationship determination module 15 is coupled to the keyword extraction module 14 for determining a relationship for each entity pair based on a mapping relationship between each of the keyword lists and the relationship dictionary. Optionally, the relationship determining module 15 is configured to establish a mapping relationship table of the keywords in each keyword list and the relationship dictionary, calculate a distribution histogram of the keywords in the relationship dictionary with a relationship value as a horizontal axis and the number of occurrences of the keywords as a vertical axis, and determine an abscissa value corresponding to a maximum pole in the distribution histogram as a relationship value of the entity pair.

Based on the same inventive concept, an embodiment further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the steps of the knowledge graph relation extraction method described in any one of the above when executing the program.

Based on the same inventive concept, there is also provided in an embodiment a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the knowledge graph relation extraction method of any one of the above.

In summary, according to the knowledge graph relation extraction method and system based on the entity co-occurrence matrix of the embodiment, the difficulty of relation discrimination between entities in the knowledge graph construction process is solved mainly aiming at the situation that relation labeling corpus is deficient and knowledge base support is lacking in certain specific industry fields; and classifying the co-occurrence fragments, extracting core keywords of the fragments, and finally realizing entity relationship judgment in a keyword mapping mode, so that the relationship extraction efficiency is greatly improved, and the method does not depend on a large amount of manpower. The invention can realize the discrimination of the entity relationship based on the unsupervised entity co-occurrence matrix under the conditions of lack of relation corpus and lack of support of a professional knowledge base in the industrial field, and has important roles in natural language processing and knowledge graph construction.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing descriptions of specific exemplary embodiments of the present invention are presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain the specific principles of the invention and its practical application to thereby enable one skilled in the art to make and utilize the invention in various exemplary embodiments and with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims

1. The knowledge graph relation extraction method based on the entity co-occurrence matrix is characterized by comprising the following steps of:

identifying named entities of each segment corpus in the segment corpus set to obtain a named entity list;

Combining the entities in the named entity list two by two to obtain an entity co-occurrence matrix, wherein the element value in the entity co-occurrence matrix is N _i, and N _i represents that the ith pair of entities are co-present in N segment corpus;

calculating a segmentation threshold value of the entity co-occurrence matrix;

For each position point with the element value larger than the segmentation threshold value, finding an entity pair corresponding to each position point and a fragment corpus list which jointly appears in the entity pair;

extracting keywords from each segment corpus list to obtain a keyword list of each segment corpus list;

determining the relation of each entity pair according to the mapping relation between each keyword list and the relation dictionary; the determining the relation of each entity pair according to the mapping relation between each keyword list and the relation dictionary comprises the following steps:

And establishing a mapping relation table of the keywords and the relation dictionary in each keyword list, calculating a distribution histogram of the keywords in the relation dictionary by taking the relation value as a horizontal axis and the occurrence frequency of the keywords as a vertical axis, and determining an abscissa value corresponding to the maximum pole in the distribution histogram as a relation value of the entity pair.

2. The knowledge-graph relationship extraction method based on entity co-occurrence matrix according to claim 1, wherein the identifying named entities of each segment corpus in the segment corpus comprises:

And (3) carrying out named entity recognition on each fragment corpus by adopting a conditional random field algorithm or a two-way long and short memory network model algorithm.

3. The method for extracting a knowledge-graph relationship based on an entity co-occurrence matrix according to claim 1, wherein the calculating the segmentation threshold of the entity co-occurrence matrix comprises:

Calculating a segmentation threshold T of the entity co-occurrence matrix using a first equation, wherein the first equation is t=max (M # P), where M is the total number of named entities,/>The first preset value is set according to manual experience, and the second preset value is set according to manual experience.

4. A knowledge-graph relationship extraction system based on entity co-occurrence matrix, comprising:

the entity recognition module is used for recognizing named entities of each fragment corpus in the fragment corpus set to obtain a named entity list;

The co-occurrence matrix determining module is coupled with the entity identifying module and is used for combining the entities in the named entity list two by two to obtain an entity co-occurrence matrix, wherein the element value in the entity co-occurrence matrix is N _i, and N _i represents that the ith pair of entities are commonly present in N fragment corpora;

the segmentation threshold determining module is used for calculating a segmentation threshold of the entity co-occurrence matrix;

The segment corpus list acquisition module is coupled with the segmentation threshold determination module and the co-occurrence matrix determination module and is used for finding out an entity pair corresponding to each position point and a segment corpus list which jointly appears in the entity pair for each position point of which the element value is larger than the segmentation threshold;

the keyword extraction module is coupled with the segment corpus list acquisition module and is used for extracting keywords from each segment corpus list to obtain a keyword list of each segment corpus list;

The relation determining module is coupled with the keyword extracting module and is used for determining the relation of each entity pair according to the mapping relation between each keyword list and the relation dictionary; the relation determining module is used for establishing a mapping relation table of the keywords and the relation dictionary in each keyword list, calculating a distribution histogram of the keywords in the relation dictionary by taking the relation value as a horizontal axis and the occurrence frequency of the keywords as a vertical axis, and determining an abscissa value corresponding to the maximum pole in the distribution histogram as a relation value of the entity pair.

5. The knowledge graph relation extraction system based on the entity co-occurrence matrix according to claim 4, wherein the entity recognition module is used for performing named entity recognition on each segment corpus by adopting a conditional random field algorithm or a two-way long and short memory network model algorithm.

6. The knowledge-graph-relationship extraction system of claim 4, wherein said segmentation threshold determination module is configured to calculate a segmentation threshold T for said co-occurrence matrix using a first equation, wherein said first equation is t=max (M xP), where M is the total number of entities,/>The first preset value is set according to manual experience, and the second preset value is set according to manual experience.

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the knowledge-graph relationship extraction method of any one of claims 1 to 3 when the program is executed by the processor.

8. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the knowledge-graph relationship extraction method of any of claims 1 to 3.