CN114564579A

CN114564579A - Entity classification method and system based on massive knowledge graph and graph embedding

Info

Publication number: CN114564579A
Application number: CN202210103224.XA
Authority: CN
Inventors: 袁浩斌; 何宇轩; 牟昊; 李旭日; 徐亚波
Original assignee: Guangdong Hengqin Shushushuo Story Information Technology Co ltd
Current assignee: Guangdong Hengqin Shushushuo Story Information Technology Co ltd
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2022-05-31

Abstract

The invention provides an entity classification method and system based on massive knowledge maps and map embedding, and relates to the technical field of natural language processing.

Description

Entity classification method and system based on massive knowledge graph and graph embedding

Technical Field

The invention relates to the technical field of natural language processing, in particular to an entity classification method and system based on massive knowledge maps and map embedding.

Background

In the information explosion era, a large amount of new data information is generated every day, the internal mode of data is deeply mined, effective information is screened and summarized, and the classification of data is realized, so that great economic efficiency can be generated.

However, there are two problems in implementing the utilization of data: one, how does the collected information be utilized systematically and organically? Secondly, how to mine the intrinsic mode of data, and implement classification task application in different scenes? At present, for the first problem, a database schema for the vertical domain can be constructed, however, such schema ignores valuable cross-domain data, reduces the diversity of the data, and further affects the mining of the internal schema of the data and the accuracy and coverage rate of downstream tasks (such as data classification). At this time, the knowledge-graph is generated.

The knowledge graph expresses proper nouns such as characters, place names, book names and the like as entities, expresses internal relations among the entities as relations, and combines the relations among the entities to form a knowledge triple. In the application field of knowledge graph, entity classification refers to extracting features from free text and analyzing the features to obtain a specific type of a given entity. In the prior art, a method and an apparatus for determining an entity class (e.g., modeled Relational Data with Graph Relational Networks) are disclosed, in which a candidate Data pair set including at least one candidate Data pair is first obtained, and then, based on attribute information of an entity in the candidate Data pair, an attribute value of the entity, and/or attribute information of a class, a Convolutional neural network is used to determine whether at least one entity included in the at least one candidate Data pair belongs to at least one class included in the candidate Data pair, and if so, the at least one class is marked as the class of the at least one entity. The method determines the classification of the entities from the aspect of marking improvement, enlarges the classification number of the entities, and improves the classification accuracy of the entities, thereby solving the problems of over-coarse and inaccurate classification particle size of the entities. However, the knowledge map is far from perfect compared with the colorful real world, and needs to be updated and supplemented continuously, and the scheme is not suitable for screening and summarizing effective information due to low coverage rate and low universality of the cross-domain mass knowledge.

Disclosure of Invention

In order to solve the problems of low accuracy and low universality of the current mode of entity classification based on the knowledge graph, the invention provides an entity classification method and system based on massive knowledge graphs and graph embedding.

In order to achieve the technical effects, the technical scheme of the invention is as follows (the flow schematic diagram is shown in figure 4):

a method for entity classification based on massive knowledge graph and graph embedding comprises the following steps:

s1, acquiring N corpus texts, and labeling entities and entity relations in the corpus texts to obtain a labeled data set;

s2, introducing a named entity recognition model and a relation extraction model, and training and testing the named entity recognition model and the relation extraction model by using a labeled data set so as to recognize entities in the massive corpus text and extract relations; then forming a knowledge graph based on the entities and the relations;

s3, introducing a graph embedding model, and performing a prediction training task of the associated entity nodes in the knowledge graph by using the graph embedding model to obtain graph embedding information of each entity node;

S4, determining entities to be classified, and calculating entity confidence of the entities;

and S5, based on the graph embedding information of each entity node and the entity confidence coefficient of the entity, executing a classification task of the entity to be classified by using a classifier, and reprocessing by using a knowledge graph to obtain a final entity classification result.

In the technical scheme, starting from a mass corpus text, a named entity recognition model and a relation extraction model are introduced to extract entities and relations, a knowledge graph is formed, graph embedding information of various entities in the knowledge graph is generated by means of the graph embedding model, namely, a vector is used for representing one entity, classification of the entity nodes is realized on the basis of the graph embedding information of the entity nodes and entity confidence coefficients of the entities, and finally, the knowledge graph is used for reprocessing to eliminate classification task result noise.

Preferably, based on the Bert Chinese pre-training model, adding a named entity recognition and relationship extraction task of the corpus text at the downstream of the Bert Chinese pre-training model to obtain a named entity recognition model and a relationship extraction model which are respectively used for entity recognition and relationship extraction in the corpus text; the named entity recognition model and the relation extraction model are used for continuously updating the faced massive corpus text and the knowledge map formed on the basis of the entities and the relations when the named entity recognition model and the relation extraction model are used for carrying out entity recognition and relation extraction on the massive corpus text, so that the richness of cross-domain knowledge information is ensured, and the constructed knowledge map has better universality and coverage rate.

Preferably, in step S2, the entity identification and the relationship in the corpus text can be extracted simultaneously by using an entity-relationship joint extraction model to perform joint extraction.

Preferably, the graph embedding model is Deepwalk, Randomwalk or LINE.

Preferably, the process of performing the prediction training task of the associated entity node in the knowledge graph by using the graph embedding model in step S3 is as follows:

s31, selecting initial entity nodes in the knowledge graph spectrum, and predicting termination tail nodes by the initial entity nodes by using a graph embedding model in combination with relation types of the initial entity nodes in the knowledge graph;

s32, comparing the node type, the node name and the relationship among the nodes of the predicted termination tail node with the node type, the node name and the relationship among the nodes of the real termination node respectively, judging whether the node type, the node name and the relationship among the nodes are correct, if the node type, the node name and the relationship among the nodes are correct, keeping the current parameter weight of the graph embedding model, and executing a step S33; otherwise, adjusting the parameter weight of the graph embedding model, and returning to the step S31;

and S33, randomly walking each entity node to an adjacent entity node in the knowledge graph based on the graph embedding model, and acquiring a local structure of each entity node in the knowledge graph so as to obtain graph embedding information of each entity node.

Preferably, in step S4, after determining the entity to be classified, an entity confidence coefficient with the entity as an initial entity node is calculated, and the calculation process includes:

firstly, calculating the relation confidence of an entity, wherein the expression is as follows:

the relationship confidence is min (100, sound volume contribution + knowledge confidence contribution),

wherein, the sound volume contribution is the sound volume weight (ln (vol) × 10), the sound volume weight represents a preset value, ln represents a natural logarithm, and vol represents the sound volume of the entity in the knowledge graph data;

the knowledge confidence coefficient is the weight of the knowledge confidence coefficient;

wherein, the weight of the confidence coefficient of knowledge is 0.7, which is a preset value;

knowledge of

The index distance is 110-abs (initial node index-end node index), abs is an absolute value function, and 110 is an algorithm preset index distance maximum;

then, sorting all the relationship confidence degrees of the entity from big to small, taking the relationship confidence degrees positioned in the first three bits, and calculating the average value to be used as the entity confidence degree of the entity.

Preferably, after step S4, step S5 is preceded by: and dividing a training set from the labeled data set, obtaining graph embedding information of the entity nodes in the training set according to the graph embedding information of each entity node, and training a classifier by using the training set based on the graph embedding information to obtain the trained classifier.

Preferably, in step S5, when the classification task of the entity to be classified is executed by using the classifier based on the entity confidence of the entity, a classification threshold is set, and the entity to be classified with the entity confidence greater than the classification threshold is classified.

Preferably, the reprocessing by using the knowledge graph in step S5 to obtain the final entity classification result includes:

A. determining the relationship type of the initial entity node preliminarily classified by the classifier in a knowledge graph, searching the alternative name of the initial entity node in the knowledge graph by using the entity and relationship information stored in a graph database, and deleting the initial entity node if the alternative name and the initial entity node are classified into the same type of result at the same time and contain Chinese, and the initial entity node does not contain Chinese;

B. determining the relationship type of the initial entity node preliminarily classified by the classifier in a knowledge graph, searching the alternative name of the initial entity node in the knowledge graph, and deleting the alternative name if the alternative name and the initial entity node are classified into the same type of result at the same time and do not contain Chinese, and the initial entity node contains Chinese;

C. determining the relationship type of the initial entity node preliminarily classified by the classifier in a knowledge graph, searching the alternative name of the initial entity node in the knowledge graph, and if the alternative name and the initial entity node are classified into the same type of result at the same time and both the alternative name and the initial entity node contain Chinese, keeping one of the initial entity node and the alternative name with higher entity confidence;

D. For the entities to be classified with the nesting relationship, reserving the entity with the highest confidence;

E. and adding a query field for the removed entity name, wherein the query field is used as a key for reserving the name and is used for querying the information of the entity.

The application also provides an entity classification system based on the embedding of the massive knowledge map and the graph, and the system is used for realizing the entity classification method based on the embedding of the massive knowledge map and the graph, and comprises the following steps:

the system comprises a labeling data set acquisition module unit, a labeling data set processing module unit and a data processing module unit, wherein the labeling data set acquisition module unit is used for acquiring N corpus texts and labeling entities and entity relations in the corpus texts to obtain a labeling data set;

the knowledge map construction unit loads the named entity recognition model and the relation extraction model, and trains and tests the named entity recognition model and the relation extraction model by using the labeled data set so as to perform entity recognition and relation extraction in the massive corpus text; then forming a knowledge graph based on the entities and the relations;

the graph embedding information acquisition unit loads an introduced graph embedding model, and performs a prediction training task on the associated entity nodes in the knowledge graph by using the graph embedding model to obtain graph embedding information of each entity node;

the entity confidence coefficient calculation unit is used for determining the entities to be classified and calculating the entity confidence coefficient of the entities;

And the entity classification unit executes the classification task of the entity to be classified by using the classifier based on the graph embedding information of each entity node and the entity confidence coefficient of the entity, and performs reprocessing by using the knowledge graph to obtain the final entity classification result.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides an entity classification method and system based on massive knowledge maps and map embedding, which are based on massive corpus texts, introduce a named entity recognition model and a relation extraction model to extract entities and relations and form a knowledge map, generate map embedding information of various entities in the knowledge map by means of the map embedding model, namely, represent one entity by using vectors, realize the classification of entity nodes based on the map embedding information of the entity nodes and the entity confidence of the entities, finally utilize the knowledge map to carry out reprocessing, eliminate the noise of classification task results, have strong universality, improve the accuracy and recall rate, and facilitate the realization of better service classification effect in the field of large data.

Drawings

Fig. 1 is a schematic flowchart of an entity classification method based on a massive knowledge graph and graph embedding according to embodiment 1 of the present invention;

Fig. 2 is a schematic diagram showing a final star knowledge graph constructed by applying the entity classification method based on the mass knowledge graph and graph embedding proposed in embodiment 1 of the present invention;

FIG. 3 is a diagram showing a complete flow chart of embodiment 1 of the present invention;

fig. 4 is a block diagram of an entity classification system based on a mass knowledge graph and graph embedding according to embodiment 3 of the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for better illustration of the present embodiment, some parts of the drawings may be omitted, enlarged or reduced, and do not represent actual sizes;

it will be understood by those skilled in the art that certain descriptions of well-known structures in the drawings may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

example 1

The embodiment provides an entity classification method based on massive knowledge maps and map embedding, a flow chart is shown in fig. 1, and the method specifically comprises the following steps:

s1, acquiring N corpus texts, and labeling entities and entity relations in the corpus texts to obtain a labeled data set; the term "N" refers to enough corpus texts, corresponding to "a large amount" and "a large amount", and after the corpus texts are obtained, the corpus texts are sequentially labeled by using preset labels to obtain a labeled data set.

S2, introducing a named entity recognition model and a relation extraction model, and training and testing the named entity recognition model and the relation extraction model by using a labeling data set so as to recognize entities in the massive corpus text and extract relations; then, forming a knowledge graph on the basis of the entity and the relation;

in the embodiment, based on a Bert Chinese pre-training model, a named entity recognition and relationship extraction task of a corpus text is added at the downstream of the Bert Chinese pre-training model to obtain a named entity recognition model and a relationship extraction model which are respectively used for entity recognition and relationship extraction in the corpus text; the named entity recognition model and the relation extraction model are used for training the two models by using the labeling data set before entity recognition and relation extraction are carried out on the massive corpus texts.

The named entity recognition model and the relation extraction model are used for continuously updating the facing massive corpus texts and continuously updating the knowledge map formed on the basis of the entities and the relations when the entity recognition and the relation extraction are performed on the massive corpus texts, so that the richness of the cross-domain knowledge information is ensured, and the constructed knowledge map has better universality and coverage rate.

in this embodiment, the graph embedding model adopts Deepwalk, and the Deepwalk can generate random walk as required, and can be expanded, and the process of generating the random walk is efficient and is trivial in parallelization.

The process of using the graph embedding model to carry out the prediction training task of the associated entity nodes in the knowledge graph comprises the following steps:

s31, selecting initial entity nodes in the knowledge graph spectrum, and predicting termination tail nodes by the initial entity nodes by using a graph embedding model in combination with the relationship types of the initial entity nodes in the knowledge graph;

s32, comparing the node type, the node name and the inter-node relation of the predicted termination tail node with the node type, the node name and the inter-node relation of the real termination node respectively, judging whether the node type, the node name and the inter-node relation are correct, if the node type, the node name and the inter-node relation are correct, keeping the current parameter weight of the graph embedding model, and executing a step S33; otherwise, adjusting the parameter weight of the graph embedding model, and returning to the step S31;

and S33, based on the graph embedding model, randomly walking each entity node to adjacent entity nodes in the knowledge graph, and acquiring the local structure of each entity node in the knowledge graph so as to obtain the graph embedding information of each entity node.

The process is a graph embedding training in actual implementation, and the Liudebua node is taken as an initial node, and the model predicts a termination node according to a relation type, such as a 'dialect'. Judging whether the node types, node names and relationships among the nodes of the prediction termination node (Louisweiden-brand) and the real nodes (such as real nodes (Huazier-name)) are correct one by one to continuously adjust the weight of the neural network model to obtain a more accurate result, and stopping the training of the model when the prediction accuracy is not improved to obtain the graph embedding of each node.

After the graph embedding training is completed, classification is performed for different downstream classification tasks (e.g., industry multi-classification performed for brand entities, and star two classification performed for names of people or not), and step S4 is first performed:

in step S4, after determining the entity to be classified, calculating an entity confidence degree with the entity as an initial entity node, wherein the calculation process includes:

firstly, calculating the relationship confidence of an entity, wherein the expression is as follows:

the relationship confidence is min (100, the sound volume contribution + the knowledge confidence contribution),

wherein, the sound volume contribution is a sound volume weight (ln (vol) × 10), the sound volume weight represents a preset value, ln represents a natural logarithm, and vol represents the sound volume of the entity in the knowledge graph data;

The knowledge confidence coefficient contribution is the knowledge confidence coefficient weight;

wherein, the weight of the knowledge confidence coefficient is 0.7 and is a preset value;

knowledge of

The index distance is 110-abs (initial node index-end node index), abs is an absolute value function, and 110 is an algorithm preset index distance maximum; if Liu Dehua has evolved a bomb disposal expert, the initial node is Liu Dehua, the index in the text is 0, the termination node is a bomb disposal expert, and the index in the text is 6, then the index distance is 6.

When the method is implemented specifically, what is realized based on the knowledge graph and the graph embedding formed in steps S1-S4 is star classification for star image recognition and promotion of matching brands for a star by an enterprise, and whether a node with an entity type of a name is a star is judged, and after step S4, before step S5, the method further includes: and dividing a training set from the labeled data set, obtaining graph embedding information of the entity nodes in the training set according to the graph embedding information of each entity node, and training a classifier by using the training set based on the graph embedding information, wherein the training set data is manually labeled data and helps a model to learn how to classify. For example, in this task, the training set data labels which names are star or non-star, resulting in a trained classifier. The classifier is to learn a classification function or construct a classification model based on the existing data, and the classification function can map the data record to a certain one of the given categories, so that the classification function can be applied to the classification task. In a word, the classifier is a general term of a method for classifying samples in data mining, and includes algorithms such as decision trees, logistic regression, naive bayes, neural networks and the like.

Based on entity confidence of an entity, when a classifier is utilized to execute a classification task of the entity to be classified, a classification threshold is set, the entity to be classified with the entity confidence greater than the classification threshold is classified, namely, name nodes with the entity confidence greater than the threshold in a knowledge graph are selected for classification, whether the classification is a star or not is judged, the higher the entity confidence is, the more reliable the entity is, the higher the entity confidence is, the more the entity is appeared in a large amount of text data, and the number of relationships with other entities is large.

The process of reprocessing by using the knowledge graph to obtain the final entity classification result comprises the following steps:

In this embodiment, the operations mentioned for finding in the knowledge graph are all performed in the database, and the entities and their attributes, and relationships between the entities are stored using a graph relation library, such as Neo4 j;

B. determining the relation type of the initial entity node preliminarily classified by the classifier in a knowledge graph, searching the alternative name of the initial entity node in the knowledge graph, and deleting the alternative name if the alternative name and the initial entity node are classified into the same type of result at the same time and do not contain Chinese, and the initial entity node contains Chinese;

E. and adding a query field for the removed entity name, wherein the query field is used as a key for reserving the name and is used for querying the information of the entity. In the present embodiment, for a person name classified as a star (hereinafter referred to as a), the following post-processing is performed in steps: taking A as an initial node, taking the relationship type as 'alternative name', searching the alternative name B of the A in the knowledge graph, and deleting the A if the B and the A are classified as star at the same time, B contains Chinese, and A does not contain Chinese; taking A as an initial node, taking the relationship type as 'alternative name', searching the alternative name B of the A in the knowledge graph, and deleting the B if the B and the A are classified as stars at the same time, the B does not contain Chinese, and the A contains Chinese; taking A as an initial node, taking the relation type as an alternative name, searching the alternative name B of the A in a knowledge graph, and if the B and the A are classified as stars at the same time and the A and the B are both Chinese, keeping A, B the name with high reliability; for names with nested relationships (such as Liu Dewa and Dewa), names with high confidence are reserved; the removed entity name is added into a query field as a keyword for retaining the name, the query field is used for querying related information of the entity, and then a final star classification result is output, a schematic diagram of the constructed final star knowledge graph can be shown in fig. 2, and a specific implementation flow of the above process can be shown in fig. 3.

Example 2

In this embodiment, in addition to the same process as in embodiment 1, for the entity identification and relationship extraction in step S2, the entity identification and relationship in the corpus text can be extracted simultaneously by using an entity-relationship joint extraction model to perform joint extraction. The specific method can adopt the classic "End-to-End" model mentioned in "End-to-End Extraction using LSTMs on Sequences and Tree Structures" or the model of "Joint Extraction of Entities and relationships Based on a Novel Tagging Scheme" in the paper to perform Joint Extraction.

Example 3

As shown in fig. 4, this embodiment provides an entity classification system based on a massive knowledge graph and graph embedding, where the system is used to implement the entity classification method based on a massive knowledge graph and graph embedding described in embodiment 1 or embodiment 2, and includes:

the system comprises a labeling data set acquisition module unit, a labeling data set processing module unit and a labeling data set generating module unit, wherein the labeling data set acquisition module unit is used for acquiring N parts of corpus texts and labeling entities and entity relations in the corpus texts to obtain a labeling data set; the "N" is enough corpus texts, and corresponds to "a large amount" and "a large amount", and after the corpus texts are obtained, the corpus texts are sequentially labeled by using preset labels to obtain a labeled data set.

The knowledge map construction unit loads the named entity recognition model and the relation extraction model, and trains and tests the named entity recognition model and the relation extraction model by using the labeled data set so as to perform entity recognition and relation extraction in the massive corpus text; then, forming a knowledge graph on the basis of the entity and the relation;

the entity confidence coefficient calculating unit is used for determining the entities to be classified and calculating the entity confidence coefficient of the entities;

and the entity classification unit executes the classification task of the entity to be classified by using the classifier based on the graph embedding information of each entity node and the entity confidence coefficient of the entity, and performs reprocessing by using the knowledge graph to obtain the final entity classification result. .

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A method for entity classification based on massive knowledge maps and map embedding is characterized by comprising the following steps:

s2, introducing a named entity recognition model and a relation extraction model, and training and testing the named entity recognition model and the relation extraction model by using a labeled data set so as to recognize entities in the massive corpus text and extract relations; then, forming a knowledge graph on the basis of the entity and the relation;

and S5, based on the graph embedding information of each entity node and the entity confidence of the entity, executing a classification task of the entity to be classified by using a classifier, and performing reprocessing by using a knowledge graph to obtain a final entity classification result.

2. The entity classification method based on the massive knowledge maps and map embedding as claimed in claim 1, wherein in step S2, based on the Bert chinese pre-trained model, a named entity recognition and relationship extraction task of the corpus text is added downstream of the Bert chinese pre-trained model to obtain a named entity recognition model and a relationship extraction model for entity recognition and relationship extraction in the corpus text, respectively; the named entity recognition model and the relation extraction model are used for continuously updating the faced massive corpus texts and continuously updating the knowledge maps formed on the basis of the entities and the relations when the named entity recognition model and the relation extraction model are used for carrying out entity recognition and relation extraction on the massive corpus texts.

3. The entity classification method based on the massive knowledge maps and map embedding as claimed in claim 1, wherein in step S2, the entity recognition and relationship in the corpus text can be extracted simultaneously by using an entity-relationship joint extraction model to perform joint extraction.

4. The entity classification method based on the mass knowledge graph and graph embedding of claim 2, wherein the graph embedding model is deep walk, RandomWalk or LINE.

5. The entity classification method based on the massive knowledge graph and graph embedding as claimed in claim 4, wherein the process of performing the associated entity node prediction training task in the knowledge graph by using the graph embedding model in step S3 is:

6. The method for entity classification based on massive knowledge maps and graph embedding of claim 5, wherein in step S4, after determining the entity to be classified, the entity confidence coefficient with the entity as the initial entity node is calculated, and the calculation process includes:

knowledge of

7. The method for entity classification based on massive knowledge maps and graph embedding of claim 6, wherein after step S4, before step S5, the method further comprises: and dividing a training set from the labeled data set, obtaining graph embedding information of the entity nodes in the training set according to the graph embedding information of each entity node, and training a classifier by using the training set based on the graph embedding information to obtain the trained classifier.

8. The method for entity classification based on the massive knowledge maps and map embeddings as claimed in claim 7, wherein in step S5, based on the entity confidence of the entity, when the classifier is used to execute the classification task of the entity to be classified, a classification threshold is set, and the entity to be classified with the entity confidence greater than the classification threshold is classified.

9. The entity classification method based on massive knowledge graphs and graph embedding of claim 8, wherein the step S5 of using the knowledge graphs for reprocessing to obtain the final entity classification result comprises:

A. determining the relationship type of the initial entity node preliminarily classified by the classifier in a knowledge graph, searching the alternative name of the initial entity node by utilizing the entity and relationship information stored in a graph database in the knowledge graph, and deleting the initial entity node if the alternative name and the initial entity node are classified into the same type of result at the same time and contain Chinese, and the initial entity node does not contain Chinese;

E. and adding a query field for the removed entity name as a keyword for reserving the name, and querying the information of the entity.

10. An entity classification system based on massive knowledge maps and graph embedding, which is used for realizing the entity classification method based on massive knowledge maps and graph embedding in claim 1, and comprises the following steps:

The system comprises a labeling data set acquisition module unit, a labeling data set processing module unit and a labeling data set generating module unit, wherein the labeling data set acquisition module unit is used for acquiring N parts of corpus texts and labeling entities and entity relations in the corpus texts to obtain a labeling data set;