CN116595982A

CN116595982A - Nested named entity identification method based on dynamic graph convolution

Info

Publication number: CN116595982A
Application number: CN202310566702.5A
Authority: CN
Inventors: 莫益军; 孙淑榕; 刘辉宇
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2023-05-19
Filing date: 2023-05-19
Publication date: 2023-08-15

Abstract

The invention relates to the technical field of computer language identification processing, in particular to a nested named entity identification technology, and discloses a nested named entity identification method based on dynamic graph convolution, which comprises the following steps: aiming at natural language texts, mapping and characterizing text features by adopting a knowledge representation technology; modeling a grammar relation graph by using a graph structure according to part-of-speech dependency information of the text; extracting attribute characteristics and semantic similarity characteristics of the text body by adopting a dynamic graph convolution mode; the two-stage recognition strategy is used for locating and classifying entities. The invention overcomes the defects of insufficient extraction of the existing time sequence feature extraction model features and insufficient excavation of the model features, weakens the time sequence of information transmission, improves the recognition effect on abnormal word sequence texts and low-frequency entities, reduces the omission ratio of an accurate recognition mode, enhances the robustness of the system and is worthy of popularization and application.

Description

Nested named entity identification method based on dynamic graph convolution

Technical Field

The invention relates to the technical field of computer language identification processing, in particular to a nested named entity identification technology.

Background

The nested named entity recognition task is one of the main components of natural language processing tasks such as a question-answering system, information retrieval, text abstracts and the like, and aims to recognize short entities in long entities with nesting conditions, 37% of sentences in a news broadcast corpus have nesting entity conditions, about 17% of the entities in a biomedical literature corpus are embedded in another entity, and the nesting condition of visible entities occupies a non-negligible share in the existing corpus. The identification of nested entities can be used to capture finer granularity semantic information, enabling better servicing of retrieved natural language applications.

The main method adopted by the named entity recognition research is based on a sequence mark model, and a sequence feature model such as a long-short-term memory network and the like is adopted to output a sequence label with the highest probability for each English character string or Chinese character of an input text in combination with a conditional random field model, but the methods have unsatisfactory effects when processing nesting conditions.

In recent years, corresponding model structures are also proposed for nested entity phenomena. If a rule-based model is proposed in the early stage, the model makes corresponding entity structure rules through a domain expert to conduct entity prediction. However, the rule-based method has limitations due to individual cognitive differences, has high field dependence, cannot be expanded, takes time and labor when rule preparation is performed, and the like, and has unsatisfactory recognition effect.

It has then been proposed to use proprietary structure based models to capture nested entities, including selection of region graphs, hypergraphs, etc., by treating the entities in the recognition sentence as the best sub-hypergraphs in the original complete hypergraph or one span of the parse tree. Wherein the hypergraph structure is composed of five node types for compactly representing entities of a plurality of different semantic categories and boundaries. Furthermore, superarc can very naturally solve the nesting problem because it can connect two or more nodes. Together, these paths form a unique sub-hypergraph of the original hypergraph, which is used to express all nested entities in the sentence. However, a great deal of manpower is required to design accurate data sets for nested entity construction to avoid false structures and structural ambiguity, which is costly and not efficient enough.

With the development of machine learning and deep learning models, nested entity recognition methods based on deep learning, such as a manner based on stacking flat entity recognition models, a manner based on span enumeration, and the like, appear. The enumeration span-based method needs to classify all sub-sequences, is expensive in calculation and low in reasoning efficiency, and is not supervised by using boundary information. In addition, the existing model can obtain better results when trained and tested on a standard data set, but the effect obtained on a verification set and a test set is inferior to that of a training set. Especially in natural language dialogue scenes, the conditions of non-mentioned (Out-of-vocaliry) entities, disordered language sequences when the entities are expressed, and the like are common, the verification effect of the existing model on low-frequency entities and disordered entities is poor, namely when most of the entities and nesting conditions in the test set are different from those in the training set, the recognition effect is obviously reduced, and the robustness of the model is weak.

The nested named entity recognition model based on time sequence feature extraction only focuses on sequence context feature extraction, can obtain a better result in a mode based on sequence decoding, but does not utilize interactive information on grammar space of texts, such as part of speech, common finger and other grammar information interaction.

In addition, it has been proposed to convert nested entity recognition questions into the form of question-and-answer tasks, input text, and output the location and category of entities in the text as answers. The appearance of the methods opens ideas for nested entity identification strategies, but there is still room for improvement in performance and application scenarios.

In summary, there is no nested named entity recognition model capable of fully extracting sequence, text ontology attribute features and semantic features, and the robustness of the existing model is still to be enhanced.

Disclosure of Invention

The invention aims to solve the technical problem of providing a nested named entity recognition method based on dynamic graph convolution, which aims to overcome the defects in the prior art and improve the accuracy and efficiency of candidate generation and category recognition.

In order to solve the technical problems, the invention provides a nested named entity identification method for dynamic graph convolution, which comprises the following steps:

s1: aiming at natural language texts, mapping and characterizing text features by adopting a knowledge representation technology;

s2: modeling a grammar relation graph by using a graph structure according to part-of-speech dependency information of the text;

s3: extracting attribute characteristics and semantic similarity characteristics of the text body by adopting a dynamic graph convolution mode;

s4: the two-stage recognition strategy is used for locating and classifying entities.

The step S1 includes the steps of:

s11: taking each given sequence in the data set as a unit, wherein the data set is text data, the sequence is a complete sentence ending with a period, each word in the sequence is represented as a word matrix formed by character vectors through a convolutional neural network, then the word matrix is subjected to conventional one-layer convolutional operation, and a character-level vector is obtained in a maximal pooling mode;

s12: obtaining word-level vectors by adopting a BERT pre-training word vector table; BERT is an abbreviation of Bidirectional Encoder Representation from Transformers, a pre-trained word embedding model;

s13: splicing the obtained character-level and word-level vectors, and extracting context characteristics through a two-way long-short-time memory network to obtain a vector representation for finishing initialization;

s14: and inputting the input word sequence in reverse order into a Long Short-Term Memory (LSTM) network to obtain reverse word vector representation, and splicing the forward word coding result and the reverse word coding result to obtain the output of the word context feature codes.

The step S2 includes:

s21: each word in the sentence sequence is used as a node in the graph, and a sequence edge is constructed for the front word node and the rear word node in the context according to the sequence relation, so that a sequence graph adjacency matrix is obtained; the edges have no directivity, and the information representing the positive and negative directions can be transmitted;

s22: a part-of-speech parser in an NLTK (fully called Natural Language Toolkit, a common natural language processing tool kit) library is adopted to decode to obtain part-of-speech relations, edges are constructed for word nodes with high-frequency part-of-speech dependency relations, dependency strength is given to the edges as a weight value, and therefore a part-of-speech dependency graph adjacency matrix is obtained, wherein the high-frequency part-of-speech dependency relations refer to that the dependency combination relations among parts of speech meet a certain statistical frequency;

the step S3 includes:

s31: respectively carrying out one-round to k-round graph convolution operation on the sequence graph and the part-of-speech dependency graph, transmitting and updating to obtain first-order or k-order neighbor information, wherein the specific propagation round number is selected according to the experimental effect, k is a natural number, and the value range of k is a certain empirical value;

s32: and adopting a binary K-means clustering algorithm to dynamically sample the common-finger nodes, adding edges to the common-finger nodes, and defining the common-finger nodes as common-finger edges, wherein the common-finger nodes are nodes with the same category labels or are similar in semantic space.

The step S4 includes:

s41: inputting the node feature vector obtained by the feature extraction module into a classifier for label decoding, and dividing the boundary label of each word node into two types, wherein one type is composed of entities and the other type is composed of non-entities;

s42: the nodes which are identified as the entity components are combined according to the adjacency to obtain candidate spans, and then normalized feature vectors of the spans are input into a category prediction module for prediction;

s43: and carrying out class prediction on the normalized input by adopting a Softmax () function through an obtained span representation input span class prediction module.

The step S41 classifies the boundary labels of each word node into two types, specifically:

dividing the boundary label of each word node into an entity composition and a non-entity composition by adopting a fuzzy boundary label strategy, wherein the calculation formula is as follows

P _b ＝Softmax(MLP(x _final ))

In which x is _final And representing the sequence feature representation obtained by the feature extraction module, wherein the MLP (·) is a multi-layer perceptron, and the final boundary tag classifier adopts a Softmax () function to conduct classification prediction.

The invention has the following beneficial effects:

1. compared with the problem of insufficient model feature mining in the prior art, the feature extraction method based on the dynamic graph convolution network provided by the invention has the advantages that the part-of-speech dependency information is flexibly utilized by the statistical analysis data set, the co-reference relation graph is dynamically generated, the time sequence is weakened by the information transmission based on the graph structure, the recognition effect of the model on the abnormal word sequence text and the low-frequency entity is improved, and meanwhile, the robustness of the model is improved.

2. The invention uses a simple and efficient information storage mode of a graph structure, adopts space mapping to map text units to a feature space, adopts a dynamic graph convolution mode to fuse different semantic and grammar information from various graph structures, adopts a two-stage recognition strategy, and overcomes the defect of high cost based on an enumeration mode and the defect of boundary blurring based on a hierarchical model.

3. The invention performs feature extraction based on the graph structure, can transfer and fuse sequence, grammar and semantic feature information in a topological structure mode, and can continuously perform iterative updating, so that the relation among each text unit can be fully reflected, the multi-granularity feature of the text can be fully learned, and the accuracy and the efficiency of candidate generation and category identification are improved.

4. The invention adopts the fuzzy boundary recognition strategy to generate candidate entities, reduces the omission ratio of an accurate recognition mode, and improves the model recognition recall rate.

Drawings

The technical scheme of the invention is further specifically described below with reference to the accompanying drawings and the detailed description.

FIG. 1 is a part-of-speech dependency graph of the present invention.

Fig. 2 is a split-map convolution illustration of the present invention.

Detailed Description

The invention provides a nested named entity identification method for dynamic graph convolution, which comprises the following steps:

Specifically, firstly, preprocessing related corpus to obtain a distributed representation of text, wherein the main steps are shown in fig. 1, and the specific steps are as follows:

s11: for each given sequence in the dataset, the dataset is text data, each given sequence is a complete sentence ending in a period, and in units of each given sequence in the dataset, the sequence is defined asn represents a sequence containing n words in total. The method comprises the steps of carrying out knowledge representation on a sequence in an initialization stage, specifically, firstly, obtaining character-level codes of each word through a convolutional neural network, constructing a dictionary for the characters by the network, carrying out one-hot (one-hot) codes, setting feature dimensions, representing each word as a word matrix formed by character vectors, carrying out conventional one-layer convolutional operation on the matrix, and obtaining a final character-level vector in a maximum pooling mode;

s12: next, the BERT pre-training word vector is used for initializing word level coding. Specifically, for each word in the text sequence, a word-level vector representation x is obtained by looking up a BERT pre-training word vector table that matches the pre-load ^word ＝BERT _emb (x)；

S13: next, the resulting character level is displayedSplicing word-level vectors, and extracting context characteristics through a two-way long short-time memory network to obtain initialized vector representationThe calculation formula is as follows:

wherein W and b each represent a parameter to be trained,represents the t-th word, h in the sentence _t-1 Indicating the state of the cell at the previous moment, +.>Output representing forget gate, +.>Representing the output of the memory gate, ">Indicating temporary cell status,/->Indicating the current cell state->Indicates the output door, ++>Representing the forward vector.

S14: inputting the input word sequence into LSTM network in reverse order to obtain reverse word vector representation,that is, the reverse vector coding result is represented, and the forward word coding result and the reverse word coding result are spliced to obtain the output of the word context feature code

S21: next, graph construction is performed. Each word x in the sentence sequence _i Will be a node n in the graph _i Firstly, constructing a sequence edge E for the front word node and the rear word node in the context according to the sequence relation _s ＝[e _1-2 ,e _2-3 ,e _3-4 ,...,e _n-1-n ]The edges are not directional, and information indicating both the forward and reverse directions can be transmitted, thereby obtaining a sequence diagram adjacent matrix A ^t ∈R ^n×n ；

S22: then, part-of-speech resolvers in an NLTK library are adopted to decode to obtain part-of-speech relations, and edges are constructed for word nodes corresponding to the high-frequency part-of-speech dependency relations, such as modified noun-modified noun and other relations, which are common part-of-speech dependency combinations

Specifically, the high-frequency part-of-speech dependency relationship in the GENIA corpus is shown in table 1, so that edges are constructed for adjectives and nouns according to part-of-speech dependency frequencies obtained by statistics on all entities in the corpus, schematically shown in fig. 1, circles in the figure represent word nodes, node indexes in the figure correspond to indexes marked above sentences one by one, edges are part-of-speech dependency edges, values on the edges are edge weights, and isolated points are nodes with no correlation dependency in part-of-speech structural analysis.

TABLE 1 word dependency statistics in GENIA

Considering the difference of dependency strength, the edge weight value is given according to the statistical frequency as the part-of-speech dependency correlation score, so the edge E can be added according to the part-of-speech dependency _r ＝[e _2-3 ,e _2-4 ,e _3-4 ]The corresponding edge weights are 0.5, 0.3 and 0.3 in sequence, so that the part-of-speech dependency graph adjacency matrix A is obtained ^s ∈R ^n×n 。

S31: and respectively carrying out one-round to k-round graph convolution operation on the sequence graph and the part-of-speech dependency graph, transmitting and updating to obtain first-order or k-order neighbor information, wherein k is a natural number, and the value range is 3 to 6 according to experience. The graph convolution formula is shown below. As shown in fig. 2, the upper dashed line frame in fig. 2 represents a sequence diagram constructed according to a sequence context, the lower dashed line frame represents a part-of-speech dependency diagram constructed according to a part-of-speech dependency relationship, the upper and lower two diagrams are respectively subjected to one-to-three diagram convolution, and the feature diagrams after convolution update are spliced to obtain an output feature diagram on the right side;

G ^t ＝GCN(X ^t ,A ^t )

G ^s ＝GCN(X ^s ,A ^s )

s32: and then, adopting a binary K-means clustering algorithm to dynamically sample the common-finger nodes and adding common-finger edges. Co-fingered nodes are nodes that have the same class labels or are similar in semantic space. The algorithm automatically samples the common-finger nodes as clusters according to the spatial distance of the semantic representation of the nodes, and constructs edges for the common-finger nodes.

Specifically, a graph made up of all nodes is first considered as one cluster, and then recursively split into two new clusters until a specified number of clusters is reached. Specifically, a K-means algorithm is called first, a data set is divided into two clusters by calculating Euclidean distance between feature vectors, and each cluster has own error square sum and is named as a father node SSE; next, these clusters are respectively K-means classified, and the sum of the SSEs of the two clusters that are separated is recorded, which is called the child node total SSE. Recording the difference SSE of SSE after each cluster is classified _{Difference value} ＝SSE _{Father node} -SSE _{Child node} The cluster with the largest SSE difference value is selected to continue dividing, while the other clusters stop dividing. The bipartite step is repeated until the total number of clusters reaches K. The sampling distance formula and the SSE calculation formula are shown below.

Wherein,,are feature vectors of node samples, n represents the dimension of the vector, and dist (·) represents the Euclidean distance between the two obtained vectors.

The model weakens the sequence characteristic for the vector characterization learned by the word nodes, enriches the generalization characteristic, and simultaneously has the characteristics of body attribute, grammar structure characteristic and semantic similarity. The model is weakened by the influence of the character sequence, and can be effectively identified under the condition of disordered characters, so that the classification result of the model aiming at the characteristics is more robust.

S41: and then inputting the node feature vector obtained by the feature extraction module into a classifier for label decoding.

Specifically, the fuzzy boundary tag strategy classifies the boundary tags of each word node into two categories, one is an entity composition (denoted by 1) and the other is a non-entity composition (denoted by 0), and the calculation formula is shown below.

P _b ＝Softmax(MLP(x _final ))

S42: and then, identifying the nodes which are formed by the entities, obtaining candidate spans according to the adjacency combination, and inputting normalized feature vectors of the spans into a category prediction module for prediction.

Specifically, the normalization mode of mapping of the full connection layer is performed once by adopting the vector spliced from the head to the tail, and the formula is as follows.

Span _gt ＝MLP([x _s :x _s+n ])

S43: finally, the obtained span representation is input into a span type prediction module, and the normalized input is subjected to type prediction by adopting Softmax, wherein the formula is as follows:

P _c ＝Softmax(Span)

the nested named entity recognition technology based on dynamic graph convolution, which is provided by the invention, performs feature extraction based on graph structure, can transfer and fuse sequence, grammar and semantic feature information in a topological structure mode, and continuously performs iterative updating, so that the relation among text units can be fully reflected, the multi-granularity features of texts can be fully learned, and the accuracy and the efficiency of candidate generation and category recognition are improved. The simple and efficient information storage mode of the graph structure is utilized, the space mapping is adopted, the text unit is mapped to the feature space, the dynamic graph convolution mode is adopted, different semantic and grammar information is fused from various graph structures, and the two-stage recognition strategy is adopted, so that the defects of high cost based on the enumeration mode and the defect of boundary blurring based on the hierarchical model are overcome.

Finally, it should be noted that the above-mentioned embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention, and all such modifications and equivalents are intended to be encompassed in the scope of the claims of the present invention.

Claims

1. The nested named entity identification method based on dynamic graph convolution is characterized by comprising the following steps of:

2. The nested named entity recognition method based on dynamic graph convolution according to claim 1, wherein the step S1 comprises the steps of:

s12: obtaining word-level vectors by adopting a BERT pre-training word vector table;

s14: and inputting the reverse sequence of the input word sequence into a long-short-time memory LSTM network to obtain reverse word vector representation, and splicing the forward word coding result and the reverse word coding result to obtain the output of the word context feature coding.

3. The nested named entity recognition method based on dynamic graph convolution according to claim 2, wherein the step S2 comprises:

s22: the part-of-speech resolvers in the NLTK library are adopted to decode and obtain part-of-speech relations, edges are built for word nodes with high-frequency part-of-speech dependency relations, dependency strength is given to the edges as a weight value, and therefore a part-of-speech dependency graph adjacency matrix is obtained, and the high-frequency part-of-speech dependency relations means that dependency combination relations among parts of speech meet certain statistical frequency.

4. The nested named entity recognition method based on dynamic graph convolution according to claim 3, wherein the step S3 comprises:

5. The nested named entity recognition method based on dynamic graph convolution according to claim 3, wherein the step S4 comprises:

s42: the nodes which are identified as the entity components are combined according to the adjacency to obtain candidate spans, and then normalization operation is carried out on the characteristic vectors of the spans;

s43: the second step of the two-stage entity recognition model is category prediction, the obtained span representation feature vector is input into a span category prediction module for prediction, and specifically, a Softmax () function is adopted for category prediction of the normalized input.

6. The nested named entity recognition method based on dynamic graph convolution according to claim 5, wherein the step S41 classifies the boundary labels of each word node into two classes, specifically:

P _b ＝Softmax(MLP(x _final ))