CN113822026A

CN113822026A - Multi-label entity labeling method

Info

Publication number: CN113822026A
Application number: CN202111062720.7A
Authority: CN
Inventors: 张传锋; 朱锦雷; 井焜; 张琨; 潘玲玲
Original assignee: Synthesis Electronic Technology Co Ltd
Current assignee: Synthesis Electronic Technology Co Ltd
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2021-12-21
Anticipated expiration: 2041-09-10
Also published as: CN113822026B

Abstract

The invention provides a multi-label labeling method which mainly comprises a multi-label entity labeling processing flow, a multi-label entity prediction model based on deep learning and a feedback type model optimization framework based on online error correction. Compared with the existing entity labeling method, on one hand, the method provided by the invention can realize labeling extraction of specific information in the tax field, on the other hand, the automatic labeling model provided by the invention can solve the problem that the same character string is endowed with multiple entity labels, and in addition, the real-time feedback type model optimization framework provided by the invention provides a feasible scheme for iterative evolution of the model, so that the model is gradually optimized in each interaction, and the method has important practical value.

Description

Multi-label entity labeling method

Technical Field

The invention relates to the field of natural language processing, in particular to the field of data annotation, and specifically relates to a multi-label entity annotation method.

Background

The entity labeling is one of the important links of the unstructured data structuring, the core entity is extracted from the unstructured text through the entity labeling, and the core entity is stored to form structured knowledge. The economic scale of China is huge, tax payment subjects and scenes are various, tax regulations of all countries and places are frequently updated, and a scheme capable of replacing experts to automatically read the tax laws is urgently needed. The existing entity labeling method depends on manpower in large quantity, each entity is only endowed with one label, the contents of the entities are not overlapped, the regulation in the tax field is large in quantity, the phenomenon of content overlapping of the entities exists in large quantity, and a method capable of labeling tax texts with multiple labels is needed.

The patent named entity labeling method facing military corpora (publication number CN 111428502A) mainly performs integrated learning on LSTM, Lattice LSTM and BERT models through an XGboost algorithm, and obtains the named entity in the military field through model prediction and manual confirmation. A patent "a closed-loop entity extraction method based on automatic sample labeling" (publication No. CN 111125378A) proposes a closed-loop workflow, and the difficulty of manual labeling can be reduced by performing entity extraction through the workflow. A model-assisted data annotation system and an annotation method (publication number CN 110880021A) provide a model-assisted data annotation method based on a computer vision recognition technology, which is mainly applied to the field of intelligent annotation of images and improves the annotation efficiency of personnel. The patent "multi-label entity-relationship joint extraction method based on deep neural network and labeling strategy" (publication number CN 109543183A) mainly proposes a method for extracting multi-label relationships by using a GRU network, which directly extracts related entity pairs from a text end to end, and omits a large number of independent entities without relationships.

Disclosure of Invention

In order to realize the extraction of structured information of a text and solve the problem of content overlapping among a plurality of entities, the invention provides a multi-label multi-annotation method, which can reduce the burden of manpower annotation and improve the efficiency of entity annotation.

In order to solve the technical problem, the sampling technical scheme of the invention is as follows: a multi-label entity labeling method comprises the following steps:

s01), acquiring text content, and constructing a database based on the text content;

s02), defining N entities which are used most frequently and have the most use value in the field based on text content, wherein N is a positive integer, and constructing corresponding labels according to the defined N entities;

s03), cleaning and performing association sequencing on the text content, wherein the association sequencing is sequencing according to time;

s04), the cleaned text is used as a parameter and is transmitted to a multi-label entity prediction model based on deep learning, and the multi-label entity prediction model automatically labels the character sequence and is called as pre-labeling; different from a single-label entity labeling model, the multi-label entity prediction model endows the same character string with more than one entity label, does not pay attention to entity relationships, and labels the entities even if the entities do not have relationships with other entities;

s05), carrying out manual review on the result of the pre-labeling by a feedback type model optimization framework based on online error correction, and scoring the generalization ability of the multi-label entity prediction model according to the difference between the result of the manual review and the prediction result of the multi-label entity prediction model so as to optimize the multi-label entity prediction model;

s06), carrying out post-processing on the original labeling result after manual review, thereby extracting entity information and storing the entity information in a database.

Further, the process of pre-labeling the multi-label entity prediction model is as follows:

s41), pre-coding the character sequence segments to be marked through a pre-coding module, wherein the pre-training coding module comprises an embedding layer and a pre-training coder, the embedding layer is obtained by embedding vocabularies, embedding the positions of the vocabularies and embedding three embedded vectors into the segments of the vocabularies, and the pre-training coder adds a segment circulation mechanism on the basis of a BERT pre-training language model, specifically, after the complete character sequence is divided into segments, the coding vector of the character of the previous segment is added with each vector of the current segment, and then the vector sequence is coded through the BERT coder, and the calculation formula is as follows:

（1），

wherein w represents vocabulary, i represents the ith segment, j represents the jth vocabulary in the current segment, the 0 th vocabulary in each segment refers to special characters, and h represents a pre-coding vector output by a BERT pre-training language model;

s42), encoding a text sequence with the length L into a second-order tensor with the shape of [ h0, L ] after pre-encoding, wherein each column in the tensor represents a context feature vector of one character in the text, and h0 represents the length of each character feature vector; considering that text content is not formed by simply linearly connecting characters in series, a semantic unit is taken as a node, a grammar dependency relationship is taken as an edge to connect the nodes to form a generated relationship graph among all vocabularies, a graph convolution network is constructed based on the graph, the number of convolution kernels of the graph convolution network is set as the number of label categories, the constructed graph convolution network is used for carrying out nonlinear transformation on a second-order tensor output by a pre-coding module to obtain K different feature maps, each map is a matrix, and K is the number of the label categories;

through the step, the precoding characteristic with the shape of [ H0, L ] is converted into a third-order tensor with the shape of [ K, H, L ], and H and L respectively represent the dimension and the sequence length of the hidden layer;

s43), processing all feature maps one by using a maximum pooling method, and extracting the maximum correlation score of each character in each map through the dimension compression action of a pooling layer, namely a classification feature matrix M with the shape of [ K, L ], wherein each element in the matrix represents the correlation score of one character and a label; l represents the sequence length of the hidden layer;

s44), carrying out element-by-element normalization processing on the matrix M by using a sigmoid function, scaling each element in the matrix M to 0 or 1, if the numerical value is less than 0.5, mapping the numerical value to 0, otherwise, mapping the numerical value to 1, and through the operations, converting the text sequence into a sparse classification matrix, wherein each column of the sparse classification matrix represents the labeling result of the model on the vocabulary in the corresponding position, and for each vocabulary, extracting the position which is not 0 in the vocabulary labeling result and mapping the position to the corresponding label, wherein the labels are the automatic labeling result of the model on the vocabulary.

Further, the process of optimizing the multi-label entity prediction model based on the online error correction feedback model optimization framework is as follows:

s51), deploying the multi-label entity prediction model trained in the early stage into an online service, giving a cleaned document to be labeled, and automatically labeling the document by the multi-label entity prediction model to obtain an entity contained in the document;

s52), the online service transmits the prediction result of the multi-label entity prediction model back to the local, the prediction result is played through the local human-computer interaction voice equipment, after the playing is finished, the server and the user perform online question-answer interaction, the server actively asks the satisfaction degree of the user on the automatic labeling answer, prompts the user to identify the wrong labeling result, and then gives the correct answer;

s53), the real-time interactive log of the server and the user is transmitted to a background, the background carries out emotion analysis on the feedback text of the user, then a scoring table of entity labeling results is generated according to the emotion analysis results, a threshold value is set according to experience, labeling entities with the scores lower than the threshold value are discarded, labeling entities with the scores higher than the threshold value are reserved, correct answers identified by the user are added into a small sample training library, and meanwhile, a final correct answer is generated;

s54), carrying out secondary training on the multi-label entity prediction model through the newly labeled small sample corpus, and updating the weight parameters of the automatic labeling model; and post-processing the labeling result after manual review, formatting a data format required by the task-oriented interaction process, and uploading the data format to a cloud database to store the data format into structured knowledge.

Further, in step S02, each entity defines two labels, which are B-entity and I-entity, respectively, where B represents the starting position of the entity, I represents the inside of the entity, and if a certain character does not belong to any category of entities, it is labeled as O.

Further, in step S03, the cleaning means deleting contents with low correlation degree and symbols other than the separators, the association sorting means gathering multiple revisions of the same topic, then arranging according to the sequence of the time of departure, and after the extraction of the entity labels is completed, replacing the old entity value with the new entity value.

Further, the way that multiple revisions of the same theme are collected is as follows: and arranging all the documents from front to back according to the release time, and performing fuzzy matching on the name of each document and the contents of all other documents one by one through two cycles to obtain other documents which are used for subsequently quoting and revising the document.

Furthermore, before the multi-label entity prediction model is used for prediction, a small amount of cold start labeling data is used for pre-training the model, and the weight parameters of the multi-label entity prediction model are solved through pre-training.

Furthermore, the method is suitable for extracting the structured information of the tax text.

The invention has the beneficial effects that: the invention provides a multi-label labeling method which mainly comprises a multi-label entity labeling processing flow, a multi-label entity prediction model based on deep learning and a feedback type model optimization framework based on online error correction. Compared with the existing entity labeling method, on one hand, the method provided by the invention can realize labeling extraction of specific information in the tax field, on the other hand, the automatic labeling model provided by the invention can solve the problem that the same character string is endowed with multiple entity labels, and in addition, the real-time feedback type model optimization framework provided by the invention provides a feasible scheme for iterative evolution of the model, so that the model is gradually optimized in each interaction, and the method has important practical value.

Drawings

FIG. 1 is a flow chart of the process described in example 1;

FIG. 2 is a flowchart of the operation of a multi-label entity prediction model;

FIG. 3 is a flow chart of the operation of the feedback model optimization framework.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments.

Example 1

The embodiment discloses a multi-label entity labeling method, in particular to labeling of tax texts, as shown in fig. 1, including the following steps:

s01), the national tax administration, the finance department and other related departments often set out some new regulation policies, and the cloud database of the tax regulation policies is updated regularly by using the interface of the periodic crawler. And taking a small amount of data from the database, placing the data locally, and carrying out manual marking at the later stage to be used as training data of model cold start.

S02), tax regulation relates to the aspects of national economy, the important concern information is different under each situation, in order to ensure the versatility of the defined entity under each situation, through reading and analyzing more than one hundred tax texts, the invention summarizes and provides 14 most frequently used and most valuable entity fields in the tax field, the main content is as follows: taxpayer, calculation method of tax amount, tax collection content, preferential condition, preferential amount of tax collection, preferential ratio of tax collection, implementation date, expiration date, tax payment place, tax payment voucher, tax payment period, tax payment subject, tax rate and tax variety.

And constructing corresponding labels according to the 14 defined tax entity categories. Each category may be constructed with two tags, B-entity and I-entity, such as "taxpayer" tag "B-taxpayer" and "I-taxpayer", where B denotes the beginning of the entity and I denotes the inside of the entity. Finally, if a character does not belong to any category of entities, it is labeled "O". This results in a total of 29 labels, possibly labeled as more than one label for each character.

S03), the obtained tax text is cleaned, and the contents with low relevance such as tables, pictures and links, and symbols except delimiters such as comma, period and exclamation mark are removed.

In order to overcome the problem of content contradiction caused by multiple revisions of laws and regulations on the same subject, the method also needs to perform one-time association sequencing on all tax texts. Specifically, the method is to centralize multiple revisions of the same theme, arrange the revisions according to the sequence of the departure time, and replace the old entity value with the new entity value after the labeling and extraction of the entities are completed. Taking the 'personal income tax law' as an example, the version that is delivered in 2011 specifies that the tax rate of the part of personal income below 3500 is 0, the version that is delivered in 2018 is revised to be that the tax rate of the part of personal income below 5000 is 0, and the latest tax rate is the standard when the invention is stored.

In this embodiment, the manner of concentrating multiple revisions of the same theme is as follows: and arranging all the documents from front to back according to the release time, and performing fuzzy matching on the name of each document and the contents of all other documents one by one through two cycles to obtain other documents which are used for subsequently quoting and revising the document.

S04), delivering the cleaned text as a parameter to a deep learning-based multi-label entity prediction model, wherein the model has the main function of automatically labeling character sequences, and the labels automatically predicted by the machines are manually checked later, so that the stage is called a pre-labeling stage. In addition, before the multi-label entity prediction model is used for prediction, a small amount of cold start labeling data is required to be used for pre-training the model, and the weight parameters of the multi-label entity prediction model are approximately solved through pre-training.

The multi-label entity prediction model of the embodiment can endow a plurality of entity labels to the same character string at the same time. For example, the tax subject "about personal income …" includes an entity of "personal income tax" which is a tax type, and the field of "personal income tax" needs to be labeled with two labels: "I-tax theme" and "I-tax type" indicate that the field belongs to both a part of the tax theme and an entity of tax. In addition, different from the entity-relationship joint extraction model, the model in the invention does not pay attention to entity relationship, the entity-relationship joint extraction model cannot extract isolated entities, and in the model, the entity can be labeled even if the entity does not have relationship with other entities.

S05), the feedback type model optimization framework based on online error correction carries out manual review on the result of the pre-labeling, and the generalization capability of the multi-label entity prediction model is scored according to the difference between the result of the manual review and the prediction result of the multi-label entity prediction model, so that the multi-label entity prediction model is optimized.

As shown in fig. 2, the process of pre-labeling the multi-label entity prediction model is as follows:

s41), pre-coding the character sequence segment to be marked through a pre-coding module, wherein the pre-training coding module comprises an embedded layer and a pre-training coder, the input information of the model comprises vocabulary content, vocabulary position and vocabulary segment 3, therefore, the embedded layer is obtained by the vocabulary embedding, the vocabulary position embedding and the vocabulary segment embedding three embedding vectors, the pre-training coder adds a segment circulation mechanism on the basis of a BERT pre-training language model, specifically, after the complete character sequence is divided into segments, the coding vector of the character of the previous segment is added with each vector of the current segment, then the vector sequence is coded through the BERT coder, and the calculation formula is as follows:

（1），

wherein w represents words, i represents the ith segment, j represents the jth word in the current segment, the 0 th word in each segment refers to special characters, and h represents a pre-coding vector output by a BERT pre-training language model. The embodiment introduces a fragment circulation mechanism on the basis of BERT, so that the length of context semantic dependence is prolonged.

S42), encoding a text sequence with the length L into a second-order tensor with the shape of [ h0, L ] after pre-encoding, wherein each column in the tensor represents a context feature vector of one character in the text, and h0 represents the length of each character feature vector; consider that text content is not simply a linear concatenation of characters. For example, the sentence "the notification of several problems about business tax by the State tax administration" is generated not according to the linear sequence of "nation", "home" and "tax" …, but includes a non-linear structure, in which a noun object such as "notification" is constructed first, then a side noun such as "business tax" is constructed, and finally, the character segments of "several problems about …", "tax", and the like are connected to form a complete text.

In the step, the semantic units are taken as nodes, grammatical dependency relations such as a principal-predicate object, a shape-supplement and the like are taken as edges, and the nodes are connected to form a generated relation graph among all vocabularies, wherein the generated relation graph contains the generation sequence of all characters in the human brain in the writing process of the article. And constructing a graph convolution network based on the graph, setting the number of convolution kernels of the graph convolution network to be 29, namely the number of label types, and carrying out nonlinear transformation on a second-order tensor output by a precoding module by using the constructed graph convolution network to obtain 29 different feature maps, wherein each map is a matrix.

Through the step, the precoding characteristic with the shape of [ H0, L ] is converted into a third-order tensor with the shape of [29, H, L ], and H and L respectively represent the dimension and the sequence length of the hidden layer;

s43), processing all feature maps one by using a maximum pooling method, and extracting the maximum correlation score obtained by each character in each map through the dimension compression action of a pooling layer, namely a classification feature matrix M with the shape of [29, L ], wherein each element in the matrix represents the correlation score of one character and one label.

For a single sample, the shape of the map feature family is [29, H, L ], and through the maximum pooling layer, the map feature family is converted into a classification feature matrix M of [29, L ].

S44), the matrix M is normalized element by element using a sigmoid function, and each element in the matrix M is scaled to 0 or 1. If the value is less than 0.5, it is mapped to 0, otherwise it is mapped to 1. Through these operations, the text sequence is converted into a sparse classification matrix, where the elements in the matrix are either 0 or 1. Each column of the sparse classification matrix represents the labeling result of the model to the vocabulary in the corresponding position, for each vocabulary, the position which is not 0 in the vocabulary labeling result is taken out and mapped to the corresponding label, and the labels are the automatic labeling result of the model to the vocabulary.

Each column of the matrix corresponds to one character, if a certain element takes the value of 0, a label 'O' is added in a label list of the character, and if a certain element takes the value of 1, a corresponding category label is added in the label list of the character. Since there may be multiple 1's in each sparse vector, multiple annotation tags can be obtained for each character at the same time.

Fig. 3 shows a feedback model optimization framework based on online error correction, which is provided by the present invention, and mainly aims to realize the organic combination of manual review and model optimization in the text labeling process, and play a role in optimizing both aspects through a part of manual participation.

Specifically, the process of optimizing the multi-label entity prediction model based on the online error correction feedback model optimization framework is as follows:

s51), before the model is applied online, the randomly initialized model parameters are pre-trained by a small amount of labeled data through initial cold start. The part of the labeling data needs to be directly labeled by using a human, so that a high-quality labeling sample is obtained.

And deploying the multi-label entity prediction model trained in the early stage into an online service, providing a cleaned document to be labeled, and automatically labeling the document by the multi-label entity prediction model to obtain the entity contained in the document.

S52), the online service transmits the prediction result of the multi-label entity prediction model back to the local, the prediction result is played through the local human-computer interaction voice equipment, after the playing is completed, the server and the user perform online question-answer interaction, the server actively asks the satisfaction degree of the user on the automatic labeling answer, prompts the user to identify the wrong labeling result, and then gives the correct answer.

S53), the real-time interactive log of the server and the user is transmitted to a background, the background carries out emotion analysis on the feedback text of the user, then a scoring table of entity labeling results is generated according to the emotion analysis results, a threshold value is set according to experience, labeling entities with the scores lower than the threshold value are discarded, labeling entities with the scores higher than the threshold value are reserved, correct answers identified by the user are added into a small sample training library, and meanwhile, a final correct answer is generated.

And automatically adding the new labeled samples into a training sample library of the labeled model, and when the newly added samples reach a certain number, the model automatically starts an incremental training process, so that the feedback updating of the model parameters is realized.

S54), carrying out post-processing on the labeling result after manual review, formatting a data format required by the task-oriented interaction process, and uploading the data format to a cloud database to store the data format into structured knowledge. Thus, the complete labeling process is completed.

The above is the main workflow and model architecture of the present invention. The invention provides a tax text-oriented multi-label labeling method which mainly comprises a multi-label entity labeling processing flow, a deep learning-based multi-label entity prediction model and an online error correction-based feedback model optimization framework. Compared with the existing entity labeling method, on one hand, the method provided by the invention can realize labeling extraction of specific information in the tax field, on the other hand, the automatic labeling model provided by the invention can solve the problem that the same character string is endowed with multiple entity labels, and in addition, the real-time feedback type model optimization framework provided by the invention provides a feasible scheme for iterative evolution of the model, so that the model is gradually optimized in each interaction, and the method has important practical value.

The application scenarios of the invention include but are not limited to the fields of artificial intelligence such as information extraction, document structuring, tax question and answer, tax search, entity identification, tax conversation, etc. The specific implementation mode of the invention details how to realize multi-label entity marking in tax text in related products. The flowchart and model block diagrams in the embodiments are only used for explaining the principles and processes of the present invention, and other schemes having similar principles and processes to the present invention should be considered by those skilled in the relevant art when referring to the present invention. The present invention is illustrated only for the convenience of the related technical personnel, and is not limited to the present invention, and any implementation scheme having similar form to the present invention should be included in the protection scope of the present invention.

Claims

1. A multi-label entity labeling method is characterized in that: the method comprises the following steps:

2. The multi-label entity tagging method of claim 1, wherein: the process of pre-labeling the multi-label entity prediction model comprises the following steps:

（1），

3. The multi-label entity tagging method of claim 1, wherein: the process of optimizing the multi-label entity prediction model by the feedback type model optimization framework based on online error correction comprises the following steps:

4. The multi-label entity tagging method of claim 1, wherein: in step S02, each entity defines two labels, which are B-entity and I-entity, respectively, where B represents the starting position of the entity, I represents the inside of the entity, and if a certain character does not belong to any category of entities, it is labeled as O.

5. The multi-label entity tagging method of claim 1, wherein: in step S03, the cleaning means deleting contents with low correlation degree and symbols other than the separators, the association sorting means gathering multiple revisions of the same theme, then arranging according to the sequence of the time of departure, and after the extraction of the labels of the entities is completed, replacing the old entity values with the new entity values.

6. The multi-label entity tagging method of claim 5, wherein: the way in which multiple revisions of the same topic are collected is: and arranging all the documents from front to back according to the release time, and performing fuzzy matching on the name of each document and the contents of all other documents one by one through two cycles to obtain other documents which are used for subsequently quoting and revising the document.

7. The multi-label entity tagging method of claim 2, wherein: before the multi-label entity prediction model is used for prediction, a small amount of cold start labeling data is used for pre-training the model, and the weight parameters of the multi-label entity prediction model are solved through pre-training.

8. The multi-label entity tagging method of claim 1, wherein: the method is suitable for extracting the structured information of the tax text.