CN111611802A

CN111611802A - Multi-field entity identification method

Info

Publication number: CN111611802A
Application number: CN202010437407.6A
Authority: CN
Inventors: 陈文亮; 方晔玮; 王铭涛; 张民
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2020-05-21
Filing date: 2020-05-21
Publication date: 2020-09-01
Anticipated expiration: 2040-05-21
Also published as: CN111611802B

Abstract

The invention discloses a multi-field entity identification method. In this patent, we mainly made the following 2 innovations: 1. aiming at a cross-domain scene without any artificial labeling data in the target domain, weak labeling data in the target domain are quickly and automatically constructed. 2. And applying local annotation learning to a cross-domain named entity recognition task. Has the advantages that: under the condition that the target field does not have any artificial labeling data, the field self-adaptive capacity of the source field model is effectively improved, and the entity identification performance of the target field is improved while the data labeling cost is reduced.

Description

Multi-field entity identification method

Technical Field

The invention relates to the field of entity identification, in particular to a multi-field entity identification method.

Background

Named entity recognition refers to the recognition of entities in text that have a particular meaning. In recent years, neural network methods have greatly improved the performance of named entity recognition tasks. However, in practical application scenarios, when the text belongs to a domain different from the corpus, the deep neural network model often exhibits a weak knowledge generalization capability.

The difficulties of named entity recognition across domains are mainly: 1) the entity names are various, and a large number of entities which do not appear in the source field can appear in the target field; 2) the language expression is different from the normal language expression in the news field, the data distribution of the linguistic data in each field is different, for example, the phenomenon of colloquial of social texts is serious, and the texts in the medical field have a large number of professional terms.

The current method for identifying the cross-domain named entity can be roughly divided into the following steps: 1) learning domain-independent features based on a multitask learning framework method; 2) and initializing a target domain model by using the model parameters obtained by the source domain training, and then training on target domain data.

Cross-domain named entity recognition based on multitask learning

The model is mainly divided into three parts: 1) word vector representation layer: converting the input word/phrase into a continuous vector representation; 2) a feature extraction layer: obtaining the probability of each word corresponding to each label through a bidirectional long-short term memory network and linear transformation; 3) prediction layer: it is predicted what the output sequence is under the current input conditions.

In order to extract domain-independent, task-dependent features, the method shares a word vector representation layer and a feature extraction layer of a source domain model and a target domain model. The CRF layer is not shared since the labels output by different domains may be different. The model is then trained separately using the artificial labeling data of the source domain and the artificial labeling data of the target domain. Experiments prove that the method can effectively extract the characteristics irrelevant to the fields by sharing a plurality of layers for joint training in 2 fields, thereby improving the entity recognition performance of the target field.

2. Cross-domain named entity identification based on parameter initialization

The method comprises the following steps:

1. and training in a source field with large-scale manual labeling data to obtain a model A.

2. Model B has the same model structure, and is initialized using the parameters of model a.

3. And continuing training the model B on the limited manual labeling data of the target field, and fitting the characteristics of the target field.

Experiments prove that the method can effectively improve the entity recognition performance of the target field, and the entity recognition performance of the fine-tuned model B to the target field is obviously superior to that of the model A.

The traditional technology has the following technical problems:

1. manual annotation corpora in the target domain are required. In practical application, large-scale high-quality labeled corpora are expensive to obtain. Moreover, the subdivided fields are very many, and a certain amount of linguistic data needs to be marked every new specific field, so that the cost is very high. When the target domain has no labeled data, most of the existing domain migration technologies cannot be effectively applied.

2. The utilization of label-free data for the target domain is lacking. The acquisition cost of large-scale label-free data is low, and abundant semantic information is contained in the label-free data. However, most existing domain migration techniques do not utilize it.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a multi-field entity identification method, which can automatically generate high-quality weak label data of a target field under the condition that no manual label data exists in the target field, and model the weak label data, thereby improving the identification performance of the named entity in the target field.

In order to solve the technical problem, the invention provides a multi-field entity identification method, which comprises the following steps: in order to reduce migration difficulty caused by different data distribution, two methods are used for simultaneously labeling the unmarked linguistic data of the target field, labels with high confidence are reserved, and special labels are adopted for uncertain positions to obtain weak labeling data of the target field; because the weakly labeled corpus contains uncertain labels, a common CRF layer cannot be modeled, and local labeling learning is applied to model the weakly labeled corpus;

automatic labeling:

searching entities which possibly appear in the text by using an external entity dictionary according to a forward maximum matching mechanism; marking the part which is successfully matched as an entity, and marking the part which is failed in matching as 'O';

training source field data to obtain a model, and directly marking the unmarked text of the target field by using the model as a result of a second automatic marking method;

comparing the labeling results of the two methods, and keeping the labels consistent with the two methods; marking the position where the conflict is generated as 'U', meaning 'Unknown', that is, the label of the word is uncertain and can be any possible label; the obtained result is the final target field weakly labeled corpus;

named entity identification based on local labeling:

the model processes the identification task as a sequence labeling task, the input of the model is a Chinese character sequence, and the output of the model is a label sequence;

in the model, for an input Chinese character sequence, neuron features are constructed through a bidirectional long-short term memory network (LSTM), and then the features are combined and input to a local CRF layer for label prediction; the whole model is divided into 3 main parts: 1) word vector representation layer: representing the input word string as a continuous vector through a word vector mapping table; 2) a feature extraction layer: obtaining the probability of each word corresponding to each label through a bidirectional long-short term memory network and linear transformation; 3) prediction layer: predicting what the output sequence under the current input condition is by adopting a local CRF;

the model is divided into two states, training and forecasting; in the training process, the system can calculate a corresponding label sequence according to an input training sentence, and the initial label sequence and the correct label sequence are definitely different from each other greatly, namely the performance of the initial model is poor; then the model calculates a difference value by using the result obtained by self prediction and the correct answer, and reversely updates the system parameters, wherein the updating aim is to minimize the difference value Ioss as much as possible; as training progresses, the model becomes more and more predictive of the sequence's tags until a performance maximum is reached.

In one embodiment, the model treats the recognition task as a sequence labeling task, the input of the model is a Chinese character sequence, and the output of the model is a label sequence; "in, the label takes the form of BIOES, where B-XX represents the first Chinese character of the XX category entity, E-XX represents the last Chinese character of the XX category entity, I-XX represents the middle part of the category XX entity, S-XX represents the category XX entity of a single character, and the other Chinese characters are labeled" 0 ".

In one embodiment, the word vector represents a layer: converting the discrete input Chinese characters into continuous vector representation; using a mapping table, wherein the vector representation corresponding to each Chinese character is stored in the table; the initial value of the vector can be initialized by using a random number and can also be set as a pre-training word vector; in the model training process, the vector table content is used as a parameter of the model and is optimized along with other parameters in the iteration process; given sentence C ═<c₁，c₂，...，c_n>Mapped as a sequence of vectors<x₁，x₂，...，x_n>。

In one embodiment, the feature extraction layer: based on the input vector sequence, coding by using a bidirectional long-short term memory network to obtain feature representation; the LSTM encodes only the past information and does not encode the future information; to take context into account, forward and reverse LSTM encoding of sentences is applied simultaneously; for the t-th Chinese character in the sentence, the forward LSTM and the reverse LSTM respectively obtain hidden layer representation of hidden layer representation

Obtaining the final hidden state representation h of each word after splicing_t(ii) a Then, the probability P of each word corresponding to each label is calculated by the following formula:

P＝W_mlph_t+b_mlp

wherein, W_mlpAnd b_mlpAre the model parameters.

In one embodiment, the prediction layer: in the local labeling data, labels of some positions can be a plurality of values; thus, the correct tag sequence for a sentence may be more than one; the partial label data format corresponding to the sentence is ({ B }, { B, I, E, O, S }, { B, I, E, O, S }, { O }, { O }, { O }, and { O }), and the correct tag sequence is considered to have 5 × 5-25 pieces;

given sentence C ═<c₁，c2，...，c_n>If the corresponding tag sequence y is equal to<y₁，y₂，...，y_n>Then define the sentence score as:

where A is a matrix of recorded transfer scores, A_i，jRepresents the score for a transition from label i to label j; p is the output of the classification layer,

indicating that the ith position is marked with a label y_iA fraction of (d);

definition of Y_LFor the set of all correct sequences, a set Y is defined_LThe fraction of (A) is:

wherein, Y_CRepresents the set of all possible sequences for the case where the input is C;

the loss function is still applicable to the fully annotated data; when in set Y_LWhen the size is 1, only one correct sequence is available, corresponding to the situation of the fully labeled data; thus, the model can process full annotation data and partial annotation data simultaneously.

In one embodiment, during training, it is desirable to maximize the probability of the sum of all correct sequence scores; therefore, the loss function is defined as follows:

。

in one embodiment, the sequence with the highest score is solved as a model prediction result by using a Viterbi algorithm during testing.

Based on the same inventive concept, the present application also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods when executing the program.

Based on the same inventive concept, the present application also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of any of the methods.

Based on the same inventive concept, the present application further provides a processor for executing a program, wherein the program executes to perform any one of the methods.

The invention has the beneficial effects that:

under the condition that the target field does not have any artificial labeling data, the field self-adaptive capacity of the source field model is effectively improved, and the entity identification performance of the target field is improved while the data labeling cost is reduced.

Drawings

Fig. 1 is a schematic diagram of a domain migration method based on multitask learning in the background of the invention.

FIG. 2 is a partially labeled example diagram of the multi-domain entity identification method of the present invention.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

In this patent, we mainly made the following 2 innovations:

1. aiming at a cross-domain scene without any artificial labeling data in the target domain, weak labeling data in the target domain are quickly and automatically constructed.

2. And applying local annotation learning to a cross-domain named entity recognition task.

In order to reduce migration difficulty caused by different data distribution, two methods are used for simultaneously labeling unmarked corpora in a target field, labels with high confidence are reserved, and special labels are adopted for uncertain positions to obtain weakly labeled data in the target field. Since the weakly labeled corpus contains uncertain labels, a common CRF layer cannot be modeled, and local labeling learning is applied to modeling.

1. Automatic labeling

1.1 entity dictionary

We use an external entity dictionary to find out the entities that may appear in the text according to the forward maximum matching mechanism. The part which is successfully matched is marked as an entity, and the part which is failed in matching is marked as 'O'.

1.2, Source Domain model

A model is obtained by training the source field data, and the model is directly used for marking the unmarked text of the target field as the result of the second automatic marking method.

1.3, Cross-comparison

Table 1 example of automatic labeling method

Comparing the labeling results of the two methods, and keeping the labels consistent with the two methods; the position where the collision occurred is labeled "U", meaning "Unknown", i.e. the label of this word is uncertain and can be any possible label. The obtained result is the final target field weakly labeled corpus. Table 1 shows the labeling results of each method when the news domain is migrated to the social media.

2. Named entity recognition based on local annotation

The model treats the recognition task as a sequence labeling task, the input of the model is a Chinese character sequence, and the output of the model is a label sequence. The tag takes the form of BIOES, where B-XX represents the first Chinese character of the XX category entity, E-XX represents the last Chinese character of the XX category entity, I-XX represents the middle portion of the category XX entity, S-XX represents the single character category XX entity, and the other Chinese characters are labeled "O".

In the model, for the input Chinese character sequence, first, neuron features are constructed through a bidirectional long-short term memory network (LSTM), and then the features are combined and input into a local CRF layer for label prediction. The whole model is divided into 3 main parts: 1) word vector representation layer: representing the input word string as a continuous vector through a word vector mapping table; 2) a feature extraction layer: obtaining the probability of each word corresponding to each label through a bidirectional long-short term memory network and linear transformation; 3) prediction layer: with local CRF, it is predicted what the output sequence is under the current input conditions.

Word vector representation layer: discrete input Chinese characters are converted into continuous vector representations. We use a mapping table in which the vector representation corresponding to each chinese character is stored. The initial value of the vector may be initialized using a random number or may be set to a pre-trained word vector. In the model training process, the vector table content is used as a parameter of the model, and is optimized along with other parameters in the iteration process. Given sentence C ═<c₁，c₂，...，c_n>Mapped as a sequence of vectors<x₁，x₂，...，x_n>。

A feature extraction layer: based on the input vector sequence, we use a bidirectional long short term memory network (LSTM) for encoding, resulting in a feature representation. LSTM encodes only past information and not future information. To take context into account, we apply forward and inverse LSTM to encode sentences simultaneously. For the t-th Chinese character in the sentence, the forward LSTM and the reverse LSTM respectively obtain hidden layer representation of hidden layer representation

Obtaining the final hidden state representation h of each word after splicing_t. Then, the probability P of each word corresponding to each label is calculated by the following formula:

P＝W_mlph_t+b_mlp

wherein, W_mlpAnd b_mlpAre the model parameters.

Prediction layer: in the local labeling data, the labels of some positions may have multiple values. Thus, the correct tag sequence for a sentence may be more than one. As shown in fig. 2, the partial annotation data format corresponding to a sentence is ({ B }, { B, I, E, O, S }, { O }, and the correct tag sequence is considered to have 5 × 5 — 25 pieces.

Given sentence C ═<c₁，c₂，...，c_n>If the corresponding tag sequence y is equal to<y₁，y₂，...，y_n>Then define the sentence score as:

where A is a matrix of recorded transfer scores, A_i，jRepresenting the score of the transition from label i to label j. P is the output of the classification layer,

indicating that the ith position is marked with a label y_iThe fraction of (c).

during the training process, we want to maximize the probability of the sum of all correct sequence scores. Therefore, the loss function is defined as follows:

wherein, Y_CRepresenting the set of all possible sequences for the case where the input is C.

The loss function is still applicable to fully annotated data. When in set Y_LWhen the size is 1, only one correct sequence is available, which corresponds to the case of fully labeled data. Thus, the model can process full annotation data and partial annotation data simultaneously.

During testing, the sequence with the highest score is solved by using a Viterbi algorithm to serve as a model prediction result.

The model is divided into two states, training and prediction (prediction is the actual use of the model). During the training process, the system will calculate the corresponding label sequence according to the input training sentence, and the label sequence is definitely different from the correct label sequence at the beginning, that is, the performance of the model at the beginning is very poor. The model then calculates a difference (loss) using the predicted result and the correct answer, and updates the system parameters in the reverse direction, with the goal of minimizing the difference loss as much as possible. As training progresses, the model becomes more and more predictive of the sequence's tags until a maximum of performance is reached (which is a cyclic iterative process).

One application scenario of the present invention is described below:

taking the migration of news domain to social media domain as an example, the enumeration steps are as follows:

1. training is carried out on manual marking data in the news field, and a model A is obtained.

2. And simultaneously, marking original texts in the social media by using the model A and the entity dictionary, and performing cross comparison to obtain the weakly marked corpus in the field of the social media.

3. And training by using local labeling learning on the weakly labeled corpus to obtain a model B.

4. The application model B labels the text in the social media field, and the performance is obviously superior to that of the model A.

The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims

1. A multi-domain entity identification method is characterized by comprising the following steps: in order to reduce migration difficulty caused by different data distribution, two methods are used for simultaneously labeling the unmarked linguistic data of the target field, labels with high confidence are reserved, and special labels are adopted for uncertain positions to obtain weak labeling data of the target field; because the weakly labeled corpus contains uncertain labels, a common CRF layer cannot be modeled, and local labeling learning is applied to model the weakly labeled corpus;

automatic labeling:

named entity identification based on local labeling:

the model is divided into two states, training and forecasting; in the training process, the system can calculate a corresponding label sequence according to an input training sentence, and the initial label sequence and the correct label sequence are definitely different from each other greatly, namely the performance of the initial model is poor; then the model calculates a difference value by using the result predicted by the model and the correct answer, and reversely updates the system parameters, wherein the updating aim is to minimize the difference value loss as much as possible; as training progresses, the model becomes more and more predictive of the sequence's tags until a performance maximum is reached.

2. The method for multi-domain entity recognition of claim 1, wherein the "model treats the recognition task as a sequence tagging task, the model input is a chinese character sequence, and the model output is a tag sequence; "in, the label takes the form of BIOES, where B-XX represents the first Chinese character of the XX category entity, E-XX represents the last Chinese character of the XX category entity, I-XX represents the middle part of the category XX entity, S-XX represents the category XX entity of a single character, and the other Chinese characters are labeled" O ".

3. The multi-domain entity recognition method of claim 1, wherein a word vector representation layer: converting the discrete input Chinese characters into continuous vector representation; using a mapping table, wherein the vector representation corresponding to each Chinese character is stored in the table; the initial value of the vector can be initialized by using a random number and can also be set as a pre-training word vector; in the model training process, the vector table content is used as a parameter of the model and is optimized along with other parameters in the iteration process; given sentence C ═<c₁，c₂，…，c_n>Mapped as a sequence of vectors<x₁，x₂，...，x_n>。

4. The multi-domain entity recognition method of claim 1, wherein the feature extraction layer: based on the input vector sequence, coding by using a bidirectional long-short term memory network to obtain feature representation; the LSTM encodes only the past information and does not encode the future information; to take context into account, forward and reverse LSTM encoding of sentences is applied simultaneously; for the t-th Chinese character in the sentence, the forward LSTM and the reverse LSTM respectively obtain hidden layer representation of hidden layer representation

Splicing to obtain the final hidden character of each characterHidden state represents h_t(ii) a Then, the probability P of each word corresponding to each label is calculated by the following formula:

P＝W_mlph_t+b_mlp

wherein, W_mlpAnd b_mlpAre the model parameters.

5. The multi-domain entity recognition method of claim 1, wherein the prediction layer: in the local labeling data, labels of some positions can be a plurality of values; thus, the correct tag sequence for a sentence may be more than one; the partial label data format corresponding to the sentence is ({ B }, { B, I, E, O, S }, { B, I, E, O, S }, { O }, { O }, { O }, and { O }), and the correct tag sequence is considered to have 5 × 5-25 pieces;

indicating that the ith position is marked with a label y_iA fraction of (d);

the loss function is still applicable to the fully annotated data; when in set Y_LAt a size of 1, i.e. onlyA correct sequence is provided, which corresponds to the situation of the fully-labeled data; thus, the model can process full annotation data and partial annotation data simultaneously.

6. The method of multi-domain entity recognition of claim 5, wherein during training, it is desirable to maximize the probability of the sum of all correct sequence scores; therefore, the loss function is defined as follows:

。

7. the method of claim 1, wherein a sequence with the highest score is solved as a model prediction result using a viterbi algorithm at the time of testing.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the program is executed by the processor.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

10. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of any of claims 1 to 7.