CN112925886A

CN112925886A - Few-sample entity identification method based on field adaptation

Info

Publication number: CN112925886A
Application number: CN202110266908.7A
Authority: CN
Inventors: 韩瑞峰; 杨红飞; 金霞
Original assignee: Hangzhou Firestone Technology Co ltd
Current assignee: Huoshi Creation Technology Co ltd
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2021-06-08
Anticipated expiration: 2041-03-11
Also published as: CN112925886B

Abstract

The invention discloses a few-sample entity identification method based on field adaptation. And then establishing a training data set, training a fulcrum feature classifier, and forming a coefficient matrix by all fulcrum features to represent the correlation between the non-fulcrum feature words and the fulcrum feature words. Training an entity recognition model in a source field, finally, performing entity recognition in a target field, predicting in the target field by using the trained model, and replacing each non-fulcrum feature word in the target field by a non-fulcrum feature word in the source field which is most relevant and satisfies the condition that the relevance is greater than a certain threshold value. The invention utilizes the common fulcrum feature words among the fields to obtain the corresponding relation among different non-fulcrum features among the fields, thereby achieving the purpose of mapping the feature words among the fields.

Description

Few-sample entity identification method based on field adaptation

Technical Field

The invention relates to the field of entity identification, in particular to a few-sample entity identification method based on field adaptation.

Background

In an application scene of text information extraction, scenes are various and detailed, labeling samples are lacked, the acquisition cost of the labeling samples is high, the current situation is faced in industrial application, the current technology has no mature scheme aiming at the scenes of a small number of labeling samples, and the current situation is a popular research direction for transferring the knowledge learned by the model to the scenes of a small number of samples by skillfully utilizing the existing labeling resources.

In the existing text information extraction method, a large number of labeled samples are required for a model training method, although some depth models have the tendency of higher and higher accuracy and smaller amount of required labeled samples, a certain amount of labeled samples are still required to train to obtain available models, and before obtaining the samples, the work cannot be carried out, so that the development cost is transferred to the labeling of the samples, and the overall development efficiency is still low.

The method is used for entity extraction, and can obtain an extraction model with higher accuracy rate without marking samples in the target field by using a large amount of marking resources in the similar field.

Disclosure of Invention

The invention aims to provide a few-sample entity recognition method based on field adaptation, which uses the same characteristics among corpora in different fields as fulcrum characteristics to establish mapping of characteristics among the fields, so that a model trained in a source field with a large number of labels can show good accuracy in a target field without labels, and can obtain higher accuracy of entity recognition in the target field without labels in migration learning among similar fields by applying the model to an entity recognition task.

The purpose of the invention is realized by the following technical scheme: a few-sample entity identification method based on field adaptation comprises the following steps:

(1) selecting fulcrum characteristics

Counting the frequency of n-gram appearing in the linguistic data of the source field and the target field, selecting phrases appearing in the two fields at the same time and having the frequency exceeding a threshold value as fulcrum feature words, instantiating templates, namely the fulcrum feature word w on the right side, the fulcrum feature word w on the left side and the fulcrum feature word w in the middle, of the templates through the fulcrum feature words, representing fulcrum features of one dimension to form a fulcrum feature set, wherein the w represents the fulcrum feature words in the text sentences.

(2) Establishing a training data set, training a pivot feature classifier

Taking the fulcrum feature set in the step (1) as a training data set, and training a group of fulcrum feature classifiers for predicting whether fulcrum feature words exist in the sample sentences, wherein the specific steps are as follows: training a logistic regression model for each dimension fulcrum feature i as a fulcrum feature classifier, judging whether the fulcrum feature exists in words in a sample sentence, and obtaining a coefficient w of the logistic regression model after training_i，w_iIs a column vector, the dimension of the column vector is the dimension number of the non-fulcrum features, and a matrix w is formed for all the fulcrum features₁,w₂,…,w_M]。w_MRepresenting the fulcrum feature of the M-th dimension, M being 3N, w_iValue w of the j-th dimension_ijRepresenting the correlation between the jth dimension non-fulcrum feature word and the ith dimension fulcrum feature word, w_ijPositive values indicate a positive correlation.

Sample input of the pivot feature classifier: and calculating n-grams of all sentences in the two fields to form an n-gram list, wherein the characteristic of each sentence is a binary vector of whether the words in the n-gram list exist in the fulcrum characteristic words and non-fulcrum characteristic words in the sentences, and then removing the corresponding dimensionality of all the fulcrum characteristic words in the sentences to obtain the characteristic vectors of the sentences.

Sample output of the pivot feature classifier: the output dimension is 1, because the n-gram feature of the sentence is calculated, if the left or right of a certain non-fulcrum feature word in the sentence is the fulcrum feature word, the sentence necessarily at least comprises the 2-gram of the fulcrum feature word, the label of the sentence sample is 1 at the moment, otherwise, the label is 0, and meanwhile, the w of the fulcrum feature word is_iThe corresponding term in the coefficient being a positive value, i.e. a sentenceThe non-fulcrum feature words in the son are positively correlated with the fulcrum feature words.

(3) Training entity recognition models in the source domain

And training an entity recognition model on the corpus of the source field, wherein the model is input into a sentence, the model is output into a tag sequence with the same length as the sentence, and the category of each character in the sentence is marked.

(4) Entity identification on target domain

And (4) predicting on the target field by using the model trained in the step (3), and replacing each non-fulcrum feature word in the target field by using a non-fulcrum feature word which is most relevant to the fulcrum feature word and meets the condition that the relevance is greater than a certain threshold value simultaneously, by using the relevance information in the w matrix.

Further, the source domain has tagged corpus and untagged corpus, and the target domain only contains untagged corpus.

Further, n in the n-gram is 2.

Further, in the step (2), w_iMeaning of (a) means the covariance between the non-fulcrum feature and the fulcrum feature.

Further, in the step (2), for the Chinese corpus, the two domain corpora are participated, all possible participated results are taken to form lattice, then n-grams are built for all lattice words, and the process of the step (1) is used for building samples, wherein each sentence contains the n-grams corresponding to all participated results in the sentence.

Further, in step (3), lattice participles are used as input for the Chinese corpus.

Further, in the step (4), for the chinese corpus, the non-pivot word in the most relevant source field corresponding to each lattice participle is replaced.

Further, in step (4), if the lengths of the prediction result replacement times do not match, the resulting offset is handled.

The invention has the beneficial effects that: the corresponding relation between different non-fulcrum characteristics among the fields is obtained by utilizing the common fulcrum characteristic words among the fields, so that the purpose of mapping the characteristic words among the fields is achieved, and the model trained on the source field with a large number of labels can be directly used on the target field without labels.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

As shown in fig. 1, the invention provides a few-sample entity identification method based on domain adaptation. The invention assumes that the similar fields are relatively sufficient with tagged corpora and non-tagged corpora, the source field is provided with tagged corpora and non-tagged corpora, and the target field only contains non-tagged corpora. The specific process is as follows:

(1) selecting fulcrum characteristics

Counting the frequency of n-gram (n is generally 2) appearing in the linguistic data of the source field and the target field, selecting phrases which appear in the two fields at the same time and exceed the frequency threshold (such as 100) as fulcrum feature words, selecting the threshold according to the required identification accuracy, instantiating templates of 'fulcrum feature word w on the right side', 'fulcrum feature word w on the left side' and 'fulcrum feature word w in the middle' through the fulcrum feature words, wherein each instance represents fulcrum features of one dimension to form a fulcrum feature set, and w represents the fulcrum feature words in text sentences.

(2) Establishing a training data set, training a pivot feature classifier

Taking the fulcrum feature set in the step (1) as a training data set, training a fulcrum feature classifier for predicting whether a fulcrum feature word exists in a group of sentences, and if the sentence contains the fulcrum feature word w on the right side of the word, specifically: training a logistic regression model for each dimension fulcrum feature i as a fulcrum feature classifier, judging whether the fulcrum feature exists in words in a sample sentence, and obtaining a coefficient w of the logistic regression model after training_i，w_iIs a column vector, the dimension of the column vector is the dimension number of the non-fulcrum features, and a matrix w is formed for all the fulcrum features₁,w₂,…,w_M]。w_MRepresenting fulcrum feature of dimension M＝3N，w_iMeaning of (a) means the covariance between the non-fulcrum feature and the fulcrum feature. w is a_iValue w of the j-th dimension_ijRepresenting the correlation between the jth dimension non-fulcrum feature word and the ith dimension fulcrum feature word, w_ijPositive values indicate a positive correlation. Two non-fulcrum features are likely to correspond if there is a positive correlation between the non-fulcrum features of two different domains and multiple identical fulcrum features.

Sample output of the pivot feature classifier: the output dimension is 1, because the n-gram feature of the sentence is calculated, if the left side or the right side of a certain non-fulcrum feature word in the sentence is the fulcrum feature word, the sentence necessarily at least comprises the 2-gram of the fulcrum feature word, the label of the sentence sample is 1 at the moment, otherwise, the label is 0, the relevance of the two words is reflected through the fact that the 2-gram appears in large quantity, and meanwhile, the w of the fulcrum feature word is w_iThe corresponding item in the coefficient is a positive value, namely the non-fulcrum feature word in the sentence is positively correlated with the fulcrum feature word.

And (3) dividing words of the two fields of linguistic data aiming at the Chinese linguistic data, taking all possible word division results to form lattice, then establishing n-grams for all lattice words, and establishing a sample by using the process of the step (1), wherein each sentence contains the n-grams corresponding to all the word division results in the sentence.

(3) Training entity recognition models in the source domain

Training an entity recognition model such as lstm on a source domain corpus, wherein the input of the model is a sentence, the output of the model is a label sequence with the same length as the sentence, and the category of each character in the sentence is marked. For Chinese corpus, lattice participles are used as input.

(4) Entity identification on target domain

And (4) predicting on the target field by using the model trained in the step (3), searching a corresponding non-fulcrum feature word _ src in the source field for each non-fulcrum feature word _ dst in the target field by using correlation information in the w matrix, specifically, screening a fulcrum feature word set word _ pivot which is most correlated with the word _ dst and simultaneously satisfies that the correlation is greater than a certain threshold value, such as 0.7, finding a word with the highest sum of word correlation scores in the word _ pivots from the source field, and replacing the word _ dst by the non-fulcrum feature word _ src in the source field if the average of the correlation scores is greater than the threshold value, such as 0.7, or not replacing the word _ dst. For example, if the source domain is a diabetic patient medical record and the target domain is a cancer patient medical record, then the non-anchor characteristic word "lung cancer" in the target domain may be replaced with "diabetes" in the source domain. And replacing the non-pivot words in the most relevant source field corresponding to each lattice participle aiming at the Chinese corpus. If the time lengths of the replacement of the prediction results are inconsistent, the caused offset is processed.

The above-described embodiments are intended to illustrate rather than to limit the invention, and any modifications and variations of the present invention are within the spirit of the invention and the scope of the appended claims.

Claims

1. A few-sample entity identification method based on field adaptation is characterized by comprising the following steps:

(1) selecting fulcrum characteristics

(2) Establishing a training data set, training a pivot feature classifier

Training a group of predictions by taking the pivot point feature set in the step (1) as a training data setThe fulcrum feature classifier for judging whether fulcrum feature words exist in the sample sentences specifically comprises the following steps: training a logistic regression model for each dimension fulcrum feature i as a fulcrum feature classifier, judging whether the fulcrum feature exists in words in a sample sentence, and obtaining a coefficient w of the logistic regression model after training_i，w_iIs a column vector, the dimension of the column vector is the dimension number of the non-fulcrum features, and a matrix w is formed for all the fulcrum features₁,w₂,…,w_M]。w_MRepresenting the fulcrum feature of the M-th dimension, M being 3N, w_iValue w of the j-th dimension_ijRepresenting the correlation between the jth dimension non-fulcrum feature word and the ith dimension fulcrum feature word, w_ijPositive values indicate a positive correlation.

Sample output of the pivot feature classifier: the output dimension is 1, because the n-gram feature of the sentence is calculated, if the left or right of a certain non-fulcrum feature word in the sentence is the fulcrum feature word, the sentence necessarily at least comprises the 2-gram of the fulcrum feature word, the label of the sentence sample is 1 at the moment, otherwise, the label is 0, and meanwhile, the w of the fulcrum feature word is_iThe corresponding item in the coefficient is a positive value, namely the non-fulcrum feature word in the sentence is positively correlated with the fulcrum feature word.

(3) Training entity recognition models in the source domain

(4) Entity identification on target domain

2. The method of claim 1, wherein the source domain has tagged corpus and untagged corpus, and the target domain only contains untagged corpus.

3. The field-adaptive small-sample entity recognition method according to claim 1, wherein n in the n-gram is 2.

4. The domain-adaptive small-sample entity recognition method as claimed in claim 1, wherein in step (2), w_iMeaning of (a) means the covariance between the non-fulcrum feature and the fulcrum feature.

5. The method for few-sample entity recognition based on domain adaptation as claimed in claim 1, wherein in step (2), for the chinese corpus, the words are segmented for two domain corpora, all possible segmentation results are taken to form lattice, then n-gram is built for all lattice words, and then the sample is built by the process of step (1), when each sentence contains n-gram corresponding to all segmentation results in the sentence.

6. The domain-adaptive small-sample entity recognition method as claimed in claim 5, wherein in step (3), lattice participles are used as input for Chinese corpus.

7. The domain-adaptive small-sample entity recognition method of claim 5, wherein in the step (4), the non-pivot word in the most relevant source domain corresponding to each lattice participle is replaced with respect to the Chinese corpus.

8. The domain-adaptive small-sample entity recognition method according to claim 1, wherein in the step (4), if the time lengths of the replacement of the prediction results are inconsistent, the resulting offset is processed.