CN115935959A

CN115935959A - Method for labeling low-resource glue word sequence

Info

Publication number: CN115935959A
Application number: CN202211612122.7A
Authority: CN
Inventors: 刘畅; 哈里旦木·阿布都克里木; 阿布都克力木·阿布力
Original assignee: Xinjiang University Of Finance & Economics
Current assignee: Xinjiang University Of Finance & Economics
Priority date: 2022-12-14
Filing date: 2022-12-14
Publication date: 2023-04-07

Abstract

The invention discloses a method for labeling low-resource glue language sequences, which comprises the steps of firstly carrying out enhancement processing on a data set of a target language, then carrying out primary extraction on multi-language features of a target text, and finally further extracting the multi-language features of the target text by utilizing a W2NER framework and carrying out prediction to generate a prediction tag sequence. The method can quickly obtain the information such as morphemes, parts of speech, time, places, people and the like, remarkably reduces the classification time and reduces the labor cost. The invention provides effective technical support for sequence labeling tasks such as form segmentation, named entity recognition, part of speech labeling and the like of middle and sub low-resource adhesive words such as Uygur language, kazakstan language, korkinj and the like, is fully applied to the fields of form and part of speech analysis and information extraction of the middle and sub low-resource adhesive words, is applied to downstream tasks such as machine translation, question-answering system, emotion analysis and the like, and has reference and popularization significance for sequence labeling tasks of other low-resource languages in China.

Description

Method for labeling low-resource glue word sequence

Technical Field

The invention relates to a sequence labeling method in the field of natural language processing, in particular to a sequence labeling method suitable for the field of minority language information processing.

Background

Sequence labeling is a complex natural language understanding task, and a target tag sequence of a sentence needs to be predicted to achieve the purposes of classifying words or morphemes, extracting text information and the like. Sequence labeling methods are mainly divided into three methods, namely a rule/dictionary-based method, a statistical-based method and a deep learning-based method, and the existing method mainly focuses on deep learning.

mBERT is a multilingual version of BERT (Bidirectional Encoder responses from Transformers), primarily used for multilingual natural language understanding tasks. mBERT is pre-trained on 104 languages, where multiple languages are represented in the same semantic space. BERT is based on the encoder portion of the transform architecture, and pre-trains tasks using Mask Language Modeling (MLM) and Next Sentence Prediction (NSP). MLM is a self-supervised method that learns semantic information by replacing original words with random words or [ MASK ] labels. The NSP mechanism is used to evaluate whether two input sentences are adjacent, but under the condition of massive pre-training texts, the performance of the model in most understanding tasks is reduced. mBERT effectively utilizes multi-language features, can process multiple languages simultaneously, and has good effect on low-resource language understanding tasks.

XLM-R (Cross-Language Model-RoBERTA) is mainly based on the RoBERTA Model, adopts larger-scale multi-Language corpus, contains more multi-Language features, removes NSP mechanism and is obviously superior to mBERT in multi-Language downstream tasks. However, the low-resource glue words occupy a lower proportion in the pre-training corpus of the XLM-R, and the characteristic extraction of the XLM-R on the low-resource glue words is limited.

The CINO (Chinese Language entity Pre-trained Language Model) performs secondary Pre-training on Chinese mINOrity Language corpora (8 languages such as Uygur Language, kazakh and the like) based on XLM-R, so that the comprehension capability of the Model on low-resource adhesive words is effectively improved, but the morphemes or the internal relations of word sequences are not fully extracted.

W ² The NER (unified Named Entity Recognition as Word-Word relationship) is a framework for unifying multiple Named Entity Recognition task types, further pays attention to the adjacent relation among Entity words, achieves the optimal level on multiple English-Chinese Named Entity Recognition evaluation standards, but lacks multi-language features and has higher requirements on data volume.

Generally, the existing sequence labeling work of low-resource glue words rarely considers the relationship between characters or words, and due to the lack of labels, the sequence labeling task is more difficult for syncope words with less data resources and complex grammar, such as Uygur language.

Disclosure of Invention

In order to alleviate the problem of data shortage of low-resource glue words and fully consider the relation between characters or words, the method is suitable for the task of labeling the low-resource glue word sequence.

In order to realize the purpose of the invention, the invention specifically adopts the following technical scheme:

a method for labeling low-resource glue word sequences, which is characterized by comprising the following steps:

enhancing the data, and adding training set data of other similar languages of the same task into a training set of the deep learning model;

primarily extracting multilingual features of a target text by utilizing a Chinese mINOrity Language Pre-training Language Model (CINO);

using W ² NER framework further extracts objectsAnd predicting the multi-language characteristics of the text, acquiring the relation between characters or words, and generating a prediction tag sequence.

Preferably, the Chinese minority language pre-training language model is only used for character representation or word representation, model parameters are not adjusted, and the Chinese minority language pre-training language model is used for preliminarily extracting multi-language features, wherein the method comprises the following steps:

segmenting the target text using the Sennce Piece toolkit, representing each input Sentence by a plurality of tokens, each token being mapped to a real number vector according to the generated dictionary, and then, each input Sentence X = { X = ₁ ，x ₂ ，...，x _N Get preliminary multilingual character representations or word representations through a max-pooling mechanism, where x _i N, N represents the number of characters or words of the input sentence, i =1, 2.

Preferably, using W ² The NER framework understands the relationship between characters and characters or words and generates a sequence of predictive labels, including:

acquiring context information from an input sentence by adopting a Bidirectional Long Short-Term Memory (Bi-LSTM) network to obtain final multi-language character representation or word representation;

transmitting the character representation or the word representation into a convolutional Layer, generating a character pair representation or a word pair representation through Conditional Layer Normalization (CLN), BERT grid representation, multi-Layer Perceptron (MLP) and Multi-Granularity extended Convolution (Multi-Granularity extended Convolution), and further generating and optimizing the character pair grid representation or the word pair grid representation for subsequent relation classification between characters and characters or words;

preferentially selecting relationship scores from different angles by adopting a combined prediction layer consisting of a Biaffine prediction layer and an MLP prediction layer, and performing combined reasoning on the relationships contained in all character pairs or word pairs;

and taking the relation between characters or words as a directed graph, decoding by adopting NNW and THW mechanisms to obtain a prediction probability, and finally obtaining a prediction label sequence according to the prediction probability.

Preferably, the conditional layer normalization extends the character representation or word representation from 2 dimensions to 3 dimensions, resulting in a character-to-lattice representation or word-to-lattice representation.

Preferably, the BERT grid representation connects the character pair grid representation information or the word pair grid representation information, the character pair position information or the word pair position information and the grid area information as BERT input representation to obtain a position-area perception representation of the grid, and then reduces dimensionality by the multi-layer perceptron to enrich the character pair grid representation or the word pair grid representation.

Preferably, the multi-granularity expansion convolution captures interactions between characters or words of different distances.

Preferably, the input of the Biaffine prediction layer is from a two-way long-short term memory network, and the output of the Biaffine prediction layer is a character pair or word pair relation score calculated by a classifier; the input of the MLP prediction layer is from the character pair grid representation or the word pair grid representation of the convolution layer, and the character pair or the word pair relation score is calculated by utilizing a multilayer perceptron; and adding the outputs of the Biaffine prediction layer and the MLP prediction layer into a Softmax function, and calculating to obtain a final result of the character pair or word pair relation score.

The invention provides a method for labeling low-resource adhesive language sequence, which comprises the steps of firstly, carrying out enhancement processing on a data set of a target language, providing more multi-language features, learning additional semantic information through similarity among languages, relieving the problems of insufficient labeled data of the target language and the like, and relieving the problem of insufficient labeled data of a downstream W ² The NER frame parameters are large and difficult to train; then, CINO is utilized to carry out multi-language characteristic preliminary extraction on a target text to obtain more accurate character representation or word representation of the target language and data enhancement language, the CINO is only used for character representation or word representation, model parameters are not adjusted, the calculation force requirement is reduced, and the problems of negative migration among multiple languages (namely, the problem of cursing the multi-language) and forgetting of low-resource languages can be relieved; finally using W ² The NER framework further extracts and predicts the multi-language features of the target text, deeply excavates the complex relation between low-resource glue language characters and characters or words and words,a predicted tag sequence is generated. The three parts of the invention are mutually matched and are advanced layer by layer, massive unstructured low-resource adhesive language text data are converted into structured data, characters, words and phrases of a target text are effectively classified, information such as morphemes, parts of speech, time, places, figures and the like is rapidly obtained, the classification time can be reduced, and the labor cost is reduced. The invention provides effective technical support for sequence labeling tasks such as form segmentation, named entity recognition, part of speech labeling and the like of middle and sub low-resource adhesive words such as Uygur language, kazakstan language, korkinj and the like, is fully applied to the fields of form and part of speech analysis and information extraction of the middle and sub low-resource adhesive words, is applied to downstream tasks such as machine translation, question-answering system, emotion analysis and the like, and has reference and popularization significance for sequence labeling tasks of other low-resource languages in China.

Drawings

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

FIG. 1 is a flowchart illustrating a method for labeling low-resource glue sequences according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of a deep learning model of a low-resource glue sequence labeling method according to a second embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples, it being understood that the specific examples described herein are for the purpose of illustration and description only and are not intended to limit the invention to the examples described below.

Example one

The present embodiment provides a method for labeling low-resource glue sequences, the main flow of which is shown in fig. 1, and the method includes:

and performing enhancement processing on the data, and adding training set data of other similar languages of the same task into a training set of the deep learning model.

And preliminarily extracting the multilingual features of the target text by utilizing a Chinese minority language pre-training language model CINO.

By means of W ² NER framework further extracts target textAnd predicting the multi-language characteristics to obtain the relation between characters or words and generate a predicted label sequence.

The above steps are further described in detail below:

1. data enhancement

And adding the training set data of other similar languages of the same task into the training set of the deep learning model, and expanding the data set data of the deep learning model to realize the enhancement processing of the data. Data enhancement is beneficial to providing more multilingual features under the condition of low resources, extra semantic information is learned through similarity among languages, the problems of insufficient labeled data of a target language and the like can be solved, and therefore the problems of low downstream W (word-by-word) are solved ² The NER frame parameters are large in quantity and difficult to train.

2. CINO preliminary extraction of multi-language features

Segmenting the target text by using a Sendence Piece toolkit, representing each input Sentence by a plurality of tokens, mapping each token to a real number vector according to a generated dictionary, and then enabling each input Sentence X = { X = { X = ₁ ，x ₂ ，...，x _N Get preliminary multilingual character representations or word representations through a max-pooling mechanism, where x _i N, N represents the number of characters or words of the input sentence, i =1, 2.

Compared with other pre-training language models, CINO contains more general semantic information of low-resource glue language, and more accurate character representation or word representation of the target language and the data enhancement language is easy to obtain. Unlike the pre-training-fine-tuning paradigm common in the low-resource language field, the present invention uses only CINO for character representation or word representation, does not adjust model parameters, reduces computational power requirements, and can alleviate the problem of negative migration between multiple languages (i.e., the problem of cursing multiple languages) and the problem of forgetfulness for low-resource languages.

3、W ² The NER framework is divided into a coding layer, a convolutional layer and a combined prediction layer, and is used for further extracting features and predicting multi-language character representations or word representations obtained according to CINO, deeply mining low-resource glue language characters and characters or joint prediction layersComplex relationships from word to word and generates a sequence of predictive labels.

The method of the present embodiment is the same as the method of processing characters and words, and the following description will be given only by taking words as an example.

Using W ² The NER framework further extracts multi-language features of the target text and predicts the multi-language features to obtain the relation between words and generate a prediction tag sequence, and the method specifically comprises the following steps:

and 3.1, acquiring context information from the input sentence by adopting Bi-LSTM to obtain a final word expression. The word representation may be expressed as

Wherein h is _i Is x _i Is expressed by the word(s) in->

Is a set of real numbers, d _h Is h _i Dimension (d) of (a).

And 3.2, transmitting the word representation into a convolutional layer, and refining the word pair grid representation through CLN, BERT grid representation, MLP and multi-granularity expansion convolution for subsequent relation classification between words.

CLN expands word representation from 2 dimension to 3 dimension to obtain 3 dimension matrix of word pair grid representation

Element S of the matrix _ij Is a word pair (x) _i ，x _j ) Is represented by (A), S _ij Is shown in formula 1:

wherein, γ _ij And λ _ij Respectively, the gain parameter and the layer normalized deviation, mu and sigma respectively representing h _j The average and standard deviation between elements, as indicated by Hardamard product.

The BERT grid represents the word pair grid representation information S and the word pair position information I ^d And mesh region informationI ^t The position-area perception representation of the grid is obtained by connecting the two parts like BERT input representation, and then the dimensionality is reduced by a multilayer perceptron MLP, so that the word-pair grid representation can be enriched, as shown in a formula 2:

G＝MLP([s，I ^d ，I ^t ]) Equation 2

The multi-granularity expansion convolution calculates the relationship between words with different distances as shown in equation 3:

R ^l ＝GELU(DC _l (G) Equation 3)

Where l is the expansion ratio, l =1,2,3; the GELU is a GELU activation function, and the DC is a multi-granularity extension convolution. The final word pair lattice representation is shown in equation 4:

wherein d is _G Is the dimension of G.

And 3.3, dividing the combined prediction layer into a Biaffine prediction layer and an MLP prediction layer, preferentially selecting relation scores from different angles by the combined prediction layer, and performing combined reasoning on the relation among all word pairs. Biaffine's main function is to act as a residual join, with its input coming from the Bi-LSTM part. Then, the word pair (x) is calculated by the classifier _i ，x _j ) As shown in equations 5-7:

m _i ＝MLP(h _i ) Equation 5

n _j ＝MLP(h _j ) Equation 6

Wherein A, B and B are trainable parameters, m _i And n _j Respectively, a target word and other words of the same sentence.

The MLP prediction layer uses word pairs (x) from the convolutional layer _i ，x _j ) Relation R _ij As input, a relationship score is calculated using MLP, as shown in equation 8:

y″ _ij ＝MLP(R _ij ) Equation 8

Finally, summing the results output by the Biaffine prediction layer and the MLP prediction layer, and inputting the sum into a Softmax function to calculate a word pair (x) _i ，x _j ) The final result of the relationship score, as shown in equation 9:

y _ij ＝Softmax(y′ _ij +y″ _ij ) Equation 9

And 3.4, regarding the relation between words as a directed word graph, decoding and predicting by adopting NNW and THW mechanisms to obtain prediction probability, and finally obtaining a prediction label sequence according to the prediction probability.

3.5, in order to be closer to the target label sequence of each sentence, only setting a log-likelihood loss function in the process of training the deep learning model, and adjusting parameters, as shown in formula 10:

wherein

Is (x) _i ，x _j ) R denotes->

In a predetermined relationship of (a), based on a number r of predetermined relationships>

Represents (x) _i ，x _j ) And/or a relationship tag of>

Comprising a pair->

The prediction probability of (2).

4. Step 2 and step 3 are general steps of training a model and testing the model or actually applying the model. And (3) performing multiple rounds of training and parameter adjustment on the model in a training stage according to the training set after data enhancement and the loss function in the step 3.5, obtaining models with different parameters through each round of training, selecting an optimal model according to a verification set of the target language data set, and finally evaluating the test set of the target language data set or predicting the test set of the target language data set according to a target language text in practical application.

Example two

This embodiment provides a method for labeling low-resource glue language sequence, which takes the tasks of morphological segmentation, named entity recognition and part-of-speech tagging of Uyghur language as an example, and combines FIG. 2 to enhance data, extract CINO multi-language features, and W ² The three specific embodiments of the NER framework for generating predicted tag sequences are described in detail.

1. Data set

The Uygur language data set used in the experiment is from a morphological segmentation data set THUUMS (text mainly comes from Tianshan network), a named entity recognition data set Wikian and a part-of-speech tagging data set Universal dependences, and the data set of each task is divided into a training set, a verification set and a training set. The THUUMS is labeled by taking characters as units, morpheme boundaries are used as research targets, labels are b (egin), m (iddle), e (nd) and s (ingle), and the labels respectively mean beginning characters, middle characters and ending characters of morphemes and independent characters (namely, the independent characters form morphemes); wikian labels words as units, and extracts named entities as research targets, and labels are 0 (None), LOC (location), PER (person) and ORG (organization); the Universal dependences are labeled in units of words to distinguish PARTs of speech as a research target, and the labels are non (common non), PUNCT (progression), ADP (progression), NUM (numerical), SYM (non-progression) symbol, SCONJ (supervision connection), ADJ (objective), PART (partial), DET (determiner), CCONJ (correlation connection), prop (positive non), prop (positive), X (other), ADV (advertisement), INTJ (interpretation), mail (main), and AUX (automatic cover), object, and person). The training and verification set is used in the deep learning model training stage, parameters are automatically adjusted, the optimal version is selected, and the test set is used for testing the performance of the deep learning model (the target text in any same language can be changed during actual application).

2. Data enhancement

And adding training set data of the same task of other similar languages into the training set of the deep learning model. Training set data for data enhancement were from wikian (kazakh, turkey, ashbyjiang, kolbenzi, and uzzibeth) and Universal Dependencies (kazakh and turkey). The Uygur language morphological segmentation task has sufficient data, and does not consider the data enhancement of the task. Data enhancement is beneficial to providing more multi-language features under the condition of low resources, and the model learns additional semantic information through the similarity between languages, so that the problems of insufficient labeled data of the target language and the like are solved.

3. Extracting features using a multilingual feature extractor, comprising the steps of:

a1, segmenting a target text by using a Sennce Piece toolkit to obtain a token;

a2, each token is mapped to a real number vector according to the generated dictionary;

and A3, obtaining a preliminary multi-language word representation through a maximum pooling mechanism.

4. Using W ² The NER framework further extracts word sequence features, W ² The NER framework mainly comprises a coding layer, a convolutional layer and a joint prediction layer.

The method of the present embodiment is the same as the method of processing characters and words, and only words will be described below as an example.

And the coding layer inputs the multi-language word representation obtained in the step A3 into the Bi-LSTM, and obtains the final word representation according to the context information of each sentence.

The step of the convolution layer comprises:

c1, expanding the word representation from 2 dimensions to 3 dimensions through CLN to obtain representation word pair grid representation;

c2, connecting word pair grid representation information, word pair position information and grid region information like BERT input representation through BERT grid representation to obtain position-region perception representation of the grid;

c3, reducing dimensionality through a multilayer perceptron, and enriching word-pair grid representation;

and C4, capturing interaction between words with different distances through multi-granularity expansion convolution.

And the joint prediction layer adopts a joint prediction layer consisting of a Biaffine prediction layer and an MLP prediction layer to preferentially select relationship scores from different angles, joint reasoning is carried out on the relationships contained in all word pairs, the relationships between the words are taken as directed graphs, decoding prediction is carried out by adopting NNW and THW mechanisms to obtain prediction probability, and a prediction tag sequence is finally obtained according to the prediction probability. The method specifically comprises the following steps:

d1, the Biaffine predictor mainly functions as a residual error connection, the input of the residual error connection comes from a Bi-LSTM part, and the output of the residual error connection is a word-pair relation score calculated by the classifier;

d2, calculating a word pair relation score from another angle by the MLP predictor, inputting a word pair grid representation from the convolutional layer, and calculating the word pair relation score by utilizing a multilayer perceptron;

d3, summing the two results of the D1 and the D2 and then inputting the summed results into a Softmax function, and calculating a final result of the word pair relationship score;

d4, decoding;

and D5, obtaining a predicted tag sequence on the basis of deep abstract representation.

Wherein, the step D4 comprises:

e1, calculating the relation between entity words by an NNW mechanism;

the E2, THW mechanism identifies the boundaries of each entity;

and E3, regarding the relation between the words as a directed word graph, predicting through NNW and THW mechanisms, and finally obtaining a predicted label sequence according to the prediction probability.

5. And performing multiple rounds of training and parameter adjustment on the model in a training stage according to the training set and the log-likelihood loss function after data enhancement, obtaining models with different parameters in each round of training, selecting an optimal model according to a verification set of the Uygur language data set, and finally performing evaluation on a test set of the Uygur language data set.

In the method for labeling low-resource glue word sequences of the present embodiment, the ablation experiment result is shown in table 1, the example prediction effect is shown in table 2, and the example prediction result is shown in table 3.

TABLE 1

TABLE 2

Model (model)	Morphological segmentation (macro-F1)	Named entity recognition (micro-F1)	Part of speech tagging (Accuracy)
				mBERT-uncased	97.57	66.67	83.85
XLM-R-Large	97.21	67.75	88.81
				CINO-Large-v2	97.53	71.15	89.49
W2NER	97.57	44.44	57.31
				The invention	98.10	79.11	91.00

TABLE 3

Therefore, the method for labeling the low-resource glue word sequence of the embodiment makes full use of the multi-language features and the relation between characters or words, and the prediction effect is obviously superior to that of all existing method models. With each portion removed, there is a significant degradation in performance.

Claims

1. A method for labeling low-resource glue word sequences, which is characterized by comprising the following steps:

preliminarily extracting multi-language features of a target text by utilizing a Chinese minority language pre-training language model;

using W ² The NER framework further extracts and predicts the multi-language features of the target text, obtains the relation between characters or words and generates a prediction label sequence.

2. The method for labeling low-resource glue sequences as claimed in claim 1, wherein the Chinese minority language pre-training language model is only used for character representation or word representation, without adjusting model parameters, and the method for preliminarily extracting multi-language features by using the Chinese minority language pre-training language model comprises:

using SentenThe ce Piece toolkit is used for segmenting the target text, each input sentence is represented by a plurality of tokens, each token is mapped to a real number vector according to the generated dictionary, and then each input sentence X = { X = (the number of the input sentences is one) is obtained ₁ ，x ₂ ，...，x _N Get preliminary multilingual character representations or word representations through a max-pooling mechanism, where x _i N, N represents the number of characters or words of the input sentence, i =1, 2.

3. The method of claim 1, wherein W is used for labeling of low-resource glue sequences ² The NER framework understands the relation between characters and characters or words and generates a prediction label sequence, and comprises the following steps:

acquiring context information from an input sentence by adopting a bidirectional long-short term memory network to obtain a final multilingual character representation or word representation;

transmitting the character representation or word representation into a convolutional layer, generating character pair representation or word pair representation through condition layer normalization, BERT grid representation, a multilayer perceptron and multi-granularity expansion convolution, further generating and optimizing the character pair grid representation or word pair grid representation, and using the character pair grid representation or word pair grid representation for subsequent relation classification between characters or words;

preferentially selecting relation scores from different angles by adopting a combined prediction layer consisting of a Biaffine prediction layer and an MLP prediction layer, and performing combined reasoning on the relations contained in all character pairs or word pairs;

4. The method for low-resource glue word sequence annotation of claim 3, wherein the condition layer normalizes the expansion of the character representation or word representation from 2 dimensions to 3 dimensions, resulting in a character-to-grid representation or word-to-grid representation.

5. The method for low-resource glue word sequence annotation defined in claim 3 wherein the BERT grid representation links the character-to-grid representation information or word-to-grid representation information, character-to-position information or word-to-position information and grid region information as the BERT input representation to obtain the position-region perception representation of the grid, and further wherein the multi-layer perceptron is used to reduce dimensionality to enrich the character-to-grid representation or word-to-grid representation.

6. The method for low-resource glue word sequence annotation of claim 3, wherein multi-granular extended convolution captures interactions between characters or words of different distances.

7. The method for low-resource glue phrase sequence tagging of claim 3, wherein the input to the Biaffine prediction layer is from a two-way long-short term memory network, the output of which is a character pair or word pair relationship score calculated by a classifier; the input of the MLP prediction layer is from the character pair grid representation or the word pair grid representation of the convolution layer, and the character pair or the word pair relation score is calculated by utilizing a multilayer perceptron; and adding the outputs of the Biaffine prediction layer and the MLP prediction layer into a Softmax function, and calculating to obtain a final result of the character pair or word pair relation score.