CN113901791B

CN113901791B - Enhanced dependency syntax analysis method for fusing multi-strategy data under low-resource condition

Info

Publication number: CN113901791B
Application number: CN202111078682.4A
Authority: CN
Inventors: 线岩团; 高凡雅; 余正涛; 相艳
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2021-09-15
Filing date: 2021-09-15
Publication date: 2022-09-23
Anticipated expiration: 2041-09-15
Also published as: CN113901791A

Abstract

The invention relates to a dependency syntax analysis method for enhancing fusion multi-strategy data under a low-resource condition, and belongs to the field of natural language processing. The invention comprises the following steps: constructing synonymy dictionaries of the same parts of speech of Thai, Vietnamese and English; synonym substitution expansion training data are carried out on a small-scale UD (universal dependent trees treebanks) data set of the three languages by utilizing a synonym dictionary; and carrying out mixup on the original words and the synonyms in the training data to generate virtual new words for subsequent training by utilizing various mixup data enhancement strategies at different stages of model training. The invention provides a plurality of data enhancement strategies aiming at the problem of low-resource dependency syntactic analysis. The method effectively expands training data through synonym replacement, and relieves the problem of unknown words. Through various data enhancement strategies of mixup, the overfitting problem of the model is effectively relieved, and the generalization capability of the model is improved.

Description

Enhanced dependency syntax analysis method for fusing multi-strategy data under low-resource condition

Technical Field

The invention relates to a dependency syntax analysis method for enhancing fusion multi-strategy data under a low-resource condition, and belongs to the field of natural language processing.

Background

In natural language processing, dependency parsing aims to identify syntactic dependencies from word to word in a sentence. The dependency syntax can provide syntax characteristics for tasks such as information extraction, automatic question answering, machine translation and the like, and the performance of the model is improved.

Although a great deal of research work is carried out on the aspects of feature coding, dependency relationship scoring, decoding and the like by the existing dependency syntax analysis method, the effect of dependency syntax analysis is effectively improved. However, under the condition of low resources, the performance of the existing model and method is difficult to obtain a good analysis result. This problem is particularly evident in low-resource languages such as Thai, Vietnamese, and the like. The lack of corpus can cause serious problems of overfitting unknown words and models. Taking the ud (universal dependences trees) data set of vietnamese as an example, the unknown word proportion of the test set is 51.7%. According to observation, under the condition of low resources, the overfitting problem of the model is easy to occur, so that the difference between the training accuracy and the testing accuracy of the model is huge.

Disclosure of Invention

The invention provides a method for analyzing dependency syntax by fusing multi-strategy data enhancement under a low-resource condition, which is used for analyzing the dependency syntax under the low-resource condition of Thai, Vietnamese, small-scale English and the like and solves the problem of poor dependency syntax analysis effect caused by the problems of rare data corpus, overhigh unknown word proportion, overfitting of a model and the like.

The technical scheme of the invention is as follows: the method for analyzing the dependency syntax for fusing multi-strategy data enhancement under the condition of low resources comprises the following specific steps:

step1, processing the dependency syntactic analysis data obtained from the Thai language, Vietnamese language and small-scale English corpus of the UD data set, obtaining synonymy information of words of three languages from the Babelnet website, and constructing a synonymy dictionary with the same part of speech according to the synonymy information.

Step2, according to the constructed synonym dictionary, performing data enhancement on the linguistic data in the UD data set in a synonym direct replacement mode to obtain training data of three language dependence syntactic analysis amplification.

Step3, obtaining synonyms of the same part of speech corresponding to the words in the training data according to the constructed synonym dictionary, carrying out mixup on the words and the synonyms in the training data in different model positions after double affine model coding, after BiLSTM and after MLP layer in a plurality of mixup data enhancement modes to generate virtual new words, and carrying out training and scoring by a scorer by utilizing the virtual new words.

The specific steps of Step1 are as follows:

step1.1, counting words in training data of three languages of Thai, Vietnamese and English, and acquiring corresponding synonym information including synonyms and corresponding parts of speech from a Babelnet website aiming at the words.

Step1.2, filtering and screening the synonym information in the following way: (1) classifying and dividing the obtained synonyms according to the part of speech, and ensuring that the replaced words are consistent with the original part of speech in subsequent replacement based on the synonym dictionary, (2) deleting the words with incompletely similar meaning in the synonym dictionary.

The specific steps of Step2 are as follows:

step2.1, traversing words in the training data, and comparing and matching the words with the synonym dictionary.

And Step2.2, screening the matched synonyms according to the part of speech, directly replacing the synonyms with the original words, and replacing the synonyms into a plurality of synonyms if a plurality of words in one sentence have synonyms. The augmented training data is trained as a new data set.

As a preferable scheme of the invention, the Step of Step3 comprises the following specific steps:

step3.1, adopting a form of splicing word vectors and part-of-speech tagging vectors as model input, wherein the input vector of the original word is

The input vector corresponding to the synonym is

Wherein e (w) _i ) And e (d) _i ) Corresponding to the original word vector and the synonym vector, e (t) _i ) And labeling vectors for parts of speech.

Step3.2, mix in mixup after the Embedding stage.

Step3.2.1, obtaining x after Embedding process _{Original source} And x _{Is composed of} And obtaining new training data through the mixup process.

Wherein, w ₁ And w ₂ Respectively represent x _{Original source} And x _{All in one} ，

The newly obtained virtual training data. λ is a distribution conforming to Beta, i.e., λ -Beta (α, α), α ∈ (0, infinity), λ ∈ [0,1 ]]The parameters obtained.

Step3.2.2, for words without synonyms, w ₁ And w ₂ All use x _{Original source} Indicating that training data was generated via the same mixup process.

Step3.2.3, bands obtained

The new training data is subjected to BilSTM to obtain the characteristic r _i Such that each input element can contact the context.

Step3.2.4、r _i Respectively obtaining characteristics after two different multi-layer perceptron MLPs for dimension reduction

And

Step3.2.5、

and

and obtaining a score matrix through a double affine scorer.

Wherein, the matrix H is a stack form of the eigenvector H coded secondarily by MLP,

is a fractional matrix.

Step3.3, mix up after the BilSTM phase.

Step3.3.1, the original word and the synonym are processed by Embedding to obtain x _{Original source} And x _{Is composed of} Then pass through BiLSTM stage to obtain x _{Original source} R with contextual characteristics _i And x _{Is composed of} R with contextual characteristics _i ' and then both get new training data through the process of mixup.

Wherein, w ₁ And w ₂ Respectively represent r _i And r _i ′，

Is a newly obtained virtual feature. Lambda _i Is in accordance with the Beta distribution, i.e., λ -Beta (α, α), α ∈ (0, infinity), λ ∈ [0,1 ]]The obtained parameters are assigned a lambda to each pair of features participating in the mixup _i 。

Step3.3.2 for words without synonyms, w ₁ And w ₂ All use r _i Indicating that the features were generated through the same mixup process.

Step3.3.3, band obtained

The subsequent processes are the same as Step3.2.4 and Step3.2.5.

Step3.4, mix up after the MLP phase.

Step3.4.1, primitive and synonyms together pass through Embedding gives x after process _{Original source} And x _{Is composed of} Then pass through BiLSTM stage to obtain x _{Original source} R with contextual characteristics _i And x _{All in one} R with contextual characteristics _i ' then passing through two different multi-layer perceptrons MLP for dimension reduction together, the original words get corresponding characteristics

And

synonyms deriving corresponding characteristics

And

finally at this stage the new features entering the scorer are obtained separately at mixup.

Wherein is newly obtained

And

is a new virtual feature. Lambda _i Is in accordance with the Beta distribution, i.e., λ -Beta (α, α), α ∈ (0, infinity), λ ∈ [0,1 ]]The obtained parameters are assigned a lambda to each pair of features participating in the mixup _i 。

Step3.4.2. for words without synonyms,

and

and

and

are consistent.

And step3.4.3, obtaining new training data, and scoring in the same way as step 3.2.5.

Step3.5, another mixup strategy is a pairwise combination mode, any two of E _ mixup, B _ mixup and M _ mixup are mixed and used, and the mixup mode of each strategy is as described above.

The invention has the beneficial effects that:

(1) the data enhancement method based on the same-part-of-speech synonym dictionary replacement is realized, the problems of high unknown word proportion and less data are effectively relieved, and training data are expanded.

(2) The data enhancement strategies of various mixup are realized, the model overfitting problem is effectively relieved, and the model generalization capability is improved.

Drawings

FIG. 1 is a baseline model used by the present invention;

FIG. 2 is a diagram of a specific structure of a direct synonym-based replacement method;

FIG. 3 is a flow chart illustrating a specific structure of a low-resource dependency syntax analysis method based on various mixup data enhancement methods.

Detailed Description

Example 1: as shown in fig. 1, fig. 2 and fig. 3, the method for merging multiple policy data enhanced dependency syntax analysis under low resource condition includes the following specific steps:

step1, processing the dependency syntactic analysis data obtained from the Thai language corpus, Vietnamese language corpus and small-scale English language corpus of the UD data set, obtaining synonymy information of words of three languages from a Babelnet website, and constructing a synonymy dictionary of the homonymy according to the synonymy information.

Step1.2, filtering and screening the synonym information in the following way: (1) classifying and dividing the obtained synonyms according to the part of speech, and ensuring that the replaced words are consistent with the original part of speech when the synonym dictionary is used for subsequent replacement, (2) deleting the words with the incompletely similar meaning in the synonym dictionary.

And Step2.2, screening the matched synonyms according to the parts of speech, directly replacing the synonyms with the original words, and if a plurality of words in one sentence have the synonyms, replacing the synonyms with a plurality of synonyms. The augmented training data is trained as a new data set. Data distribution before and after expansion is shown in table 1:

TABLE 1 Thai, Vietnamese and English dependency parsing data information

Step3.1. The form of splicing word vectors and part-of-speech tagging vectors is adopted as model input, and the input vector of the original word is

The input vector corresponding to the synonym is

Wherein e (w) _i ) And e (d) _i ) Corresponding to the original word vector and the synonym vector, e (t), respectively _i ) And labeling vectors for parts of speech.

Step3.2, mix in mixup after the Embedding stage.

The newly obtained virtual training data. λ is according to the Beta distribution, i.e., λ -Beta (α, α), α ∈ (0, infinity), λ ∈ [0,1 ]]The parameters obtained.

Step3.2.3 bands obtained

And

Step3.2.5、

and

and obtaining a score matrix through a double affine scorer.

Wherein, the matrix H is a stack form of the eigenvector H obtained by secondary encoding through MLP,

is a fractional matrix.

Step3.3, mix up after the BilSTM phase.

Step3.3.1, original word and synonym are processed together through Embedding process to obtain x _{Original source} And x _{All in one} Then pass through BiLSTM stage to obtain x _{Original source} R with contextual characteristics _i And x _{All in one} R with contextual characteristics _i ' and then both get new training data through the process of mixup.

Wherein, w ₁ And w ₂ Respectively represent r _i And r _i ′，

Is a newly obtained virtual feature. Lambda [ alpha ] _i Is in accordance with the Beta distribution, i.e., λ -Beta (α, α), α ∈ (0, infinity), λ ∈ [0,1 ]]The obtained parameters are assigned a lambda to each pair of features participating in the mixup _i 。

Step3.3.3, bands obtained

The subsequent processes are the same as Step3.2.4 and Step3.2.5.

Step3.4, mix up after the MLP phase.

Step3.4.1, the original word and the synonym are processed by Embedding to obtain x _{Original source} And x _{All in one} Then pass through BiLSTM stage to obtain x _{Original source} R with contextual characteristics _i And x _{All in one} R with contextual characteristics _i ' then passing through two different multi-layer perceptrons MLP for dimension reduction together, the original words get corresponding characteristics

And

synonyms deriving corresponding characteristics

And

Wherein, newly obtained

And

is a new virtual feature. Lambda [ alpha ] _i Is to follow a Beta distribution, i.e., λ -Beta (α, α), α ∈ (0, infinity), λ ∈ [0,1 ]]The obtained parameters are assigned a lambda to each pair of features participating in the mixup _i 。

Step3.4.2. for words without synonyms,

and

and

and

are consistent.

To illustrate the effect of the present invention, 3 sets of comparative experiments were set up. The first group of experiments verifies the effectiveness of the synonym direct replacement method, and the second group of experiments verifies the effectiveness of the various mixup data enhancement methods.

(1) Effectiveness of synonym direct replacement method

The results were compared in the baseline model using the small scale data set directly. In the reference model, the input was replaced with the synonym-substituted amplification version data set, and the experimental results are shown in table 2.

TABLE 2 direct replacement of synonyms in different languages

As can be seen from the analysis table 2, the direct replacement of synonyms in Thai in the method of the invention is improved to a certain extent in both the verification set and the test set, and the improvement of more than 3 points is obtained in the verification set and the improvement of about 2 points is obtained in the test set. The method is improved by about 1 point in the test set and the verification set of English, and the direct replacement based on the synonym dictionary of the same part of speech is a relatively simple and effective method in a data enhancement mode, so that the problem of unknown words can be effectively relieved.

(2) Effectiveness of various mixup data enhancement methods

And applying the mixup data enhancement method in different training stages of the reference model. The results of the experiment are shown in table 3:

TABLE 3 different languages and different mixup strategy experimental results

As can be seen from the analysis table 3, for different strategies of mixup, different languages have performance improvements in different degrees, and for Thai, in the verification set result, the mixup effect is better than that of the combination strategy in a single stage, and the combination strategy is better than that in the test set. For Vietnamese and English, the performance of a combined strategy is comparable to that of a strategy for carrying out mixup in a single stage, and the mixup strategy is reflected from the side face, so that the model can be smoother, the generalization capability can be effectively improved, but the effect superposition is not necessarily obtained by applying a multiple mixup method. Meanwhile, the invention also tries to apply mixup at the three stages of the model at the same time, and the effect is not further improved, which also explains the problems. In a whole view, the improvement on UAS is better than LAS no matter the verification set or the test set, which shows that the data enhancement mode is more prone to predicting the arc. Different languages are longitudinally compared, and the synonym dictionaries have different sizes, different quality and different expanded corpus sizes, so that the presented promotion effect is different. .

The experimental data prove that the problem of unknown words can be effectively relieved by using a synonymy dictionary replacement-based method and a plurality of mixup data enhancement-based method, training data are expanded, the problem of model overfitting is relieved, the model is smoothed, and the generalization capability of the model is improved. Experiments show that the method provided by the invention achieves a certain improvement effect compared with a baseline model. Aiming at the dependency syntactic analysis task under the low resource condition, the dependency syntactic analysis pair for fusing multi-strategy data enhancement under the low resource condition is effective.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The method for analyzing the dependency syntax for fusing the enhancement of the multi-strategy data under the condition of low resources is characterized by comprising the following steps of: the method comprises the following specific steps:

step1, processing the obtained dependency syntactic analysis data, then obtaining synonymy information of a plurality of words in different languages, and constructing a synonymy dictionary with the same part of speech according to the synonymy information;

step2, according to the constructed synonym dictionary, performing data enhancement on the corpus in the data set in a synonym direct replacement mode to obtain a plurality of different languages dependence syntactic analysis and amplification training data;

step3, obtaining synonyms with the same part of speech corresponding to the words in the training data according to the constructed synonym dictionary, carrying out mixup on the words and the synonyms in the training data in different model positions after an Embedding stage, a BilStm stage or an MLP stage of a double affine model in a plurality of mixup data enhancement modes to generate virtual new words, and carrying out training and scoring by a scorer by utilizing the virtual new words;

step3 comprises the following steps:

the form of splicing word vectors and part-of-speech tagging vectors is adopted as model input, and the input vector of the original word is

The input vector corresponding to the synonym is

Wherein e (w) _i ) And e (d) _i ) Corresponding to the original word vector and the synonym vector, e (t) _i ) Labeling vectors for parts of speech;

mix up after Embedding stage;

mix up after the BilSTM stage;

mix up after MLP stage;

the concrete steps of blending mixup after the Embedding stage are as follows:

step3.2.1, obtaining x after Embedding process _{Original source} And x _{All in one} And the two data are subjected to a mixup process to obtain new training data:

For the newly obtained virtual training data, λ is according to the Beta distribution, i.e., λ -Beta (α, α), α ∈ (0, ∞), λ ∈ [0,1 ]]Obtaining parameters;

step3.2.2, for words without synonyms, w ₁ And w ₂ All use x _{Original source} Representing that training data is generated through the same mixup process;

step3.2.3 bands obtained

The new training data is subjected to BilSTM to obtain the characteristic r _i Enabling each input element to contact the context;

And

Step3.2.5、

and

obtaining a fraction matrix through a double affine scorer;

is a fractional matrix;

the concrete steps of blending mixup after the BilSTM stage are as follows:

step3.3.1, the original word and the synonym are processed by Embedding to obtain x _{Original source} And x _{All in one} Then pass through BiLSTM stage to obtain x _{Original source} R with contextual characteristics _i And x _{All in one} R with contextual characteristics _i ', then the two are subjected to a mixup process to obtain new training data;

wherein, w ₁ And w ₂ Respectively represent r _i And r _i ′，

For newly obtained virtual features, λ _i Is in accordance with the Beta distribution, i.e., λ -Beta (α, α), α ∈ (0, infinity), λ ∈ [0,1 ]]The parameters obtained assign a lambda to each pair of features participating in the mixup _i ；

Step3.3.2 for words without synonyms, w ₁ And w ₂ All use r _i Representing that features are generated through the same mixup process;

step3.3.3, band obtained

After the new characteristics, the subsequent processes are the same as Step3.2.4 and Step3.2.5;

the specific steps for blending mixup after the MLP stage are as follows:

step3.4.1, original word and synonym are processed together through Embedding process to obtain x _{Original source} And x _{All in one} Then pass through BiLSTM stage to obtain x _{Original source} R with contextual characteristics _i And x _{All in one} R with contextual characteristics _i ' then passing through two different multi-layer perceptrons MLP for dimension reduction together, the original words get corresponding characteristics

And

synonyms deriving corresponding characteristics

And

finally, at this stage, mixup respectively obtains new characteristics entering the scorer;

wherein, newly obtained

And

is a new virtual feature; lambda [ alpha ] _i Is in accordance with the Beta distribution, i.e., λ -Beta (α, α), α ∈ (0, infinity), λ ∈ [0,1 ]]The obtained parameters are assigned a lambda to each pair of features participating in the mixup _i ；

Step3.4.2. for words without synonyms,

and

and

and

are consistent;

2. The method of claim 1, wherein the method for merging multi-policy data enhanced dependency syntax analysis under low resource conditions is characterized by: the specific steps of Step1 are as follows:

step1.1, counting words in training data of three languages, namely Thai, Vietnamese and English, and acquiring corresponding synonym information including synonyms and corresponding parts of speech from a Babelnet website aiming at the words;

3. The method of claim 1, wherein the method for merging multi-policy data enhanced dependency syntax analysis under low resource conditions is characterized by: the specific steps of Step2 are as follows:

step2.1, traversing words in the training data, and comparing and matching the words with the synonym dictionary;

step2.2, screening the matched synonyms according to the part of speech, directly replacing the synonyms with original words, replacing the synonyms with multiple synonyms if multiple words in one sentence have synonyms, and training by taking the expanded training data as a new data set.