CN113901791B - Enhanced dependency syntax analysis method for fusing multi-strategy data under low-resource condition - Google Patents

Enhanced dependency syntax analysis method for fusing multi-strategy data under low-resource condition Download PDF

Info

Publication number
CN113901791B
CN113901791B CN202111078682.4A CN202111078682A CN113901791B CN 113901791 B CN113901791 B CN 113901791B CN 202111078682 A CN202111078682 A CN 202111078682A CN 113901791 B CN113901791 B CN 113901791B
Authority
CN
China
Prior art keywords
words
mixup
data
synonyms
synonym
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111078682.4A
Other languages
Chinese (zh)
Other versions
CN113901791A (en
Inventor
线岩团
高凡雅
余正涛
相艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202111078682.4A priority Critical patent/CN113901791B/en
Publication of CN113901791A publication Critical patent/CN113901791A/en
Application granted granted Critical
Publication of CN113901791B publication Critical patent/CN113901791B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a dependency syntax analysis method for enhancing fusion multi-strategy data under a low-resource condition, and belongs to the field of natural language processing. The invention comprises the following steps: constructing synonymy dictionaries of the same parts of speech of Thai, Vietnamese and English; synonym substitution expansion training data are carried out on a small-scale UD (universal dependent trees treebanks) data set of the three languages by utilizing a synonym dictionary; and carrying out mixup on the original words and the synonyms in the training data to generate virtual new words for subsequent training by utilizing various mixup data enhancement strategies at different stages of model training. The invention provides a plurality of data enhancement strategies aiming at the problem of low-resource dependency syntactic analysis. The method effectively expands training data through synonym replacement, and relieves the problem of unknown words. Through various data enhancement strategies of mixup, the overfitting problem of the model is effectively relieved, and the generalization capability of the model is improved.

Description

Enhanced dependency syntax analysis method for fusing multi-strategy data under low-resource condition
Technical Field
The invention relates to a dependency syntax analysis method for enhancing fusion multi-strategy data under a low-resource condition, and belongs to the field of natural language processing.
Background
In natural language processing, dependency parsing aims to identify syntactic dependencies from word to word in a sentence. The dependency syntax can provide syntax characteristics for tasks such as information extraction, automatic question answering, machine translation and the like, and the performance of the model is improved.
Although a great deal of research work is carried out on the aspects of feature coding, dependency relationship scoring, decoding and the like by the existing dependency syntax analysis method, the effect of dependency syntax analysis is effectively improved. However, under the condition of low resources, the performance of the existing model and method is difficult to obtain a good analysis result. This problem is particularly evident in low-resource languages such as Thai, Vietnamese, and the like. The lack of corpus can cause serious problems of overfitting unknown words and models. Taking the ud (universal dependences trees) data set of vietnamese as an example, the unknown word proportion of the test set is 51.7%. According to observation, under the condition of low resources, the overfitting problem of the model is easy to occur, so that the difference between the training accuracy and the testing accuracy of the model is huge.
Disclosure of Invention
The invention provides a method for analyzing dependency syntax by fusing multi-strategy data enhancement under a low-resource condition, which is used for analyzing the dependency syntax under the low-resource condition of Thai, Vietnamese, small-scale English and the like and solves the problem of poor dependency syntax analysis effect caused by the problems of rare data corpus, overhigh unknown word proportion, overfitting of a model and the like.
The technical scheme of the invention is as follows: the method for analyzing the dependency syntax for fusing multi-strategy data enhancement under the condition of low resources comprises the following specific steps:
step1, processing the dependency syntactic analysis data obtained from the Thai language, Vietnamese language and small-scale English corpus of the UD data set, obtaining synonymy information of words of three languages from the Babelnet website, and constructing a synonymy dictionary with the same part of speech according to the synonymy information.
Step2, according to the constructed synonym dictionary, performing data enhancement on the linguistic data in the UD data set in a synonym direct replacement mode to obtain training data of three language dependence syntactic analysis amplification.
Step3, obtaining synonyms of the same part of speech corresponding to the words in the training data according to the constructed synonym dictionary, carrying out mixup on the words and the synonyms in the training data in different model positions after double affine model coding, after BiLSTM and after MLP layer in a plurality of mixup data enhancement modes to generate virtual new words, and carrying out training and scoring by a scorer by utilizing the virtual new words.
The specific steps of Step1 are as follows:
step1.1, counting words in training data of three languages of Thai, Vietnamese and English, and acquiring corresponding synonym information including synonyms and corresponding parts of speech from a Babelnet website aiming at the words.
Step1.2, filtering and screening the synonym information in the following way: (1) classifying and dividing the obtained synonyms according to the part of speech, and ensuring that the replaced words are consistent with the original part of speech in subsequent replacement based on the synonym dictionary, (2) deleting the words with incompletely similar meaning in the synonym dictionary.
The specific steps of Step2 are as follows:
step2.1, traversing words in the training data, and comparing and matching the words with the synonym dictionary.
And Step2.2, screening the matched synonyms according to the part of speech, directly replacing the synonyms with the original words, and replacing the synonyms into a plurality of synonyms if a plurality of words in one sentence have synonyms. The augmented training data is trained as a new data set.
As a preferable scheme of the invention, the Step of Step3 comprises the following specific steps:
step3.1, adopting a form of splicing word vectors and part-of-speech tagging vectors as model input, wherein the input vector of the original word is
Figure BDA0003263054090000021
The input vector corresponding to the synonym is
Figure BDA0003263054090000022
Wherein e (w) i ) And e (d) i ) Corresponding to the original word vector and the synonym vector, e (t) i ) And labeling vectors for parts of speech.
Step3.2, mix in mixup after the Embedding stage.
Step3.2.1, obtaining x after Embedding process Original source And x Is composed of And obtaining new training data through the mixup process.
Figure BDA0003263054090000023
Wherein, w 1 And w 2 Respectively represent x Original source And x All in one
Figure BDA0003263054090000024
The newly obtained virtual training data. λ is a distribution conforming to Beta, i.e., λ -Beta (α, α), α ∈ (0, infinity), λ ∈ [0,1 ]]The parameters obtained.
Step3.2.2, for words without synonyms, w 1 And w 2 All use x Original source Indicating that training data was generated via the same mixup process.
Step3.2.3, bands obtained
Figure BDA0003263054090000025
The new training data is subjected to BilSTM to obtain the characteristic r i Such that each input element can contact the context.
Step3.2.4、r i Respectively obtaining characteristics after two different multi-layer perceptron MLPs for dimension reduction
Figure BDA0003263054090000026
And
Figure BDA0003263054090000027
Figure BDA0003263054090000028
Figure BDA0003263054090000029
Step3.2.5、
Figure BDA00032630540900000210
and
Figure BDA00032630540900000211
and obtaining a score matrix through a double affine scorer.
Figure BDA00032630540900000212
Wherein, the matrix H is a stack form of the eigenvector H coded secondarily by MLP,
Figure BDA00032630540900000213
is a fractional matrix.
Step3.3, mix up after the BilSTM phase.
Step3.3.1, the original word and the synonym are processed by Embedding to obtain x Original source And x Is composed of Then pass through BiLSTM stage to obtain x Original source R with contextual characteristics i And x Is composed of R with contextual characteristics i ' and then both get new training data through the process of mixup.
Figure BDA0003263054090000031
Wherein, w 1 And w 2 Respectively represent r i And r i ′,
Figure BDA0003263054090000032
Is a newly obtained virtual feature. Lambda i Is in accordance with the Beta distribution, i.e., λ -Beta (α, α), α ∈ (0, infinity), λ ∈ [0,1 ]]The obtained parameters are assigned a lambda to each pair of features participating in the mixup i
Step3.3.2 for words without synonyms, w 1 And w 2 All use r i Indicating that the features were generated through the same mixup process.
Step3.3.3, band obtained
Figure BDA0003263054090000033
The subsequent processes are the same as Step3.2.4 and Step3.2.5.
Step3.4, mix up after the MLP phase.
Step3.4.1, primitive and synonyms together pass through Embedding gives x after process Original source And x Is composed of Then pass through BiLSTM stage to obtain x Original source R with contextual characteristics i And x All in one R with contextual characteristics i ' then passing through two different multi-layer perceptrons MLP for dimension reduction together, the original words get corresponding characteristics
Figure BDA0003263054090000034
And
Figure BDA0003263054090000035
synonyms deriving corresponding characteristics
Figure BDA0003263054090000036
And
Figure BDA0003263054090000037
finally at this stage the new features entering the scorer are obtained separately at mixup.
Figure BDA0003263054090000038
Figure BDA0003263054090000039
Figure BDA00032630540900000310
Wherein is newly obtained
Figure BDA00032630540900000311
And
Figure BDA00032630540900000312
is a new virtual feature. Lambda i Is in accordance with the Beta distribution, i.e., λ -Beta (α, α), α ∈ (0, infinity), λ ∈ [0,1 ]]The obtained parameters are assigned a lambda to each pair of features participating in the mixup i
Step3.4.2. for words without synonyms,
Figure BDA00032630540900000313
and
Figure BDA00032630540900000314
and
Figure BDA00032630540900000315
and
Figure BDA00032630540900000316
are consistent.
And step3.4.3, obtaining new training data, and scoring in the same way as step 3.2.5.
Step3.5, another mixup strategy is a pairwise combination mode, any two of E _ mixup, B _ mixup and M _ mixup are mixed and used, and the mixup mode of each strategy is as described above.
The invention has the beneficial effects that:
(1) the data enhancement method based on the same-part-of-speech synonym dictionary replacement is realized, the problems of high unknown word proportion and less data are effectively relieved, and training data are expanded.
(2) The data enhancement strategies of various mixup are realized, the model overfitting problem is effectively relieved, and the model generalization capability is improved.
Drawings
FIG. 1 is a baseline model used by the present invention;
FIG. 2 is a diagram of a specific structure of a direct synonym-based replacement method;
FIG. 3 is a flow chart illustrating a specific structure of a low-resource dependency syntax analysis method based on various mixup data enhancement methods.
Detailed Description
Example 1: as shown in fig. 1, fig. 2 and fig. 3, the method for merging multiple policy data enhanced dependency syntax analysis under low resource condition includes the following specific steps:
step1, processing the dependency syntactic analysis data obtained from the Thai language corpus, Vietnamese language corpus and small-scale English language corpus of the UD data set, obtaining synonymy information of words of three languages from a Babelnet website, and constructing a synonymy dictionary of the homonymy according to the synonymy information.
Step1.1, counting words in training data of three languages of Thai, Vietnamese and English, and acquiring corresponding synonym information including synonyms and corresponding parts of speech from a Babelnet website aiming at the words.
Step1.2, filtering and screening the synonym information in the following way: (1) classifying and dividing the obtained synonyms according to the part of speech, and ensuring that the replaced words are consistent with the original part of speech when the synonym dictionary is used for subsequent replacement, (2) deleting the words with the incompletely similar meaning in the synonym dictionary.
Step2, according to the constructed synonym dictionary, performing data enhancement on the linguistic data in the UD data set in a synonym direct replacement mode to obtain training data of three language dependence syntactic analysis amplification.
Step2.1, traversing words in the training data, and comparing and matching the words with the synonym dictionary.
And Step2.2, screening the matched synonyms according to the parts of speech, directly replacing the synonyms with the original words, and if a plurality of words in one sentence have the synonyms, replacing the synonyms with a plurality of synonyms. The augmented training data is trained as a new data set. Data distribution before and after expansion is shown in table 1:
TABLE 1 Thai, Vietnamese and English dependency parsing data information
Figure BDA0003263054090000041
Step3, obtaining synonyms of the same part of speech corresponding to the words in the training data according to the constructed synonym dictionary, carrying out mixup on the words and the synonyms in the training data in different model positions after double affine model coding, after BiLSTM and after MLP layer in a plurality of mixup data enhancement modes to generate virtual new words, and carrying out training and scoring by a scorer by utilizing the virtual new words.
Step3.1. The form of splicing word vectors and part-of-speech tagging vectors is adopted as model input, and the input vector of the original word is
Figure BDA0003263054090000042
The input vector corresponding to the synonym is
Figure BDA0003263054090000043
Wherein e (w) i ) And e (d) i ) Corresponding to the original word vector and the synonym vector, e (t), respectively i ) And labeling vectors for parts of speech.
Step3.2, mix in mixup after the Embedding stage.
Step3.2.1, obtaining x after Embedding process Original source And x Is composed of And obtaining new training data through the mixup process.
Figure BDA0003263054090000044
Wherein, w 1 And w 2 Respectively represent x Original source And x All in one
Figure BDA0003263054090000051
The newly obtained virtual training data. λ is according to the Beta distribution, i.e., λ -Beta (α, α), α ∈ (0, infinity), λ ∈ [0,1 ]]The parameters obtained.
Step3.2.2, for words without synonyms, w 1 And w 2 All use x Original source Indicating that training data was generated via the same mixup process.
Step3.2.3 bands obtained
Figure BDA0003263054090000052
The new training data is subjected to BilSTM to obtain the characteristic r i Such that each input element can contact the context.
Step3.2.4、r i Respectively obtaining characteristics after two different multi-layer perceptron MLPs for dimension reduction
Figure BDA0003263054090000053
And
Figure BDA0003263054090000054
Figure BDA0003263054090000055
Figure BDA0003263054090000056
Step3.2.5、
Figure BDA0003263054090000057
and
Figure BDA0003263054090000058
and obtaining a score matrix through a double affine scorer.
Figure BDA0003263054090000059
Wherein, the matrix H is a stack form of the eigenvector H obtained by secondary encoding through MLP,
Figure BDA00032630540900000510
is a fractional matrix.
Step3.3, mix up after the BilSTM phase.
Step3.3.1, original word and synonym are processed together through Embedding process to obtain x Original source And x All in one Then pass through BiLSTM stage to obtain x Original source R with contextual characteristics i And x All in one R with contextual characteristics i ' and then both get new training data through the process of mixup.
Figure BDA00032630540900000511
Wherein, w 1 And w 2 Respectively represent r i And r i ′,
Figure BDA00032630540900000512
Is a newly obtained virtual feature. Lambda [ alpha ] i Is in accordance with the Beta distribution, i.e., λ -Beta (α, α), α ∈ (0, infinity), λ ∈ [0,1 ]]The obtained parameters are assigned a lambda to each pair of features participating in the mixup i
Step3.3.2 for words without synonyms, w 1 And w 2 All use r i Indicating that the features were generated through the same mixup process.
Step3.3.3, bands obtained
Figure BDA00032630540900000513
The subsequent processes are the same as Step3.2.4 and Step3.2.5.
Step3.4, mix up after the MLP phase.
Step3.4.1, the original word and the synonym are processed by Embedding to obtain x Original source And x All in one Then pass through BiLSTM stage to obtain x Original source R with contextual characteristics i And x All in one R with contextual characteristics i ' then passing through two different multi-layer perceptrons MLP for dimension reduction together, the original words get corresponding characteristics
Figure BDA00032630540900000514
And
Figure BDA00032630540900000515
synonyms deriving corresponding characteristics
Figure BDA00032630540900000516
And
Figure BDA00032630540900000517
finally at this stage the new features entering the scorer are obtained separately at mixup.
Figure BDA00032630540900000518
Figure BDA00032630540900000519
Figure BDA00032630540900000520
Wherein, newly obtained
Figure BDA00032630540900000521
And
Figure BDA00032630540900000522
is a new virtual feature. Lambda [ alpha ] i Is to follow a Beta distribution, i.e., λ -Beta (α, α), α ∈ (0, infinity), λ ∈ [0,1 ]]The obtained parameters are assigned a lambda to each pair of features participating in the mixup i
Step3.4.2. for words without synonyms,
Figure BDA0003263054090000061
and
Figure BDA0003263054090000062
and
Figure BDA0003263054090000063
and
Figure BDA0003263054090000064
are consistent.
And step3.4.3, obtaining new training data, and scoring in the same way as step 3.2.5.
Step3.5, another mixup strategy is a pairwise combination mode, any two of E _ mixup, B _ mixup and M _ mixup are mixed and used, and the mixup mode of each strategy is as described above.
To illustrate the effect of the present invention, 3 sets of comparative experiments were set up. The first group of experiments verifies the effectiveness of the synonym direct replacement method, and the second group of experiments verifies the effectiveness of the various mixup data enhancement methods.
(1) Effectiveness of synonym direct replacement method
The results were compared in the baseline model using the small scale data set directly. In the reference model, the input was replaced with the synonym-substituted amplification version data set, and the experimental results are shown in table 2.
TABLE 2 direct replacement of synonyms in different languages
Figure BDA0003263054090000065
As can be seen from the analysis table 2, the direct replacement of synonyms in Thai in the method of the invention is improved to a certain extent in both the verification set and the test set, and the improvement of more than 3 points is obtained in the verification set and the improvement of about 2 points is obtained in the test set. The method is improved by about 1 point in the test set and the verification set of English, and the direct replacement based on the synonym dictionary of the same part of speech is a relatively simple and effective method in a data enhancement mode, so that the problem of unknown words can be effectively relieved.
(2) Effectiveness of various mixup data enhancement methods
And applying the mixup data enhancement method in different training stages of the reference model. The results of the experiment are shown in table 3:
TABLE 3 different languages and different mixup strategy experimental results
Figure BDA0003263054090000066
Figure BDA0003263054090000071
As can be seen from the analysis table 3, for different strategies of mixup, different languages have performance improvements in different degrees, and for Thai, in the verification set result, the mixup effect is better than that of the combination strategy in a single stage, and the combination strategy is better than that in the test set. For Vietnamese and English, the performance of a combined strategy is comparable to that of a strategy for carrying out mixup in a single stage, and the mixup strategy is reflected from the side face, so that the model can be smoother, the generalization capability can be effectively improved, but the effect superposition is not necessarily obtained by applying a multiple mixup method. Meanwhile, the invention also tries to apply mixup at the three stages of the model at the same time, and the effect is not further improved, which also explains the problems. In a whole view, the improvement on UAS is better than LAS no matter the verification set or the test set, which shows that the data enhancement mode is more prone to predicting the arc. Different languages are longitudinally compared, and the synonym dictionaries have different sizes, different quality and different expanded corpus sizes, so that the presented promotion effect is different. .
The experimental data prove that the problem of unknown words can be effectively relieved by using a synonymy dictionary replacement-based method and a plurality of mixup data enhancement-based method, training data are expanded, the problem of model overfitting is relieved, the model is smoothed, and the generalization capability of the model is improved. Experiments show that the method provided by the invention achieves a certain improvement effect compared with a baseline model. Aiming at the dependency syntactic analysis task under the low resource condition, the dependency syntactic analysis pair for fusing multi-strategy data enhancement under the low resource condition is effective.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (3)

1. The method for analyzing the dependency syntax for fusing the enhancement of the multi-strategy data under the condition of low resources is characterized by comprising the following steps of: the method comprises the following specific steps:
step1, processing the obtained dependency syntactic analysis data, then obtaining synonymy information of a plurality of words in different languages, and constructing a synonymy dictionary with the same part of speech according to the synonymy information;
step2, according to the constructed synonym dictionary, performing data enhancement on the corpus in the data set in a synonym direct replacement mode to obtain a plurality of different languages dependence syntactic analysis and amplification training data;
step3, obtaining synonyms with the same part of speech corresponding to the words in the training data according to the constructed synonym dictionary, carrying out mixup on the words and the synonyms in the training data in different model positions after an Embedding stage, a BilStm stage or an MLP stage of a double affine model in a plurality of mixup data enhancement modes to generate virtual new words, and carrying out training and scoring by a scorer by utilizing the virtual new words;
step3 comprises the following steps:
the form of splicing word vectors and part-of-speech tagging vectors is adopted as model input, and the input vector of the original word is
Figure FDA0003812725160000015
The input vector corresponding to the synonym is
Figure FDA0003812725160000016
Wherein e (w) i ) And e (d) i ) Corresponding to the original word vector and the synonym vector, e (t) i ) Labeling vectors for parts of speech;
mix up after Embedding stage;
mix up after the BilSTM stage;
mix up after MLP stage;
the concrete steps of blending mixup after the Embedding stage are as follows:
step3.2.1, obtaining x after Embedding process Original source And x All in one And the two data are subjected to a mixup process to obtain new training data:
Figure FDA0003812725160000011
wherein, w 1 And w 2 Respectively represent x Original source And x All in one
Figure FDA0003812725160000012
For the newly obtained virtual training data, λ is according to the Beta distribution, i.e., λ -Beta (α, α), α ∈ (0, ∞), λ ∈ [0,1 ]]Obtaining parameters;
step3.2.2, for words without synonyms, w 1 And w 2 All use x Original source Representing that training data is generated through the same mixup process;
step3.2.3 bands obtained
Figure FDA0003812725160000013
The new training data is subjected to BilSTM to obtain the characteristic r i Enabling each input element to contact the context;
Step3.2.4、r i respectively obtaining characteristics after two different multi-layer perceptron MLPs for dimension reduction
Figure FDA0003812725160000014
And
Figure FDA0003812725160000021
Figure FDA0003812725160000022
Figure FDA0003812725160000023
Step3.2.5、
Figure FDA0003812725160000024
and
Figure FDA0003812725160000025
obtaining a fraction matrix through a double affine scorer;
Figure FDA0003812725160000026
wherein, the matrix H is a stack form of the eigenvector H coded secondarily by MLP,
Figure FDA0003812725160000027
is a fractional matrix;
the concrete steps of blending mixup after the BilSTM stage are as follows:
step3.3.1, the original word and the synonym are processed by Embedding to obtain x Original source And x All in one Then pass through BiLSTM stage to obtain x Original source R with contextual characteristics i And x All in one R with contextual characteristics i ', then the two are subjected to a mixup process to obtain new training data;
Figure FDA0003812725160000028
wherein, w 1 And w 2 Respectively represent r i And r i ′,
Figure FDA0003812725160000029
For newly obtained virtual features, λ i Is in accordance with the Beta distribution, i.e., λ -Beta (α, α), α ∈ (0, infinity), λ ∈ [0,1 ]]The parameters obtained assign a lambda to each pair of features participating in the mixup i
Step3.3.2 for words without synonyms, w 1 And w 2 All use r i Representing that features are generated through the same mixup process;
step3.3.3, band obtained
Figure FDA00038127251600000214
After the new characteristics, the subsequent processes are the same as Step3.2.4 and Step3.2.5;
the specific steps for blending mixup after the MLP stage are as follows:
step3.4.1, original word and synonym are processed together through Embedding process to obtain x Original source And x All in one Then pass through BiLSTM stage to obtain x Original source R with contextual characteristics i And x All in one R with contextual characteristics i ' then passing through two different multi-layer perceptrons MLP for dimension reduction together, the original words get corresponding characteristics
Figure FDA00038127251600000210
And
Figure FDA00038127251600000211
synonyms deriving corresponding characteristics
Figure FDA00038127251600000212
And
Figure FDA00038127251600000213
finally, at this stage, mixup respectively obtains new characteristics entering the scorer;
Figure FDA0003812725160000031
Figure FDA0003812725160000032
Figure FDA0003812725160000033
wherein, newly obtained
Figure FDA0003812725160000034
And
Figure FDA0003812725160000035
is a new virtual feature; lambda [ alpha ] i Is in accordance with the Beta distribution, i.e., λ -Beta (α, α), α ∈ (0, infinity), λ ∈ [0,1 ]]The obtained parameters are assigned a lambda to each pair of features participating in the mixup i
Step3.4.2. for words without synonyms,
Figure FDA0003812725160000036
and
Figure FDA0003812725160000037
and
Figure FDA0003812725160000038
and
Figure FDA0003812725160000039
are consistent;
and step3.4.3, obtaining new training data, and scoring in the same way as step 3.2.5.
2. The method of claim 1, wherein the method for merging multi-policy data enhanced dependency syntax analysis under low resource conditions is characterized by: the specific steps of Step1 are as follows:
step1.1, counting words in training data of three languages, namely Thai, Vietnamese and English, and acquiring corresponding synonym information including synonyms and corresponding parts of speech from a Babelnet website aiming at the words;
step1.2, filtering and screening the synonym information in the following way: (1) classifying and dividing the obtained synonyms according to the part of speech, and ensuring that the replaced words are consistent with the original part of speech in subsequent replacement based on the synonym dictionary, (2) deleting the words with incompletely similar meaning in the synonym dictionary.
3. The method of claim 1, wherein the method for merging multi-policy data enhanced dependency syntax analysis under low resource conditions is characterized by: the specific steps of Step2 are as follows:
step2.1, traversing words in the training data, and comparing and matching the words with the synonym dictionary;
step2.2, screening the matched synonyms according to the part of speech, directly replacing the synonyms with original words, replacing the synonyms with multiple synonyms if multiple words in one sentence have synonyms, and training by taking the expanded training data as a new data set.
CN202111078682.4A 2021-09-15 2021-09-15 Enhanced dependency syntax analysis method for fusing multi-strategy data under low-resource condition Active CN113901791B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111078682.4A CN113901791B (en) 2021-09-15 2021-09-15 Enhanced dependency syntax analysis method for fusing multi-strategy data under low-resource condition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111078682.4A CN113901791B (en) 2021-09-15 2021-09-15 Enhanced dependency syntax analysis method for fusing multi-strategy data under low-resource condition

Publications (2)

Publication Number Publication Date
CN113901791A CN113901791A (en) 2022-01-07
CN113901791B true CN113901791B (en) 2022-09-23

Family

ID=79028490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111078682.4A Active CN113901791B (en) 2021-09-15 2021-09-15 Enhanced dependency syntax analysis method for fusing multi-strategy data under low-resource condition

Country Status (1)

Country Link
CN (1) CN113901791B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114611487B (en) * 2022-03-10 2022-12-13 昆明理工大学 Unsupervised Thai dependency syntax analysis method based on dynamic word embedding alignment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737974A (en) * 2020-08-18 2020-10-02 北京擎盾信息科技有限公司 Semantic abstract representation method and device for statement
CN112069799A (en) * 2020-09-14 2020-12-11 深圳前海微众银行股份有限公司 Dependency syntax based data enhancement method, apparatus and readable storage medium
CN112232024A (en) * 2020-10-13 2021-01-15 苏州大学 Dependency syntax analysis model training method and device based on multi-labeled data
CN112699665A (en) * 2021-03-25 2021-04-23 北京智源人工智能研究院 Triple extraction method and device of safety report text and electronic equipment
CN112765956A (en) * 2021-01-22 2021-05-07 大连民族大学 Dependency syntax analysis method based on multi-task learning and application
CN112860781A (en) * 2021-02-05 2021-05-28 陈永朝 Mining and displaying method combining vocabulary collocation extraction and semantic classification

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737974A (en) * 2020-08-18 2020-10-02 北京擎盾信息科技有限公司 Semantic abstract representation method and device for statement
CN112069799A (en) * 2020-09-14 2020-12-11 深圳前海微众银行股份有限公司 Dependency syntax based data enhancement method, apparatus and readable storage medium
CN112232024A (en) * 2020-10-13 2021-01-15 苏州大学 Dependency syntax analysis model training method and device based on multi-labeled data
CN112765956A (en) * 2021-01-22 2021-05-07 大连民族大学 Dependency syntax analysis method based on multi-task learning and application
CN112860781A (en) * 2021-02-05 2021-05-28 陈永朝 Mining and displaying method combining vocabulary collocation extraction and semantic classification
CN112699665A (en) * 2021-03-25 2021-04-23 北京智源人工智能研究院 Triple extraction method and device of safety report text and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DEEP BIAFFINE ATTENTION FOR NEURAL DEPENDENCY PARSING;Timothy Dozat 等;《arXiv》;20170310;第1-8页 *
越南语短语树到依存树的转换研究;李英 等;《计算机科学与探索》;20160419;第1-9页 *

Also Published As

Publication number Publication date
CN113901791A (en) 2022-01-07

Similar Documents

Publication Publication Date Title
CN110532554B (en) Chinese abstract generation method, system and storage medium
Zhao et al. Semi-supervised text simplification with back-translation and asymmetric denoising autoencoders
Hansen et al. The Copenhagen Team Participation in the Check-Worthiness Task of the Competition of Automatic Identification and Verification of Claims in Political Debates of the CLEF-2018 CheckThat! Lab.
WO2022088570A1 (en) Method and apparatus for post-editing of translation, electronic device, and storage medium
Vakili Tahami et al. Distilling knowledge for fast retrieval-based chat-bots
CN112818093A (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN107305543B (en) Method and device for classifying semantic relation of entity words
CN111310470A (en) Chinese named entity recognition method fusing word and word features
CN112926337B (en) End-to-end aspect level emotion analysis method combined with reconstructed syntax information
Fei et al. CQG: A simple and effective controlled generation framework for multi-hop question generation
CN113901791B (en) Enhanced dependency syntax analysis method for fusing multi-strategy data under low-resource condition
CN115048936A (en) Method for extracting aspect-level emotion triple fused with part-of-speech information
CN115114940A (en) Machine translation style migration method and system based on curriculum pre-training
CN113204978A (en) Machine translation enhancement training method and system
CN107992479A (en) Word rank Chinese Text Chunking method based on transfer method
CN117218503A (en) Cross-Han language news text summarization method integrating image information
Cai et al. Revisiting pivot-based paraphrase generation: Language is not the only optional pivot
Gain et al. Low resource chat translation: A benchmark for Hindi–English language pair
CN114756679A (en) Chinese medical text entity relation combined extraction method based on conversation attention mechanism
Guo et al. The HW-TSC’s Simultaneous Speech-to-Text Translation system for IWSLT 2023 evaluation
CN114595700A (en) Zero-pronoun and chapter information fused Hanyue neural machine translation method
CN115374258A (en) Knowledge base query method and system combining semantic understanding with question template
Mi et al. Recurrent neural network based loanwords identification in Uyghur
CN114185573A (en) Implementation and online updating system and method for human-computer interaction machine translation system
Zhang et al. Language-agnostic and language-aware multilingual natural language understanding for large-scale intelligent voice assistant application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant