CN113901831A - Parallel sentence pair extraction method based on pre-training language model and bidirectional interaction attention - Google Patents
Parallel sentence pair extraction method based on pre-training language model and bidirectional interaction attention Download PDFInfo
- Publication number
- CN113901831A CN113901831A CN202111082587.1A CN202111082587A CN113901831A CN 113901831 A CN113901831 A CN 113901831A CN 202111082587 A CN202111082587 A CN 202111082587A CN 113901831 A CN113901831 A CN 113901831A
- Authority
- CN
- China
- Prior art keywords
- language
- semantic
- attention
- parallel
- cross
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 43
- 230000003993 interaction Effects 0.000 title claims abstract description 27
- 230000002457 bidirectional effect Effects 0.000 title claims abstract description 25
- 238000000605 extraction Methods 0.000 title claims abstract description 24
- 238000000034 method Methods 0.000 claims abstract description 27
- 230000004927 fusion Effects 0.000 claims abstract description 26
- 230000007246 mechanism Effects 0.000 claims abstract description 15
- 238000013519 translation Methods 0.000 claims abstract description 9
- 239000010410 layer Substances 0.000 claims description 49
- 239000013598 vector Substances 0.000 claims description 37
- 238000002474 experimental method Methods 0.000 claims description 15
- 238000004364 calculation method Methods 0.000 claims description 12
- 238000013528 artificial neural network Methods 0.000 claims description 11
- 239000011159 matrix material Substances 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 10
- 238000005070 sampling Methods 0.000 claims description 9
- 230000002452 interceptive effect Effects 0.000 claims description 8
- 238000005516 engineering process Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 6
- 238000004140 cleaning Methods 0.000 claims description 3
- 230000006835 compression Effects 0.000 claims description 3
- 238000007906 compression Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 239000002356 single layer Substances 0.000 claims description 3
- 238000002372 labelling Methods 0.000 claims 1
- 238000013507 mapping Methods 0.000 claims 1
- 238000003058 natural language processing Methods 0.000 abstract description 3
- 230000000694 effects Effects 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 4
- 230000000052 comparative effect Effects 0.000 description 3
- 238000002679 ablation Methods 0.000 description 2
- 230000002950 deficient Effects 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24137—Distances to cluster centroïds
- G06F18/2414—Smoothing the distance, e.g. radial basis function networks [RBFN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a pre-training language model and a parallel sentence pair extraction method for bidirectional interaction attention, belonging to the field of natural language processing. The invention comprises the following steps: constructing a comparable corpus data set; obtaining bilingual representations of a source language and a target language respectively by using a pre-training language model, and then realizing cross-language feature space semantic alignment based on a bidirectional interaction attention mechanism; and finally, judging the relation of cross-language sentence pairs based on the semantic representation after multi-view feature fusion, and extracting parallel sentence pairs according to deep semantic consistency. Experimental results show that the method provided by the invention can effectively identify the bilingual parallel sentences with consistent semantics under the background of data containing noise, and the extracted bilingual parallel sentences provide support for subsequent machine translation.
Description
Technical Field
The invention relates to a pre-training language model and a parallel sentence pair extraction method for bidirectional interaction attention, belonging to the field of natural language processing.
Background
The performance of neural machine translation depends on a large amount of high-quality parallel data support, a parallel corpus with rich resources is used for supporting academic research facing mainstream language pairs (De-English, Fa-English) and the like, the mainstream language machine translation performance is remarkably improved recently, and the effect of manual translation is approached. However, for a large number of non-mainstream language pairs, the performance of machine translation is severely limited because it does not have large-scale high-quality parallel sentence pair resources.
The parallel sentence pair extraction realizes the matching of two languages based on semantic similarity, the task aims to find out the parallel sentences from a comparable corpus as training data of natural language processing tasks such as neural machine translation and the like, and the core lies in realizing the alignment of bilingual semantic space in a cross-language space so as to judge the semantic consistency. The traditional semantic space alignment method is to use different neural network structures to learn vector representation of cross-language sentences, and judge similarity between sentences in a shared vector space. When the problem that network resource data with a large amount of noise exists is solved, good semantic representation is difficult to generate due to the quantity of training data and the coverage field, and the semantic alignment effect is influenced. The cross-language pre-training model can well realize the alignment of bilingual semantics on a public semantic space. The invention uses a pre-training language model as prior knowledge to carry out semantic alignment on the obtained semantic representation based on a bidirectional interaction attention mechanism. The method and the device realize the extraction of bilingual parallel sentences with consistent deep semantics from the comparable corpus to expand the bilingual parallel corpus, thereby relieving the problem of lack of training data of the resource-deficient languages.
Disclosure of Invention
The invention provides a parallel sentence pair extraction method based on a pre-training language model and bidirectional interactive attention, which is used for extracting bilingual parallel sentences with consistent deep semantics from a comparable corpus to expand the bilingual parallel corpus, so that the problem that the language pairs lack training data due to lack of resources is solved, and the prediction effect of the parallel sentence pairs is improved.
The technical scheme of the invention is as follows: a parallel sentence pair extraction method based on a pre-training language model and bidirectional interaction attention comprises the following specific steps:
step1, collecting and constructing the middle-crossing parallel data through a web crawler technology, obtaining non-parallel data by utilizing negative sampling, and manually marking the data set to obtain a Chinese-crossing comparable corpus data set, wherein the main sources of the Chinese-crossing parallel data comprise Wikipedia, bilingual news websites, movie subtitles and the like.
Step2, coding a source language and a target language respectively through a pre-training language model to obtain semantic representations; and then, carrying out space semantic alignment on the obtained semantic representations by using a bidirectional interaction attention mechanism, finally realizing the relation judgment of cross-language sentence pairs based on the semantic representations after multi-view feature fusion, and realizing the extraction of parallel sentence pairs in a noise environment based on bilingual representation consistency.
As a further scheme of the present invention, the Step1 specifically comprises the following steps:
step1.1, acquiring the more parallel data through a web crawler technology; data sources include wikipedia, bilingual news websites, movie subtitles, and the like;
step1.2, cleaning and aligning the crawled data, using the crawled data as positive samples for model training, constructing a corresponding negative sample for each positive sample through negative sampling in order to keep the balance of the number of samples, randomly sampling each pair of parallel sentences to generate negative samples, and using a comparable corpus consisting of the positive and negative samples to train the model;
step1.3, manually marking to obtain a Chinese-Yue comparable corpus data set; the final data consists of 2n tripletsn is the number of parallel sentences and non-parallel sentences;representing the sentence in the source language and,representing a sentence in the target language, yiIs to representAndand the label of the translation relation is set to be 1 when the source language sentence and the target language sentence are parallel sentences, and is set to be 0 when the source language sentence and the target language sentence are non-parallel sentences.
As a further scheme of the present invention, in Step2, the encoding the source language and the target language respectively by the pre-training language model to obtain semantic representations includes the following steps:
respectively coding a source language and a target language through a cross-language text semantic coding layer, wherein the length of the cross-language text semantic coding layer is lsIs expressed as Si={x1,x2...xlsIs e.g. M and has a length of ltIs denoted as Tj={y1,y2...yltJ belongs to N, M and N represent the total number of sentences; using pre-trained multilingual BERT as a bilingual encoder, the source and target languages are represented by a semantic coding layer as:
wherein,a vector representation representing the source language after encoding,and d represents the word vector dimension of the words in the sentences of the source language and the target language, and the generated vector is used as the input of the next cross-language semantic alignment layer.
As a further aspect of the present invention, in Step2, performing spatial semantic alignment on the obtained semantic representations by using a bidirectional interaction attention mechanism includes:
the word order of two languages does not correspond exactly, subject to the linguistic characteristics of the different languages. Considering that the self-attention mechanism is not influenced by the positions of words to directly calculate the semantic correlation relationship between word pairs, the invention adopts an improved two-way interaction attention mechanism to capture the semantic interaction relationship between cross-language texts on the basis, maps a source language and a target language to a public semantic space for carrying out space semantic alignment, and uses a plurality of parallel attention heads to enable a model to focus on semantic information of different layers;
deriving semantic representations of source and target languagesYang (Yang)Thereafter, for simplicity of the formula, the following formula uses vsAs a general representation of semantic representations of source language sentences, vtA generic representation that is a semantic representation of the target language; the attention calculation process for the source language to target language direction is shown in equations (3) - (5):
v′s=MultiHead(vs,vt,vt)=Concat(head1,...,headh)W° (4)
Where headi=Attention(vsWi 1,vtWi 2,utWi 3) (5)
wherein i represents the ith attention head, Wi 1,Wi 2,Wi 3Representing a parameter matrix corresponding to the ith head, wherein h in the formula represents the number of the attention heads, the number of the attention heads is set to be 8 in an experiment, the dimension of each attention head is 64, and Wo is the final splicing of all the attention headsThe results are a parameter matrix, v ', of the linear projection'sOutputting a source language coding vector through a cross-language semantic alignment layer;
the attention calculation process from the target language to the source language is shown in equations (6-8):
v′t=MultiHead(vt,vs,vs)=Concat(head1,...,headh)W° (7)
Where headi=Attention(vtWi 1,vsWi 2,vsWi 3) (8)
wherein i represents the ith attention head, Wi 1,Wi 2,Wi 3Representing a parameter matrix corresponding to the ith head, wherein h in the formula represents the number of the attention heads and is set to be 8 in an experiment, the dimension of each attention head is 64, and Wo is a parameter matrix v 'for finally splicing all the attention head results to make a linear projection'tFor the output of target language code vectors through a cross-language semantic alignment layer, dkRepresenting the model vector dimensions.
As a further aspect of the present invention, in Step2, the method for extracting parallel sentence pairs in a noise environment based on bilingual token consistency includes:
comparing the similarity of global representation and alignment representation of semantic vectors from multiple perspectives by a cross-language semantic fusion layer, designing three different fusion strategies, splicing, bitwise subtracting and bitwise multiplying original semantic information and aligned semantic information, projecting all vector matrixes to the same space to obtain the output of the final semantic fusion layer, and outputting a result V by a source language through the semantic fusion layersThe calculation process is as follows:
Vs1=G1([vs;v′s]) (9)
Vs2=G2([vs;vs-v′s]) (10)
Vs3=G3([vs;vs⊙v′s]) (11)
Vs=G([Vs1;Vs2;Vs3]) (12)
wherein, v'sFor the output of source language code vectors through a cross-language semantic alignment layer, vsGeneral representation of semantic representations of source language sentences, G1、G2、G3And G respectively represent single-layer feedforward neural networks with independent parameters, which represent multiplication of corresponding elements, the difference between feature vectors is measured by subtraction between the two, the multiplication is used to highlight the similarity between the two, and the target language outputs a result V through semantic fusiontThe calculation and the source language output a result V through a semantic fusion layersThe formula is omitted because the two are consistent;
the output of the semantic fusion layer is subjected to feature compression through maximum pooling operation to obtain vector representations O of the source language and the target languages,OtAs input to the semantic prediction layer, a semantic relevance probability distribution of the cross-language sentence pair is obtained:
where H represents a multi-layer feed-forward neural network,probability scores representing all classes according to a probability distributionDistinguishing whether the input source language sentence and target language sentence pair is parallel sentences or not; the cross entropy loss function is:
wherein y isiRepresents a real label and is a label of a real label,representing the prediction result, and training the model by taking the cross-meaning entropy of the minimized prediction result and the true buying result as a loss function after obtaining the prediction result of the cross-language sentence pair.
The invention has the beneficial effects that:
(1) aiming at semantic representation and deep semantic alignment problems containing noise data, a cross-language text semantic matching method based on pre-training language model bidirectional interactive attention is used, and extraction of parallel sentence pairs in a noise environment is achieved based on bilingual semantic representation consistency.
(2) The bilingual semantic representation of the cross-language sentence pair is obtained by means of the powerful semantic representation capability of the pre-training language model, and a better semantic representation of the context words is generated for the cross-language sentence.
(3) The bilingual semantic interaction relation is learned by using a bidirectional interaction attention mechanism, semantic alignment in a public semantic space is realized, the characteristic representation of cross-language sentence pairs is compared from multiple visual angles, the semantic consistency is further measured, and the prediction effect is improved.
Drawings
Fig. 1 is a schematic structural diagram of a pre-training language model and a bidirectional interactive attention parallel sentence pair extraction method according to the present invention.
Detailed Description
Example 1: as shown in fig. 1, based on a pre-training language model and a bidirectional interactive attention parallel sentence pair extraction method, the method specifically includes the following steps:
step1, collecting and constructing the middle-crossing parallel data through a web crawler technology, obtaining non-parallel data by utilizing negative sampling, and manually marking the data set to obtain a Chinese-crossing comparable corpus data set, wherein the main sources of the Chinese-crossing parallel data comprise Wikipedia, bilingual news websites, movie subtitles and the like.
As a further scheme of the present invention, the Step1 specifically comprises the following steps:
step1.1, acquiring the more parallel data through a web crawler technology; data sources include wikipedia, bilingual news websites, movie subtitles, and the like;
step1.2, cleaning and aligning the crawled data, using the crawled data as positive samples for model training, constructing a corresponding negative sample for each positive sample through negative sampling in order to keep the balance of the number of samples, randomly sampling each pair of parallel sentences to generate negative samples, and using a comparable corpus consisting of the positive and negative samples to train the model;
step1.3, manually marking to obtain a Chinese-Yue comparable corpus data set; the final data consists of 2n tripletsn is the number of parallel sentences and non-parallel sentences;representing the sentence in the source language and,representing a sentence in the target language, yiIs to representAndand the label of the translation relation is set to be 1 when the source language sentence and the target language sentence are parallel sentences, and is set to be 0 when the source language sentence and the target language sentence are non-parallel sentences. The experimental corpus scale is shown in table 1:
TABLE 1 statistical information of the experimental data
Step2, coding a source language and a target language respectively through a pre-training language model to obtain semantic representations; and then, carrying out space semantic alignment on the obtained semantic representations by using a bidirectional interaction attention mechanism, finally realizing the relation judgment of cross-language sentence pairs based on the semantic representations after multi-view feature fusion, and realizing the extraction of parallel sentence pairs in a noise environment based on bilingual representation consistency.
As a further scheme of the present invention, in Step2, the encoding the source language and the target language respectively by the pre-training language model to obtain semantic representations includes the following steps:
step2.1, respectively coding a source language and a target language through a cross-language text semantic coding layer, wherein the length is lsIs expressed as Si={x1,x2...xlsIs e.g. M and has a length of ltIs denoted as Tj={y1,y2...yltJ belongs to N, M and N represent the total number of sentences; using pre-trained multilingual BERT as a bilingual encoder, the source and target languages are represented by a semantic coding layer as:
wherein,a vector representation of the encoded source language is formulated,and d represents the word vector dimension of the words in the sentences of the source language and the target language, and the generated vector is used as the input of the next cross-language semantic alignment layer.
As a further aspect of the present invention, in Step2, performing spatial semantic alignment on the obtained semantic representations by using a bidirectional interaction attention mechanism includes:
step2.2, influenced by the language characteristics of different languages, the word sequences of the two languages do not completely correspond. Considering that the self-attention mechanism is not influenced by the positions of words to directly calculate the semantic correlation relationship between word pairs, the invention adopts an improved two-way interaction attention mechanism to capture the semantic interaction relationship between cross-language texts on the basis, maps a source language and a target language to a public semantic space for carrying out space semantic alignment, and uses a plurality of parallel attention heads to enable a model to focus on semantic information of different layers;
deriving semantic representations of source and target languagesAndthereafter, for simplicity of the formula, the following formula uses vsAs a general representation of semantic representations of source language sentences, vtA generic representation that is a semantic representation of the target language; the attention calculation process for the source language to target language direction is shown in equations (3) - (5):
v′s=MultiHead(vs,vt,vt)=Concat(head1,...,headh)W° (4)
Where headi=Attention(vsWi 1,vtWi 2,vt 3) (5)
wherein i represents the ith attention head, Wi 1,Wi 2,Wi 3Representing a parameter matrix corresponding to the ith head, wherein h in the formula represents the number of the attention heads, the number of the attention heads is set to be 8 in an experiment, the dimension of each attention head is 64, and Wo is the final splicing of all attention headsAttention is paid to the parameter matrix v 'of the linear projection of the head result'sOutputting a source language coding vector through a cross-language semantic alignment layer;
the attention calculation process from the target language to the source language is shown in equations (6-8):
v′t=MultiHead(vt,vs,vs)=Concat(head1,...,headh)W° (7)
Where headi=Attention(vtWi 1,vsWi 2,vsWi 3) (8)
wherein i represents the ith attention head, Wi 1,Wi 2,Wi 3Representing a parameter matrix corresponding to the ith head, wherein h in the formula represents the number of the attention heads and is set to be 8 in an experiment, the dimension of each attention head is 64, W DEG is a parameter matrix which is used for splicing all the attention head results finally and is used for linear projection, v'tFor the output of target language code vectors through a cross-language semantic alignment layer, dkRepresenting the model vector dimensions.
As a further aspect of the present invention, in Step2, the method for extracting parallel sentence pairs in a noise environment based on bilingual token consistency includes:
step2.3, comparing the global representation and the alignment representation of the semantic vector from multiple perspectives by the cross-language semantic fusion layer, designing three different fusion strategies, splicing, bit-wise subtracting and bit-wise multiplying the original semantic information and the aligned semantic information, projecting all vector matrixes to the same space to obtain the final output of the semantic fusion layer, and outputting a result V by the source language through the semantic fusion layersThe calculation process is as follows:
Vs1=G1([vs;v′s]) (9)
Vs2=G2([vs;vs-v′s]) (10)
Vs3=G3([vs;vs⊙v′s]) (11)
Vs=G([Vs1;Vs2;Vs3]) (12)
wherein, v'sFor the output of source language code vectors through a cross-language semantic alignment layer, vsGeneral representation of semantic representations of source language sentences, G1、G2、G3And G respectively represent single-layer feedforward neural networks with independent parameters, which represent multiplication of corresponding elements, the difference between feature vectors is measured by subtraction between the two, the multiplication is used to highlight the similarity between the two, and the target language outputs a result V through semantic fusiontThe calculation and the source language output a result V through a semantic fusion layersThe formula is omitted because the two are consistent;
step2.4, performing feature compression on the output of the semantic fusion layer through maximum pooling operation to obtain vector representations O of the source language and the target languages,OtAs input to the semantic prediction layer, a semantic relevance probability distribution of the cross-language sentence pair is obtained:
where H represents a multi-layer feed-forward neural network,probability scores representing all classes according to a probability distributionDistinguishing whether the input source language sentence and target language sentence pair is parallel sentences or not; the cross entropy loss function is:
wherein y isiRepresents a real label and is a label of a real label,representing the prediction result, and training the model by taking the cross entropy of the minimized prediction result and the real result as a loss function after obtaining the prediction result of the cross-language sentence pair.
To illustrate the effect of the present invention, 3 sets of comparative experiments were set up. The first group of experiments verifies the effectiveness of the pre-training language model, the second group of experiments verifies the effectiveness of the model in different data scales, and the third group of experiments verifies the effectiveness of each module of the model.
(1) Performance enhancement verification based on pre-training language model
And (3) replacing the pre-training language model with other neural network structures such as a cyclic neural network, a convolutional neural network, a transformer and the like in the reference model to learn the semantic representation of the cross-language sentence, and using the parameters for obtaining the optimal performance in the comparison model on the premise of consistent other conditions. The results of the experiment are shown in table 2.
TABLE 2 comparison of Performance of baseline models
Analysis of table 2 shows that the performance using the pre-trained model is significantly better than that of the other models. The main reason is that the data scale plays a crucial role in the performance of the neural network, and the pre-training language model uses massive data to represent the more contextualized vector representation for word representation learning, so as to generate a better semantic representation for the input text. From the experimental results of table 2, the bidirectional network structure is superior in performance to the model of the unidirectional network structure because the bidirectional network can better characterize the context information of the input text. The results in table 2 show that the value of the model F1 based on the transformer is reduced by 1.92% compared with the model based on the RNN, which is considered to be because the model parameters are too large due to the complex network structure, the scale of the training data is limited, the model is not sufficiently trained, and the model does not achieve the fitting.
(2) Comparative experiment of different data scales
In order to further prove the influence of the data scale on the model performance, 20 ten thousand of training data are added into the data set of the invention to retrain each model, and the experimental result of table 3 shows the performance of each model under a larger-scale data set.
TABLE 3 results of model experiments on larger data scale
Analysis of table 3 reveals that the performance of each method is improved to varying degrees with larger scale training data. The performance of models based on transformer and other structurally complex methods is gradually superior to that of models with simple structures such as a recurrent neural network. The main reason is that the model with a complex structure can better learn the context information of the text to generate better feature representation, and because the reason that the number of parameters is large is more sensitive to the scale of the training data, the ideal effect is difficult to achieve under the condition of small-scale data. Compared with the method, the performance of the method provided by the invention is improved at the least, and the performance equivalent to the performance can be achieved under the low data scale. The method based on the pre-training language model is proved to achieve ideal effects on smaller data volume.
(3) Model different module validity experiment
To analyze the specific contributions of the different modules in the model of the invention, a comparative ablation experiment was performed on the overall model. In the comparison process, the complete model is simplified or altered to get several different variant models for different components. The effectiveness of the different modules was analyzed according to the experimental results of table 4.
TABLE 4 ablation model test results
Analysis of Table 4 reveals that the method of the present invention is superior to other variant models in terms of accuracy, recall, and F1 values. Compared with the EAFP model of the invention, the accuracy of the model is reduced by 12.28%, and the F1 value is reduced by 8.98%. The analysis reason is that after the semantic alignment layer and the semantic fusion layer are removed, the deep semantic features are not well distinguished only by using the original semantic representation, and the great reduction of the accuracy rate shows that the number of samples classified as negative is increased when the original semantic features have larger difference. When the semantic alignment layer and the semantic fusion layer are added, the performance of the model is remarkably improved; the F1 value of the EAP model is reduced by 7.98%, because the semantic fusion layer compares the local representation and the alignment representation of the semantics from multiple perspectives, the similarity and the difference of semantic features can be better distinguished, and the accuracy of the result is ensured; compared with the experimental results of EAFP-A and EAFP-B, the F1 value is reduced by 1.79% and 2.19% compared with the model of the invention. It is shown that interaction information between cross-language sentences can be better learned based on a two-way interaction attention mechanism. The overall experimental results show that all modules of the model play an important role in improving the performance, and if any one module is lacked, the performance of the model is reduced.
The experimental data prove that the cross-language text semantic matching method based on the pre-training language model and the bidirectional interaction attention can realize the extraction of the parallel sentence pairs with high-quality semantic consistency on the premise that a large amount of noise data exist in network resources. And generating a language-contextualized semantic representation for the cross-language sentence pairs by using a pre-training language model under the condition of limited resources, performing semantic alignment on the cross-language sentence pairs in a public semantic space by using bidirectional interaction attention, and finally obtaining the relationship judgment of the cross-language sentence pairs. The method and the device realize the extraction of bilingual parallel sentences with consistent deep semantics from the comparable corpus to expand the bilingual parallel corpus, thereby relieving the problem of lack of training data of the resource-deficient languages. The experimental result shows that the effect of the invention is superior to that of other models. Aiming at the parallel sentence pair extraction task, the method based on the pre-training language model and the bidirectional interaction attention is effective in improving the extraction performance of the parallel sentence pair.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.
Claims (5)
1. The parallel sentence pair extraction method based on the pre-training language model and the bidirectional interaction attention is characterized by comprising the following steps of:
the method comprises the following specific steps:
step1, collecting and constructing the more parallel data through a web crawler technology, obtaining non-parallel data by utilizing negative sampling, and manually labeling a data set to obtain a more Chinese-to-more comparable corpus data set;
step2, coding a source language and a target language respectively through a pre-training language model to obtain semantic representations; and then, carrying out space semantic alignment on the obtained semantic representations by using a bidirectional interaction attention mechanism, finally realizing the relation judgment of cross-language sentence pairs based on the semantic representations after multi-view feature fusion, and realizing the extraction of parallel sentence pairs in a noise environment based on bilingual representation consistency.
2. The parallel sentence pair extraction method based on the pre-trained language model and the two-way interactive attention of claim 1, wherein: the specific steps of Step1 are as follows:
step1.1, acquiring the more parallel data through a web crawler technology;
step1.2, cleaning and aligning the crawled data, using the crawled data as positive samples for model training, constructing a corresponding negative sample for each positive sample through negative sampling in order to keep the balance of the number of samples, randomly sampling each pair of parallel sentences to generate negative samples, and using a comparable corpus consisting of the positive and negative samples to train the model;
step1.3, manually marking to obtain a Chinese-Yue comparable corpus data set; the final data is composed of 2nTernary compositionn is the number of parallel sentences and non-parallel sentences;representing the sentence in the source language and,representing a sentence in the target language, yiIs to representAndand the label of the translation relation is set to be 1 when the source language sentence and the target language sentence are parallel sentences, and is set to be 0 when the source language sentence and the target language sentence are non-parallel sentences.
3. The parallel sentence pair extraction method based on the pre-trained language model and the two-way interactive attention of claim 1, wherein: in Step2, the encoding of the source language and the target language respectively through the pre-training language model to obtain semantic representations comprises the following steps:
respectively coding a source language and a target language through a cross-language text semantic coding layer, wherein the length of the cross-language text semantic coding layer is lsIs expressed as Si={x1,x2...xlsIs e.g. M and has a length of ltIs denoted as Tj={y1,y2...yttJ belongs to N, M and N represent the total number of sentences; using pre-trained multilingual BERT as a bilingual encoder, the source and target languages are represented by a semantic coding layer as:
wherein,i e M represents the vector representation of the source language after encoding,j e N represents the vector representation of the coded target language, d represents the word vector dimension of words in sentences of the source language and the target language, and the generated vector is used as the input of the next cross-language semantic alignment layer.
4. The parallel sentence pair extraction method based on the pre-trained language model and the two-way interactive attention of claim 1, wherein: in Step2, performing spatial semantic alignment on the acquired semantic representations by using a bidirectional interaction attention mechanism includes:
capturing semantic interaction relation between cross-language texts by adopting an improved bidirectional interaction attention mechanism, mapping a source language and a target language to a public semantic space for carrying out spatial semantic alignment, and enabling a model to focus on semantic information of different layers by using a plurality of attention heads in parallel;
deriving semantic representations of source and target languagesAndthereafter, for simplicity of the formula, the following formula uses vsAs a general representation of semantic representations of source language sentences, vtA generic representation that is a semantic representation of the target language; the attention calculation process from the source language to the target language is shown in formulas (3) - (5):
v′s=MultiHead(vs,vt,vt)=Concat(head1,...,headh)Wo (4)
Where i represents the ith attention head,representing a parameter matrix corresponding to the ith head, wherein h in the formula represents the number of the attention heads and is set to be 8 in an experiment, the dimension of each attention head is 64, and W isoParameter matrix v 'for linear projection for final stitching of all attention head results'sOutputting a source language coding vector through a cross-language semantic alignment layer;
the attention calculation process from the target language to the source language is shown in equations (6-8):
v′t=MultiHead(vt,vs,vs)=Concat(head1,...,headh)Wo (7)
where i represents the ith attention head,represents the firsti parameter matrixes corresponding to the heads, h in the formula represents the number of the attention heads, the number of the attention heads is set to be 8 in an experiment, the dimension of each attention head is 64, and W isoParameter matrix v 'for linear projection for final stitching of all attention head results'tFor the output of target language code vectors through a cross-language semantic alignment layer, dkRepresenting the model vector dimensions.
5. The parallel sentence pair extraction method based on the pre-trained language model and the two-way interactive attention of claim 1, wherein: in Step2, the relationship judgment of the cross-language sentence pairs is realized based on the semantic representation after the multi-view feature fusion, and the extraction of the parallel sentence pairs in the noise environment includes:
comparing the similarity of global representation and alignment representation of semantic vectors from multiple perspectives by a cross-language semantic fusion layer, designing three different fusion strategies, splicing, bitwise subtracting and bitwise multiplying original semantic information and aligned semantic information, projecting all vector matrixes to the same space to obtain the output of the final semantic fusion layer, and outputting a result V by a source language through the semantic fusion layersThe calculation process is as follows:
Vs1=G1([vs;v′s]) (9)
Vs2=G2([vs;vs-v′s]) (10)
Vs3=G3([vs;vs⊙v′s]) (11)
Vs=G([Vs1;Vs2;Vs3]) (12)
wherein, v'sFor the output of source language code vectors through a cross-language semantic alignment layer, vsGeneral representation of semantic representations of source language sentences, G1、G2、G3And G each represents a single layer feedforward neural network with independent parameters, corresponding to multiplication of corresponding elements, the difference between feature vectors being subtracted from each otherCalculating quantity, multiplying operation is used for highlighting similarity of the two, and the target language outputs a result V through semantic fusiontThe calculation and the source language output a result V through a semantic fusion layersThe formula is omitted because the two are consistent;
the output of the semantic fusion layer is subjected to feature compression through maximum pooling operation to obtain vector representations O of the source language and the target languages,OtAs input to the semantic prediction layer, a semantic relevance probability distribution of the cross-language sentence pair is obtained:
where H represents a multi-layer feed-forward neural network,probability scores representing all classes according to a probability distributionDistinguishing whether the input source language sentence and target language sentence pair is parallel sentences or not; the cross entropy loss function is:
wherein y isiRepresents a real label and is a label of a real label,representing the prediction result, and training the model by taking the cross entropy of the minimized prediction result and the real result as a loss function after obtaining the prediction result of the cross-language sentence pair.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111082587.1A CN113901831B (en) | 2021-09-15 | 2021-09-15 | Parallel sentence pair extraction method based on pre-training language model and bidirectional interaction attention |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111082587.1A CN113901831B (en) | 2021-09-15 | 2021-09-15 | Parallel sentence pair extraction method based on pre-training language model and bidirectional interaction attention |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113901831A true CN113901831A (en) | 2022-01-07 |
CN113901831B CN113901831B (en) | 2024-04-26 |
Family
ID=79028408
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111082587.1A Active CN113901831B (en) | 2021-09-15 | 2021-09-15 | Parallel sentence pair extraction method based on pre-training language model and bidirectional interaction attention |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113901831B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114996459A (en) * | 2022-03-18 | 2022-09-02 | 星宙数智科技(珠海)有限公司 | Parallel corpus classification method and device, computer equipment and storage medium |
CN115017884A (en) * | 2022-01-20 | 2022-09-06 | 昆明理工大学 | Text parallel sentence pair extraction method based on image-text multi-mode gating enhancement |
CN116910627A (en) * | 2023-09-11 | 2023-10-20 | 四川大学 | Method, system and storage medium for improving efficiency of electric drive system of electric automobile |
WO2024164580A1 (en) * | 2023-02-09 | 2024-08-15 | 中南大学 | Method for calculating chinese-english semantic similarity of paragraph-level texts |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111931903A (en) * | 2020-07-09 | 2020-11-13 | 北京邮电大学 | Network alignment method based on double-layer graph attention neural network |
CN112287695A (en) * | 2020-09-18 | 2021-01-29 | 昆明理工大学 | Cross-language bilingual pre-training and Bi-LSTM-based Chinese-character-cross parallel sentence pair extraction method |
CN112287688A (en) * | 2020-09-17 | 2021-01-29 | 昆明理工大学 | English-Burmese bilingual parallel sentence pair extraction method and device integrating pre-training language model and structural features |
CN112329474A (en) * | 2020-11-02 | 2021-02-05 | 山东师范大学 | Attention-fused aspect-level user comment text emotion analysis method and system |
CN112560502A (en) * | 2020-12-28 | 2021-03-26 | 桂林电子科技大学 | Semantic similarity matching method and device and storage medium |
US20210166446A1 (en) * | 2019-11-28 | 2021-06-03 | Shanghai United Imaging Intelligence Co., Ltd. | Systems and methods for image reconstruction |
-
2021
- 2021-09-15 CN CN202111082587.1A patent/CN113901831B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210166446A1 (en) * | 2019-11-28 | 2021-06-03 | Shanghai United Imaging Intelligence Co., Ltd. | Systems and methods for image reconstruction |
CN111931903A (en) * | 2020-07-09 | 2020-11-13 | 北京邮电大学 | Network alignment method based on double-layer graph attention neural network |
CN112287688A (en) * | 2020-09-17 | 2021-01-29 | 昆明理工大学 | English-Burmese bilingual parallel sentence pair extraction method and device integrating pre-training language model and structural features |
CN112287695A (en) * | 2020-09-18 | 2021-01-29 | 昆明理工大学 | Cross-language bilingual pre-training and Bi-LSTM-based Chinese-character-cross parallel sentence pair extraction method |
CN112329474A (en) * | 2020-11-02 | 2021-02-05 | 山东师范大学 | Attention-fused aspect-level user comment text emotion analysis method and system |
CN112560502A (en) * | 2020-12-28 | 2021-03-26 | 桂林电子科技大学 | Semantic similarity matching method and device and storage medium |
Non-Patent Citations (3)
Title |
---|
JAY HEO等: "cost-effective interactive attention learning with neural attention processes", PROCEEDINGS OF THE 37TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING, 31 December 2020 (2020-12-31), pages 4228 - 4238 * |
张乐乐等: "基于预训练语言模型及交互注意力的平行句对抽取方法", 通信技术, vol. 55, no. 4, 20 April 2022 (2022-04-20), pages 443 - 452 * |
毛焱颖;: "基于注意力双层LSTM的长文本情感分类方法", 重庆电子工程职业学院学报, vol. 28, no. 02, 20 April 2019 (2019-04-20), pages 118 - 125 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115017884A (en) * | 2022-01-20 | 2022-09-06 | 昆明理工大学 | Text parallel sentence pair extraction method based on image-text multi-mode gating enhancement |
CN115017884B (en) * | 2022-01-20 | 2024-04-26 | 昆明理工大学 | Text parallel sentence pair extraction method based on graphic multi-mode gating enhancement |
CN114996459A (en) * | 2022-03-18 | 2022-09-02 | 星宙数智科技(珠海)有限公司 | Parallel corpus classification method and device, computer equipment and storage medium |
WO2024164580A1 (en) * | 2023-02-09 | 2024-08-15 | 中南大学 | Method for calculating chinese-english semantic similarity of paragraph-level texts |
CN116910627A (en) * | 2023-09-11 | 2023-10-20 | 四川大学 | Method, system and storage medium for improving efficiency of electric drive system of electric automobile |
CN116910627B (en) * | 2023-09-11 | 2023-11-17 | 四川大学 | Method, system and storage medium for improving efficiency of electric drive system of electric automobile |
Also Published As
Publication number | Publication date |
---|---|
CN113901831B (en) | 2024-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115033670B (en) | Cross-modal image-text retrieval method with multi-granularity feature fusion | |
CN111144448B (en) | Video barrage emotion analysis method based on multi-scale attention convolution coding network | |
CN113901831B (en) | Parallel sentence pair extraction method based on pre-training language model and bidirectional interaction attention | |
Palangi et al. | Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval | |
CN109992669B (en) | Keyword question-answering method based on language model and reinforcement learning | |
CN117371456B (en) | Multi-mode irony detection method and system based on feature fusion | |
CN114595306B (en) | Text similarity calculation system and method based on distance perception self-attention mechanism and multi-angle modeling | |
Tian et al. | An attempt towards interpretable audio-visual video captioning | |
CN114238649A (en) | Common sense concept enhanced language model pre-training method | |
CN116796251A (en) | Poor website classification method, system and equipment based on image-text multi-mode | |
CN116029305A (en) | Chinese attribute-level emotion analysis method, system, equipment and medium based on multitask learning | |
Lu et al. | Self‐supervised domain adaptation for cross‐domain fault diagnosis | |
CN114595700A (en) | Zero-pronoun and chapter information fused Hanyue neural machine translation method | |
CN112749566B (en) | Semantic matching method and device for English writing assistance | |
CN117312577A (en) | Traffic event knowledge graph construction method based on multi-layer semantic graph convolutional neural network | |
CN113901172B (en) | Case-related microblog evaluation object extraction method based on keyword structural coding | |
CN113157855B (en) | Text summarization method and system fusing semantic and context information | |
Zeng et al. | An Explainable Multi-view Semantic Fusion Model for Multimodal Fake News Detection | |
Khaing | Attention-based deep learning model for image captioning: a comparative study | |
Li et al. | Tst-gan: A legal document generation model based on text style transfer | |
Mi et al. | VVA: Video Values Analysis | |
Yu et al. | Semantic extraction for sentence representation via reinforcement learning | |
Zhou et al. | Multimodal emotion recognition based on multilevel acoustic and textual information | |
Wang | An Automatic Error Detection Method for Engineering English Translation Based on the Deep Learning Model | |
CN113127599B (en) | Question-answering position detection method and device of hierarchical alignment structure |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |