CN113901831A

CN113901831A - Parallel sentence pair extraction method based on pre-training language model and bidirectional interaction attention

Info

Publication number: CN113901831A
Application number: CN202111082587.1A
Authority: CN
Inventors: 余正涛; 张乐乐; 郭军军
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2021-09-15
Filing date: 2021-09-15
Publication date: 2022-01-07
Anticipated expiration: 2041-09-15
Also published as: CN113901831B

Abstract

The invention relates to a pre-training language model and a parallel sentence pair extraction method for bidirectional interaction attention, belonging to the field of natural language processing. The invention comprises the following steps: constructing a comparable corpus data set; obtaining bilingual representations of a source language and a target language respectively by using a pre-training language model, and then realizing cross-language feature space semantic alignment based on a bidirectional interaction attention mechanism; and finally, judging the relation of cross-language sentence pairs based on the semantic representation after multi-view feature fusion, and extracting parallel sentence pairs according to deep semantic consistency. Experimental results show that the method provided by the invention can effectively identify the bilingual parallel sentences with consistent semantics under the background of data containing noise, and the extracted bilingual parallel sentences provide support for subsequent machine translation.

Description

Parallel sentence pair extraction method based on pre-training language model and bidirectional interaction attention

Technical Field

The invention relates to a pre-training language model and a parallel sentence pair extraction method for bidirectional interaction attention, belonging to the field of natural language processing.

Background

The performance of neural machine translation depends on a large amount of high-quality parallel data support, a parallel corpus with rich resources is used for supporting academic research facing mainstream language pairs (De-English, Fa-English) and the like, the mainstream language machine translation performance is remarkably improved recently, and the effect of manual translation is approached. However, for a large number of non-mainstream language pairs, the performance of machine translation is severely limited because it does not have large-scale high-quality parallel sentence pair resources.

The parallel sentence pair extraction realizes the matching of two languages based on semantic similarity, the task aims to find out the parallel sentences from a comparable corpus as training data of natural language processing tasks such as neural machine translation and the like, and the core lies in realizing the alignment of bilingual semantic space in a cross-language space so as to judge the semantic consistency. The traditional semantic space alignment method is to use different neural network structures to learn vector representation of cross-language sentences, and judge similarity between sentences in a shared vector space. When the problem that network resource data with a large amount of noise exists is solved, good semantic representation is difficult to generate due to the quantity of training data and the coverage field, and the semantic alignment effect is influenced. The cross-language pre-training model can well realize the alignment of bilingual semantics on a public semantic space. The invention uses a pre-training language model as prior knowledge to carry out semantic alignment on the obtained semantic representation based on a bidirectional interaction attention mechanism. The method and the device realize the extraction of bilingual parallel sentences with consistent deep semantics from the comparable corpus to expand the bilingual parallel corpus, thereby relieving the problem of lack of training data of the resource-deficient languages.

Disclosure of Invention

The invention provides a parallel sentence pair extraction method based on a pre-training language model and bidirectional interactive attention, which is used for extracting bilingual parallel sentences with consistent deep semantics from a comparable corpus to expand the bilingual parallel corpus, so that the problem that the language pairs lack training data due to lack of resources is solved, and the prediction effect of the parallel sentence pairs is improved.

The technical scheme of the invention is as follows: a parallel sentence pair extraction method based on a pre-training language model and bidirectional interaction attention comprises the following specific steps:

step1, collecting and constructing the middle-crossing parallel data through a web crawler technology, obtaining non-parallel data by utilizing negative sampling, and manually marking the data set to obtain a Chinese-crossing comparable corpus data set, wherein the main sources of the Chinese-crossing parallel data comprise Wikipedia, bilingual news websites, movie subtitles and the like.

Step2, coding a source language and a target language respectively through a pre-training language model to obtain semantic representations; and then, carrying out space semantic alignment on the obtained semantic representations by using a bidirectional interaction attention mechanism, finally realizing the relation judgment of cross-language sentence pairs based on the semantic representations after multi-view feature fusion, and realizing the extraction of parallel sentence pairs in a noise environment based on bilingual representation consistency.

As a further scheme of the present invention, the Step1 specifically comprises the following steps:

step1.1, acquiring the more parallel data through a web crawler technology; data sources include wikipedia, bilingual news websites, movie subtitles, and the like;

step1.2, cleaning and aligning the crawled data, using the crawled data as positive samples for model training, constructing a corresponding negative sample for each positive sample through negative sampling in order to keep the balance of the number of samples, randomly sampling each pair of parallel sentences to generate negative samples, and using a comparable corpus consisting of the positive and negative samples to train the model;

step1.3, manually marking to obtain a Chinese-Yue comparable corpus data set; the final data consists of 2n triplets

n is the number of parallel sentences and non-parallel sentences;

representing the sentence in the source language and,

representing a sentence in the target language, y_iIs to represent

And

and the label of the translation relation is set to be 1 when the source language sentence and the target language sentence are parallel sentences, and is set to be 0 when the source language sentence and the target language sentence are non-parallel sentences.

As a further scheme of the present invention, in Step2, the encoding the source language and the target language respectively by the pre-training language model to obtain semantic representations includes the following steps:

respectively coding a source language and a target language through a cross-language text semantic coding layer, wherein the length of the cross-language text semantic coding layer is l_sIs expressed as S_i＝{x₁，x₂...x_lsIs e.g. M and has a length of l_tIs denoted as T_j＝{y₁，y₂...y_ltJ belongs to N, M and N represent the total number of sentences; using pre-trained multilingual BERT as a bilingual encoder, the source and target languages are represented by a semantic coding layer as:

wherein,

a vector representation representing the source language after encoding,

and d represents the word vector dimension of the words in the sentences of the source language and the target language, and the generated vector is used as the input of the next cross-language semantic alignment layer.

As a further aspect of the present invention, in Step2, performing spatial semantic alignment on the obtained semantic representations by using a bidirectional interaction attention mechanism includes:

the word order of two languages does not correspond exactly, subject to the linguistic characteristics of the different languages. Considering that the self-attention mechanism is not influenced by the positions of words to directly calculate the semantic correlation relationship between word pairs, the invention adopts an improved two-way interaction attention mechanism to capture the semantic interaction relationship between cross-language texts on the basis, maps a source language and a target language to a public semantic space for carrying out space semantic alignment, and uses a plurality of parallel attention heads to enable a model to focus on semantic information of different layers;

deriving semantic representations of source and target languages

Yang (Yang)

Thereafter, for simplicity of the formula, the following formula uses v_sAs a general representation of semantic representations of source language sentences, v_tA generic representation that is a semantic representation of the target language; the attention calculation process for the source language to target language direction is shown in equations (3) - (5):

v′_s＝MultiHead(v_s，v_t，v_t)＝Concat(head₁，...，head_h)W° (4)

Where head_i＝Attention(v_sW_i ¹，v_tW_i ²，u_tW_i ³) (5)

wherein i represents the ith attention head, W_i ¹，W_i ²，W_i ³Representing a parameter matrix corresponding to the ith head, wherein h in the formula represents the number of the attention heads, the number of the attention heads is set to be 8 in an experiment, the dimension of each attention head is 64, and Wo is the final splicing of all the attention headsThe results are a parameter matrix, v ', of the linear projection'_sOutputting a source language coding vector through a cross-language semantic alignment layer;

the attention calculation process from the target language to the source language is shown in equations (6-8):

v′_t＝MultiHead(v_t，v_s，v_s)＝Concat(head₁，...，head_h)W° (7)

Where head_i＝Attention(v_tW_i ¹，v_sW_i ²，v_sW_i ³) (8)

wherein i represents the ith attention head, W_i ¹，W_i ²，W_i ³Representing a parameter matrix corresponding to the ith head, wherein h in the formula represents the number of the attention heads and is set to be 8 in an experiment, the dimension of each attention head is 64, and Wo is a parameter matrix v 'for finally splicing all the attention head results to make a linear projection'_tFor the output of target language code vectors through a cross-language semantic alignment layer, d_kRepresenting the model vector dimensions.

As a further aspect of the present invention, in Step2, the method for extracting parallel sentence pairs in a noise environment based on bilingual token consistency includes:

comparing the similarity of global representation and alignment representation of semantic vectors from multiple perspectives by a cross-language semantic fusion layer, designing three different fusion strategies, splicing, bitwise subtracting and bitwise multiplying original semantic information and aligned semantic information, projecting all vector matrixes to the same space to obtain the output of the final semantic fusion layer, and outputting a result V by a source language through the semantic fusion layer_sThe calculation process is as follows:

V_s1＝G₁([v_s；v′_s]) (9)

V_s2＝G₂([v_s；v_s-v′_s]) (10)

V_s3＝G₃([v_s；v_s⊙v′_s]) (11)

V_s＝G([V_s1；V_s2；V_s3]) (12)

wherein, v'_sFor the output of source language code vectors through a cross-language semantic alignment layer, v_sGeneral representation of semantic representations of source language sentences, G₁、G₂、G₃And G respectively represent single-layer feedforward neural networks with independent parameters, which represent multiplication of corresponding elements, the difference between feature vectors is measured by subtraction between the two, the multiplication is used to highlight the similarity between the two, and the target language outputs a result V through semantic fusion_tThe calculation and the source language output a result V through a semantic fusion layer_sThe formula is omitted because the two are consistent;

the output of the semantic fusion layer is subjected to feature compression through maximum pooling operation to obtain vector representations O of the source language and the target language_s，O_tAs input to the semantic prediction layer, a semantic relevance probability distribution of the cross-language sentence pair is obtained:

where H represents a multi-layer feed-forward neural network,

probability scores representing all classes according to a probability distribution

Distinguishing whether the input source language sentence and target language sentence pair is parallel sentences or not; the cross entropy loss function is:

wherein y is_iRepresents a real label and is a label of a real label,

representing the prediction result, and training the model by taking the cross-meaning entropy of the minimized prediction result and the true buying result as a loss function after obtaining the prediction result of the cross-language sentence pair.

The invention has the beneficial effects that:

(1) aiming at semantic representation and deep semantic alignment problems containing noise data, a cross-language text semantic matching method based on pre-training language model bidirectional interactive attention is used, and extraction of parallel sentence pairs in a noise environment is achieved based on bilingual semantic representation consistency.

(2) The bilingual semantic representation of the cross-language sentence pair is obtained by means of the powerful semantic representation capability of the pre-training language model, and a better semantic representation of the context words is generated for the cross-language sentence.

(3) The bilingual semantic interaction relation is learned by using a bidirectional interaction attention mechanism, semantic alignment in a public semantic space is realized, the characteristic representation of cross-language sentence pairs is compared from multiple visual angles, the semantic consistency is further measured, and the prediction effect is improved.

Drawings

Fig. 1 is a schematic structural diagram of a pre-training language model and a bidirectional interactive attention parallel sentence pair extraction method according to the present invention.

Detailed Description

Example 1: as shown in fig. 1, based on a pre-training language model and a bidirectional interactive attention parallel sentence pair extraction method, the method specifically includes the following steps:

n is the number of parallel sentences and non-parallel sentences;

representing the sentence in the source language and,

representing a sentence in the target language, y_iIs to represent

And

and the label of the translation relation is set to be 1 when the source language sentence and the target language sentence are parallel sentences, and is set to be 0 when the source language sentence and the target language sentence are non-parallel sentences. The experimental corpus scale is shown in table 1:

TABLE 1 statistical information of the experimental data

step2.1, respectively coding a source language and a target language through a cross-language text semantic coding layer, wherein the length is l_sIs expressed as S_i＝{x₁，x₂...x_lsIs e.g. M and has a length of l_tIs denoted as T_j＝{y₁，y₂...y_ltJ belongs to N, M and N represent the total number of sentences; using pre-trained multilingual BERT as a bilingual encoder, the source and target languages are represented by a semantic coding layer as:

wherein,

a vector representation of the encoded source language is formulated,

step2.2, influenced by the language characteristics of different languages, the word sequences of the two languages do not completely correspond. Considering that the self-attention mechanism is not influenced by the positions of words to directly calculate the semantic correlation relationship between word pairs, the invention adopts an improved two-way interaction attention mechanism to capture the semantic interaction relationship between cross-language texts on the basis, maps a source language and a target language to a public semantic space for carrying out space semantic alignment, and uses a plurality of parallel attention heads to enable a model to focus on semantic information of different layers;

deriving semantic representations of source and target languages

And

v′_s＝MultiHead(v_s，v_t，v_t)＝Concat(head₁，...，head_h)W° (4)

Where head_i＝Attention(v_sW_i ¹，v_tW_i ²，v_t ³) (5)

wherein i represents the ith attention head, W_i ¹，W_i ²，W_i ³Representing a parameter matrix corresponding to the ith head, wherein h in the formula represents the number of the attention heads, the number of the attention heads is set to be 8 in an experiment, the dimension of each attention head is 64, and Wo is the final splicing of all attention headsAttention is paid to the parameter matrix v 'of the linear projection of the head result'_sOutputting a source language coding vector through a cross-language semantic alignment layer;

v′_t＝MultiHead(v_t，v_s，v_s)＝Concat(head₁，...，head_h)W° (7)

Where head_i＝Attention(v_tW_i ¹，v_sW_i ²，v_sW_i ³) (8)

wherein i represents the ith attention head, W_i ¹，W_i ²，W_i ³Representing a parameter matrix corresponding to the ith head, wherein h in the formula represents the number of the attention heads and is set to be 8 in an experiment, the dimension of each attention head is 64, W DEG is a parameter matrix which is used for splicing all the attention head results finally and is used for linear projection, v'_tFor the output of target language code vectors through a cross-language semantic alignment layer, d_kRepresenting the model vector dimensions.

step2.3, comparing the global representation and the alignment representation of the semantic vector from multiple perspectives by the cross-language semantic fusion layer, designing three different fusion strategies, splicing, bit-wise subtracting and bit-wise multiplying the original semantic information and the aligned semantic information, projecting all vector matrixes to the same space to obtain the final output of the semantic fusion layer, and outputting a result V by the source language through the semantic fusion layer_sThe calculation process is as follows:

V_s1＝G₁([v_s；v′_s]) (9)

V_s2＝G₂([v_s；v_s-v′_s]) (10)

V_s3＝G₃([v_s；v_s⊙v′_s]) (11)

V_s＝G([V_s1；V_s2；V_s3]) (12)

step2.4, performing feature compression on the output of the semantic fusion layer through maximum pooling operation to obtain vector representations O of the source language and the target language_s，O_tAs input to the semantic prediction layer, a semantic relevance probability distribution of the cross-language sentence pair is obtained:

where H represents a multi-layer feed-forward neural network,

wherein y is_iRepresents a real label and is a label of a real label,

representing the prediction result, and training the model by taking the cross entropy of the minimized prediction result and the real result as a loss function after obtaining the prediction result of the cross-language sentence pair.

To illustrate the effect of the present invention, 3 sets of comparative experiments were set up. The first group of experiments verifies the effectiveness of the pre-training language model, the second group of experiments verifies the effectiveness of the model in different data scales, and the third group of experiments verifies the effectiveness of each module of the model.

(1) Performance enhancement verification based on pre-training language model

And (3) replacing the pre-training language model with other neural network structures such as a cyclic neural network, a convolutional neural network, a transformer and the like in the reference model to learn the semantic representation of the cross-language sentence, and using the parameters for obtaining the optimal performance in the comparison model on the premise of consistent other conditions. The results of the experiment are shown in table 2.

TABLE 2 comparison of Performance of baseline models

Analysis of table 2 shows that the performance using the pre-trained model is significantly better than that of the other models. The main reason is that the data scale plays a crucial role in the performance of the neural network, and the pre-training language model uses massive data to represent the more contextualized vector representation for word representation learning, so as to generate a better semantic representation for the input text. From the experimental results of table 2, the bidirectional network structure is superior in performance to the model of the unidirectional network structure because the bidirectional network can better characterize the context information of the input text. The results in table 2 show that the value of the model F1 based on the transformer is reduced by 1.92% compared with the model based on the RNN, which is considered to be because the model parameters are too large due to the complex network structure, the scale of the training data is limited, the model is not sufficiently trained, and the model does not achieve the fitting.

(2) Comparative experiment of different data scales

In order to further prove the influence of the data scale on the model performance, 20 ten thousand of training data are added into the data set of the invention to retrain each model, and the experimental result of table 3 shows the performance of each model under a larger-scale data set.

TABLE 3 results of model experiments on larger data scale

Analysis of table 3 reveals that the performance of each method is improved to varying degrees with larger scale training data. The performance of models based on transformer and other structurally complex methods is gradually superior to that of models with simple structures such as a recurrent neural network. The main reason is that the model with a complex structure can better learn the context information of the text to generate better feature representation, and because the reason that the number of parameters is large is more sensitive to the scale of the training data, the ideal effect is difficult to achieve under the condition of small-scale data. Compared with the method, the performance of the method provided by the invention is improved at the least, and the performance equivalent to the performance can be achieved under the low data scale. The method based on the pre-training language model is proved to achieve ideal effects on smaller data volume.

(3) Model different module validity experiment

To analyze the specific contributions of the different modules in the model of the invention, a comparative ablation experiment was performed on the overall model. In the comparison process, the complete model is simplified or altered to get several different variant models for different components. The effectiveness of the different modules was analyzed according to the experimental results of table 4.

TABLE 4 ablation model test results

Analysis of Table 4 reveals that the method of the present invention is superior to other variant models in terms of accuracy, recall, and F1 values. Compared with the EAFP model of the invention, the accuracy of the model is reduced by 12.28%, and the F1 value is reduced by 8.98%. The analysis reason is that after the semantic alignment layer and the semantic fusion layer are removed, the deep semantic features are not well distinguished only by using the original semantic representation, and the great reduction of the accuracy rate shows that the number of samples classified as negative is increased when the original semantic features have larger difference. When the semantic alignment layer and the semantic fusion layer are added, the performance of the model is remarkably improved; the F1 value of the EAP model is reduced by 7.98%, because the semantic fusion layer compares the local representation and the alignment representation of the semantics from multiple perspectives, the similarity and the difference of semantic features can be better distinguished, and the accuracy of the result is ensured; compared with the experimental results of EAFP-A and EAFP-B, the F1 value is reduced by 1.79% and 2.19% compared with the model of the invention. It is shown that interaction information between cross-language sentences can be better learned based on a two-way interaction attention mechanism. The overall experimental results show that all modules of the model play an important role in improving the performance, and if any one module is lacked, the performance of the model is reduced.

The experimental data prove that the cross-language text semantic matching method based on the pre-training language model and the bidirectional interaction attention can realize the extraction of the parallel sentence pairs with high-quality semantic consistency on the premise that a large amount of noise data exist in network resources. And generating a language-contextualized semantic representation for the cross-language sentence pairs by using a pre-training language model under the condition of limited resources, performing semantic alignment on the cross-language sentence pairs in a public semantic space by using bidirectional interaction attention, and finally obtaining the relationship judgment of the cross-language sentence pairs. The method and the device realize the extraction of bilingual parallel sentences with consistent deep semantics from the comparable corpus to expand the bilingual parallel corpus, thereby relieving the problem of lack of training data of the resource-deficient languages. The experimental result shows that the effect of the invention is superior to that of other models. Aiming at the parallel sentence pair extraction task, the method based on the pre-training language model and the bidirectional interaction attention is effective in improving the extraction performance of the parallel sentence pair.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The parallel sentence pair extraction method based on the pre-training language model and the bidirectional interaction attention is characterized by comprising the following steps of:

the method comprises the following specific steps:

step1, collecting and constructing the more parallel data through a web crawler technology, obtaining non-parallel data by utilizing negative sampling, and manually labeling a data set to obtain a more Chinese-to-more comparable corpus data set;

2. The parallel sentence pair extraction method based on the pre-trained language model and the two-way interactive attention of claim 1, wherein: the specific steps of Step1 are as follows:

step1.1, acquiring the more parallel data through a web crawler technology;

step1.3, manually marking to obtain a Chinese-Yue comparable corpus data set; the final data is composed of 2nTernary composition

n is the number of parallel sentences and non-parallel sentences;

representing the sentence in the source language and,

representing a sentence in the target language, y_iIs to represent

And

3. The parallel sentence pair extraction method based on the pre-trained language model and the two-way interactive attention of claim 1, wherein: in Step2, the encoding of the source language and the target language respectively through the pre-training language model to obtain semantic representations comprises the following steps:

respectively coding a source language and a target language through a cross-language text semantic coding layer, wherein the length of the cross-language text semantic coding layer is l_sIs expressed as S_i＝{x₁，x₂...x_lsIs e.g. M and has a length of l_tIs denoted as T_j＝{y₁，y₂...y_ttJ belongs to N, M and N represent the total number of sentences; using pre-trained multilingual BERT as a bilingual encoder, the source and target languages are represented by a semantic coding layer as:

wherein,

i e M represents the vector representation of the source language after encoding,

j e N represents the vector representation of the coded target language, d represents the word vector dimension of words in sentences of the source language and the target language, and the generated vector is used as the input of the next cross-language semantic alignment layer.

4. The parallel sentence pair extraction method based on the pre-trained language model and the two-way interactive attention of claim 1, wherein: in Step2, performing spatial semantic alignment on the acquired semantic representations by using a bidirectional interaction attention mechanism includes:

capturing semantic interaction relation between cross-language texts by adopting an improved bidirectional interaction attention mechanism, mapping a source language and a target language to a public semantic space for carrying out spatial semantic alignment, and enabling a model to focus on semantic information of different layers by using a plurality of attention heads in parallel;

deriving semantic representations of source and target languages

And

thereafter, for simplicity of the formula, the following formula uses v_sAs a general representation of semantic representations of source language sentences, v_tA generic representation that is a semantic representation of the target language; the attention calculation process from the source language to the target language is shown in formulas (3) - (5)：

v′_s＝MultiHead(v_s，v_t，v_t)＝Concat(head₁，...，head_h)W^o (4)

Where i represents the ith attention head,

representing a parameter matrix corresponding to the ith head, wherein h in the formula represents the number of the attention heads and is set to be 8 in an experiment, the dimension of each attention head is 64, and W is^oParameter matrix v 'for linear projection for final stitching of all attention head results'_sOutputting a source language coding vector through a cross-language semantic alignment layer;

v′_t＝MultiHead(v_t，v_s，v_s)＝Concat(head₁，...，head_h)W^o (7)

where i represents the ith attention head,

represents the firsti parameter matrixes corresponding to the heads, h in the formula represents the number of the attention heads, the number of the attention heads is set to be 8 in an experiment, the dimension of each attention head is 64, and W is^oParameter matrix v 'for linear projection for final stitching of all attention head results'_tFor the output of target language code vectors through a cross-language semantic alignment layer, d_kRepresenting the model vector dimensions.

5. The parallel sentence pair extraction method based on the pre-trained language model and the two-way interactive attention of claim 1, wherein: in Step2, the relationship judgment of the cross-language sentence pairs is realized based on the semantic representation after the multi-view feature fusion, and the extraction of the parallel sentence pairs in the noise environment includes:

V_s1＝G₁([v_s；v′_s]) (9)

V_s2＝G₂([v_s；v_s-v′_s]) (10)

V_s3＝G₃([v_s；v_s⊙v′_s]) (11)

V_s＝G([V_s1；V_s2；V_s3]) (12)

wherein, v'_sFor the output of source language code vectors through a cross-language semantic alignment layer, v_sGeneral representation of semantic representations of source language sentences, G₁、G₂、G₃And G each represents a single layer feedforward neural network with independent parameters, corresponding to multiplication of corresponding elements, the difference between feature vectors being subtracted from each otherCalculating quantity, multiplying operation is used for highlighting similarity of the two, and the target language outputs a result V through semantic fusion_tThe calculation and the source language output a result V through a semantic fusion layer_sThe formula is omitted because the two are consistent;

where H represents a multi-layer feed-forward neural network,

wherein y is_iRepresents a real label and is a label of a real label,