CN113901831A - Parallel sentence pair extraction method based on pre-training language model and bidirectional interaction attention - Google Patents

Parallel sentence pair extraction method based on pre-training language model and bidirectional interaction attention Download PDF

Info

Publication number
CN113901831A
CN113901831A CN202111082587.1A CN202111082587A CN113901831A CN 113901831 A CN113901831 A CN 113901831A CN 202111082587 A CN202111082587 A CN 202111082587A CN 113901831 A CN113901831 A CN 113901831A
Authority
CN
China
Prior art keywords
language
semantic
attention
parallel
cross
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111082587.1A
Other languages
Chinese (zh)
Other versions
CN113901831B (en
Inventor
余正涛
张乐乐
郭军军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202111082587.1A priority Critical patent/CN113901831B/en
Publication of CN113901831A publication Critical patent/CN113901831A/en
Application granted granted Critical
Publication of CN113901831B publication Critical patent/CN113901831B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a pre-training language model and a parallel sentence pair extraction method for bidirectional interaction attention, belonging to the field of natural language processing. The invention comprises the following steps: constructing a comparable corpus data set; obtaining bilingual representations of a source language and a target language respectively by using a pre-training language model, and then realizing cross-language feature space semantic alignment based on a bidirectional interaction attention mechanism; and finally, judging the relation of cross-language sentence pairs based on the semantic representation after multi-view feature fusion, and extracting parallel sentence pairs according to deep semantic consistency. Experimental results show that the method provided by the invention can effectively identify the bilingual parallel sentences with consistent semantics under the background of data containing noise, and the extracted bilingual parallel sentences provide support for subsequent machine translation.

Description

Parallel sentence pair extraction method based on pre-training language model and bidirectional interaction attention
Technical Field
The invention relates to a pre-training language model and a parallel sentence pair extraction method for bidirectional interaction attention, belonging to the field of natural language processing.
Background
The performance of neural machine translation depends on a large amount of high-quality parallel data support, a parallel corpus with rich resources is used for supporting academic research facing mainstream language pairs (De-English, Fa-English) and the like, the mainstream language machine translation performance is remarkably improved recently, and the effect of manual translation is approached. However, for a large number of non-mainstream language pairs, the performance of machine translation is severely limited because it does not have large-scale high-quality parallel sentence pair resources.
The parallel sentence pair extraction realizes the matching of two languages based on semantic similarity, the task aims to find out the parallel sentences from a comparable corpus as training data of natural language processing tasks such as neural machine translation and the like, and the core lies in realizing the alignment of bilingual semantic space in a cross-language space so as to judge the semantic consistency. The traditional semantic space alignment method is to use different neural network structures to learn vector representation of cross-language sentences, and judge similarity between sentences in a shared vector space. When the problem that network resource data with a large amount of noise exists is solved, good semantic representation is difficult to generate due to the quantity of training data and the coverage field, and the semantic alignment effect is influenced. The cross-language pre-training model can well realize the alignment of bilingual semantics on a public semantic space. The invention uses a pre-training language model as prior knowledge to carry out semantic alignment on the obtained semantic representation based on a bidirectional interaction attention mechanism. The method and the device realize the extraction of bilingual parallel sentences with consistent deep semantics from the comparable corpus to expand the bilingual parallel corpus, thereby relieving the problem of lack of training data of the resource-deficient languages.
Disclosure of Invention
The invention provides a parallel sentence pair extraction method based on a pre-training language model and bidirectional interactive attention, which is used for extracting bilingual parallel sentences with consistent deep semantics from a comparable corpus to expand the bilingual parallel corpus, so that the problem that the language pairs lack training data due to lack of resources is solved, and the prediction effect of the parallel sentence pairs is improved.
The technical scheme of the invention is as follows: a parallel sentence pair extraction method based on a pre-training language model and bidirectional interaction attention comprises the following specific steps:
step1, collecting and constructing the middle-crossing parallel data through a web crawler technology, obtaining non-parallel data by utilizing negative sampling, and manually marking the data set to obtain a Chinese-crossing comparable corpus data set, wherein the main sources of the Chinese-crossing parallel data comprise Wikipedia, bilingual news websites, movie subtitles and the like.
Step2, coding a source language and a target language respectively through a pre-training language model to obtain semantic representations; and then, carrying out space semantic alignment on the obtained semantic representations by using a bidirectional interaction attention mechanism, finally realizing the relation judgment of cross-language sentence pairs based on the semantic representations after multi-view feature fusion, and realizing the extraction of parallel sentence pairs in a noise environment based on bilingual representation consistency.
As a further scheme of the present invention, the Step1 specifically comprises the following steps:
step1.1, acquiring the more parallel data through a web crawler technology; data sources include wikipedia, bilingual news websites, movie subtitles, and the like;
step1.2, cleaning and aligning the crawled data, using the crawled data as positive samples for model training, constructing a corresponding negative sample for each positive sample through negative sampling in order to keep the balance of the number of samples, randomly sampling each pair of parallel sentences to generate negative samples, and using a comparable corpus consisting of the positive and negative samples to train the model;
step1.3, manually marking to obtain a Chinese-Yue comparable corpus data set; the final data consists of 2n triplets
Figure BDA0003264630290000021
n is the number of parallel sentences and non-parallel sentences;
Figure BDA0003264630290000022
representing the sentence in the source language and,
Figure BDA0003264630290000023
representing a sentence in the target language, yiIs to represent
Figure BDA0003264630290000024
And
Figure BDA0003264630290000025
and the label of the translation relation is set to be 1 when the source language sentence and the target language sentence are parallel sentences, and is set to be 0 when the source language sentence and the target language sentence are non-parallel sentences.
As a further scheme of the present invention, in Step2, the encoding the source language and the target language respectively by the pre-training language model to obtain semantic representations includes the following steps:
respectively coding a source language and a target language through a cross-language text semantic coding layer, wherein the length of the cross-language text semantic coding layer is lsIs expressed as Si={x1,x2...xlsIs e.g. M and has a length of ltIs denoted as Tj={y1,y2...yltJ belongs to N, M and N represent the total number of sentences; using pre-trained multilingual BERT as a bilingual encoder, the source and target languages are represented by a semantic coding layer as:
Figure BDA0003264630290000026
Figure BDA0003264630290000027
wherein,
Figure BDA0003264630290000028
a vector representation representing the source language after encoding,
Figure BDA0003264630290000029
and d represents the word vector dimension of the words in the sentences of the source language and the target language, and the generated vector is used as the input of the next cross-language semantic alignment layer.
As a further aspect of the present invention, in Step2, performing spatial semantic alignment on the obtained semantic representations by using a bidirectional interaction attention mechanism includes:
the word order of two languages does not correspond exactly, subject to the linguistic characteristics of the different languages. Considering that the self-attention mechanism is not influenced by the positions of words to directly calculate the semantic correlation relationship between word pairs, the invention adopts an improved two-way interaction attention mechanism to capture the semantic interaction relationship between cross-language texts on the basis, maps a source language and a target language to a public semantic space for carrying out space semantic alignment, and uses a plurality of parallel attention heads to enable a model to focus on semantic information of different layers;
deriving semantic representations of source and target languages
Figure BDA00032646302900000210
Yang (Yang)
Figure BDA00032646302900000211
Thereafter, for simplicity of the formula, the following formula uses vsAs a general representation of semantic representations of source language sentences, vtA generic representation that is a semantic representation of the target language; the attention calculation process for the source language to target language direction is shown in equations (3) - (5):
Figure BDA0003264630290000031
v′s=MultiHead(vs,vt,vt)=Concat(head1,...,headh)W° (4)
Where headi=Attention(vsWi 1,vtWi 2,utWi 3) (5)
wherein i represents the ith attention head, Wi 1,Wi 2,Wi 3Representing a parameter matrix corresponding to the ith head, wherein h in the formula represents the number of the attention heads, the number of the attention heads is set to be 8 in an experiment, the dimension of each attention head is 64, and Wo is the final splicing of all the attention headsThe results are a parameter matrix, v ', of the linear projection'sOutputting a source language coding vector through a cross-language semantic alignment layer;
the attention calculation process from the target language to the source language is shown in equations (6-8):
Figure BDA0003264630290000032
v′t=MultiHead(vt,vs,vs)=Concat(head1,...,headh)W° (7)
Where headi=Attention(vtWi 1,vsWi 2,vsWi 3) (8)
wherein i represents the ith attention head, Wi 1,Wi 2,Wi 3Representing a parameter matrix corresponding to the ith head, wherein h in the formula represents the number of the attention heads and is set to be 8 in an experiment, the dimension of each attention head is 64, and Wo is a parameter matrix v 'for finally splicing all the attention head results to make a linear projection'tFor the output of target language code vectors through a cross-language semantic alignment layer, dkRepresenting the model vector dimensions.
As a further aspect of the present invention, in Step2, the method for extracting parallel sentence pairs in a noise environment based on bilingual token consistency includes:
comparing the similarity of global representation and alignment representation of semantic vectors from multiple perspectives by a cross-language semantic fusion layer, designing three different fusion strategies, splicing, bitwise subtracting and bitwise multiplying original semantic information and aligned semantic information, projecting all vector matrixes to the same space to obtain the output of the final semantic fusion layer, and outputting a result V by a source language through the semantic fusion layersThe calculation process is as follows:
Vs1=G1([vs;v′s]) (9)
Vs2=G2([vs;vs-v′s]) (10)
Vs3=G3([vs;vs⊙v′s]) (11)
Vs=G([Vs1;Vs2;Vs3]) (12)
wherein, v'sFor the output of source language code vectors through a cross-language semantic alignment layer, vsGeneral representation of semantic representations of source language sentences, G1、G2、G3And G respectively represent single-layer feedforward neural networks with independent parameters, which represent multiplication of corresponding elements, the difference between feature vectors is measured by subtraction between the two, the multiplication is used to highlight the similarity between the two, and the target language outputs a result V through semantic fusiontThe calculation and the source language output a result V through a semantic fusion layersThe formula is omitted because the two are consistent;
the output of the semantic fusion layer is subjected to feature compression through maximum pooling operation to obtain vector representations O of the source language and the target languages,OtAs input to the semantic prediction layer, a semantic relevance probability distribution of the cross-language sentence pair is obtained:
Figure BDA0003264630290000041
where H represents a multi-layer feed-forward neural network,
Figure BDA0003264630290000042
probability scores representing all classes according to a probability distribution
Figure BDA0003264630290000043
Distinguishing whether the input source language sentence and target language sentence pair is parallel sentences or not; the cross entropy loss function is:
Figure BDA0003264630290000044
wherein y isiRepresents a real label and is a label of a real label,
Figure BDA0003264630290000045
representing the prediction result, and training the model by taking the cross-meaning entropy of the minimized prediction result and the true buying result as a loss function after obtaining the prediction result of the cross-language sentence pair.
The invention has the beneficial effects that:
(1) aiming at semantic representation and deep semantic alignment problems containing noise data, a cross-language text semantic matching method based on pre-training language model bidirectional interactive attention is used, and extraction of parallel sentence pairs in a noise environment is achieved based on bilingual semantic representation consistency.
(2) The bilingual semantic representation of the cross-language sentence pair is obtained by means of the powerful semantic representation capability of the pre-training language model, and a better semantic representation of the context words is generated for the cross-language sentence.
(3) The bilingual semantic interaction relation is learned by using a bidirectional interaction attention mechanism, semantic alignment in a public semantic space is realized, the characteristic representation of cross-language sentence pairs is compared from multiple visual angles, the semantic consistency is further measured, and the prediction effect is improved.
Drawings
Fig. 1 is a schematic structural diagram of a pre-training language model and a bidirectional interactive attention parallel sentence pair extraction method according to the present invention.
Detailed Description
Example 1: as shown in fig. 1, based on a pre-training language model and a bidirectional interactive attention parallel sentence pair extraction method, the method specifically includes the following steps:
step1, collecting and constructing the middle-crossing parallel data through a web crawler technology, obtaining non-parallel data by utilizing negative sampling, and manually marking the data set to obtain a Chinese-crossing comparable corpus data set, wherein the main sources of the Chinese-crossing parallel data comprise Wikipedia, bilingual news websites, movie subtitles and the like.
As a further scheme of the present invention, the Step1 specifically comprises the following steps:
step1.1, acquiring the more parallel data through a web crawler technology; data sources include wikipedia, bilingual news websites, movie subtitles, and the like;
step1.2, cleaning and aligning the crawled data, using the crawled data as positive samples for model training, constructing a corresponding negative sample for each positive sample through negative sampling in order to keep the balance of the number of samples, randomly sampling each pair of parallel sentences to generate negative samples, and using a comparable corpus consisting of the positive and negative samples to train the model;
step1.3, manually marking to obtain a Chinese-Yue comparable corpus data set; the final data consists of 2n triplets
Figure BDA0003264630290000051
n is the number of parallel sentences and non-parallel sentences;
Figure BDA0003264630290000052
representing the sentence in the source language and,
Figure BDA0003264630290000053
representing a sentence in the target language, yiIs to represent
Figure BDA0003264630290000054
And
Figure BDA0003264630290000055
and the label of the translation relation is set to be 1 when the source language sentence and the target language sentence are parallel sentences, and is set to be 0 when the source language sentence and the target language sentence are non-parallel sentences. The experimental corpus scale is shown in table 1:
TABLE 1 statistical information of the experimental data
Figure BDA0003264630290000056
Step2, coding a source language and a target language respectively through a pre-training language model to obtain semantic representations; and then, carrying out space semantic alignment on the obtained semantic representations by using a bidirectional interaction attention mechanism, finally realizing the relation judgment of cross-language sentence pairs based on the semantic representations after multi-view feature fusion, and realizing the extraction of parallel sentence pairs in a noise environment based on bilingual representation consistency.
As a further scheme of the present invention, in Step2, the encoding the source language and the target language respectively by the pre-training language model to obtain semantic representations includes the following steps:
step2.1, respectively coding a source language and a target language through a cross-language text semantic coding layer, wherein the length is lsIs expressed as Si={x1,x2...xlsIs e.g. M and has a length of ltIs denoted as Tj={y1,y2...yltJ belongs to N, M and N represent the total number of sentences; using pre-trained multilingual BERT as a bilingual encoder, the source and target languages are represented by a semantic coding layer as:
Figure BDA0003264630290000057
Figure BDA0003264630290000058
wherein,
Figure BDA0003264630290000059
a vector representation of the encoded source language is formulated,
Figure BDA00032646302900000510
and d represents the word vector dimension of the words in the sentences of the source language and the target language, and the generated vector is used as the input of the next cross-language semantic alignment layer.
As a further aspect of the present invention, in Step2, performing spatial semantic alignment on the obtained semantic representations by using a bidirectional interaction attention mechanism includes:
step2.2, influenced by the language characteristics of different languages, the word sequences of the two languages do not completely correspond. Considering that the self-attention mechanism is not influenced by the positions of words to directly calculate the semantic correlation relationship between word pairs, the invention adopts an improved two-way interaction attention mechanism to capture the semantic interaction relationship between cross-language texts on the basis, maps a source language and a target language to a public semantic space for carrying out space semantic alignment, and uses a plurality of parallel attention heads to enable a model to focus on semantic information of different layers;
deriving semantic representations of source and target languages
Figure BDA00032646302900000511
And
Figure BDA00032646302900000512
thereafter, for simplicity of the formula, the following formula uses vsAs a general representation of semantic representations of source language sentences, vtA generic representation that is a semantic representation of the target language; the attention calculation process for the source language to target language direction is shown in equations (3) - (5):
Figure BDA0003264630290000061
v′s=MultiHead(vs,vt,vt)=Concat(head1,...,headh)W° (4)
Where headi=Attention(vsWi 1,vtWi 2,vt 3) (5)
wherein i represents the ith attention head, Wi 1,Wi 2,Wi 3Representing a parameter matrix corresponding to the ith head, wherein h in the formula represents the number of the attention heads, the number of the attention heads is set to be 8 in an experiment, the dimension of each attention head is 64, and Wo is the final splicing of all attention headsAttention is paid to the parameter matrix v 'of the linear projection of the head result'sOutputting a source language coding vector through a cross-language semantic alignment layer;
the attention calculation process from the target language to the source language is shown in equations (6-8):
Figure BDA0003264630290000062
v′t=MultiHead(vt,vs,vs)=Concat(head1,...,headh)W° (7)
Where headi=Attention(vtWi 1,vsWi 2,vsWi 3) (8)
wherein i represents the ith attention head, Wi 1,Wi 2,Wi 3Representing a parameter matrix corresponding to the ith head, wherein h in the formula represents the number of the attention heads and is set to be 8 in an experiment, the dimension of each attention head is 64, W DEG is a parameter matrix which is used for splicing all the attention head results finally and is used for linear projection, v'tFor the output of target language code vectors through a cross-language semantic alignment layer, dkRepresenting the model vector dimensions.
As a further aspect of the present invention, in Step2, the method for extracting parallel sentence pairs in a noise environment based on bilingual token consistency includes:
step2.3, comparing the global representation and the alignment representation of the semantic vector from multiple perspectives by the cross-language semantic fusion layer, designing three different fusion strategies, splicing, bit-wise subtracting and bit-wise multiplying the original semantic information and the aligned semantic information, projecting all vector matrixes to the same space to obtain the final output of the semantic fusion layer, and outputting a result V by the source language through the semantic fusion layersThe calculation process is as follows:
Vs1=G1([vs;v′s]) (9)
Vs2=G2([vs;vs-v′s]) (10)
Vs3=G3([vs;vs⊙v′s]) (11)
Vs=G([Vs1;Vs2;Vs3]) (12)
wherein, v'sFor the output of source language code vectors through a cross-language semantic alignment layer, vsGeneral representation of semantic representations of source language sentences, G1、G2、G3And G respectively represent single-layer feedforward neural networks with independent parameters, which represent multiplication of corresponding elements, the difference between feature vectors is measured by subtraction between the two, the multiplication is used to highlight the similarity between the two, and the target language outputs a result V through semantic fusiontThe calculation and the source language output a result V through a semantic fusion layersThe formula is omitted because the two are consistent;
step2.4, performing feature compression on the output of the semantic fusion layer through maximum pooling operation to obtain vector representations O of the source language and the target languages,OtAs input to the semantic prediction layer, a semantic relevance probability distribution of the cross-language sentence pair is obtained:
Figure BDA0003264630290000071
where H represents a multi-layer feed-forward neural network,
Figure BDA0003264630290000072
probability scores representing all classes according to a probability distribution
Figure BDA0003264630290000073
Distinguishing whether the input source language sentence and target language sentence pair is parallel sentences or not; the cross entropy loss function is:
Figure BDA0003264630290000074
wherein y isiRepresents a real label and is a label of a real label,
Figure BDA0003264630290000075
representing the prediction result, and training the model by taking the cross entropy of the minimized prediction result and the real result as a loss function after obtaining the prediction result of the cross-language sentence pair.
To illustrate the effect of the present invention, 3 sets of comparative experiments were set up. The first group of experiments verifies the effectiveness of the pre-training language model, the second group of experiments verifies the effectiveness of the model in different data scales, and the third group of experiments verifies the effectiveness of each module of the model.
(1) Performance enhancement verification based on pre-training language model
And (3) replacing the pre-training language model with other neural network structures such as a cyclic neural network, a convolutional neural network, a transformer and the like in the reference model to learn the semantic representation of the cross-language sentence, and using the parameters for obtaining the optimal performance in the comparison model on the premise of consistent other conditions. The results of the experiment are shown in table 2.
TABLE 2 comparison of Performance of baseline models
Figure BDA0003264630290000076
Analysis of table 2 shows that the performance using the pre-trained model is significantly better than that of the other models. The main reason is that the data scale plays a crucial role in the performance of the neural network, and the pre-training language model uses massive data to represent the more contextualized vector representation for word representation learning, so as to generate a better semantic representation for the input text. From the experimental results of table 2, the bidirectional network structure is superior in performance to the model of the unidirectional network structure because the bidirectional network can better characterize the context information of the input text. The results in table 2 show that the value of the model F1 based on the transformer is reduced by 1.92% compared with the model based on the RNN, which is considered to be because the model parameters are too large due to the complex network structure, the scale of the training data is limited, the model is not sufficiently trained, and the model does not achieve the fitting.
(2) Comparative experiment of different data scales
In order to further prove the influence of the data scale on the model performance, 20 ten thousand of training data are added into the data set of the invention to retrain each model, and the experimental result of table 3 shows the performance of each model under a larger-scale data set.
TABLE 3 results of model experiments on larger data scale
Figure BDA0003264630290000081
Analysis of table 3 reveals that the performance of each method is improved to varying degrees with larger scale training data. The performance of models based on transformer and other structurally complex methods is gradually superior to that of models with simple structures such as a recurrent neural network. The main reason is that the model with a complex structure can better learn the context information of the text to generate better feature representation, and because the reason that the number of parameters is large is more sensitive to the scale of the training data, the ideal effect is difficult to achieve under the condition of small-scale data. Compared with the method, the performance of the method provided by the invention is improved at the least, and the performance equivalent to the performance can be achieved under the low data scale. The method based on the pre-training language model is proved to achieve ideal effects on smaller data volume.
(3) Model different module validity experiment
To analyze the specific contributions of the different modules in the model of the invention, a comparative ablation experiment was performed on the overall model. In the comparison process, the complete model is simplified or altered to get several different variant models for different components. The effectiveness of the different modules was analyzed according to the experimental results of table 4.
TABLE 4 ablation model test results
Figure BDA0003264630290000082
Analysis of Table 4 reveals that the method of the present invention is superior to other variant models in terms of accuracy, recall, and F1 values. Compared with the EAFP model of the invention, the accuracy of the model is reduced by 12.28%, and the F1 value is reduced by 8.98%. The analysis reason is that after the semantic alignment layer and the semantic fusion layer are removed, the deep semantic features are not well distinguished only by using the original semantic representation, and the great reduction of the accuracy rate shows that the number of samples classified as negative is increased when the original semantic features have larger difference. When the semantic alignment layer and the semantic fusion layer are added, the performance of the model is remarkably improved; the F1 value of the EAP model is reduced by 7.98%, because the semantic fusion layer compares the local representation and the alignment representation of the semantics from multiple perspectives, the similarity and the difference of semantic features can be better distinguished, and the accuracy of the result is ensured; compared with the experimental results of EAFP-A and EAFP-B, the F1 value is reduced by 1.79% and 2.19% compared with the model of the invention. It is shown that interaction information between cross-language sentences can be better learned based on a two-way interaction attention mechanism. The overall experimental results show that all modules of the model play an important role in improving the performance, and if any one module is lacked, the performance of the model is reduced.
The experimental data prove that the cross-language text semantic matching method based on the pre-training language model and the bidirectional interaction attention can realize the extraction of the parallel sentence pairs with high-quality semantic consistency on the premise that a large amount of noise data exist in network resources. And generating a language-contextualized semantic representation for the cross-language sentence pairs by using a pre-training language model under the condition of limited resources, performing semantic alignment on the cross-language sentence pairs in a public semantic space by using bidirectional interaction attention, and finally obtaining the relationship judgment of the cross-language sentence pairs. The method and the device realize the extraction of bilingual parallel sentences with consistent deep semantics from the comparable corpus to expand the bilingual parallel corpus, thereby relieving the problem of lack of training data of the resource-deficient languages. The experimental result shows that the effect of the invention is superior to that of other models. Aiming at the parallel sentence pair extraction task, the method based on the pre-training language model and the bidirectional interaction attention is effective in improving the extraction performance of the parallel sentence pair.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (5)

1. The parallel sentence pair extraction method based on the pre-training language model and the bidirectional interaction attention is characterized by comprising the following steps of:
the method comprises the following specific steps:
step1, collecting and constructing the more parallel data through a web crawler technology, obtaining non-parallel data by utilizing negative sampling, and manually labeling a data set to obtain a more Chinese-to-more comparable corpus data set;
step2, coding a source language and a target language respectively through a pre-training language model to obtain semantic representations; and then, carrying out space semantic alignment on the obtained semantic representations by using a bidirectional interaction attention mechanism, finally realizing the relation judgment of cross-language sentence pairs based on the semantic representations after multi-view feature fusion, and realizing the extraction of parallel sentence pairs in a noise environment based on bilingual representation consistency.
2. The parallel sentence pair extraction method based on the pre-trained language model and the two-way interactive attention of claim 1, wherein: the specific steps of Step1 are as follows:
step1.1, acquiring the more parallel data through a web crawler technology;
step1.2, cleaning and aligning the crawled data, using the crawled data as positive samples for model training, constructing a corresponding negative sample for each positive sample through negative sampling in order to keep the balance of the number of samples, randomly sampling each pair of parallel sentences to generate negative samples, and using a comparable corpus consisting of the positive and negative samples to train the model;
step1.3, manually marking to obtain a Chinese-Yue comparable corpus data set; the final data is composed of 2nTernary composition
Figure FDA0003264630280000013
n is the number of parallel sentences and non-parallel sentences;
Figure FDA0003264630280000017
representing the sentence in the source language and,
Figure FDA0003264630280000016
representing a sentence in the target language, yiIs to represent
Figure FDA0003264630280000014
And
Figure FDA0003264630280000015
and the label of the translation relation is set to be 1 when the source language sentence and the target language sentence are parallel sentences, and is set to be 0 when the source language sentence and the target language sentence are non-parallel sentences.
3. The parallel sentence pair extraction method based on the pre-trained language model and the two-way interactive attention of claim 1, wherein: in Step2, the encoding of the source language and the target language respectively through the pre-training language model to obtain semantic representations comprises the following steps:
respectively coding a source language and a target language through a cross-language text semantic coding layer, wherein the length of the cross-language text semantic coding layer is lsIs expressed as Si={x1,x2...xlsIs e.g. M and has a length of ltIs denoted as Tj={y1,y2...yttJ belongs to N, M and N represent the total number of sentences; using pre-trained multilingual BERT as a bilingual encoder, the source and target languages are represented by a semantic coding layer as:
Figure FDA0003264630280000018
Figure FDA0003264630280000019
wherein,
Figure FDA0003264630280000011
i e M represents the vector representation of the source language after encoding,
Figure FDA0003264630280000012
j e N represents the vector representation of the coded target language, d represents the word vector dimension of words in sentences of the source language and the target language, and the generated vector is used as the input of the next cross-language semantic alignment layer.
4. The parallel sentence pair extraction method based on the pre-trained language model and the two-way interactive attention of claim 1, wherein: in Step2, performing spatial semantic alignment on the acquired semantic representations by using a bidirectional interaction attention mechanism includes:
capturing semantic interaction relation between cross-language texts by adopting an improved bidirectional interaction attention mechanism, mapping a source language and a target language to a public semantic space for carrying out spatial semantic alignment, and enabling a model to focus on semantic information of different layers by using a plurality of attention heads in parallel;
deriving semantic representations of source and target languages
Figure FDA0003264630280000027
And
Figure FDA0003264630280000028
thereafter, for simplicity of the formula, the following formula uses vsAs a general representation of semantic representations of source language sentences, vtA generic representation that is a semantic representation of the target language; the attention calculation process from the source language to the target language is shown in formulas (3) - (5):
Figure FDA0003264630280000021
v′s=MultiHead(vs,vt,vt)=Concat(head1,...,headh)Wo (4)
Figure FDA0003264630280000022
Where i represents the ith attention head,
Figure FDA0003264630280000023
representing a parameter matrix corresponding to the ith head, wherein h in the formula represents the number of the attention heads and is set to be 8 in an experiment, the dimension of each attention head is 64, and W isoParameter matrix v 'for linear projection for final stitching of all attention head results'sOutputting a source language coding vector through a cross-language semantic alignment layer;
the attention calculation process from the target language to the source language is shown in equations (6-8):
Figure FDA0003264630280000024
v′t=MultiHead(vt,vs,vs)=Concat(head1,...,headh)Wo (7)
Figure FDA0003264630280000025
where i represents the ith attention head,
Figure FDA0003264630280000026
represents the firsti parameter matrixes corresponding to the heads, h in the formula represents the number of the attention heads, the number of the attention heads is set to be 8 in an experiment, the dimension of each attention head is 64, and W isoParameter matrix v 'for linear projection for final stitching of all attention head results'tFor the output of target language code vectors through a cross-language semantic alignment layer, dkRepresenting the model vector dimensions.
5. The parallel sentence pair extraction method based on the pre-trained language model and the two-way interactive attention of claim 1, wherein: in Step2, the relationship judgment of the cross-language sentence pairs is realized based on the semantic representation after the multi-view feature fusion, and the extraction of the parallel sentence pairs in the noise environment includes:
comparing the similarity of global representation and alignment representation of semantic vectors from multiple perspectives by a cross-language semantic fusion layer, designing three different fusion strategies, splicing, bitwise subtracting and bitwise multiplying original semantic information and aligned semantic information, projecting all vector matrixes to the same space to obtain the output of the final semantic fusion layer, and outputting a result V by a source language through the semantic fusion layersThe calculation process is as follows:
Vs1=G1([vs;v′s]) (9)
Vs2=G2([vs;vs-v′s]) (10)
Vs3=G3([vs;vs⊙v′s]) (11)
Vs=G([Vs1;Vs2;Vs3]) (12)
wherein, v'sFor the output of source language code vectors through a cross-language semantic alignment layer, vsGeneral representation of semantic representations of source language sentences, G1、G2、G3And G each represents a single layer feedforward neural network with independent parameters, corresponding to multiplication of corresponding elements, the difference between feature vectors being subtracted from each otherCalculating quantity, multiplying operation is used for highlighting similarity of the two, and the target language outputs a result V through semantic fusiontThe calculation and the source language output a result V through a semantic fusion layersThe formula is omitted because the two are consistent;
the output of the semantic fusion layer is subjected to feature compression through maximum pooling operation to obtain vector representations O of the source language and the target languages,OtAs input to the semantic prediction layer, a semantic relevance probability distribution of the cross-language sentence pair is obtained:
Figure FDA0003264630280000031
where H represents a multi-layer feed-forward neural network,
Figure FDA0003264630280000032
probability scores representing all classes according to a probability distribution
Figure FDA0003264630280000033
Distinguishing whether the input source language sentence and target language sentence pair is parallel sentences or not; the cross entropy loss function is:
Figure FDA0003264630280000034
wherein y isiRepresents a real label and is a label of a real label,
Figure FDA0003264630280000035
representing the prediction result, and training the model by taking the cross entropy of the minimized prediction result and the real result as a loss function after obtaining the prediction result of the cross-language sentence pair.
CN202111082587.1A 2021-09-15 2021-09-15 Parallel sentence pair extraction method based on pre-training language model and bidirectional interaction attention Active CN113901831B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111082587.1A CN113901831B (en) 2021-09-15 2021-09-15 Parallel sentence pair extraction method based on pre-training language model and bidirectional interaction attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111082587.1A CN113901831B (en) 2021-09-15 2021-09-15 Parallel sentence pair extraction method based on pre-training language model and bidirectional interaction attention

Publications (2)

Publication Number Publication Date
CN113901831A true CN113901831A (en) 2022-01-07
CN113901831B CN113901831B (en) 2024-04-26

Family

ID=79028408

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111082587.1A Active CN113901831B (en) 2021-09-15 2021-09-15 Parallel sentence pair extraction method based on pre-training language model and bidirectional interaction attention

Country Status (1)

Country Link
CN (1) CN113901831B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996459A (en) * 2022-03-18 2022-09-02 星宙数智科技(珠海)有限公司 Parallel corpus classification method and device, computer equipment and storage medium
CN115017884A (en) * 2022-01-20 2022-09-06 昆明理工大学 Text parallel sentence pair extraction method based on image-text multi-mode gating enhancement
CN116910627A (en) * 2023-09-11 2023-10-20 四川大学 Method, system and storage medium for improving efficiency of electric drive system of electric automobile
WO2024164580A1 (en) * 2023-02-09 2024-08-15 中南大学 Method for calculating chinese-english semantic similarity of paragraph-level texts

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931903A (en) * 2020-07-09 2020-11-13 北京邮电大学 Network alignment method based on double-layer graph attention neural network
CN112287695A (en) * 2020-09-18 2021-01-29 昆明理工大学 Cross-language bilingual pre-training and Bi-LSTM-based Chinese-character-cross parallel sentence pair extraction method
CN112287688A (en) * 2020-09-17 2021-01-29 昆明理工大学 English-Burmese bilingual parallel sentence pair extraction method and device integrating pre-training language model and structural features
CN112329474A (en) * 2020-11-02 2021-02-05 山东师范大学 Attention-fused aspect-level user comment text emotion analysis method and system
CN112560502A (en) * 2020-12-28 2021-03-26 桂林电子科技大学 Semantic similarity matching method and device and storage medium
US20210166446A1 (en) * 2019-11-28 2021-06-03 Shanghai United Imaging Intelligence Co., Ltd. Systems and methods for image reconstruction

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210166446A1 (en) * 2019-11-28 2021-06-03 Shanghai United Imaging Intelligence Co., Ltd. Systems and methods for image reconstruction
CN111931903A (en) * 2020-07-09 2020-11-13 北京邮电大学 Network alignment method based on double-layer graph attention neural network
CN112287688A (en) * 2020-09-17 2021-01-29 昆明理工大学 English-Burmese bilingual parallel sentence pair extraction method and device integrating pre-training language model and structural features
CN112287695A (en) * 2020-09-18 2021-01-29 昆明理工大学 Cross-language bilingual pre-training and Bi-LSTM-based Chinese-character-cross parallel sentence pair extraction method
CN112329474A (en) * 2020-11-02 2021-02-05 山东师范大学 Attention-fused aspect-level user comment text emotion analysis method and system
CN112560502A (en) * 2020-12-28 2021-03-26 桂林电子科技大学 Semantic similarity matching method and device and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JAY HEO等: "cost-effective interactive attention learning with neural attention processes", PROCEEDINGS OF THE 37TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING, 31 December 2020 (2020-12-31), pages 4228 - 4238 *
张乐乐等: "基于预训练语言模型及交互注意力的平行句对抽取方法", 通信技术, vol. 55, no. 4, 20 April 2022 (2022-04-20), pages 443 - 452 *
毛焱颖;: "基于注意力双层LSTM的长文本情感分类方法", 重庆电子工程职业学院学报, vol. 28, no. 02, 20 April 2019 (2019-04-20), pages 118 - 125 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115017884A (en) * 2022-01-20 2022-09-06 昆明理工大学 Text parallel sentence pair extraction method based on image-text multi-mode gating enhancement
CN115017884B (en) * 2022-01-20 2024-04-26 昆明理工大学 Text parallel sentence pair extraction method based on graphic multi-mode gating enhancement
CN114996459A (en) * 2022-03-18 2022-09-02 星宙数智科技(珠海)有限公司 Parallel corpus classification method and device, computer equipment and storage medium
WO2024164580A1 (en) * 2023-02-09 2024-08-15 中南大学 Method for calculating chinese-english semantic similarity of paragraph-level texts
CN116910627A (en) * 2023-09-11 2023-10-20 四川大学 Method, system and storage medium for improving efficiency of electric drive system of electric automobile
CN116910627B (en) * 2023-09-11 2023-11-17 四川大学 Method, system and storage medium for improving efficiency of electric drive system of electric automobile

Also Published As

Publication number Publication date
CN113901831B (en) 2024-04-26

Similar Documents

Publication Publication Date Title
CN115033670B (en) Cross-modal image-text retrieval method with multi-granularity feature fusion
CN111144448B (en) Video barrage emotion analysis method based on multi-scale attention convolution coding network
CN113901831B (en) Parallel sentence pair extraction method based on pre-training language model and bidirectional interaction attention
Palangi et al. Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval
CN109992669B (en) Keyword question-answering method based on language model and reinforcement learning
CN117371456B (en) Multi-mode irony detection method and system based on feature fusion
CN114595306B (en) Text similarity calculation system and method based on distance perception self-attention mechanism and multi-angle modeling
Tian et al. An attempt towards interpretable audio-visual video captioning
CN114238649A (en) Common sense concept enhanced language model pre-training method
CN116796251A (en) Poor website classification method, system and equipment based on image-text multi-mode
CN116029305A (en) Chinese attribute-level emotion analysis method, system, equipment and medium based on multitask learning
Lu et al. Self‐supervised domain adaptation for cross‐domain fault diagnosis
CN114595700A (en) Zero-pronoun and chapter information fused Hanyue neural machine translation method
CN112749566B (en) Semantic matching method and device for English writing assistance
CN117312577A (en) Traffic event knowledge graph construction method based on multi-layer semantic graph convolutional neural network
CN113901172B (en) Case-related microblog evaluation object extraction method based on keyword structural coding
CN113157855B (en) Text summarization method and system fusing semantic and context information
Zeng et al. An Explainable Multi-view Semantic Fusion Model for Multimodal Fake News Detection
Khaing Attention-based deep learning model for image captioning: a comparative study
Li et al. Tst-gan: A legal document generation model based on text style transfer
Mi et al. VVA: Video Values Analysis
Yu et al. Semantic extraction for sentence representation via reinforcement learning
Zhou et al. Multimodal emotion recognition based on multilevel acoustic and textual information
Wang An Automatic Error Detection Method for Engineering English Translation Based on the Deep Learning Model
CN113127599B (en) Question-answering position detection method and device of hierarchical alignment structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant