CN113836192B

CN113836192B - Parallel corpus mining method and device, computer equipment and storage medium

Info

Publication number: CN113836192B
Application number: CN202110930495.8A
Authority: CN
Inventors: 林余楚; 黄辉
Original assignee: Shenyi Information Technology Hengqin Co ltd
Current assignee: Shenyi Information Technology Zhuhai Co ltd
Priority date: 2021-08-13
Filing date: 2021-08-13
Publication date: 2022-05-03
Anticipated expiration: 2041-08-13
Also published as: CN113836192A

Abstract

The invention discloses a method for excavating parallel corpora, which comprises the following steps: respectively coding a source sentence and each target sentence based on a multi-language translation model to obtain a vector corresponding to the source sentence and a target coding vector corresponding to each target sentence, and mapping the target coding vectors to a vector space; calculating a similarity score corresponding to each target sentence; selecting K target sentences of which the similarity scores meet preset conditions from all the target sentences based on a Top-K algorithm, and forming candidate sentence pairs by using each selected target sentence and the source sentence respectively; regularizing the corresponding similarity scores of the candidate sentence pairs, and updating the corresponding similarity scores of the candidate sentence pairs based on the obtained regularization processing result; and classifying all candidate sentence pairs based on the pre-training language model to obtain the classification probability corresponding to the candidate sentence pairs, and taking the candidate sentence pairs as parallel sentences if the classification probability is greater than a preset threshold value.

Description

Parallel corpus mining method and device, computer equipment and storage medium

Technical Field

The invention relates to the technical field of neural machine translation, in particular to a method and a device for mining parallel corpora, computer equipment and a storage medium.

Background

With the development of deep learning technology, the neural machine translation based on the encoder-decoder framework has become a new generation of machine translation technology, and compared with other machine translation methods, the neural machine translation model has a great improvement in translation quality.

However, training neural machine translation models requires a large number of parallel corpora to achieve better translation performance than other machine translation methods. Parallel corpora refer to text written in different languages that have a "translation relationship" with each other. Therefore, in some language pairs lacking parallel corpus resources, the neural machine translation method has insufficient resources for model training, resulting in limited translation performance.

At present, a large number of weakly-aligned bilingual articles and comparable corpora can be easily obtained on the internet, so that parallel sentences in the corpora are aligned through a parallel corpus mining method, a large number of parallel corpus resources are collected, and the method is the method for most directly and effectively improving the translation performance of the neural machine translation model.

The traditional parallel corpus mining method is based on linguistic features and bilingual dictionary information, such as sentence length, number of punctuation marks, word alignment, etc. However, these features need to be defined and extracted by linguistic experts, often involve a lot of expert domain knowledge, and need to be defined manually, the system cannot automatically learn and extract the features, and there is subjectivity in the parallel corpus mining process, so that the reliability of the accuracy is low when the parallel corpus is mined.

The existing parallel corpus mining method comprises a similarity measurement method based on multi-language sentence embedding and cosine similarity, but for language pairs which are the same parallel sentences, the cosine similarity is not uniform, and the parallel sentences are difficult to obtain by using the same threshold value, so that the accuracy and the recall rate of the parallel corpus mining system are lower.

Therefore, the existing method has the problem of low accuracy of parallel corpus mining.

Disclosure of Invention

The embodiment of the invention provides a method and a device for excavating parallel corpuses, computer equipment and a storage medium, which are used for improving the accuracy of excavating the parallel corpuses.

A method for excavating parallel corpora comprises the following steps:

respectively coding a source sentence and each target sentence based on a multilingual translation model to obtain a vector corresponding to the source sentence and a target coding vector corresponding to each target sentence, and mapping the target coding vectors to a vector space corresponding to the source sentence, wherein the source sentence is a sentence corresponding to a source language, and the target sentence is a sentence corresponding to a target language;

for each target sentence, calculating the similarity between a target coding vector corresponding to the target sentence and a vector corresponding to a source sentence in the vector space to obtain a similarity score corresponding to the target sentence;

selecting K target sentences of which the similarity scores meet preset conditions from all the target sentences based on a Top-K algorithm, and forming candidate sentence pairs by combining each selected target sentence and the source sentence respectively, wherein K is a preset threshold value of the candidate sentence pairs;

regularization processing is carried out on the corresponding similarity scores of the candidate sentence pairs, and the similarity scores of the candidate sentence pairs are updated based on the obtained regularization processing result;

and classifying all the candidate sentence pairs based on a pre-training language model to obtain the classification probability corresponding to the candidate sentence pairs, and if the classification probability is greater than a preset threshold value, taking the candidate sentence pairs as parallel sentences.

A parallel corpus mining device, comprising:

the system comprises a coding module, a target sentence generating module and a target sentence generating module, wherein the coding module is used for coding a source sentence and each target sentence respectively based on a multilingual translation model to obtain a vector corresponding to the source sentence and a target coding vector corresponding to each target sentence, and mapping the target coding vectors to a vector space corresponding to the source sentence, wherein the source sentence is a sentence corresponding to a source language, and the target sentence is a sentence corresponding to a target language;

the similarity calculation module is used for calculating the similarity between a target coding vector corresponding to the target sentence and a vector corresponding to the source sentence in the vector space aiming at each target sentence to obtain a similarity score corresponding to the target sentence;

the candidate sentence pair selection module is used for selecting K target sentences of which the similarity scores meet preset conditions from all the target sentences based on a Top-K algorithm, and respectively combining each selected target sentence with the source sentence to form a candidate sentence pair, wherein K is a preset threshold value of the candidate sentence pair;

the regularization module is used for regularizing the similarity scores corresponding to the candidate sentence pairs and updating the similarity scores corresponding to the candidate sentence pairs based on the obtained regularization processing result;

and the classification module is used for classifying all the candidate sentence pairs based on a pre-training language model to obtain the classification probability corresponding to the candidate sentence pairs, and if the classification probability is greater than a preset threshold value, the candidate sentence pairs are used as parallel sentences.

A computer device comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor implements the steps of the mining method of the parallel corpus when executing the computer program.

A computer-readable storage medium, which stores a computer program, which, when executed by a processor, implements the steps of the method for mining parallel corpuses described above.

The parallel corpus mining method, the device, the computer equipment and the storage medium in the embodiment of the invention respectively encode the source sentence and each target sentence based on the multi-language translation model to obtain the vector corresponding to the source sentence and the target coding vector corresponding to each target sentence, and map the target coding vectors to the vector space corresponding to the source sentence. And calculating the similarity between the target coding vector corresponding to the target sentence and the vector corresponding to the source sentence in the vector space aiming at each target sentence to obtain the similarity score corresponding to the target sentence. Based on the Top-K algorithm, K target sentences with similarity scores meeting preset conditions are selected from all the target sentences, and each selected target sentence and the source sentence form a candidate sentence pair. And carrying out regularization processing on the corresponding similarity scores of the candidate sentence pairs, and updating the similarity scores of the candidate sentence pairs based on the obtained regularization processing result. And classifying all candidate sentence pairs based on the pre-training language model to obtain the classification probability corresponding to the candidate sentence pairs, and taking the candidate sentence pairs as parallel sentences if the classification probability is greater than a preset threshold value. Through the steps, sentences of different languages can be mapped to the same shared vector space by the multi-language translation model, meanwhile, the target coding vector in the vector space can be used for parallel sentence mining, mining can be achieved without expert experience, and the similarity score is further regularized by top-k softmax, so that the problem that the threshold value is unstable when cosine similarity is used for measuring the sentence similarity is solved, and the accuracy and the recall rate of a parallel sentence mining system are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a diagram illustrating an application environment of a method for mining parallel corpora according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for mining parallel corpora according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an apparatus for parallel corpus mining according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a computer device according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The method for mining the parallel corpus provided by the present application can be applied to the application environment as shown in fig. 1, wherein the computer device communicates with the server through a network. The computer device may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, among others. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.

In an embodiment, as shown in fig. 2, a method for mining parallel corpuses is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps S101 to S105:

s101, respectively coding a source sentence and each target sentence based on a multilingual translation model to obtain a vector corresponding to the source sentence and a target coding vector corresponding to each target sentence, and mapping the target coding vectors to a vector space corresponding to the source sentence, wherein the source sentence is a sentence corresponding to a source language, and the target sentence is a sentence corresponding to a target language.

In step S101, the multi-language translation model is a model for translating one or more sentences in one or more languages into other languages, for example, translating one sentence of chinese into a corresponding english sentence, portuguese sentence, japanese sentence, etc.

The vector space may be used to store target code vectors corresponding to different languages of each source sentence and/or different target code vectors corresponding to the same language, and all the target code vectors may be shared.

Through the encoder of the multi-language translation model, sentences of different languages can be mapped to the same shared vector space, the multi-language translation model has the capability of understanding the internal semantics of the sentences of the different languages, and extra expert prior knowledge required by the traditional method is not needed, so that parallel corpus sentences are further efficiently and accurately mined.

S102, calculating the similarity between a target coding vector corresponding to the target sentence and a vector corresponding to the source sentence in the vector space aiming at each target sentence to obtain a similarity score corresponding to the target sentence.

In step S102, the similarity calculation method includes, but is not limited to, a cosine similarity calculation method and a euclidean distance similarity calculation method.

Preferably, a cosine similarity calculation method is employed.

Calculating the similarity score corresponding to the target sentence according to the following formula (1):

wherein h is_sIs a vector of source sentences, h_tFor the target coded vector, [ phi ] (h)_s,h_t) Means that the vector of the source sentence is h_sThe target code vector is h_tThe corresponding similarity score.

By calculating the similarity score of each target sentence, the similarity score is beneficial to analyzing and processing each target sentence, and then parallel corpus sentences with high accuracy are mined.

S103, based on a Top-K algorithm, selecting K target sentences of which the similarity scores meet preset conditions from all the target sentences, and forming candidate sentence pairs by combining each selected target sentence and the source sentence respectively, wherein K is a preset threshold value of the candidate sentence pairs.

In step S103, the implementation methods of the Top-K algorithm include, but are not limited to, a full-order TOP-K algorithm, a partial-order TOP-K algorithm, and a minimum-heap TOP-K algorithm.

The preset conditions include, but are not limited to, the similarity score exceeding a certain preset value and the similarity score being within a certain preset value range, for example, the preset conditions are that the similarity score exceeds a preset value of 0.7, and for example, the preset conditions are that the similarity score is between a preset value range (0.4, 0.6).

The candidate sentence pairs are obtained by binding target sentences meeting preset conditions with source sentences to form a set of sentence pairs. The candidate sentence pair is used for mining the parallel corpus sentences.

Target sentences meeting preset conditions are selected through a TOP-K algorithm and form candidate sentence pairs with the source sentences, the similarity of the candidate sentence pairs is high, and the accuracy of parallel sentence pairs mining can be effectively improved.

S104, regularizing the corresponding similarity scores of the candidate sentences, and updating the similarity scores of the candidate sentences based on the obtained regularization processing result.

In step S104, the regularization is a process for preventing an overfitting scene from occurring in the mining parallel corpus. It should be understood that overfitting refers to the phenomenon of performing well on the training set, but not well on the test set, with poor generalization performance.

The regularization methods include, but are not limited to, L2 regularization, softmax regularization.

By regularizing the similarity scores, the similarity scores can be normalized, the similarity score values are updated for the candidate sentences again, and the range of the similarity score values is unified, so that the similarity of the sentences is determined by adopting a unified threshold value, the problem that the parallel corpus sentences are difficult to obtain by using the same threshold value is avoided, and the accuracy and the recall rate of the parallel corpus mining process are improved.

And S105, classifying all candidate sentence pairs based on the pre-training language model to obtain the classification probability corresponding to the candidate sentence pairs, and taking the candidate sentence pairs as parallel sentences if the classification probability is greater than a preset threshold value.

In step S105, the pre-training language model refers to a model for identifying the parallel probabilities of the candidate sentence pairs.

The preset threshold value is a probability value for measuring the parallel sentences. For example, when the preset threshold is 0.7, that is, when the classification probability of the candidate sentence pair after the pre-training language model is greater than 0.7, the candidate sentence pair is a parallel sentence.

Inputting the candidate sentence pair into a pre-training language model, and classifying and judging the candidate sentence pair by the pre-training language model according to the existing parallel corpus to obtain the classification probability of the candidate sentence pair, wherein the classification probability is used for measuring the probability of the candidate sentence pair as a parallel sentence.

Through the pre-training language model, the probability judgment of the parallel sentences can be carried out on the candidate sentences subjected to similarity degree unification, the candidate sentences larger than the preset threshold value are used as the parallel sentences, the fact that the unified threshold value is used for obtaining the parallel corpus sentences is achieved, and the accuracy and the recall rate of the parallel corpus mining process are improved.

In the method for mining parallel corpus in this embodiment, based on the multilingual translation model, the source sentence and each target sentence are encoded respectively to obtain a vector corresponding to the source sentence and a target encoding vector corresponding to each target sentence, and the target encoding vectors are mapped to a vector space corresponding to the source sentence. Sentences in different languages can be mapped to the same shared vector space through the multi-language translation model. The target coding vector in the vector space can be used for parallel sentence mining, and mining can be realized without expert experience. And calculating the similarity between the target coding vector corresponding to the target sentence and the vector corresponding to the source sentence in the vector space aiming at each target sentence to obtain the similarity score corresponding to the target sentence. Based on the Top-K algorithm, K target sentences with similarity scores meeting preset conditions are selected from all the target sentences, and each selected target sentence and the source sentence form a candidate sentence pair. And carrying out regularization processing on the corresponding similarity scores of the candidate sentence pairs, and updating the similarity scores of the candidate sentence pairs based on the obtained regularization processing result. And classifying all candidate sentence pairs based on the pre-training language model to obtain the classification probability corresponding to the candidate sentence pairs, and taking the candidate sentence pairs as parallel sentences if the classification probability is greater than a preset threshold value. The Top-K is used for conducting regularization processing on the similarity score, the problem that a threshold value is unstable when similarity of sentences is measured by using the similarity is solved, and accuracy and recall rate of a parallel sentence mining system are improved.

In some optional implementation manners of this embodiment, in step S101, the training method of the multilingual translation model includes the following steps a to H:

A. and acquiring a training sentence and a target language embedded representation, and inputting the training sentence and the target language embedded representation into an initial multilingual translation model, wherein the target language embedded representation refers to a word embedding mode of a target language.

B. And the encoder based on the Transformer algorithm is used for encoding the training sentence to obtain an encoding vector.

C. And based on a preset pooling mode, pooling the coding vectors to obtain pooled vectors.

D. And connecting the coding vector and the pooling vector to obtain a connecting vector.

E. And decoding the connection vector and the target language embedded representation by a decoder based on a Transformer algorithm to obtain a decoding vector.

F. And performing loss calculation on the decoding vector to obtain a loss value.

G. And if the loss value exceeds the preset loss value, returning to the step of acquiring the training sentence and the target language embedded representation, and inputting the training sentence and the target language embedded representation into the initial multilingual translation model for continuous execution.

H. And if the loss value does not exceed the preset loss value, obtaining the multilingual translation model.

In the step a, the target language embedded representation refers to a word embedding manner of the target language, and it should be understood that word embedding refers to a method of converting words in a text into digital vectors, and thus the target language embedded representation may be understood as a target vector corresponding to the target language.

The training sentence refers to a translated sentence.

For step B, the Transformer algorithm is composed of and only of the attention mechanism and the feedforward neural network. A trainable neural network based on a Transformer can be built in a form of stacking the transformers, and has good parallelism, wherein the self-attention mechanism refers to a mechanism for correspondingly calculating values, keys and queries of attention objects. The feed-forward neural network is a unidirectional multilayer structure.

When the training sentence is coded, each word in the training sentence is subjected to self-attention calculation through a multi-layer coding layer of a Transformer algorithm, and is transmitted to a next-layer coding layer through a feedforward neural network to continue the self-attention calculation until the coding of the coding layer is finished, so that a coding vector is obtained.

For the step C, the predetermined pooling method includes, but is not limited to, a maximum pooling method, and an average pooling method.

Preferably, the maximum pooling method is adopted to perform pooling on the encoding vectors to obtain pooled vectors.

For step D above, the join vector is obtained according to the following equation (2):

O_pool＝O⊕h_pool (2)

wherein, O_poolFor concatenated vectors, O is the code vector h_poolFor pooled vectors, # is the matrix row join operator, or AND operator.

For the above step E, a decoding vector is obtained according to the following formula (3):

where Q is the linear transformation of the target language embedded representation, W^k∈R^d×dkAnd W^v∈R^d×dvThe matrix is mapped for weights in self-attention.

For step F, the above loss calculation method includes, but is not limited to, cross entropy loss function and mean square error loss function.

Preferably, the present embodiment employs a cross entropy loss function.

The multi-language translation model based on the Transformer model trained in the steps adopts the self-attention calculation and the feedforward neural network to translate the sentences in different languages, so that the multi-language translation is favorably realized, and meanwhile, the encoder in the multi-language translation model can map the sentences in different languages to the same shared vector space, so that the sharing function of the sentences in different languages is ensured.

In some optional implementations of this embodiment, step S101 includes the following steps S100 to S500:

s100, inputting the source sentence and the target sentences into a multi-language translation model.

S200, extracting the characteristics of the source sentence to obtain a first vector corresponding to the source sentence.

S300, extracting the characteristics of each target sentence to obtain a second vector corresponding to each target sentence.

S400, coding the first vector and each second vector to obtain a target coding vector corresponding to each second vector.

And S500, mapping all target coding vectors to a vector space corresponding to the source sentence.

In step S200, the first vector is a vector corresponding to the source sentence.

And performing self-attention calculation on each word of the source sentence through a Transformer-based encoder in a multilingual translation model to obtain a first vector corresponding to the source sentence.

In step S300, the second vector is a vector corresponding to the target sentence.

And performing self-attention calculation on each word of the target sentence through a Transformer-based encoder in a multilingual translation model to obtain a second vector corresponding to the target sentence.

Through the multi-language translation model, sentences of different languages are translated, multi-language translation is facilitated, meanwhile, the encoder in the multi-language translation model can map the sentences of the different languages to the same shared vector space, and the sharing function of the sentences of the different languages is achieved.

In some optional implementations of this embodiment, step S103 includes the following steps S301 to S306:

s301, based on a minimum heap Top-K algorithm, randomly selecting similarity scores of K target sentences from all target sentences, and establishing a minimum heap, wherein the minimum heap comprises a heap Top, the heap Top is the minimum similarity score of the K target sentences, and unselected target sentences serve as residual target sentences.

S302, selecting the similarity score of any one of the remaining target sentences as a contrast similarity score, and comparing the contrast similarity score with the similarity score at the top of the heap.

S303, if the contrast similarity score is not larger than the similarity score at the top of the heap, updating the rest target sentences.

S304, if the contrast similarity score is larger than the similarity score of the heap top, the contrast similarity score is used as the similarity score of the new heap top, and the rest target sentences are updated.

S305, when the remaining target sentences are not selected completely, returning to select the similarity score of any one of the remaining target sentences as a contrast similarity score, and continuing to execute the step of comparing the contrast similarity score with the similarity score at the top of the heap.

S306, when the selection of the residual target sentences is finished, the target sentences corresponding to all the similarity scores contained in the minimum heap and the source sentences form candidate sentence pairs.

In step S301, the above-mentioned minimum heap Top-K algorithm refers to an algorithm that selects K data that are the first in some data by using a minimum heap algorithm.

The heap top is the value used to place the minimum data in the minimum heap. It should be appreciated that in this embodiment, the heap top is the minimum similarity score among the K target sentences.

In step S304, if the contrast similarity score is still greater than the similarity scores of the smallest heap other than the heap top, the smallest heap is updated so that the smallest heap to which the new similarity score of the heap top is added satisfies the condition of the smallest heap.

Through the TOP-K algorithm of the minimum heap, the TOP-K similarity fraction meeting the conditions can be effectively and quickly positioned and selected, and the data selection speed is improved.

In some optional implementations of this embodiment, step S104 includes the following steps S401 to S402:

s401, performing softmax regularization processing on the corresponding similarity scores of all candidate sentences to obtain regularized similarity.

S402, according to the regularized similarity, the similarity scores corresponding to all candidate sentence pairs are updated.

In step S401, the softmax regularization processing is a processing method for preventing data overfitting based on a softmax algorithm.

For the step S402, the updating refers to re-assigning the similarity scores of the candidate sentence pairs according to the regularized similarity.

The above updating method includes, but is not limited to, weight updating. For example, the regularization similarity is used as a weight, the similarity scores corresponding to all candidate sentence pairs are multiplied by the regularization similarity, and the result obtained by the multiplication is used as the updated similarity score of the candidate sentence pair.

Through the softmax regularization processing, the similarity scores of the Top-K target sentences are redistributed, and the overall accuracy of the model is effectively improved.

In some optional implementations of this embodiment, in step S105, the training method for pre-training the language model includes the following steps S501 to S502:

s501, obtaining a positive sample and a negative sample, wherein the positive sample and the negative sample comprise label vectors, the positive sample is a parallel sentence in a target sentence, and the negative sample is a non-parallel sentence which is selected from the target sentence according to the sequence of similarity from large to small and is equal to the preset negative sample in number.

S502, training the initialized pre-training language model based on the positive sample and the negative sample to obtain a pre-training model.

With respect to step S502 above, it should be understood that both the positive and negative examples above are target sentences.

The positive samples include, but are not limited to, the target sentence determined to be a parallel sentence through the steps S101 to S105, and the parallel sentence in the existing parallel corpus.

The negative samples include, but are not limited to, non-parallel sentences in the target sentences, the number of the non-parallel sentences is equal to the preset number of the negative samples, and the similarity score of the target sentences is lower than a preset threshold value.

Through training in the language model of training in advance with positive sample and negative sample input, can carry out the precision fine setting to training in advance the language model for the precision of training in advance the language model of training more is close to the precision in the experimentation, is favorable to reducing the production of overfitting phenomenon.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In an embodiment, a device for mining parallel corpuses is provided, and the device for mining parallel corpuses corresponds to the method for mining parallel corpuses in the above embodiment one to one. As shown in fig. 3, the apparatus for mining parallel corpuses includes an encoding module 11, a similarity calculation module 12, a candidate sentence pair extraction module 13, a regularization module 14 and a classification module 15. The detailed description of each functional module is as follows:

and the encoding module 11 is configured to encode the source sentence and each target sentence based on a multilingual translation model, to obtain a vector corresponding to the source sentence and a target encoding vector corresponding to each target sentence, and to map the target encoding vector to a vector space corresponding to the source sentence, where the source sentence is a sentence corresponding to the source language and the target sentence is a sentence corresponding to the target language.

And a similarity calculation module 12, configured to calculate, for each target sentence, a similarity between a target coding vector corresponding to the target sentence and a vector corresponding to the source sentence in the vector space, so as to obtain a similarity score corresponding to the target sentence.

And a candidate sentence pair selecting module 13, configured to select, based on a Top-K algorithm, K target sentences, of which similarity scores meet a preset condition, from all the target sentences, and respectively combine each selected target sentence with the source sentence to form a candidate sentence pair, where K is a preset threshold of the candidate sentence pair.

And the regularization module 14 is configured to perform regularization processing on the similarity scores corresponding to the candidate sentence pairs, and update the similarity scores corresponding to the candidate sentence pairs based on an obtained regularization processing result.

And the classification module 15 is configured to classify all candidate sentence pairs based on the pre-training language model to obtain a classification probability corresponding to the candidate sentence pair, and if the classification probability is greater than a preset threshold, take the candidate sentence pair as a parallel sentence.

Optionally, in the multilingual translation model of the encoding module 11, the apparatus for mining parallel corpuses further includes:

and the data acquisition module is used for acquiring the training sentences and the target language embedded representation, and inputting the training sentences and the target language embedded representation into the initial multilingual translation model, wherein the target language embedded representation refers to a word embedding mode of a target language.

And the training sentence coding module is used for coding the training sentences based on a Transformer algorithm encoder to obtain coding vectors.

And the pooling module is used for pooling the coding vectors based on a preset pooling mode to obtain pooled vectors.

And the connection module is used for connecting the coding vector and the pooling vector to obtain a connection vector.

And the decoding module is used for decoding the connection vector and the target language embedded representation based on a decoder of a Transformer algorithm to obtain a decoding vector.

And the loss calculation module is used for performing loss calculation on the decoding vector to obtain a loss value.

And the first loss module is used for returning to obtain the training sentences and the target language embedded representation and inputting the training sentences and the target language embedded representation into the initial multilingual translation model to continue executing the steps if the loss value exceeds the preset loss value.

And the second loss module is used for obtaining the multilingual translation model if the loss value does not exceed the preset loss value.

In this embodiment, the encoding module 11 further includes:

and the input unit is used for inputting the source sentence and the target sentences into the multilingual translation model.

And the first vector acquisition unit is used for extracting the characteristics of the source sentence to obtain a first vector corresponding to the source sentence.

And the second vector acquisition unit is used for extracting the characteristics of each target sentence to obtain a second vector corresponding to each target sentence.

And the coding unit is used for coding the first vector and each second vector to obtain a target coding vector corresponding to each second vector.

And the mapping unit is used for mapping all the target coding vectors to the vector space corresponding to the source sentence.

In this embodiment, the candidate sentence pair extracting module 13 further includes:

and the minimum heap establishing unit is used for randomly selecting the similarity scores of the K target sentences from all the target sentences based on a minimum heap Top-K algorithm to establish a minimum heap, wherein the minimum heap comprises a heap Top which is the minimum similarity score in the K target sentences, and the unselected target sentences serve as the residual target sentences.

And the selecting unit is used for selecting the similarity score of any one of the rest target sentences as a contrast similarity score and comparing the contrast similarity score with the similarity score of the heap top.

A first updating unit, configured to update the remaining target sentence if the contrast similarity score is not greater than the similarity score at the top of the heap.

And the second updating unit is used for taking the contrast similarity score as a new similarity score of the heap top and updating the residual target sentences if the contrast similarity score is larger than the similarity score of the heap top.

And the first selection unit is used for returning and selecting the similarity score of any one of the remaining target sentences as a contrast similarity score when the remaining target sentences are not selected completely, and continuously executing the step of comparing the contrast similarity score with the similarity score at the top of the heap.

And the second selection unit is used for forming a candidate sentence pair by the target sentences corresponding to all the similarity scores contained in the minimum heap and the source sentences when the selection of the residual target sentences is finished. .

In this embodiment, the regularization module 14 further comprises:

and the regularization unit is used for performing softmax regularization processing on the corresponding similarity scores of all the candidate sentences to obtain regularized similarity.

And the updating unit is used for updating the similarity scores corresponding to all candidate sentence pairs according to the regularized similarity.

Optionally, the pre-trained language model in the classification module 15 further includes:

and the positive and negative sample acquisition unit is used for acquiring a positive sample and a negative sample, wherein the positive sample and the negative sample comprise label vectors, the positive sample is a parallel sentence in the target sentence, and the negative sample is a non-parallel sentence in the target sentence, the number of which is equal to the preset negative sample, and the non-parallel sentences are selected according to the sequence of similarity from large to small.

And the training unit is used for training the initialized pre-training language model based on the positive sample and the negative sample to obtain the pre-training model.

Wherein the meaning of "first" and "second" in the above modules/units is only to distinguish different modules/units, and is not used to define which module/unit has higher priority or other defining meaning. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not explicitly listed or inherent to such process, method, article, or apparatus, and such that a division of modules presented in this application is merely a logical division and may be implemented in a practical application in a further manner.

For the specific definition of the parallel corpus mining device, reference may be made to the above definition of the parallel corpus mining method, which is not described herein again. All or part of the modules in the parallel corpus mining device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing data related to the mining method of the parallel corpora. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of mining parallel corpora.

In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the steps of the method for mining parallel corpuses in the above embodiments are implemented, for example, steps S101 to S105 shown in fig. 2 and other extensions of the method and related steps are extended. Alternatively, the processor implements the functions of the modules/units of the parallel corpus mining device in the above embodiment, for example, the functions of the modules 11 to 15 shown in fig. 3, when executing the computer program. To avoid repetition, further description is omitted here.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like which is the control center for the computer device and which connects the various parts of the overall computer device using various interfaces and lines.

The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, video data, etc.) created according to the use of the cellular phone, etc.

The memory may be integrated in the processor or may be provided separately from the processor.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program is executed by a processor to implement the steps of the mining method of parallel corpus in the above-mentioned embodiment, such as the steps S101 to S105 shown in fig. 2 and the extensions of other extensions and related steps of the method. Alternatively, the computer program may be executed by a processor to implement the functions of the modules/units of the parallel corpus mining device in the above embodiment, for example, the functions of the modules 11 to 15 shown in fig. 3. To avoid repetition, further description is omitted here.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A method for mining parallel corpora is characterized by comprising the following steps:

classifying all the candidate sentence pairs based on a pre-training language model to obtain the classification probability corresponding to the candidate sentence pairs, and if the classification probability is greater than a preset threshold value, taking the candidate sentence pairs as parallel sentences;

the method comprises the following steps of selecting K target sentences with similarity scores meeting preset conditions from all target sentences based on a Top-K algorithm, and forming candidate sentence pairs by combining each selected target sentence and the source sentence respectively, wherein the step comprises the following steps:

based on a minimum heap Top-K algorithm, randomly selecting similarity scores of K target sentences from all target sentences, and establishing a minimum heap, wherein the minimum heap comprises a heap Top which is the minimum similarity score of the K target sentences, and unselected target sentences serve as residual target sentences;

selecting the similarity score of any one of the remaining target sentences as a contrast similarity score, and comparing the contrast similarity score with the similarity score of the heap top;

if the contrast similarity score is not greater than the similarity score at the top of the heap, updating the remaining target sentences;

if the contrast similarity score is larger than the similarity score of the heap top, taking the contrast similarity score as a new similarity score of the heap top, and updating the remaining target sentences;

when the remaining target sentences are not selected completely, returning to select the similarity score of any one of the remaining target sentences as a contrast similarity score, and continuing to execute the step of comparing the contrast similarity score with the similarity score at the top of the heap;

and when the selection of the residual target sentences is finished, the target sentences corresponding to all the similarity scores contained in the minimum heap and the source sentences form candidate sentence pairs.

2. The method of claim 1, wherein the encoding of the source sentence and each target sentence based on the multilingual translation model separately obtains a vector corresponding to the source sentence and a target encoding vector corresponding to each target sentence, and before mapping the target encoding vectors into a vector space corresponding to the source sentence, the method further comprises:

acquiring a training sentence and a target language embedded representation, and inputting the training sentence and the target language embedded representation into an initial multilingual translation model, wherein the target language embedded representation refers to a word embedding mode of a target language;

the encoder based on the Transformer algorithm is used for encoding the training sentence to obtain an encoding vector;

based on a preset pooling mode, pooling the coding vectors to obtain pooled vectors;

connecting the coding vector and the pooling vector to obtain a connecting vector;

decoding the connection vector and the target language embedded representation by a decoder based on a Transformer algorithm to obtain a decoding vector;

performing loss calculation on the decoding vector to obtain a loss value;

if the loss value exceeds a preset loss value, returning to the step of acquiring a training sentence and a target language embedded representation, and inputting the training sentence and the target language embedded representation into an initial multilingual translation model for continuous execution;

and if the loss value does not exceed a preset loss value, obtaining the multilingual translation model.

3. The method of claim 2, wherein the step of encoding the source sentence and each target sentence separately based on the multilingual translation model to obtain a vector corresponding to the source sentence and a target coding vector corresponding to each target sentence, and mapping the target coding vectors into a vector space corresponding to the source sentence comprises:

inputting the source sentence and a plurality of target sentences into the multilingual translation model;

extracting the characteristics of the source sentence to obtain a first vector corresponding to the source sentence;

extracting features of each target sentence to obtain a second vector corresponding to each target sentence;

coding the first vector and each second vector to obtain a target coding vector corresponding to each second vector;

and mapping all the target coding vectors to the vector space corresponding to the source sentence.

4. The method according to claim 1, wherein the step of regularizing the similarity scores corresponding to the candidate sentence pairs and updating the similarity scores corresponding to the candidate sentence pairs based on the regularization result comprises:

performing softmax regularization processing on the corresponding similarity scores of all the candidate sentences to obtain regularized similarity;

and updating the similarity scores corresponding to all the candidate sentence pairs according to the regularized similarity.

5. The method according to claim 1, wherein the pre-training language model is used for classifying all the candidate sentence pairs to obtain classification probabilities corresponding to the candidate sentence pairs, and if the classification probabilities are greater than a preset threshold, the candidate sentence pairs are regarded as parallel sentences, and the training method of the pre-training language model comprises:

acquiring a positive sample and a negative sample, wherein the positive sample and the negative sample comprise tag vectors, the positive sample is a parallel sentence in a target sentence, and the negative sample is a non-parallel sentence in the target sentence, the number of which is equal to the number of preset negative samples, selected according to the sequence of similarity from large to small;

and training the initialized pre-training language model based on the positive sample and the negative sample to obtain a pre-training model.

6. A parallel corpus excavating device, comprising:

the classification module is used for classifying all the candidate sentence pairs based on a pre-training language model to obtain the classification probability corresponding to the candidate sentence pairs, and if the classification probability is greater than a preset threshold value, the candidate sentence pairs are used as parallel sentences;

wherein the candidate sentence pair selection module comprises:

a minimum heap establishing unit, configured to arbitrarily select similarity scores of K target sentences from all target sentences based on a minimum heap Top-K algorithm, and establish a minimum heap, where the minimum heap includes a heap Top, the heap Top is a minimum similarity score in the K target sentences, and unselected target sentences serve as remaining target sentences;

a selecting unit, configured to select a similarity score of any one of the remaining target sentences as a comparison similarity score, and compare the comparison similarity score with the similarity score at the top of the heap;

a first updating unit, configured to update the remaining target sentence if the contrast similarity score is not greater than the similarity score at the top of the heap;

a second updating unit, configured to, if the contrast similarity score is greater than the top-of-heap similarity score, take the contrast similarity score as a new top-of-heap similarity score, and update the remaining target sentence;

the first selection unit is used for returning and selecting the similarity score of any one of the remaining target sentences as a contrast similarity score when the remaining target sentences are not selected completely, and continuously executing the step of comparing the contrast similarity score with the similarity score at the top of the heap;

and a second selecting unit, configured to, when the remaining target sentences are selected, form candidate sentence pairs by using the target sentences corresponding to all the similarity scores included in the minimum heap and the source sentence.

7. The apparatus for parallel corpus mining according to claim 6, wherein said apparatus further comprises:

the data acquisition module is used for acquiring a training sentence and a target language embedded representation, and inputting the training sentence and the target language embedded representation into an initial multilingual translation model, wherein the target language embedded representation refers to a word embedding mode of a target language;

the training sentence coding module is used for coding the training sentences to obtain coding vectors based on a Transformer algorithm coder;

the pooling module is used for pooling the coding vectors based on a preset pooling mode to obtain pooled vectors;

the connection module is used for connecting the coding vector and the pooling vector to obtain a connection vector;

the decoding module is used for decoding the connection vector and the target language embedded representation based on a decoder of a Transformer algorithm to obtain a decoding vector;

the loss calculation module is used for performing loss calculation on the decoding vector to obtain a loss value;

the first loss module is used for returning to obtain a training sentence and a target language embedded representation and inputting the training sentence and the target language embedded representation into an initial multilingual translation model to continue executing the steps if the loss value exceeds a preset loss value;

and the second loss module is used for obtaining the multilingual translation model if the loss value does not exceed a preset loss value.

8. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method for mining parallel corpora according to any one of claims 1 to 5 when executing the computer program.

9. A computer-readable storage medium, in which a computer program is stored, and the computer program, when being executed by a processor, implements the steps of the method for mining parallel corpuses according to any one of claims 1 to 5.