CN111597824B

CN111597824B - Training method and device for language translation model

Info

Publication number: CN111597824B
Application number: CN202010307663.3A
Authority: CN
Inventors: 陈巍华
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-04-17
Filing date: 2020-04-17
Publication date: 2023-05-26
Anticipated expiration: 2040-04-17
Also published as: CN111597824A

Abstract

The invention relates to a training method and a training device for a language translation model. The method comprises the following steps: step A1, acquiring a source corpus S1 and a target corpus T1 which are bilingual corpora, and constructing a bi-class translation network M1 according to the source corpus S1 and the target corpus T1, wherein the bi-class translation network M1 has the capability of judging whether any source sentence and target sentence are translated or not; step A2, training an initial source language training model by utilizing a source corpus P1 to obtain a target source language training model M2; step A3, obtaining a source corpus S3 and a target corpus T2 corresponding to the source corpus S3 according to a binary translation network M1, a target source language training model M2 and the source corpus S1; and A4, obtaining a language translation model according to the source corpus S1, the target corpus T1, the source corpus S3 and the target corpus T2. According to the technical scheme, the source corpus can be expanded, so that the source corpus and the target corpus which are rich in resources and are bi-translated are obtained, and further, the language translation model with high translation precision and accuracy and quality is obtained.

Description

Training method and device for language translation model

Technical Field

The invention relates to the technical field of translation, in particular to a training method and device of a language translation model.

Background

At present, in the Translation task, most of the existing mainstream data enhancement algorithms expand the corpus by introducing noise (word insertion, deletion, reordering and the like) or expand the corpus by generating parallel pseudo bilingual by using a large number of target bilingual terms by a Back-Translation method to obtain bilingual data, and then training a language Translation model, but the Translation accuracy and quality of the trained language Translation model are not high enough no matter which way of training the language Translation model is adopted for the obtained corpus data.

Disclosure of Invention

The embodiment of the invention provides a training method and device for a language translation model. The technical scheme is as follows:

according to a first aspect of an embodiment of the present invention, there is provided a training method of a language translation model, including:

step A1, acquiring a source corpus S1 and a target corpus T1 which are bilingual corpora, and constructing a two-class translation network M1 according to the source corpus S1 and the target corpus T1, wherein the two-class translation network M1 has the capability of judging whether any source sentence and target sentence are translated or not;

step A2, training an initial source language training model by utilizing a source corpus P1 to obtain a target source language training model M2;

step A3, obtaining a source corpus S3 and a target corpus T2 corresponding to the source corpus S3 according to the two-classification translation network M1, the target source language training model M2 and the source corpus S1;

and A4, obtaining the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3 and the target corpus T2.

In one embodiment, the step A3 includes:

expanding the source corpus S1 according to the target source language training model M2 to obtain a candidate source corpus S2;

and screening the candidate source corpus S2 and the target corpus T1 by using the bi-classification translation network M1 to obtain a source corpus S3 and the target corpus T2.

In one embodiment, the filtering the candidate source corpus S2 and the target corpus T1 using the bi-classification translation network M1 to obtain a source corpus S3 and the target corpus T2 includes:

acquiring a preset corpus probability threshold value which is mutually translated corpus;

the candidate source corpus S2 and the target corpus T1 are input into the bi-classification translation network M1, so that the source corpus S3 and the target corpus T2 are screened out by utilizing the bi-classification translation network M1 and the corpus probability threshold value.

In one embodiment, the method further comprises:

re-using the source corpus S3 and a target corpus T2 corresponding to the source corpus S3 as the target corpus T1 and the source corpus S1 respectively, and executing the step A2 and the step A3 to obtain a target corpus T3 and a source corpus S4 corresponding to the target corpus T3;

the step A4 comprises the following steps:

and obtaining the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3, the target corpus T2, the target corpus T3 and the source corpus S4.

In one embodiment, the method further comprises:

acquiring a preset number of target whistle;

translating the source corpus P2 corresponding to the target single language by using the language translation model;

and retraining the language translation model according to the source corpus P2 and the target monolingual.

According to a second aspect of an embodiment of the present invention, there is provided a training apparatus for a language translation model, including:

the first processing module is used for acquiring a source corpus S1 and a target corpus T1 which are bilingual corpora, and constructing a bi-class translation network M1 according to the source corpus S1 and the target corpus T1, wherein the bi-class translation network M1 has the capability of judging whether any source sentence and target sentence are translated or not;

the training module is used for training the initial source language training model by utilizing the source corpus P1 to obtain a target source language training model M2;

the first obtaining module is configured to obtain a source corpus S3 and a target corpus T2 corresponding to the source corpus S3 according to the two classification translation network M1, the target source language training model M2 and the source corpus S1;

the second obtaining module is configured to obtain the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3, and the target corpus T2.

In one embodiment, the first acquisition module includes:

the expansion sub-module is used for expanding the source corpus S1 according to the target source language training model M2 to obtain a candidate source corpus S2;

and the screening sub-module is used for screening the candidate source corpus S2 and the target corpus T1 by using the bi-classification translation network M1 to obtain a source corpus S3 and the target corpus T2.

In one embodiment, the screening submodule is specifically configured to:

In one embodiment, the apparatus further comprises:

the second processing module is configured to re-use the source corpus S3 and a target corpus T2 corresponding to the source corpus S3 as the target corpus T1 and the source corpus S1, and execute the training module and the first obtaining module, so as to obtain a target corpus T3 and a source corpus S4 corresponding to the target corpus T3;

the second acquisition module includes:

the obtaining sub-module is configured to obtain the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3, the target corpus T2, the target corpus T3, and the source corpus S4.

In one embodiment, the apparatus further comprises:

the third acquisition module is used for acquiring target whisper of a preset number;

the translation module is used for translating the source corpus P2 corresponding to the target single language by utilizing the language translation model;

and the training module is used for retraining the language translation model according to the source corpus P2 and the target monolingual.

The technical scheme provided by the embodiment of the invention can comprise the following beneficial effects:

an initial two-class translation network M1 can be constructed through a source corpus S1 and a target corpus T1 with low resources (namely a small amount of resources and/or simple resources), then an initial source language training model can be utilized by the source corpus P1, so that a target source language training model M2 with higher accuracy is obtained, the source corpus S1 can be further expanded by utilizing the two-class translation network M1 and the target source language training model M2, so that a source corpus S3 with rich resources and two-class translation and a target corpus T2 corresponding to the source corpus S3 are obtained, and further a language translation model with higher translation precision and accuracy and quality can be obtained by utilizing rich corpora (namely the source corpus S1, the target corpus T1, the source corpus S3 and the target corpus T2).

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flowchart illustrating a method of training a language translation model, according to an exemplary embodiment.

FIG. 2 is a block diagram illustrating a training apparatus for a language translation model, according to an example embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.

In order to solve the above technical problems, an embodiment of the present invention provides a method for training a language translation model, where the method may be used in a training program, a system or a device of the language translation model, and an execution subject corresponding to the method may be a terminal or a server, as shown in fig. 1, and the method includes steps A1 to A4:

step A1, acquiring a source corpus S1 and a target corpus T1 which are bilingual corpora, and constructing a two-class translation network M1 according to the source corpus S1 and the target corpus T1, wherein the two-class translation network M1 has the capability of judging whether any source sentence and target sentence are translated or not; the bilingual source corpus S1 and the target corpus T1 refer to the same content as the target corpus T1 and the source corpus S1, and only the expression languages are different. The source corpus S1 belongs to low-resource corpora, i.e. low-resource, i.e. less-resource, i.e. the number of corpora of the source corpus S1 is less than the preset number. Therefore, the invention is essentially a training method of a low-resource translation model. The binary translation network M1 may be any network such as a binary convolutional neural network, which has the capability of judging whether any source sentence and target sentence are translated with each other.

Step A2, training an initial source language training model by utilizing a source corpus P1 to obtain a target source language training model M2; the initial source language training model can be an open source language training model, the target source language training model M2 is a model obtained by training a source corpus P1 on a pre-training model (such as BERT (Bidirectional Encoder Representations from Transformers, bi-directional encoder characterization quantity from a converter)), the BERT is only a network, the model M2 trained by taking the network has the capacity of complete filling, and the model M2 can be used for expanding corpus; in addition, both the initial source language training model and the target source language training model have the function of enriching the corpus, namely, a certain source corpus is enriched into a plurality of corpora.

The source corpus P1 may be a single-language corpus or a bilingual corpus, wherein the single-language corpus is a corpus which can be expressed in only one language at present, and the bilingual corpus is a corpus which can be expressed in multiple languages at present.

and A4, obtaining the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3 and the target corpus T2. The language translation model is a model having a language translation function (double translation function), for example, a model capable of translating chinese into a language such as english or russian.

An initial two-class translation network M1 can be constructed through a source corpus S1 and a target corpus T1 with low resources (namely a small amount of resources and/or simplicity), then the initial source language training model can be trained by utilizing the source corpus P1, so that a target source language training model M2 with higher accuracy is obtained, the source corpus S1 can be further expanded by utilizing the two-class translation network M1 and the target source language training model M2, so that a source corpus S3 with rich resources and two-class translation and a target corpus T2 corresponding to the source corpus S3 are obtained, and further, a language translation model with higher translation precision and accuracy and quality can be obtained by utilizing rich corpora (namely the source corpus S1, the target corpus T1, the source corpus S3 and the target corpus T2).

In one embodiment, the step A3 includes:

Because the number of the source linguistic data S1 is larger, the language translation model trained by using the S1 is lower in accuracy, so that the target source language training model M2 can be used for expanding the source linguistic data S1 to obtain a larger number of candidate source linguistic data S2, and further the two-class translation network M1 is used for further screening the candidate source linguistic data S2 and the target linguistic data T1, so that the source linguistic data S3 and the target linguistic data T2 which are bilingual linguistic data with higher probability are obtained, and the translation accuracy and quality of the language translation model are improved by combining the source linguistic data S3 and the target linguistic data T2.

acquiring a preset corpus probability threshold value of translation corpora which are mutually translated (namely bilingual corpora);

During screening, a corpus probability threshold value set in advance can be combined, so that the source corpus S2 and the target corpus T1 which are candidate sources and target corpora with low bilingual corpus probability are filtered, and the source corpus S3 and the target corpus T2 which are screened out are ensured to have high probability of bilingual corpus.

In one embodiment, the method further comprises:

the step A4 comprises the following steps:

The source corpus S3 and the target corpus T2 corresponding to the source corpus S3 are respectively re-used as the target corpus T1 and the source corpus S1, and the step A2 and the step A3 are re-executed, so that more source corpora and target corpora which are bilingual corpora can be obtained, namely the target corpus T3 and the source corpus S4 which are bilingual corpora with the target corpus T3 can be further obtained, and more source corpora and target corpora which are bilingual corpora can be used later (namely (S1, T1), (S3, T2), (S4, T3)), so that a language translation model with higher translation accuracy and quality can be obtained.

In one embodiment, the method further comprises:

acquiring a preset number of target whistle; the target monolingual is the corpus which can be expressed in only one language at present.

The method comprises the steps of obtaining a preset number of target single words, translating a source corpus P2 which is translated with the target single words by utilizing the language Translation model, obtaining a large number of target single words through the language Translation model in a Back-Translation mode, and then retraining the language Translation model by utilizing the target single words according to the source corpus P2, so that the Translation precision and the Translation quality of the language Translation model are further improved by combining the Back-Translation mode.

Therefore, the invention builds the Translation model by further expanding high-quality bilingual language for low-resource bilingual corpus, improves the capability of carrying out the Translation model, further combines with the fake bilingual obtained by Back-Translation to have higher quality, and finally can further improve the Translation capability of the low-resource bilingual model.

The technical scheme of the invention will be further described in detail as follows:

step 1: modeling a source corpus S1 and a target corpus T1 in low-resource bilingual corpus, and training a two-classification network M1, wherein the network has the capability of judging whether any source sentence and target sentence are mutually translated.

Step 2: for source corpus (bilingual ), a pre-training language model of an open source is utilized or a large number of source bilingual training is utilized to obtain a pre-training model (such as a bi-directional encoder characterization quantity of BERT M2 from a converter, etc.).

Step 3: and (3) carrying out corpus expansion on the source corpus S1 by using M2, adding some masks into the source corpus at random masks (shielding) or some positions of sentences, and predicting by using M2 to predict/increase words at the mask positions to obtain the candidate source corpus S2.

Step 4: the generated source corpus S2 and the corresponding target corpus T1 pass through a two-class network M1, and data in the S2 which is larger than a certain threshold value (the probability that the original expectation and the target expectation are bilingual) are screened out, so that a forged source corpus S3 and the corresponding target corpus T1 are obtained.

Step 5: and (3) converting the source corpus into the target corpus, and repeating the steps (2), 3 and 4) to obtain the forged target corpus T3 and the source corpus S1 corresponding to the target corpus T3.

Step 6: and constructing models by taking the original bilingual corpus (S1, T1) and the forged bilingual corpus (S3, T1) and the forged bilingual corpus (S1, T3) as training corpora for training the translation model to obtain a bilingual translation model M3.

Step 7: a large number of target single words are subjected to M3 by means of Back-Translation to obtain high-quality fake double words (S4, T4), the obtained data are trained in a Translation model, and the effect of low-resource Translation is improved.

Finally, it is clear that: the above embodiments may be freely combined according to actual needs by those skilled in the art.

Corresponding to the training method of the language translation model provided by the embodiment of the present invention, the embodiment of the present invention further provides a training device of the language translation model, as shown in fig. 2, where the device includes:

the first processing module 201 is configured to obtain a source corpus S1 and a target corpus T1 that are bilingual corpora, and construct a bi-class translation network M1 according to the source corpus S1 and the target corpus T1, where the bi-class translation network M1 has a capability of judging whether any source sentence and target sentence are translated with each other;

the training module 202 is configured to train the initial source language training model by using the source corpus P1 to obtain a target source language training model M2;

the first obtaining module 203 is configured to obtain a source corpus S3 and a target corpus T2 corresponding to the source corpus S3 according to the bi-classification translation network M1, the target source language training model M2 and the source corpus S1;

the second obtaining module 204 is configured to obtain the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3, and the target corpus T2.

In one embodiment, the first acquisition module includes:

In one embodiment, the screening submodule is specifically configured to:

In one embodiment, the apparatus further comprises:

the second acquisition module includes:

In one embodiment, the apparatus further comprises:

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method for training a language translation model, comprising:

step A4, obtaining the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3 and the target corpus T2;

the step A3 comprises the following steps:

screening the candidate source corpus S2 and the target corpus T1 by using the bi-classification translation network M1 to obtain a source corpus S3 and the target corpus T2;

the filtering the candidate source corpus S2 and the target corpus T1 by using the bi-classification translation network M1 to obtain a source corpus S3 and the target corpus T2 includes:

inputting the candidate source corpus S2 and the target corpus T1 into the bi-classification translation network M1 to screen the source corpus S3 and the target corpus T2 by utilizing the bi-classification translation network M1 and the corpus probability threshold;

further comprises: re-using the source corpus S3 and a target corpus T2 corresponding to the source corpus S3 as the target corpus T1 and the source corpus S1 respectively, and executing the step A2 and the step A3 to obtain a target corpus T3 and a source corpus S4 corresponding to the target corpus T3;

the step A4 comprises the following steps:

obtaining the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3, the target corpus T2, the target corpus T3 and the source corpus S4;

further comprises:

acquiring a preset number of target whistle;

2. A training device for a language translation model, comprising:

the second obtaining module is configured to obtain the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3 and the target corpus T2;

the first acquisition module includes:

the screening submodule is used for screening the candidate source corpus S2 and the target corpus T1 by using the bi-classification translation network M1 to obtain a source corpus S3 and the target corpus T2;

the screening submodule is specifically used for:

the apparatus further comprises:

the second acquisition module includes:

the obtaining submodule is used for obtaining the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3, the target corpus T2, the target corpus T3 and the source corpus S4;

the apparatus further comprises: