CN111597824B - Training method and device for language translation model - Google Patents

Training method and device for language translation model Download PDF

Info

Publication number
CN111597824B
CN111597824B CN202010307663.3A CN202010307663A CN111597824B CN 111597824 B CN111597824 B CN 111597824B CN 202010307663 A CN202010307663 A CN 202010307663A CN 111597824 B CN111597824 B CN 111597824B
Authority
CN
China
Prior art keywords
corpus
source
target
language
translation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010307663.3A
Other languages
Chinese (zh)
Other versions
CN111597824A (en
Inventor
陈巍华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd, Xiamen Yunzhixin Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202010307663.3A priority Critical patent/CN111597824B/en
Publication of CN111597824A publication Critical patent/CN111597824A/en
Application granted granted Critical
Publication of CN111597824B publication Critical patent/CN111597824B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to a training method and a training device for a language translation model. The method comprises the following steps: step A1, acquiring a source corpus S1 and a target corpus T1 which are bilingual corpora, and constructing a bi-class translation network M1 according to the source corpus S1 and the target corpus T1, wherein the bi-class translation network M1 has the capability of judging whether any source sentence and target sentence are translated or not; step A2, training an initial source language training model by utilizing a source corpus P1 to obtain a target source language training model M2; step A3, obtaining a source corpus S3 and a target corpus T2 corresponding to the source corpus S3 according to a binary translation network M1, a target source language training model M2 and the source corpus S1; and A4, obtaining a language translation model according to the source corpus S1, the target corpus T1, the source corpus S3 and the target corpus T2. According to the technical scheme, the source corpus can be expanded, so that the source corpus and the target corpus which are rich in resources and are bi-translated are obtained, and further, the language translation model with high translation precision and accuracy and quality is obtained.

Description

Training method and device for language translation model
Technical Field
The invention relates to the technical field of translation, in particular to a training method and device of a language translation model.
Background
At present, in the Translation task, most of the existing mainstream data enhancement algorithms expand the corpus by introducing noise (word insertion, deletion, reordering and the like) or expand the corpus by generating parallel pseudo bilingual by using a large number of target bilingual terms by a Back-Translation method to obtain bilingual data, and then training a language Translation model, but the Translation accuracy and quality of the trained language Translation model are not high enough no matter which way of training the language Translation model is adopted for the obtained corpus data.
Disclosure of Invention
The embodiment of the invention provides a training method and device for a language translation model. The technical scheme is as follows:
according to a first aspect of an embodiment of the present invention, there is provided a training method of a language translation model, including:
step A1, acquiring a source corpus S1 and a target corpus T1 which are bilingual corpora, and constructing a two-class translation network M1 according to the source corpus S1 and the target corpus T1, wherein the two-class translation network M1 has the capability of judging whether any source sentence and target sentence are translated or not;
step A2, training an initial source language training model by utilizing a source corpus P1 to obtain a target source language training model M2;
step A3, obtaining a source corpus S3 and a target corpus T2 corresponding to the source corpus S3 according to the two-classification translation network M1, the target source language training model M2 and the source corpus S1;
and A4, obtaining the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3 and the target corpus T2.
In one embodiment, the step A3 includes:
expanding the source corpus S1 according to the target source language training model M2 to obtain a candidate source corpus S2;
and screening the candidate source corpus S2 and the target corpus T1 by using the bi-classification translation network M1 to obtain a source corpus S3 and the target corpus T2.
In one embodiment, the filtering the candidate source corpus S2 and the target corpus T1 using the bi-classification translation network M1 to obtain a source corpus S3 and the target corpus T2 includes:
acquiring a preset corpus probability threshold value which is mutually translated corpus;
the candidate source corpus S2 and the target corpus T1 are input into the bi-classification translation network M1, so that the source corpus S3 and the target corpus T2 are screened out by utilizing the bi-classification translation network M1 and the corpus probability threshold value.
In one embodiment, the method further comprises:
re-using the source corpus S3 and a target corpus T2 corresponding to the source corpus S3 as the target corpus T1 and the source corpus S1 respectively, and executing the step A2 and the step A3 to obtain a target corpus T3 and a source corpus S4 corresponding to the target corpus T3;
the step A4 comprises the following steps:
and obtaining the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3, the target corpus T2, the target corpus T3 and the source corpus S4.
In one embodiment, the method further comprises:
acquiring a preset number of target whistle;
translating the source corpus P2 corresponding to the target single language by using the language translation model;
and retraining the language translation model according to the source corpus P2 and the target monolingual.
According to a second aspect of an embodiment of the present invention, there is provided a training apparatus for a language translation model, including:
the first processing module is used for acquiring a source corpus S1 and a target corpus T1 which are bilingual corpora, and constructing a bi-class translation network M1 according to the source corpus S1 and the target corpus T1, wherein the bi-class translation network M1 has the capability of judging whether any source sentence and target sentence are translated or not;
the training module is used for training the initial source language training model by utilizing the source corpus P1 to obtain a target source language training model M2;
the first obtaining module is configured to obtain a source corpus S3 and a target corpus T2 corresponding to the source corpus S3 according to the two classification translation network M1, the target source language training model M2 and the source corpus S1;
the second obtaining module is configured to obtain the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3, and the target corpus T2.
In one embodiment, the first acquisition module includes:
the expansion sub-module is used for expanding the source corpus S1 according to the target source language training model M2 to obtain a candidate source corpus S2;
and the screening sub-module is used for screening the candidate source corpus S2 and the target corpus T1 by using the bi-classification translation network M1 to obtain a source corpus S3 and the target corpus T2.
In one embodiment, the screening submodule is specifically configured to:
acquiring a preset corpus probability threshold value which is mutually translated corpus;
the candidate source corpus S2 and the target corpus T1 are input into the bi-classification translation network M1, so that the source corpus S3 and the target corpus T2 are screened out by utilizing the bi-classification translation network M1 and the corpus probability threshold value.
In one embodiment, the apparatus further comprises:
the second processing module is configured to re-use the source corpus S3 and a target corpus T2 corresponding to the source corpus S3 as the target corpus T1 and the source corpus S1, and execute the training module and the first obtaining module, so as to obtain a target corpus T3 and a source corpus S4 corresponding to the target corpus T3;
the second acquisition module includes:
the obtaining sub-module is configured to obtain the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3, the target corpus T2, the target corpus T3, and the source corpus S4.
In one embodiment, the apparatus further comprises:
the third acquisition module is used for acquiring target whisper of a preset number;
the translation module is used for translating the source corpus P2 corresponding to the target single language by utilizing the language translation model;
and the training module is used for retraining the language translation model according to the source corpus P2 and the target monolingual.
The technical scheme provided by the embodiment of the invention can comprise the following beneficial effects:
an initial two-class translation network M1 can be constructed through a source corpus S1 and a target corpus T1 with low resources (namely a small amount of resources and/or simple resources), then an initial source language training model can be utilized by the source corpus P1, so that a target source language training model M2 with higher accuracy is obtained, the source corpus S1 can be further expanded by utilizing the two-class translation network M1 and the target source language training model M2, so that a source corpus S3 with rich resources and two-class translation and a target corpus T2 corresponding to the source corpus S3 are obtained, and further a language translation model with higher translation precision and accuracy and quality can be obtained by utilizing rich corpora (namely the source corpus S1, the target corpus T1, the source corpus S3 and the target corpus T2).
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a flowchart illustrating a method of training a language translation model, according to an exemplary embodiment.
FIG. 2 is a block diagram illustrating a training apparatus for a language translation model, according to an example embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.
In order to solve the above technical problems, an embodiment of the present invention provides a method for training a language translation model, where the method may be used in a training program, a system or a device of the language translation model, and an execution subject corresponding to the method may be a terminal or a server, as shown in fig. 1, and the method includes steps A1 to A4:
step A1, acquiring a source corpus S1 and a target corpus T1 which are bilingual corpora, and constructing a two-class translation network M1 according to the source corpus S1 and the target corpus T1, wherein the two-class translation network M1 has the capability of judging whether any source sentence and target sentence are translated or not; the bilingual source corpus S1 and the target corpus T1 refer to the same content as the target corpus T1 and the source corpus S1, and only the expression languages are different. The source corpus S1 belongs to low-resource corpora, i.e. low-resource, i.e. less-resource, i.e. the number of corpora of the source corpus S1 is less than the preset number. Therefore, the invention is essentially a training method of a low-resource translation model. The binary translation network M1 may be any network such as a binary convolutional neural network, which has the capability of judging whether any source sentence and target sentence are translated with each other.
Step A2, training an initial source language training model by utilizing a source corpus P1 to obtain a target source language training model M2; the initial source language training model can be an open source language training model, the target source language training model M2 is a model obtained by training a source corpus P1 on a pre-training model (such as BERT (Bidirectional Encoder Representations from Transformers, bi-directional encoder characterization quantity from a converter)), the BERT is only a network, the model M2 trained by taking the network has the capacity of complete filling, and the model M2 can be used for expanding corpus; in addition, both the initial source language training model and the target source language training model have the function of enriching the corpus, namely, a certain source corpus is enriched into a plurality of corpora.
The source corpus P1 may be a single-language corpus or a bilingual corpus, wherein the single-language corpus is a corpus which can be expressed in only one language at present, and the bilingual corpus is a corpus which can be expressed in multiple languages at present.
Step A3, obtaining a source corpus S3 and a target corpus T2 corresponding to the source corpus S3 according to the two-classification translation network M1, the target source language training model M2 and the source corpus S1;
and A4, obtaining the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3 and the target corpus T2. The language translation model is a model having a language translation function (double translation function), for example, a model capable of translating chinese into a language such as english or russian.
An initial two-class translation network M1 can be constructed through a source corpus S1 and a target corpus T1 with low resources (namely a small amount of resources and/or simplicity), then the initial source language training model can be trained by utilizing the source corpus P1, so that a target source language training model M2 with higher accuracy is obtained, the source corpus S1 can be further expanded by utilizing the two-class translation network M1 and the target source language training model M2, so that a source corpus S3 with rich resources and two-class translation and a target corpus T2 corresponding to the source corpus S3 are obtained, and further, a language translation model with higher translation precision and accuracy and quality can be obtained by utilizing rich corpora (namely the source corpus S1, the target corpus T1, the source corpus S3 and the target corpus T2).
In one embodiment, the step A3 includes:
expanding the source corpus S1 according to the target source language training model M2 to obtain a candidate source corpus S2;
and screening the candidate source corpus S2 and the target corpus T1 by using the bi-classification translation network M1 to obtain a source corpus S3 and the target corpus T2.
Because the number of the source linguistic data S1 is larger, the language translation model trained by using the S1 is lower in accuracy, so that the target source language training model M2 can be used for expanding the source linguistic data S1 to obtain a larger number of candidate source linguistic data S2, and further the two-class translation network M1 is used for further screening the candidate source linguistic data S2 and the target linguistic data T1, so that the source linguistic data S3 and the target linguistic data T2 which are bilingual linguistic data with higher probability are obtained, and the translation accuracy and quality of the language translation model are improved by combining the source linguistic data S3 and the target linguistic data T2.
In one embodiment, the filtering the candidate source corpus S2 and the target corpus T1 using the bi-classification translation network M1 to obtain a source corpus S3 and the target corpus T2 includes:
acquiring a preset corpus probability threshold value of translation corpora which are mutually translated (namely bilingual corpora);
the candidate source corpus S2 and the target corpus T1 are input into the bi-classification translation network M1, so that the source corpus S3 and the target corpus T2 are screened out by utilizing the bi-classification translation network M1 and the corpus probability threshold value.
During screening, a corpus probability threshold value set in advance can be combined, so that the source corpus S2 and the target corpus T1 which are candidate sources and target corpora with low bilingual corpus probability are filtered, and the source corpus S3 and the target corpus T2 which are screened out are ensured to have high probability of bilingual corpus.
In one embodiment, the method further comprises:
re-using the source corpus S3 and a target corpus T2 corresponding to the source corpus S3 as the target corpus T1 and the source corpus S1 respectively, and executing the step A2 and the step A3 to obtain a target corpus T3 and a source corpus S4 corresponding to the target corpus T3;
the step A4 comprises the following steps:
and obtaining the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3, the target corpus T2, the target corpus T3 and the source corpus S4.
The source corpus S3 and the target corpus T2 corresponding to the source corpus S3 are respectively re-used as the target corpus T1 and the source corpus S1, and the step A2 and the step A3 are re-executed, so that more source corpora and target corpora which are bilingual corpora can be obtained, namely the target corpus T3 and the source corpus S4 which are bilingual corpora with the target corpus T3 can be further obtained, and more source corpora and target corpora which are bilingual corpora can be used later (namely (S1, T1), (S3, T2), (S4, T3)), so that a language translation model with higher translation accuracy and quality can be obtained.
In one embodiment, the method further comprises:
acquiring a preset number of target whistle; the target monolingual is the corpus which can be expressed in only one language at present.
Translating the source corpus P2 corresponding to the target single language by using the language translation model;
and retraining the language translation model according to the source corpus P2 and the target monolingual.
The method comprises the steps of obtaining a preset number of target single words, translating a source corpus P2 which is translated with the target single words by utilizing the language Translation model, obtaining a large number of target single words through the language Translation model in a Back-Translation mode, and then retraining the language Translation model by utilizing the target single words according to the source corpus P2, so that the Translation precision and the Translation quality of the language Translation model are further improved by combining the Back-Translation mode.
Therefore, the invention builds the Translation model by further expanding high-quality bilingual language for low-resource bilingual corpus, improves the capability of carrying out the Translation model, further combines with the fake bilingual obtained by Back-Translation to have higher quality, and finally can further improve the Translation capability of the low-resource bilingual model.
The technical scheme of the invention will be further described in detail as follows:
step 1: modeling a source corpus S1 and a target corpus T1 in low-resource bilingual corpus, and training a two-classification network M1, wherein the network has the capability of judging whether any source sentence and target sentence are mutually translated.
Step 2: for source corpus (bilingual ), a pre-training language model of an open source is utilized or a large number of source bilingual training is utilized to obtain a pre-training model (such as a bi-directional encoder characterization quantity of BERT M2 from a converter, etc.).
Step 3: and (3) carrying out corpus expansion on the source corpus S1 by using M2, adding some masks into the source corpus at random masks (shielding) or some positions of sentences, and predicting by using M2 to predict/increase words at the mask positions to obtain the candidate source corpus S2.
Step 4: the generated source corpus S2 and the corresponding target corpus T1 pass through a two-class network M1, and data in the S2 which is larger than a certain threshold value (the probability that the original expectation and the target expectation are bilingual) are screened out, so that a forged source corpus S3 and the corresponding target corpus T1 are obtained.
Step 5: and (3) converting the source corpus into the target corpus, and repeating the steps (2), 3 and 4) to obtain the forged target corpus T3 and the source corpus S1 corresponding to the target corpus T3.
Step 6: and constructing models by taking the original bilingual corpus (S1, T1) and the forged bilingual corpus (S3, T1) and the forged bilingual corpus (S1, T3) as training corpora for training the translation model to obtain a bilingual translation model M3.
Step 7: a large number of target single words are subjected to M3 by means of Back-Translation to obtain high-quality fake double words (S4, T4), the obtained data are trained in a Translation model, and the effect of low-resource Translation is improved.
Finally, it is clear that: the above embodiments may be freely combined according to actual needs by those skilled in the art.
Corresponding to the training method of the language translation model provided by the embodiment of the present invention, the embodiment of the present invention further provides a training device of the language translation model, as shown in fig. 2, where the device includes:
the first processing module 201 is configured to obtain a source corpus S1 and a target corpus T1 that are bilingual corpora, and construct a bi-class translation network M1 according to the source corpus S1 and the target corpus T1, where the bi-class translation network M1 has a capability of judging whether any source sentence and target sentence are translated with each other;
the training module 202 is configured to train the initial source language training model by using the source corpus P1 to obtain a target source language training model M2;
the first obtaining module 203 is configured to obtain a source corpus S3 and a target corpus T2 corresponding to the source corpus S3 according to the bi-classification translation network M1, the target source language training model M2 and the source corpus S1;
the second obtaining module 204 is configured to obtain the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3, and the target corpus T2.
In one embodiment, the first acquisition module includes:
the expansion sub-module is used for expanding the source corpus S1 according to the target source language training model M2 to obtain a candidate source corpus S2;
and the screening sub-module is used for screening the candidate source corpus S2 and the target corpus T1 by using the bi-classification translation network M1 to obtain a source corpus S3 and the target corpus T2.
In one embodiment, the screening submodule is specifically configured to:
acquiring a preset corpus probability threshold value which is mutually translated corpus;
the candidate source corpus S2 and the target corpus T1 are input into the bi-classification translation network M1, so that the source corpus S3 and the target corpus T2 are screened out by utilizing the bi-classification translation network M1 and the corpus probability threshold value.
In one embodiment, the apparatus further comprises:
the second processing module is configured to re-use the source corpus S3 and a target corpus T2 corresponding to the source corpus S3 as the target corpus T1 and the source corpus S1, and execute the training module and the first obtaining module, so as to obtain a target corpus T3 and a source corpus S4 corresponding to the target corpus T3;
the second acquisition module includes:
the obtaining sub-module is configured to obtain the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3, the target corpus T2, the target corpus T3, and the source corpus S4.
In one embodiment, the apparatus further comprises:
the third acquisition module is used for acquiring target whisper of a preset number;
the translation module is used for translating the source corpus P2 corresponding to the target single language by utilizing the language translation model;
and the training module is used for retraining the language translation model according to the source corpus P2 and the target monolingual.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (2)

1. A method for training a language translation model, comprising:
step A1, acquiring a source corpus S1 and a target corpus T1 which are bilingual corpora, and constructing a two-class translation network M1 according to the source corpus S1 and the target corpus T1, wherein the two-class translation network M1 has the capability of judging whether any source sentence and target sentence are translated or not;
step A2, training an initial source language training model by utilizing a source corpus P1 to obtain a target source language training model M2;
step A3, obtaining a source corpus S3 and a target corpus T2 corresponding to the source corpus S3 according to the two-classification translation network M1, the target source language training model M2 and the source corpus S1;
step A4, obtaining the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3 and the target corpus T2;
the step A3 comprises the following steps:
expanding the source corpus S1 according to the target source language training model M2 to obtain a candidate source corpus S2;
screening the candidate source corpus S2 and the target corpus T1 by using the bi-classification translation network M1 to obtain a source corpus S3 and the target corpus T2;
the filtering the candidate source corpus S2 and the target corpus T1 by using the bi-classification translation network M1 to obtain a source corpus S3 and the target corpus T2 includes:
acquiring a preset corpus probability threshold value which is mutually translated corpus;
inputting the candidate source corpus S2 and the target corpus T1 into the bi-classification translation network M1 to screen the source corpus S3 and the target corpus T2 by utilizing the bi-classification translation network M1 and the corpus probability threshold;
further comprises: re-using the source corpus S3 and a target corpus T2 corresponding to the source corpus S3 as the target corpus T1 and the source corpus S1 respectively, and executing the step A2 and the step A3 to obtain a target corpus T3 and a source corpus S4 corresponding to the target corpus T3;
the step A4 comprises the following steps:
obtaining the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3, the target corpus T2, the target corpus T3 and the source corpus S4;
further comprises:
acquiring a preset number of target whistle;
translating the source corpus P2 corresponding to the target single language by using the language translation model;
and retraining the language translation model according to the source corpus P2 and the target monolingual.
2. A training device for a language translation model, comprising:
the first processing module is used for acquiring a source corpus S1 and a target corpus T1 which are bilingual corpora, and constructing a bi-class translation network M1 according to the source corpus S1 and the target corpus T1, wherein the bi-class translation network M1 has the capability of judging whether any source sentence and target sentence are translated or not;
the training module is used for training the initial source language training model by utilizing the source corpus P1 to obtain a target source language training model M2;
the first obtaining module is configured to obtain a source corpus S3 and a target corpus T2 corresponding to the source corpus S3 according to the two classification translation network M1, the target source language training model M2 and the source corpus S1;
the second obtaining module is configured to obtain the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3 and the target corpus T2;
the first acquisition module includes:
the expansion sub-module is used for expanding the source corpus S1 according to the target source language training model M2 to obtain a candidate source corpus S2;
the screening submodule is used for screening the candidate source corpus S2 and the target corpus T1 by using the bi-classification translation network M1 to obtain a source corpus S3 and the target corpus T2;
the screening submodule is specifically used for:
acquiring a preset corpus probability threshold value which is mutually translated corpus;
inputting the candidate source corpus S2 and the target corpus T1 into the bi-classification translation network M1 to screen the source corpus S3 and the target corpus T2 by utilizing the bi-classification translation network M1 and the corpus probability threshold;
the apparatus further comprises:
the second processing module is configured to re-use the source corpus S3 and a target corpus T2 corresponding to the source corpus S3 as the target corpus T1 and the source corpus S1, and execute the training module and the first obtaining module, so as to obtain a target corpus T3 and a source corpus S4 corresponding to the target corpus T3;
the second acquisition module includes:
the obtaining submodule is used for obtaining the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3, the target corpus T2, the target corpus T3 and the source corpus S4;
the apparatus further comprises:
the third acquisition module is used for acquiring target whisper of a preset number;
the translation module is used for translating the source corpus P2 corresponding to the target single language by utilizing the language translation model;
and the training module is used for retraining the language translation model according to the source corpus P2 and the target monolingual.
CN202010307663.3A 2020-04-17 2020-04-17 Training method and device for language translation model Active CN111597824B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010307663.3A CN111597824B (en) 2020-04-17 2020-04-17 Training method and device for language translation model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010307663.3A CN111597824B (en) 2020-04-17 2020-04-17 Training method and device for language translation model

Publications (2)

Publication Number Publication Date
CN111597824A CN111597824A (en) 2020-08-28
CN111597824B true CN111597824B (en) 2023-05-26

Family

ID=72190412

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010307663.3A Active CN111597824B (en) 2020-04-17 2020-04-17 Training method and device for language translation model

Country Status (1)

Country Link
CN (1) CN111597824B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105389303A (en) * 2015-10-27 2016-03-09 北京信息科技大学 Automatic heterogenous corpus fusion method
CN110334361A (en) * 2019-07-12 2019-10-15 电子科技大学 A kind of neural machine translation method towards rare foreign languages language

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2054817A4 (en) * 2006-08-18 2009-10-21 Ca Nat Research Council Means and method for training a statistical machine translation system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105389303A (en) * 2015-10-27 2016-03-09 北京信息科技大学 Automatic heterogenous corpus fusion method
CN110334361A (en) * 2019-07-12 2019-10-15 电子科技大学 A kind of neural machine translation method towards rare foreign languages language

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吕学强 ; 仵永栩 ; 周强 ; 刘殷 ; .异源语料融合研究.中文信息学报.2016,(05),全文. *
姚亮 ; 洪宇 ; 刘昊 ; 刘乐 ; 姚建民 ; .基于语义分布相似度的翻译模型领域自适应研究.山东大学学报(理学版).2016,(07),全文. *

Also Published As

Publication number Publication date
CN111597824A (en) 2020-08-28

Similar Documents

Publication Publication Date Title
CN110134968B (en) Poem generation method, device, equipment and storage medium based on deep learning
CN112287696B (en) Post-translation editing method and device, electronic equipment and storage medium
US11669695B2 (en) Translation method, learning method, and non-transitory computer-readable storage medium for storing translation program to translate a named entity based on an attention score using neural network
US20190129695A1 (en) Programming by voice
KR20140049150A (en) Automatic translation postprocessing system based on user participating
Hassani BLARK for multi-dialect languages: towards the Kurdish BLARK
Werlen et al. Self-attentive residual decoder for neural machine translation
CN111144140A (en) Zero-learning-based Chinese and Tai bilingual corpus generation method and device
Çolakoğlu et al. Normalizing non-canonical Turkish texts using machine translation approaches
JP2016164707A (en) Automatic translation device and translation model learning device
Chennoufi et al. Impact of morphological analysis and a large training corpus on the performances of Arabic diacritization
CN111597824B (en) Training method and device for language translation model
KR20210035721A (en) Machine translation method using multi-language corpus and system implementing using the same
CN109657244B (en) English long sentence automatic segmentation method and system
Nanayakkara et al. Context aware back-transliteration from english to sinhala
Muaidi Levenberg-Marquardt learning neural network for part-of-speech tagging of Arabic sentences
CN109446537B (en) Translation evaluation method and device for machine translation
CN113705251A (en) Training method of machine translation model, language translation method and equipment
CN112528680A (en) Corpus expansion method and system
US11664010B2 (en) Natural language domain corpus data set creation based on enhanced root utterances
CN113270088B (en) Text processing method, data processing method, voice processing method, data processing device, voice processing device and electronic equipment
CN116452906B (en) Railway wagon fault picture generation method based on text description
Araújo et al. From VLibras to OpenSigns: Towards an Open Platform for Machine Translation of Spoken Languages into Sign Languages
CN113392657A (en) Training sample enhancement method and device, computer equipment and storage medium
Bounaas et al. Effects of pre-editing operations on audiovisual translation using TRADOS: an experimental analysis of Saudi students’ translations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant