CN113204979A

CN113204979A - Model training method and device, electronic equipment and storage medium

Info

Publication number: CN113204979A
Application number: CN202110592495.1A
Authority: CN
Inventors: 杨柳祎; 李长亮; 郭馨泽
Original assignee: Beijing Kingsoft Software Co Ltd
Current assignee: Beijing Kingsoft Software Co Ltd; Beijing Kingsoft Digital Entertainment Co Ltd
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2021-08-03

Abstract

The embodiment of the application provides a model training method, a model training device, electronic equipment and a storage medium, which relate to the technical field of computers and comprise the following steps: obtaining a first corpus pair, the first corpus pair comprising: the method comprises the steps of obtaining a first source text of a source language and a first target text of a target language corresponding to the first source text; utilizing the first corpus to build a model for a training sample, wherein the sample building model is used for: translating the text in the target language into the text in the source language; obtaining a second target text of the target language, and translating the second target text by using the sample construction model to obtain a second source text; training a target translation model by using a second corpus pair, wherein the second corpus pair comprises: the second source text, the second target text, the target translation model to: translating the text of the source language into the text of the target language. By applying the scheme provided by the embodiment of the application, the accuracy of model translation can be improved.

Description

Model training method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a model training method and apparatus, an electronic device, and a storage medium.

Background

With the rapid development of artificial intelligence, the application of network models is more and more extensive. In translating, a network model may be utilized to translate source text in a source language to target text in a target language. However, when the network model is translated, there may be an over-turning problem, so that an incorrect character string in the translated target text appears repeatedly, and the length of the target text is limited, which easily causes that the target text is filled with the incorrect character string and cannot cover all characters in the source text, and the translation accuracy is poor.

For example, assuming that the source text in the english language is "remote control of a, the mouse always has an offset of about 150px, and the operation is very difficult";

when the source text is translated into a target text in English language, the correct translation result should be "A's remote control," the user always has a displacement of about 150px, "the operation version differential";

if there is an over-translation problem, the obtained translation result may be "A remote control, the mobile always a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit of a bit.

In the related art, in order to solve the problem of the over-turning, a source text is generally input into a network model, a plurality of candidate texts translated by the network model are obtained, the coverage rate of each candidate text on each character in the source text is determined, each candidate text is scored based on the coverage rate, and the candidate text with the highest score is used as a final translation result.

In the above scheme, since the network model itself has defects, even the candidate result with the highest score may have an over-conversion problem, so that the accuracy of the translation result finally obtained is low.

Disclosure of Invention

An object of the embodiments of the present application is to provide a model training method, an apparatus, an electronic device, and a storage medium, so as to improve accuracy of model translation. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a model training method, where the method includes:

obtaining a first corpus pair, wherein the first corpus pair comprises: the method comprises the steps of obtaining a first source text of a source language and a first target text of a target language corresponding to the first source text;

utilizing the first corpus to build a model for a training sample, wherein the sample building model is used for: translating the text in the target language into the text in the source language;

obtaining a second target text of the target language, and translating the second target text by using the sample construction model to obtain a second source text;

training a target translation model by using a second corpus pair, wherein the second corpus pair comprises: the second source text, the second target text, the target translation model to: translating the text of the source language into the text of the target language.

In an embodiment of the application, the translating the target model with the second corpus pair includes:

and training the target translation model by utilizing the first corpus pair and the second corpus pair.

In an embodiment of the application, the obtaining the first corpus pair includes:

obtaining a candidate corpus pair comprising the first source text and the first target text;

and cleaning the candidate corpus pair to obtain a first corpus pair.

In an embodiment of the application, the cleaning the corpus candidate pair to obtain a first corpus pair includes:

identifying a messy code corpus pair with messy codes in the candidate corpus pair, and removing the messy code corpus pair from the candidate corpus pair to obtain a first corpus pair; and/or

Searching for an abnormal corpus pair with a length ratio smaller than a first preset ratio or larger than a second preset ratio in the candidate corpus pair, and removing the abnormal corpus pair from the candidate corpus pair to obtain a first corpus pair, wherein the first ratio is: and the ratio of the length of the first source text to the length of the first target text is greater than the first preset ratio.

In an embodiment of the application, the obtaining the second target text in the target language includes:

obtaining a candidate text of the target language;

and cleaning the candidate text to obtain a second target text.

In an embodiment of the application, the cleaning the candidate text to obtain a second target text includes:

searching a text with the length within a preset length range in the candidate texts as a second target text; and/or

Searching incomplete texts which do not take preset ending identifiers as endings in the candidate texts, and removing the incomplete texts from the candidate texts to obtain second target texts; and/or

Identifying semantic information of each candidate text, determining semantic missing texts with semantic missing in the candidate texts according to the identified semantic information, and removing the semantic missing texts from the candidate texts to obtain a second target text.

In an embodiment of the present application, after the step of using the second corpus pair to train the target translation model, the method further includes:

obtaining a test text of the source language;

inputting the test text into a trained target translation model, and translating the test text by using the target translation model to obtain a model output text;

and judging whether the translation result of the target translation model is accurate or not according to the model output text, and returning to the step of constructing the model for the training sample by using the first corpus under the condition that the translation result of the target translation model is inaccurate.

In an embodiment of the present application, the determining whether the translation result of the target translation model is accurate according to the model output text includes:

calculating an accuracy index of the target translation model from the model output text, wherein the accuracy index comprises: a Bleu index, and/or a degree of confusion;

and judging whether the translation result of the target translation model is accurate or not by using the accuracy index.

In a second aspect, an embodiment of the present application provides a model training apparatus, including:

a first corpus pair obtaining module, configured to obtain a first corpus pair, where the first corpus pair includes: the method comprises the steps of obtaining a first source text of a source language and a first target text of a target language corresponding to the first source text;

a first model training module, configured to build a model for a training sample using the first corpus, where the sample building model is configured to: translating the text in the target language into the text in the source language;

the target text obtaining module is used for obtaining a second target text of the target language, and translating the second target text by using the sample construction model to obtain a second source text;

a second model training module, configured to train a target translation model using a second corpus pair, where the second corpus pair includes: the second source text, the second target text, the target translation model to: translating the text of the source language into the text of the target language.

In an embodiment of the application, the second model training module is specifically configured to:

In an embodiment of the application, the first corpus pair obtaining module includes:

a corpus candidate pair obtaining unit, configured to obtain a corpus candidate pair including the first source text and the first target text;

and the first corpus pair obtaining unit is used for cleaning the candidate corpus pairs to obtain first corpus pairs.

In an embodiment of the application, the first corpus pair obtaining unit is specifically configured to:

In an embodiment of the application, the target text obtaining module includes:

a candidate text obtaining unit, configured to obtain a candidate text of the target language;

and the target text obtaining unit is used for cleaning the candidate text to obtain a second target text.

In an embodiment of the application, the target text obtaining unit is specifically configured to:

In one embodiment of the present application, the apparatus further comprises:

the test text obtaining module is used for obtaining a test text of the source language after a second corpus pair training target translation model is utilized;

the output text obtaining module is used for inputting the test text into a trained target translation model, and translating the test text by using the target translation model to obtain a model output text;

and the third model training module is used for judging whether the translation result of the target translation model is accurate according to the model output text, and triggering the first model training module under the condition that the translation result of the target translation model is inaccurate.

In an embodiment of the application, the third model training module is specifically configured to:

and judging whether the translation result of the target translation model is accurate or not by using the accuracy index, and triggering the first model training module under the condition that the translation result of the target translation model is inaccurate.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface complete communication between the memory and the processor through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of the first aspect when executing a program stored in the memory.

In a fourth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method steps of any one of the first aspect.

Embodiments of the present application also provide a computer program product containing instructions that, when executed on a computer, cause the computer to perform any of the above-described model training methods.

The embodiment of the application has the following beneficial effects:

in the model training scheme provided in the embodiment of the present application, a first corpus pair may be obtained, where the first corpus pair includes: the method comprises the steps of obtaining a first source text of a source language and a first target text of a target language corresponding to the first source text; utilizing the first corpus to build a model for the training sample, wherein the sample building model is used for: translating the text of the target language into the text of the source language; obtaining a second target text of the target language, and translating the second target text by using the sample construction model to obtain a second source text; training a target translation model by using a second corpus pair, wherein the second corpus pair comprises: a second source text, a second target text, the target translation model to: the text in the source language is translated into text in the target language. The second target text is a single-language corpus, so that a wide range of obtaining ways is convenient, a large amount of second target texts are obtained, then a sample construction model is obtained by training, a second source text corresponding to the second target text is obtained, a large amount of second corpus pairs comprising the second source text and the second target text are obtained, and then the large amount of second corpus pairs are used as sample training target translation models, so that the training effect of the model can be improved. Therefore, by applying the scheme provided by the embodiment of the application, the probability of false turning or over turning of the trained model can be reduced, and the accuracy of model translation is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and it is also obvious for a person skilled in the art to obtain other embodiments according to the drawings.

Fig. 1 is a schematic flowchart of a model training method according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart illustrating another model training method according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart illustrating another model training method according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the description herein are intended to be within the scope of the present disclosure.

In order to improve the accuracy of model training, embodiments of the present application provide a model training method and apparatus, an electronic device, and a storage medium, which are described in detail below.

Referring to fig. 1, fig. 1 is a schematic flowchart of a model training method according to an embodiment of the present disclosure, where the method may be applied to an electronic device such as a desktop computer, a notebook computer, and a tablet computer. The model training method may include the following steps S101 to S104:

s101, obtaining a first corpus pair.

Wherein, the first corpus pair includes: the source language text processing method comprises a first source text of a source language and a first target text of a target language corresponding to the first source text.

The source language may be a chinese language, an english language, a french language, a spanish language, a portuguese language, or the like. The first source text is: text expressed in the source language described above is used.For example, assuming the source language is chinese, the first source text may be "bright moon before bed"; assuming that the source language is French, the first source text may be "Se coucher

”。

The target language is a language different from the source language, and may be a chinese language, an english language, a japanese language, a russian language, a korean language, or the like. The first target text is: and expressing the text of the first source text by adopting the target language. It is to be understood that the first target text may be: translating the first source text into a text obtained by a target language; likewise, the first source text may be: and translating the first target text into a text obtained from the source language. For example, assuming that the source text is "How you are", and the target language is english language, the first target text may be "How do you do"; assuming that the source text is "Be careful on the way" and the target language is Korean language, the first target text may Be Korean language

In one embodiment of the present application, the first corpus pair may be crawled from an open corpus pair platform when the first corpus pair is obtained.

In addition, a first source text can be obtained from an open text platform, and the first source text is translated into a first target text of a target language, so that a first corpus pair is obtained;

or acquiring a first target text from an open text platform, and translating the first target text into a first source text of a source language, thereby obtaining a first corpus pair.

In addition, the first corpus pair can also be obtained from a corpus platform of a novel, periodical, news and the like translated in source language-target language. For example, assuming the source language is English and the target language is Chinese, the first corpus can be obtained from the Chinese-English double-translated novel.

And S102, constructing a model for the training sample by using the first corpus.

Wherein the sample construction model is used for: the text in the target language is translated into the text in the source language. The sample building model is a language translation model, and may be, for example, a Seq2Seq translation model based on a tensrflow framework, an NMT translation model based on an Apache MXNet framework, or the like.

Specifically, when the sample construction model is trained, the first corpus pair may be used as a sample, the first target text in the first corpus pair is input into the sample construction model to be trained to obtain an output result, then the loss of the output result relative to the first source text is calculated, and the loss is used to adjust the parameters of the sample construction model, so as to implement one-time training of the model.

In an embodiment of the application, after the training frequency for training the sample construction model reaches a preset frequency threshold, the model training is considered to be completed, and the trained sample construction model is obtained. Wherein, the frequency threshold may be 100000 times, 200000 times, 5000000 times, etc.

In addition, after the loss of the output result of the sample construction model relative to the first source text is lower than a preset loss threshold, the accuracy of the model is considered to meet the requirement, so that the model training is considered to be finished, and the trained sample construction model is obtained.

S103, obtaining a second target text of the target language, and translating the second target text by using the sample construction model to obtain a second source text.

Specifically, a text in the target language may be obtained as the second target text. Since the second target text is a corpus of a single language, the difficulty of obtaining the second target text is low compared with that of a bilingual corpus, and therefore a large amount of second target texts can be conveniently obtained. For each second target text, the sample construction model obtained by training in S102 may be used to translate the second target text, so as to obtain a text in the source language corresponding to the second target text and output by the model, and the text is used as a second source text. The second target text and the second source text output by the sample construction model can be subsequently used as samples for training the target translation model to be trained.

In an embodiment of the present application, when obtaining the second target text, the second target text may be obtained from an open corpus platform of a target language, such as a novel, a periodical, a magazine, and news, which is not limited in this application.

And S104, utilizing the second corpus to train the target translation model.

Wherein the second corpus pair comprises: a second source text, a second target text. Because the second source text in the second corpus pair is: and translating the second target text to obtain a text, so that a corresponding relation exists between the second source text and the second target text.

The target translation model is used for: the text in the source language is translated into text in the target language. The target translation model is also a language translation model, and may be, for example, a Seq2Seq translation model based on a tensrflow framework, an NMT translation model based on an Apache MXNet framework, or the like.

Specifically, when the target translation model is trained, the second corpus pair may be used as a sample, the second source text in the second corpus pair is input into the target translation model to be trained, an output result is obtained, then a loss of the output result relative to the second target text corresponding to the second source text is calculated, and a parameter of the target translation model is adjusted by using the loss, so as to implement one-time training of the model.

In an embodiment of the application, after the training frequency for training the target translation model reaches a preset frequency threshold, the model training is considered to be completed, and a trained target translation model is obtained. The frequency threshold may be 50000 times, 100000 times, 2000000 times, or the like.

In addition, after the loss of the output result of the target translation model relative to the second target text is lower than a preset loss threshold, the accuracy of the model is considered to meet the requirement, so that the model training is considered to be completed, and the trained target translation model is obtained.

In an embodiment of the present application, when training the target translation model, the target translation model may be trained by using the first corpus pair and the second corpus pair.

Specifically, can mix first corpus pair, second corpus pair, obtain mixed corpus pair, mix the corpus pair and include: the first source text and the first target text corresponding to the first source text, and the second source text and the second target text corresponding to the second source text. It should be noted that the corresponding relationship between the first source text and the first target text in the mixed corpus pair does not change, and similarly, the corresponding relationship between the second source text and the second target text does not change.

Therefore, when the target translation model is trained, the mixed corpus pair is taken as a sample, the first source text and the second source text in the mixed corpus pair are input into the target translation model to be trained to obtain an output result, then the loss of the output result relative to the target text corresponding to the input source text is calculated, and the loss is utilized to adjust the parameters of the target translation model, so that the model is trained once.

In this way, the number of training samples used is larger, and because the first corpus pair is a directly obtained paired corpus, the accuracy of the first corpus pair is generally higher, and thus the accuracy of the target translation model obtained by training based on the first corpus pair and the second corpus pair is also higher.

In the model training scheme provided in the foregoing embodiment, a first corpus pair may be obtained, where the first corpus pair includes: the method comprises the steps of obtaining a first source text of a source language and a first target text of a target language corresponding to the first source text; utilizing the first corpus to build a model for the training sample, wherein the sample building model is used for: translating the text of the target language into the text of the source language; obtaining a second target text of the target language, and translating the second target text by using the sample construction model to obtain a second source text; training a target translation model by using a second corpus pair, wherein the second corpus pair comprises: a second source text, a second target text, the target translation model to: the text in the source language is translated into text in the target language. The second target text is a single-language corpus, so that a wide range of obtaining ways is convenient, a large amount of second target texts are obtained, then a sample construction model is obtained by training, a second source text corresponding to the second target text is obtained, a large amount of second corpus pairs comprising the second source text and the second target text are obtained, and then the large amount of second corpus pairs are used as sample training target translation models, so that the training effect of the model can be improved. Therefore, by applying the scheme provided by the embodiment, the probability of false inversion or over inversion of the trained model can be reduced, and the accuracy of model translation is improved.

In an embodiment of the present application, for the step S101, when obtaining the first corpus pair, a corpus candidate pair including a first source text and a first target text may be obtained; and cleaning the candidate corpus pair to obtain a first corpus pair.

Specifically, an initial corpus pair including a first source text and a first target text may be obtained first and used as a candidate corpus pair, and then the candidate corpus pair is subjected to corpus cleaning to obtain a cleaned candidate corpus pair, which is used as a first corpus pair finally used for model training.

In an embodiment of the present application, when the corpus candidate pairs are cleaned, one or more of the following cleaning manners may be adopted, which are respectively described below:

in the first mode, the messy code corpus pair with messy codes in the candidate corpus pair is identified, and the messy code corpus pair is removed from the candidate corpus pair to obtain the first corpus pair.

Wherein, the messy code may be: and preset characters influencing text expression, such as #, &, @, and the like.

Specifically, for each corpus candidate pair, each character in the corpus candidate pair may be detected, whether the corpus candidate pair has a garbled code is determined, if yes, the corpus pair is used as a garbled code corpus pair, and finally, the identified garbled code corpus pair is removed from each corpus candidate pair.

And in the second mode, searching for abnormal corpus pairs of which the length ratio is smaller than the first preset ratio or larger than the second preset ratio in the candidate corpus pairs, and removing the abnormal corpus pairs from the candidate corpus pairs to obtain the first corpus pair.

Wherein the first ratio is: and the ratio of the length of the first source text to the length of the first target text is larger than the first preset ratio. The first preset ratio may be 0.3, 0.5, 0.2, etc., and the second preset ratio may be 2, 3, 4, etc., which is not limited in the embodiments of the present application.

The length of each text can be understood as: the number of characters of the text.

Specifically, for each corpus candidate pair, the number of characters of a first source text in the corpus pair may be obtained as a first length, the number of characters of a first target text in the corpus pair may be obtained as a second length, and then a ratio of the first length to the second length is calculated;

whether the ratio is smaller than a first preset ratio or not can be judged, if so, the length of the first source text is far smaller than that of the first target text, and further, the content expressed by the first source text and the content expressed by the first target text are possibly different and are not completely corresponding to each other, so that the corpus pair can be used as an abnormal corpus pair;

in addition, whether the ratio is larger than a second preset ratio or not can be judged, if so, the length of the first source text is far larger than that of the first target text, and further, the content expressed by the first source text and the content expressed by the first target text are possibly different and are not completely corresponding, so that the corpus pair can be used as an abnormal corpus pair.

And finally, removing the searched abnormal corpus pair from the candidate corpus pair to obtain a first corpus pair.

In addition, the ratio of the length of the first target text to the length of the first source text may also be calculated, and whether the candidate corpus pair is an abnormal corpus pair is determined by using the ratio, which is similar to the above-mentioned second method, and is not described herein again.

In an embodiment of the application, an abnormal corpus pair with a length difference larger than a preset difference in the candidate corpus pair may also be searched, and the abnormal corpus pair is removed from the candidate corpus pair to obtain the first corpus pair. Wherein, the length difference may be: the absolute value of the difference between the length of the first source text and the length of the first target text may be 5, 8, 10, and the like, which is not limited in this embodiment of the application.

Besides the above cleaning method, a corpus pair in which the length of the first source text is within a first preset length range and the length of the first target text is within a second preset length range can be searched from the corpus candidate pair as a first corpus pair.

The first preset length range may be 15-25, 20-30, 5-18, etc., the second preset length range may be 5-25, 3-28, 4-17, etc., and the first preset length range and the second preset length range may be equal or unequal, which is not limited in the embodiments of the present application. The first preset length range and the second preset length range may be determined specifically according to the scale of the model to be trained, an application scenario, and the like.

In the above scheme, the cleaned corpus pair can be used as the first corpus pair, so that the accuracy of the obtained first corpus pair is higher, the accuracy of the model constructed by the trained sample is higher, and the accuracy of the target translation model obtained by the subsequent training can be improved.

In an embodiment of the present application, for the step S103, when obtaining the second target text, a candidate text of the target language may be obtained; and cleaning the candidate text to obtain a second target text.

Specifically, an initial text expressed by using a target language may be obtained as a candidate text, and then the corpus of the candidate text is cleaned to obtain a cleaned candidate text, which is used as a second target text for model training.

In one embodiment of the present application, when cleaning candidate texts, one or more of the following cleaning manners may be adopted, which are respectively described as follows:

and thirdly, searching the text with the length within the preset length range in the candidate texts as a second target text.

The preset length range may be 10 to 25, 2 to 30, 3 to 18, and the like, which is not limited in the embodiments of the present application.

Specifically, the number of characters of each candidate text may be determined as the length of the candidate text, and it is determined whether the length is within a preset length range, if so, the candidate text is used as the second target text, otherwise, the candidate text is not used as the second target text.

And fourthly, searching for incomplete texts which do not take the preset ending identifier as the ending in the candidate texts, and removing the incomplete texts from the candidate texts to obtain a second target text.

Wherein the ending identifier may be. ","! ","? "," etc.

Specifically, in the complete text, the character at the end is usually the end identifier, and if the character at the end of the text is not the end identifier, the text may be considered to be incomplete. In view of this, for each candidate text, it may be determined whether the ending character of the candidate text is the ending identifier, if not, the text may be determined as an incomplete text, and then, when the candidate text is cleaned, the searched incomplete text may be removed from the candidate text, so as to obtain a second target text.

And fifthly, identifying semantic information of each candidate text, determining the semantic missing text with missing semantics in the candidate text according to the identified semantic information, and removing the semantic missing text from the candidate text to obtain a second target text.

Specifically, for each candidate text, the speech information of the text may be obtained, and whether the semantics of the candidate text are missing, such as lack of subject, lack of object, etc., is determined according to the semantic information, and if so, the text may be determined as the text with missing semantics. Therefore, when the corpus is cleaned, the semantic missing text can be removed from the candidate text, and a second target text is obtained.

The semantic information of each candidate text can be extracted by using a semantic extraction algorithm, and then whether each candidate text is a semantic-missing text or not can be judged based on the extracted information.

Besides the cleaning mode, when the candidate text is cleaned, the messy code text with messy codes in the candidate text can be identified, and the messy code text is removed from the candidate text, so that a second target text is obtained.

In the above scheme, the candidate text after being cleaned can be used as the second target text, so that the accuracy of the obtained second target text is higher, and the accuracy of the target translation model obtained through subsequent training can be improved.

Referring to fig. 2, fig. 2 is a schematic flow chart of another model training method according to an embodiment of the present disclosure, and as shown in fig. 2, after the step of translating the model by using the second corpus pair in S104, the model training method may further include the following steps S105 to S107:

and S105, obtaining a test text of the source language.

Wherein, the test text is: text expressed in a source language is used.

In an embodiment of the present application, the test text may be a text in a source language obtained from an open corpus platform, may also be a text selected from a first source text in a first corpus pair, or may be a text selected from a second source text, and the like, which is not limited in this embodiment of the present application.

And S106, inputting the test text into the trained target translation model, and translating the test text by using the target translation model to obtain a model output text.

Specifically, the test text may be input into a trained target translation model, and the test text in the source language is translated into a text in the target language by using the model, so as to obtain a model output text.

And S107, judging whether the translation result of the target translation model is accurate or not according to the model output text, and returning to the step S102 of constructing the model for the training sample by using the first corpus under the condition that the translation result of the target translation model is inaccurate.

Specifically, according to the model output text output by the model in S106, it can be determined whether the translation result obtained by translating the input text by the target translation model is accurate, and if the determination is not accurate, it indicates that the model has low accuracy, and the model needs to be trained continuously. In this case, in order to improve the accuracy of the model, the method may return to step S102, and first train the sample construction model again to improve the accuracy of the sample construction model, and when the accuracy of the sample construction model is improved, the accuracy of the second source text obtained by translating the second target text by the sample construction model is higher, so that when the target translation model is continuously trained by using the second corpus including the second target text and the second source text, the accuracy of the used training sample is improved, and therefore the accuracy of the trained target translation model can be improved.

In an embodiment of the present application, when determining whether the translation result of the target translation model is accurate in step S107, the accuracy index of the target translation model may be calculated according to the model output text; and judging whether the translation result of the target translation model is accurate or not by using the accuracy index.

Wherein, the accuracy index comprises: bleu index, and/or confusion.

In one implementation, in a case that the accuracy index includes a Bleu index, a translation reference text corresponding to the test text may be obtained, where the translation reference text is: and adopting the text expressed by the target language and corresponding to the test text. After the model output text is obtained, the similarity between the model output text and the translation reference text can be calculated, and the Bleu index is determined based on the similarity. When the Bleu index reaches a preset first threshold value, the translation result of the target translation model can be considered to be accurate, otherwise, the translation result of the target translation model is considered to be inaccurate.

In one implementation, when the accuracy index includes the confusion degree, each model output text output by the target translation model may be obtained, and the confusion degree may be calculated according to the confidence degree corresponding to each model output text. When the confusion degree is lower than a preset second threshold value, the translation result of the target translation model can be considered to be accurate, otherwise, the translation result of the target translation model is considered to be inaccurate.

In another implementation manner, a mathematical statistic of the Bleu index and the confusion degree may be calculated, and whether the translation result of the target translation model is accurate or not may be determined according to the mathematical statistic.

The mathematical statistic value may be a ratio, a product, a mean value, etc.

For example, assuming that the mathematical statistic is a ratio, a ratio of the Bleu index to the confusion degree may be calculated, and when the ratio reaches a preset third threshold, the translation result of the target translation model may be considered to be accurate, otherwise, the translation result of the target translation model may be considered to be inaccurate.

Referring to fig. 3, fig. 3 is a schematic flow chart of another model training method provided in the embodiment of the present application, where the method includes the following steps S301 to S308:

s301, obtaining a candidate corpus pair comprising a first source text and a first target text, and cleaning the candidate corpus pair to obtain a first corpus pair.

S302, constructing a model for the training sample by using the first corpus.

And S303, obtaining a candidate text of the target language, and cleaning the candidate text to obtain a second target text.

S304, translating the second target text by using the sample construction model to obtain a second source text.

S305, training a target translation model by using the first corpus pair and the second corpus pair.

Wherein the second corpus pair comprises: a second source text, a second target text.

S306, obtaining a test text of the source language.

And S307, inputting the test text into the trained target translation model, and translating the test text by using the target translation model to obtain a model output text.

And S308, judging whether the translation result of the target translation model is accurate or not according to the model output text, and returning to the step S302 of constructing the model for the training sample by using the first corpus under the condition that the translation result of the target translation model is inaccurate.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a model training apparatus provided in an embodiment of the present application, where the apparatus includes:

a first corpus pair obtaining module 401, configured to obtain a first corpus pair, where the first corpus pair includes: the method comprises the steps of obtaining a first source text of a source language and a first target text of a target language corresponding to the first source text;

a first model training module 402, configured to build a model for a training sample using the first corpus, where the sample building model is configured to: translating the text in the target language into the text in the source language;

a target text obtaining module 403, configured to obtain a second target text in the target language, and translate the second target text by using the sample construction model to obtain a second source text;

a second model training module 404, configured to train the target translation model using a second corpus pair, where the second corpus pair includes: the second source text, the second target text, the target translation model to: translating the text of the source language into the text of the target language.

In an embodiment of the application, the second model training module 404 is specifically configured to:

In an embodiment of the application, the first corpus pair obtaining module 401 includes:

In an embodiment of the present application, the target text obtaining module 403 includes:

In one embodiment of the present application, the apparatus further comprises:

The embodiment of the present application further provides an electronic device, as shown in fig. 5, which includes a processor 501, a communication interface 502, a memory 503 and a communication bus 504, wherein the processor 501, the communication interface 502 and the memory 503 complete mutual communication through the communication bus 504,

a memory 503 for storing a computer program;

the processor 501 is configured to implement the method steps of the model training described above when executing the program stored in the memory 503.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In yet another embodiment provided by the present application, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any of the above-mentioned model training methods.

In yet another embodiment provided by the present application, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the above-described model training methods.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, apparatus embodiments, electronic device embodiments, computer-readable storage medium embodiments, and computer program product embodiments are substantially similar to method embodiments and therefore are described with relative ease, as appropriate, with reference to the partial description of the method embodiments.

The above description is only for the preferred embodiment of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims

1. A method of model training, the method comprising:

2. The method of claim 1, wherein training the target translation model using the second corpus pair comprises:

3. The method of claim 1, wherein obtaining the first corpus pair comprises:

and cleaning the candidate corpus pair to obtain a first corpus pair.

4. The method according to claim 3, wherein said cleaning said corpus candidate pair to obtain a first corpus pair comprises:

5. The method of claim 1, wherein obtaining the second target text in the target language comprises:

obtaining a candidate text of the target language;

and cleaning the candidate text to obtain a second target text.

6. The method of claim 5, wherein the cleansing the candidate text to obtain a second target text comprises:

7. The method according to any of claims 1-6, wherein after the step of training the target translation model using the second corpus pair, the method further comprises:

obtaining a test text of the source language;

8. The method of claim 7, wherein the determining whether the translation result of the target translation model is accurate according to the model output text comprises:

9. A model training apparatus, the apparatus comprising:

10. The apparatus of claim 9, wherein the second model training module is specifically configured to:

11. The apparatus of claim 9, wherein the first corpus pair obtaining module comprises:

12. The apparatus according to claim 11, wherein the first corpus pair obtaining unit is specifically configured to:

13. The apparatus of claim 9, wherein the target text obtaining module comprises:

14. The apparatus according to claim 13, wherein the target text obtaining unit is specifically configured to:

15. The apparatus according to any one of claims 9-14, wherein the apparatus further comprises:

16. The apparatus of claim 15, wherein the third model training module is specifically configured to:

17. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 8 when executing a program stored in the memory.

18. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of the claims 1-8.