CN112633017B

CN112633017B - Translation model training method, translation processing method, translation model training device, translation processing equipment and storage medium

Info

Publication number: CN112633017B
Application number: CN202011555680.5A
Authority: CN
Inventors: 姜博健; 张睿卿; 李芝; 何中军; 吴华
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2023-07-25
Anticipated expiration: 2040-12-24
Also published as: CN112633017A

Abstract

The application discloses a translation model training method, a translation processing method, a translation model training device, translation processing equipment and a storage medium, and relates to the technical field of artificial intelligence such as deep learning. The specific implementation scheme is as follows: the method comprises the steps of obtaining multi-language training corpus, clustering the multi-language training corpus according to languages, obtaining a plurality of class cluster training corpus, performing training corpus processing on target language resources in each class cluster training corpus, obtaining each class cluster target training corpus, training a translation model according to each class cluster target training corpus, and generating a plurality of sub-translation models. Therefore, the clustering method is used for training languages with similar language characteristics together, which is favorable for improving the generalization capability of the translation model, and increasing the training corpus data quantity of low-resource small languages to train the translation model, thereby improving the translation quality.

Description

Translation model training method, translation processing method, translation model training device, translation processing equipment and storage medium

Technical Field

The application relates to the technical field of artificial intelligence such as deep learning in the technical field of data processing, in particular to a translation model training and translation processing method, device, equipment and storage medium.

Background

Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence techniques generally include such techniques as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

With the continuous development of deep learning technology and global internationalization, the demand for machine translation is increasing, international communication is more frequent, and the demand for multilingual machine translation is also increasing.

In the related art, a one-to-one translation model is used for modeling bilingual sentence pairs, however, the number of the translation directions among multiple languages is relatively large, the deployment cost is relatively high, and parallel corpus is probably not existed between any two languages, so that translation devices in certain translation directions cannot be trained, and the translation quality and efficiency are relatively poor.

Disclosure of Invention

The present disclosure provides a method, apparatus, device and storage medium for translation model training, translation processing.

According to an aspect of the present disclosure, there is provided a translation model training method, including:

acquiring multiple language training corpuses, clustering the multiple language training corpuses according to languages, and acquiring multiple cluster training corpuses;

performing training corpus processing on the target language resources in each class cluster training corpus to obtain each class cluster target training corpus;

and training the translation model according to the target training corpus of each class cluster to generate a plurality of sub translation models.

According to another aspect of the present disclosure, there is provided a translation processing method using the translation model of any one of claims 1 to 5, including:

acquiring a text to be translated and a target language;

under the condition that the source language of the text to be translated and the target language belong to the same class of clusters, a translation sub-model is obtained, the text to be translated is translated, and a translation result is obtained;

under the condition that the source language of the text to be translated and the target language are detected to not belong to the same class of clusters, a first translation sub-model is obtained to translate the text to be translated, and a candidate translation result is obtained;

and acquiring a second translation sub-model, translating the candidate translation result, and acquiring a target translation result.

According to still another aspect of the present disclosure, there is provided a translation model training device including:

the first acquisition module is used for acquiring multiple language training corpus;

the second acquisition module is used for clustering the multiple language training corpora according to languages to acquire multiple cluster training corpora;

the first processing module is used for carrying out training corpus processing on the target language resources in each class cluster training corpus to obtain each class cluster target training corpus;

and the training module is used for training the translation model according to the target training corpus of each class cluster to generate a plurality of sub-translation models.

According to still another aspect of the present disclosure, there is provided a translation processing apparatus of the translation model, including:

the fourth acquisition module is used for acquiring the text to be translated and the target language;

a fifth obtaining module, configured to obtain a translation sub-model when detecting that the source language of the text to be translated and the target language belong to the same class cluster, and translate the text to be translated to obtain a translation result;

the sixth obtaining module is used for obtaining a first translation sub-model to translate the text to be translated under the condition that the source language of the text to be translated and the target language are detected not to belong to the same class of clusters, and obtaining a candidate translation result;

And a seventh obtaining module, configured to obtain a second translation sub-model, translate the candidate translation result, and obtain a target translation result.

According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, so that the at least one processor can execute the translation model training and translation processing method described in the above embodiments.

According to a sixth aspect, a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the translation model training, translation processing method described in the above embodiments is provided.

According to a seventh aspect, a computer program product is proposed, comprising a computer program, which, when executed by a processor, enables a server to perform the translation model training, translation processing method described in the above embodiments.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:

FIG. 1 is a flow chart of a translation model training method according to a first embodiment of the present application;

FIG. 2 is a flow chart of a translation model training method according to a second embodiment of the present application;

FIG. 3 is a flow chart of a translation model training method according to a third embodiment of the present application;

FIG. 4 is a flow chart of a translation processing method according to a fourth embodiment of the present application;

FIG. 5 is an exemplary diagram of a translation process according to an embodiment of the present application;

FIG. 6 is a flowchart of a translation processing method according to a fifth embodiment of the present application;

FIG. 7 is a block diagram of a translation model training device according to a sixth embodiment of the present application;

fig. 8 is a structural diagram of a translation processing device according to a seventh embodiment of the present application;

FIG. 9 is a block diagram of an electronic device for implementing a method of translation model training, translation processing, according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Based on the above description, in practical application, for example, 200 languages are supported for inter-translation, the translation directions reach 40000, 40000 translation devices are needed for modeling bilingual sentence pairs by using a one-to-one translation model, the maintenance cost is extremely high, parallel corpora are likely not to exist between any two languages, so that the translation devices in certain translation directions cannot be trained, and the translation quality and efficiency are poor.

Aiming at the problems, the application provides a translation model training method, which uses a clustering method to jointly train languages with similar language characteristics, is favorable for improving the generalization capability of a translation model and increasing the training corpus data quantity of low-resource small languages so as to improve the translation quality.

First, fig. 1 is a flowchart of a translation model training method according to a first embodiment of the present application, where the translation model training method is used in an electronic device, and the electronic device may be any device with computing capability, for example, may be a personal computer (Personal Computer, abbreviated as PC), a mobile terminal, and the like, and the mobile terminal may be, for example, a mobile phone, a tablet computer, a personal digital assistant, a wearable device, an in-vehicle device, and the like, which are hardware devices with various operating systems, touch screens, and/or display screens.

As shown in fig. 1, the method includes:

step 101, acquiring multi-language training corpus, clustering the multi-language training corpus according to languages, and acquiring a plurality of cluster training corpus.

In the embodiment of the present application, the multi-language training corpus refers to training corpus corresponding to different languages, such as turkish training corpus 1, russian training corpus 2, chinese training corpus 3, and the like, so that the training corpus 1, the training corpus 2, the training corpus 3, and the like form multi-language training corpus.

In the embodiment of the application, the training corpus can acquire text information of a corresponding language in real time, can also be acquired based on a database corresponding to a history record, and takes texts as an example, for example, text information such as how-today-weather text information is input by a user, how-today-weather text information is good, and the like can be acquired in real time as the training corpus, and can also acquire history text information based on a mode such as a user history search record and the like as the training corpus, and the like, and the method is specifically selected and set according to application scenes.

Further, clustering is performed on multiple language training corpora according to languages, and multiple class cluster training corpora are obtained, that is, clustering is performed according to similar language features and languages, and setting can be selected according to application scene requirements, for example, as follows.

According to a first example, aiming at multiple language training corpuses, adding labels corresponding to each target language at preset positions of a source language, training the source language to a language translation model of each target language, acquiring label codes of each target language after training is completed, clustering according to the label codes of each target language through a preset clustering algorithm to acquire multiple class clusters, and dividing the multiple language training corpuses according to the multiple class clusters to acquire multiple class cluster training corpuses.

In a second example, training a pre-training language model by using a single-language training corpus, adding a corresponding tag before each sentence, acquiring tag codes after training, clustering according to the tag codes to acquire a plurality of class clusters, and dividing the multi-language training corpus according to the plurality of class clusters to acquire a plurality of class cluster training corpuses.

And 102, performing training corpus processing on the target language resources in the training corpus of each class cluster to obtain the target training corpus of each class cluster.

In the embodiment of the present application, the target language resource refers to a resource that needs to be used to increase the amount of training corpus data, i.e. a low-resource language. Therefore, the target language resource in each class cluster training corpus is subjected to training corpus processing to obtain each class cluster target training corpus, so that the data size of the training corpus of each language in each class cluster is ensured to be more, and the subsequent translation quality of a translation model obtained by training is improved.

In the embodiment of the present application, the training corpus is processed on the target language resource in each cluster training corpus, and various ways of obtaining the target training corpus of each cluster are available, and the selection setting can be performed according to the actual application needs, for example, as follows.

In a first example, a target phrase segment of a target language resource is obtained, a related phrase segment matched with the target phrase segment is obtained, the related language resource corresponding to the related phrase segment is determined, a training corpus of the related language resource is sampled, a candidate training corpus is obtained, the candidate training corpus is added to the training corpus corresponding to the target language resource, and each cluster target training corpus is obtained.

In a second example, candidate language resources corresponding to the target language resources are obtained from each cluster training corpus, candidate training corpuses of the candidate language resources are obtained, word splitting is carried out on the candidate training corpuses to obtain a plurality of word corpuses, and the plurality of word corpuses are added into the training corpuses corresponding to the target language resources to obtain each cluster target training corpus.

In a third example, in the process of training the translation model according to each cluster-like target training corpus, single-material data in each cluster-like target training corpus is obtained, the single-material data is subjected to coding processing through a pre-training language model, and training vectors subjected to coding processing are subjected to training of the translation model.

And step 103, training the translation model according to the target training corpus of each class cluster to generate a plurality of sub translation models.

In the embodiment of the present application, after the target training corpus of each class cluster is obtained, the translation model training is performed according to each class cluster, for example, an input-processing encoder network composed of two cyclic neural networks and an output-generating decoder network, and training is performed according to the target training corpus, so as to generate a plurality of sub-translation models.

According to the translation model training method, multiple language training corpuses are obtained, the multiple language training corpuses are clustered according to languages, multiple class cluster training corpuses are obtained, training corpus processing is carried out on target language resources in each class cluster training corpus, each class cluster target training corpus is obtained, the translation model is trained according to each class cluster target training corpus, and multiple sub translation models are generated. Therefore, the clustering method is used for training languages with similar language characteristics together, which is favorable for improving the generalization capability of the translation model, and increasing the training corpus data quantity of low-resource small languages to train the translation model, thereby improving the translation quality.

Based on the above embodiments, there are various ways to increase the data amount of the training corpus of the target language, and the processing is performed in a single way or in multiple ways in combination with fig. 2 and 3, respectively.

Specifically, another method for training a translation model is proposed, in which data processing is performed in a single manner, and fig. 2 is a flowchart of a method for training a translation model according to a second embodiment of the present application, as shown in fig. 2:

step 201, adding labels corresponding to each target language in a preset position of a source language for training corpus of multiple languages, training a language translation model from the source language to each target language, and obtaining label codes of each target language after training.

In this embodiment of the present application, the source language is generally english, that is, the language translation model of the present application is a translation model from english to small languages, a tag of a target language is added before english data, after training is completed, tag codes may be extracted to perform clustering processing, for example, the target languages are turkish, russian and ukrainian, the tag a code is obtained after the training is performed when the tag a is added to the english data, and the tag B code is obtained after the training is performed when the tag B is added to the english data, where the tag identifies a unique language.

Step 202, clustering is carried out according to the label codes of each target language through a preset clustering algorithm, a plurality of class clusters are obtained, the multi-language training corpus is divided according to the class clusters, and a plurality of class cluster training corpuses are obtained.

Further, a preset clustering algorithm, such as a Kmeans unsupervised clustering algorithm, is used for unsupervised clustering of tag codes of all languages, tag codes with the same attribute language are clustered together into a plurality of class clusters, multiple language training corpuses are divided according to the plurality of class clusters, and multiple class cluster training corpuses, such as Russian and Ukrait, are obtained as one class cluster, so that Russian training corpuses and Ukrait training corpuses are used as training corpuses of the class clusters from the multiple language training corpuses. Therefore, the clustering method is used for training languages with similar language characteristics together, and is beneficial to improving the generalization capability of the translation model.

Step 203, obtaining a target phrase segment of the target language resource, obtaining a related phrase segment matched with the target phrase segment, and determining the related language resource corresponding to the related phrase segment.

Step 204, sampling the training corpus of the related language resource to obtain a candidate training corpus, and adding the candidate training corpus to the training corpus corresponding to the target language resource to obtain each cluster target training corpus.

In the embodiment of the application, the training corpus with the target language resources being low-resource small languages is quite scarce, bilingual data collection is quite difficult, in general, the low-resource languages can generally find the high-resource languages with higher relativity, the low-resource languages have strong relativity in grammar and characters, the same continuous phrase fragments exist, the training corpus with the high-resource languages containing the same phrase fragments are subjected to oversampling, the low-resource small language training data amount is increased in a phase-changing mode, and the problem of data scarcity of the low-resource small languages is relieved.

And step 205, training the translation model according to the target training corpus of each class cluster to generate a plurality of sub translation models.

According to the translation model training method, tags corresponding to each target language are added to preset positions of source languages for multiple language training corpuses, the source languages are trained to language translation models of each target language, after training is completed, tag codes of each target language are obtained, clustering is conducted according to the tag codes of each target language through a preset clustering algorithm, multiple class clusters are obtained, multiple language training corpuses are divided according to the multiple class clusters, multiple class cluster training corpuses are obtained, target phrase fragments of target language resources are obtained, related phrase fragments matched with the target phrase fragments are obtained, related language resources corresponding to the related phrase fragments are determined, sampling processing is conducted on training corpuses of the related language resources, candidate training corpuses are obtained, the candidate training corpuses are added to the training corpuses corresponding to the target language resources, each class cluster target training corpus is obtained, the translation models are trained according to each class cluster target training corpus, and multiple sub-translation models are generated. Therefore, the clustering method is used for training languages with similar language characteristics together, which is favorable for improving the generalization capability of the translation model, and increasing the training corpus data quantity of low-resource small languages to train the translation model, thereby improving the translation quality.

FIG. 3 is a flow chart of a translation model training method according to a third embodiment of the present application, as shown in FIG. 3:

step 301, obtaining multiple language training corpus, and clustering the multiple language training corpus according to languages to obtain multiple cluster training corpus.

Step 302, obtaining a target phrase segment of a target language resource, obtaining a related phrase segment matched with the target phrase segment, and determining a related language resource corresponding to the related phrase segment.

Step 303, sampling the training corpus of the related language resource to obtain a candidate training corpus, and adding the candidate training corpus to the training corpus corresponding to the target language resource.

And 304, acquiring candidate language resources corresponding to the target language resources from the training corpus of each class cluster, and acquiring the candidate training corpus of the candidate language resources.

Step 305, word splitting is performed on the candidate training corpus, a plurality of word corpora are obtained, the plurality of word corpora are added into the training corpus corresponding to the target language resource, and each cluster target training corpus is generated.

In this embodiment of the present application, the target language resources are low-resource small languages, high-resource languages (such as turkish languages) and low-resource languages (such as abase languages) are mixed for training, and because there is great similarity between the two languages, the low-resource languages can learn useful knowledge representations from the high-resource languages, and in general, the root representations of the high-resource languages are unique, the high-resource languages are represented into multiple root granularities, so that the low-resource languages can learn more knowledge representations from the high-resource languages with mixed granularities, that is, word splitting is performed on candidate training corpuses to obtain multiple word corpuses, the word corpuses are added into training corpuses corresponding to the target language resources, each cluster target training corpus is generated, the low-resource small language training data quantity is increased, and the problem of data scarcity of the low-resource small languages is alleviated.

Step 306, obtaining single-language material data in each class cluster target training corpus in the process of training the translation model according to each class cluster target training corpus.

In step 307, the single language data is encoded by the pre-training language model, and the training vector subjected to the encoding process is trained for the translation model, so as to generate a plurality of sub-translation models.

In the embodiment of the application, the pre-training language model is adopted to encode the single language data, and the training vector subjected to encoding is used for training the translation model, because the low-resource language is very little in parallel corpus, but the single language data can be collected in a large quantity through the Internet, when the translation model is trained by only using the parallel corpus, the translation direction training of the low-resource language is very insufficient, so that the pre-training language model is carried out by using the single language data, and then the transfer learning is carried out, and the training effect of the translation model is improved.

Therefore, the single language data is fully utilized to perform pre-training language models, and then transfer learning is performed, so that semantic representation of low-resource small languages is enriched, and the problem that the low-resource small language training data is insufficient is solved.

According to the translation model training method, multiple language training corpuses are obtained, clustering is conducted on the multiple language training corpuses according to languages, multiple class cluster training corpuses are obtained, target phrase segments of target language resources are obtained, relevant phrase segments matched with the target phrase segments are obtained, relevant language resources corresponding to the relevant phrase segments are determined, training corpuses of the relevant language resources are sampled and processed, candidate training corpuses are obtained, the candidate training corpuses are added into the training corpuses corresponding to the target language resources, candidate training corpuses of the candidate language resources are obtained in each class cluster training corpus, word splitting is conducted on the candidate training corpuses, multiple word corpuses are obtained, the multiple word corpuses are added into the training corpuses corresponding to the target language resources, each class cluster target training corpus is generated, single language data in each class cluster target training corpus are obtained in the process of training a translation model according to each class cluster target training corpus, the single language data are processed through pre-training language model, the vector subjected to coding processing is conducted on the single language data, and translation models are generated. Therefore, the clustering method is used for training languages with similar language characteristics together, which is favorable for improving the generalization capability of the translation model, and increasing the training corpus data quantity of low-resource small languages to train the translation model, thereby improving the translation quality.

Fig. 4 is a flowchart of a translation processing method according to a fourth embodiment of the present application, as shown in fig. 4:

step 401, obtaining text to be translated and a target language.

Step 402, under the condition that the source language and the target language of the text to be translated are detected to belong to the same class of clusters, a translation sub-model is obtained, the text to be translated is translated, and a translation result is obtained.

Step 403, under the condition that the source language and the target language of the text to be translated are detected not to belong to the same class of clusters, a first translation sub-model is obtained to translate the text to be translated, and a candidate translation result is obtained.

Step 404, obtaining a second translation sub-model, translating the candidate translation result, and obtaining a target translation result.

In the embodiment of the application, a text to be translated and a target language input by a client are received, a translation sub-model is determined according to a source language and a target language of the text to be translated, the translation sub-model is obtained under the condition that the source language and the target language of the text to be translated are detected to belong to the same class of clusters, a translation result is obtained, a first translation sub-model is obtained under the condition that the source language and the target language of the text to be translated are detected to not belong to the same class of clusters, a candidate translation result is obtained, and then the candidate translation result is translated by obtaining a second translation sub-model, so that the target translation result is obtained.

For example, in the above embodiment, each translation sub-model is trained by using parallel language materials from non-english language to english language, and any two languages in two different sub-translation sub-models skip by using english language, as shown in fig. 5, for example, the source language of the text to be translated is russian and the target language is ukrainian, and the text to be translated belongs to the same cluster, and translation results are obtained directly by translating the corresponding translation sub-models of the cluster; and then, under the condition that the source language of the text to be translated is Turkish language and the target language is Ukrait language and the text does not belong to the same class of clusters, a first translation sub-model of the class cluster to which the Turkish language belongs is obtained for translating the text to be translated, a candidate translation result is obtained, and then, a second translation sub-model of the class cluster to which the Ukrait language belongs is obtained for translating the candidate translation result, so that the target translation result is obtained.

According to the translation processing method, the text to be translated and the target language are obtained, under the condition that the source language and the target language of the text to be translated are detected to belong to the same class of clusters, the translation sub-model is obtained, the text to be translated is translated, the translation result is obtained, under the condition that the source language and the target language of the text to be translated are detected to not belong to the same class of clusters, the first translation sub-model is obtained, the text to be translated is translated, the candidate translation result is obtained, the second translation sub-model is obtained, and the candidate translation result is translated, so that the target translation result is obtained. Thus, a high-quality translation text can be obtained quickly.

Based on the description of the above embodiments, fig. 6 is a flowchart of a translation processing method according to a fifth embodiment of the present application, and as shown in fig. 6, in the process of translating a text to be translated, further processing may be performed, and the translation quality is improved and the decoding speed is increased by limiting the word candidate set. Specifically:

step 501, in the process of translating the text to be translated, each word to be translated in the text to be translated is obtained.

Step 502, obtaining a word candidate set corresponding to each word to be translated, and obtaining an error probability of each candidate word in the word candidate set corresponding to each word to be translated.

In step 503, in case the error probability is larger than a preset threshold, the candidate word is deleted from the word candidate set.

In the embodiment of the application, since the target language is a multi-language mixture, the translated language may be generated by an autoregressive manner, for example, a translation task from english to turkish, the turkish is used as the target language, no non-latin characters (such as characters used by arabic) exist, but candidate characters of arabic may appear in the translation process, and in a generation task of natural language processing, all characters of a target language vocabulary are used as a character candidate set, so that the fidelity of the translated language is difficult to be controlled.

Therefore, for example, the word candidate set of the target language vocabulary is limited according to the translation direction of the target language which is not Latin, and in this way, the translation fidelity of the rare resource small language is obviously improved, and the manual evaluation is greatly improved.

The preset threshold value can be set according to application scenes.

For example, in the process of translating "how are you" into chinese, for example, when "are" translations are performed, the error probability of yes "in the obtained word candidate set is ninety percent, and is greater than a preset threshold value by sixty percent, and the yes word candidate is deleted from the word candidate set.

Thus, not only can the fidelity of translation generation be improved, but also the translation process can be accelerated, and the use of a vocabulary can reduce the amount of computation.

In order to achieve the above embodiment, the present application further provides a translation model training device. Fig. 7 is a schematic structural diagram of a translation model training device according to a sixth embodiment of the present application, and as shown in fig. 7, the translation model training device includes: a first acquisition module 701, a second acquisition module 702, a first processing module 703 and a training module 704.

The first obtaining module 701 is configured to obtain a plurality of language training corpora.

The second obtaining module 702 is configured to cluster the multiple language training corpora according to languages, and obtain multiple cluster training corpora.

The first processing module 703 is configured to perform training corpus processing on the target language resource in each cluster training corpus, so as to obtain each cluster target training corpus.

And the training module 704 is configured to train the translation model according to the target training corpus of each cluster, and generate a plurality of sub-translation models.

In this embodiment of the present application, the second obtaining module 702 is specifically configured to: aiming at multiple language training corpus, adding labels corresponding to each target language in preset positions of source languages;

training a language translation model from a source language to each target language, and acquiring a label code of each target language after training; clustering is carried out according to the label codes of each target language through a preset clustering algorithm, a plurality of class clusters are obtained, the multi-language training corpus is divided according to the class clusters, and a plurality of class cluster training corpuses are obtained.

In the embodiment of the present application, the first processing module 703 is specifically configured to: obtaining a target phrase fragment of a target language resource; acquiring a related phrase fragment matched with the target phrase fragment, and determining a related language resource corresponding to the related phrase fragment; sampling the training corpus of the related language resources to obtain candidate training corpus; and adding the candidate training corpus into the training corpus corresponding to the target language resource to obtain the target training corpus of each class cluster.

In the embodiment of the present application, the first processing module 703 is specifically configured to: acquiring candidate language resources corresponding to the target language resources from the training corpus of each class cluster; obtaining a candidate training corpus of a candidate language resource, and carrying out word splitting on the candidate training corpus to obtain a plurality of word corpora; and adding the word corpora into the training corpora corresponding to the target language resources to obtain the target training corpora of each class cluster.

In this embodiment of the present application, the translation model training device further includes: the third acquisition module is used for acquiring single-language material data in each cluster-like target training corpus in the process of training the translation model according to each cluster-like target training corpus; and the second processing module is used for carrying out coding processing on the single-material data through a pre-training language model, and training the translation model by using the training vector subjected to the coding processing.

It should be noted that the foregoing explanation of the translation model training method is also applicable to the translation model training device in the embodiment of the present invention, and the implementation principle is similar and will not be repeated here.

According to the translation model training device, multiple language training corpuses are obtained, the multiple language training corpuses are clustered according to languages, multiple class cluster training corpuses are obtained, training corpus processing is carried out on target language resources in each class cluster training corpus, each class cluster target training corpus is obtained, the translation model is trained according to each class cluster target training corpus, and multiple sub translation models are generated. Therefore, the clustering method is used for training languages with similar language characteristics together, which is favorable for improving the generalization capability of the translation model, and increasing the training corpus data quantity of low-resource small languages to train the translation model, thereby improving the translation quality.

In order to implement the above embodiment, the present application further provides a translation processing device. Fig. 8 is a schematic structural view of a translation processing device according to a seventh embodiment of the present application, and as shown in fig. 8, the translation processing device includes: a fourth acquisition module 801, a fifth acquisition module 802, a sixth acquisition module 803, and a seventh acquisition module 804.

The fourth obtaining module 801 is configured to obtain a text to be translated and a target language.

And a fifth obtaining module 802, configured to obtain a translation sub-model when detecting that the source language and the target language of the text to be translated belong to the same class of clusters, and translate the text to be translated to obtain a translation result.

And a sixth obtaining module 803, configured to obtain, when detecting that the source language and the target language of the text to be translated do not belong to the same class cluster, a first translation sub-model to translate the text to be translated, and obtain a candidate translation result.

A seventh obtaining module 804, configured to obtain a second translational submodel, translate the candidate translation result, and obtain a target translation result.

In the embodiment of the present application, the translation processing device further includes: an eighth obtaining module, configured to obtain each word to be translated in the text to be translated in the process of translating the text to be translated; a ninth obtaining module, configured to obtain a word candidate set corresponding to each word to be translated, and obtain an error probability of each candidate word in the word candidate set corresponding to each word to be translated; and the deleting module is used for deleting the candidate words from the word candidate set under the condition that the error probability is larger than a preset threshold value.

It should be noted that the foregoing explanation of the translation processing method is also applicable to the translation processing device in the embodiment of the present invention, and the implementation principle is similar and will not be repeated here.

According to the translation processing device, the text to be translated and the target language are obtained, under the condition that the source language and the target language of the text to be translated are detected to belong to the same class of clusters, the translation sub-model is obtained, the text to be translated is translated, the translation result is obtained, under the condition that the source language and the target language of the text to be translated are detected to not belong to the same class of clusters, the first translation sub-model is obtained, the text to be translated is translated, the candidate translation result is obtained, the second translation sub-model is obtained, the candidate translation result is translated, and the target translation result is obtained. Thus, a high-quality translation text can be obtained quickly.

According to embodiments of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 9, a block diagram of an electronic device is provided for a method of translation model training and translation processing according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

As shown in fig. 9, the electronic device includes: one or more processors 901, memory 902, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). In fig. 9, a processor 901 is taken as an example.

Memory 902 is a non-transitory computer-readable storage medium provided herein. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the translation model training and translation processing methods provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of translation model training, translation processing provided herein.

The memory 902 is used as a non-transitory computer readable storage medium, and may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., the first acquisition module 701, the second acquisition module 702, the first processing module 703, and the training module 704 shown in fig. 7) corresponding to the method of translation model training and translation processing in the embodiments of the present application. The processor 901 executes various functional applications of the server and data processing, i.e., a method for implementing translation model training and translation processing in the above-described method embodiment, by executing non-transitory software programs, instructions, and modules stored in the memory 902.

The memory 902 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for a function; the storage data area may store data created from the use of the electronic device for translation model training, translation processing, and the like. In addition, the memory 902 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 902 optionally includes memory remotely located relative to processor 901, which may be connected to the translation model training, translation processing electronics via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the translation model training and translation processing method may further include: an input device 903 and an output device 904. The processor 901, memory 902, input devices 903, and output devices 904 may be connected by a bus or other means, for example in fig. 9.

The input device 903 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device for translation model training, translation processing, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, etc. input devices. The output means 904 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, is a host product in a cloud computing service system, and solves the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS"), and can be a server of a distributed system or a server combined with a blockchain.

The application also provides a computer program product, which comprises a computer program, wherein the computer program realizes the steps of the translation model training and translation processing method when being executed by a processor.

According to the technical scheme of the embodiment of the application, the multi-language training corpus is obtained, the multi-language training corpus is clustered according to languages, the multi-class cluster training corpus is obtained, the training corpus is processed on the target language resources in each class cluster training corpus, the target training corpus of each class cluster is obtained, the translation model is trained according to the target training corpus of each class cluster, and a plurality of sub translation models are generated. Therefore, the clustering method is used for training languages with similar language characteristics together, which is favorable for improving the generalization capability of the translation model, and increasing the training corpus data quantity of low-resource small languages to train the translation model, thereby improving the translation quality.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.

The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A translation model training method, comprising:

training the translation model according to the target training corpus of each class cluster to generate a plurality of sub translation models;

the clustering of the multi-language training corpus according to languages to obtain a plurality of cluster training corpora comprises the following steps:

aiming at the multi-language training corpus, adding labels corresponding to each target language at preset positions of source languages;

training a language translation model from the source language to each target language, and acquiring a label code of each target language after training is completed;

Clustering is carried out according to the label codes of each target language through a preset clustering algorithm, a plurality of class clusters are obtained, the multi-language training corpus is divided according to the class clusters, and the class cluster training corpus is obtained.

2. The method for training a translation model according to claim 1, wherein the training corpus processing is performed on the target language resources in each cluster-like training corpus to obtain each cluster-like target training corpus, and the method comprises:

obtaining a target phrase segment of the target language resource;

acquiring a related phrase fragment matched with the target phrase fragment, and determining a related language resource corresponding to the related phrase fragment;

sampling the training corpus of the related language resources to obtain candidate training corpus;

and adding the candidate training corpus into the training corpus corresponding to the target language resource to obtain the target training corpus of each class cluster.

3. The method for training a translation model according to claim 1, wherein the training corpus processing is performed on the target language resources in each cluster-like training corpus to obtain each cluster-like target training corpus, and the method comprises:

acquiring candidate language resources corresponding to the target language resources from the training corpus of each class cluster;

Obtaining a candidate training corpus of the candidate language resource, and carrying out word splitting on the candidate training corpus to obtain a plurality of word corpora;

and adding the word corpora into the training corpora corresponding to the target language resources to obtain the target training corpora of each class cluster.

4. The translation model training method according to claim 1, further comprising:

acquiring single-language material data in each cluster-like target training corpus in the process of training a translation model according to each cluster-like target training corpus;

and carrying out coding processing on the single-material data through a pre-training language model, and training the translation model by using the training vector subjected to coding processing.

5. A translation processing method using the translation model according to any one of claims 1 to 4, comprising:

acquiring a text to be translated and a target language;

6. The translation processing method according to claim 5, further comprising:

in the process of translating the text to be translated, acquiring each word to be translated in the text to be translated;

acquiring a word candidate set corresponding to each word to be translated, and acquiring the error probability of each candidate word in the word candidate set corresponding to each word to be translated;

and deleting the candidate words from the word candidate set under the condition that the error probability is larger than a preset threshold value.

7. A translation model training device, comprising:

the training module is used for training the translation model according to the target training corpus of each class cluster to generate a plurality of sub translation models;

the second obtaining module is specifically configured to:

8. The translation model training device according to claim 7, wherein the first processing module is specifically configured to:

obtaining a target phrase segment of the target language resource;

9. The translation model training device according to claim 7, wherein the first processing module is specifically configured to:

10. The translation model training device according to claim 7 further comprising:

the third obtaining module is used for obtaining single-language material data in each cluster-like target training corpus in the process of training the translation model according to each cluster-like target training corpus;

and the second processing module is used for carrying out coding processing on the single-material data through a pre-training language model, and training the translation model by using the training vector subjected to the coding processing.

11. A translation processing device applying the translation model according to any one of claims 7 to 10, comprising:

12. The translation processing device according to claim 11, further comprising:

an eighth obtaining module, configured to obtain each word to be translated in the text to be translated in the process of translating the text to be translated;

a ninth obtaining module, configured to obtain a word candidate set corresponding to each word to be translated, and obtain an error probability of each candidate word in the word candidate set corresponding to each word to be translated;

and the deleting module is used for deleting the candidate words from the word candidate set under the condition that the error probability is larger than a preset threshold value.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.