CN117291193A

CN117291193A - Machine translation method, apparatus and storage medium

Info

Publication number: CN117291193A
Application number: CN202211243383.6A
Authority: CN
Inventors: 张帆; 涂眉; 刘松
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2022-06-14
Filing date: 2022-10-11
Publication date: 2023-12-26

Abstract

The application provides a machine translation method, equipment and a storage medium, and relates to the technical fields of artificial intelligence, machine learning and the like. The target domain converter corresponding to the information to be translated is determined from the plurality of candidate domain converters based on the information to be translated, and a translation result is obtained based on the target domain converter corresponding to the information to be translated.

Description

Machine translation method, apparatus and storage medium

Technical Field

The application relates to the technical fields of artificial intelligence, machine learning and the like, and relates to a machine translation method, equipment and a storage medium.

Background

Natural language processing is an important part of artificial intelligence, and research into natural language processing is challenging. Natural language processing research began with machine translation systems, and through a great deal of scientific experimentation, the public and scientific community have seen the possibility of automatic translation using computers.

The neural network machine translation is a machine translation method proposed in recent years, and is mainly a technology for realizing translation among different languages by using a neural network. In the related art, although there are various methods for machine translation using a neural network, there is still a great room for improvement in the neural network model. Therefore, how to make better use of neural networks for machine translation remains a current research hotspot in the art.

Disclosure of Invention

The application provides a machine translation method, equipment and a storage medium, and the technical scheme is as follows:

in one aspect, a method performed by an electronic device is provided, the method comprising:

acquiring information to be translated;

determining a target domain converter corresponding to the information to be translated from a plurality of candidate domain converters based on the information to be translated, wherein each candidate domain converter corresponds to at least one domain;

and obtaining a translation result corresponding to the information to be translated based on the target domain converter corresponding to the information to be translated.

In another aspect, a method performed by an electronic device is provided, the method comprising:

displaying a translation domain list, wherein the translation domain list comprises identification information of at least one domain in a plurality of candidate translation domains;

Acquiring a first input of a user, wherein the first input is used for selecting a field corresponding to translation from the translation field list;

and in response to the first input, downloading a domain converter of the corresponding domain.

acquiring a data set label of a target data set, wherein the data set label represents the data distribution category of each data in the target data set;

training a data distribution prediction module based on the target data set and the data set label, wherein the data distribution prediction module is used for predicting the probability that each data in the target data set belongs to each data distribution category, and each data distribution category corresponds to at least one field;

based on the trained data distribution prediction module, training each candidate domain converter to obtain a machine translation model, wherein each candidate domain converter corresponds to at least one domain.

In another aspect, an electronic device is provided that includes a memory, a processor, and a computer program stored on the memory, the processor executing the computer program to implement the method described above.

In another aspect, a computer readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, implements the above-mentioned method.

The beneficial effects that this application provided technical scheme brought are:

according to the method, the information to be translated is obtained, the target domain converter corresponding to the information to be translated is determined from the plurality of candidate domain converters based on the information to be translated, and the translation result is obtained based on the target domain converter corresponding to the information to be translated.

The method comprises the steps of displaying a translation domain list, wherein the translation domain list comprises identification information of at least one domain in a plurality of candidate translation domains; by responding to the first input of the user, the domain converters of the corresponding domains selected by the user in the list are downloaded, so that the translation can be completed by only downloading part of domain converters on the electronic equipment.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a schematic diagram of an implementation environment of a machine translation method according to an embodiment of the present application;

fig. 2 is a flowchart of a method performed by an electronic device according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a machine translation model according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a structure of a data distribution prediction module according to an embodiment of the present disclosure;

fig. 5 is a schematic flow chart of an expert selector according to an embodiment of the present application;

fig. 6 is a schematic diagram of a target decoding feature construction manner according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a candidate domain converter according to an embodiment of the present application;

FIG. 8 is a schematic diagram illustrating an implementation process of a machine translation model according to an embodiment of the present application;

FIG. 9 is a schematic diagram illustrating an implementation process of a machine translation model according to an embodiment of the present application;

FIG. 10 is a schematic diagram illustrating an implementation process of a machine translation model according to an embodiment of the present application;

FIG. 11 is a diagram of an example machine translation provided in an embodiment of the present application;

FIG. 12 is a diagram of an example machine translation provided in an embodiment of the present application;

FIG. 13 is a diagram of an example machine translation provided in an embodiment of the present application;

FIG. 14 is a flowchart of a method performed by an electronic device according to an embodiment of the present application;

FIG. 15 is a schematic diagram of a machine translation model according to an embodiment of the present disclosure;

FIG. 16 is a schematic diagram of an interface for model maintenance update according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of a model training device according to an embodiment of the present application;

FIG. 18 is a flowchart of a method performed by an electronic device according to an embodiment of the present application;

fig. 19 is a schematic diagram of a training process of a data distribution prediction module according to an embodiment of the present application;

FIG. 20 is a schematic diagram of a variation of the training phase integration function according to the embodiment of the present application;

FIG. 21 is a schematic diagram of a prototype database construction process according to an embodiment of the present application;

FIG. 22 is a schematic diagram of a training process of a hybrid expert module according to an embodiment of the present application;

fig. 23 is a schematic diagram of an update process of a data distribution prediction module according to an embodiment of the present application;

fig. 24 is a schematic diagram of a hybrid expert module update process according to an embodiment of the present application;

FIG. 25 is a schematic diagram illustrating an update process of a data distribution prediction module according to an embodiment of the present application;

FIG. 26 is a schematic diagram of a prototype database update process according to an embodiment of the present application;

fig. 27 is a schematic diagram of a process of adding an expert module according to an embodiment of the present application;

FIG. 28 is a schematic diagram of a machine translation model according to an embodiment of the present disclosure;

fig. 29 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It should be further understood that the terms "comprises" and "comprising," when used in this application embodiments, specify the presence of stated features, information, data, steps, operations, but do not preclude the presence of other features, information, data, steps, operations, etc., that may be implemented as well as present features, information, data, steps, operations, etc., that are supported by the present technology.

The present application relates to the field of artificial intelligence, which is a theory, method, technique and application system that simulates, extends and expands human intelligence, perceives the environment, obtains knowledge and uses knowledge to obtain an optimal result using a digital computer or a machine controlled by a digital computer. Artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence.

In particular, the present application may relate to machine learning, which specially studies how a computer simulates or implements learning behavior of a human to obtain new knowledge or skills, reorganizing existing knowledge structures to continuously improve their own performance. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The method and the device can train to obtain the machine translation model by utilizing artificial intelligence technology, machine learning technology and other technologies, and provide translation service by utilizing the machine translation model.

The environment in which the present application is implemented will be described below by taking fig. 1 as an example.

Fig. 1 is a schematic diagram of an implementation environment of a machine translation method provided in the present application. As shown in fig. 1, the implementation environment includes: an electronic device 11.

In one possible scenario, as shown in FIG. 1, the implementation environment may also include a user's terminal 12. The electronic device 11 may send the trained machine translation model to the terminal 12, which the terminal 12 uses to provide translation services. In an example, the terminal 12 may output a translation result of the information to be translated using an offline machine translation model; for example, a chinese sentence is translated into an english sentence. In yet another example, the terminal 12 may send a translation request to the electronic device 11; the electronic device 11 receives the translation request sent by the terminal 12, outputs a translation result of the information to be translated by using the trained machine translation model, and returns the translation result to the terminal 12.

The electronic device 11 may adopt the model training method provided in the present application to train to obtain a machine translation model. In one possible scenario, the electronic device 11 may be a server that may perform a decomposition training of various modules in the machine translation model based on a large number of data sets. The machine translation model may include: the system comprises a coding and decoding module, a data distribution prediction module and a hybrid expert module. The decomposition training refers to training that the training process of the machine translation module is decomposed according to each module so as to separate and mutually independent modules. In this application, the server may train the codec module first. Then, the trained codec module is fixed and the data distribution prediction module is trained. And finally, fixing the trained coding and decoding module and the data distribution prediction model, and training the mixed expert module in the machine translation model.

The electronic device 11 and the terminal 12 may be connected by wired or wireless communication. The electronic device 11 may be a server, which may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server or a server cluster that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, and basic cloud computing services such as big data and an artificial intelligence platform. The terminal may be a smart phone, an intelligent robot, a tablet computer, a notebook computer, a digital broadcast receiver, a MID (Mobile Internet Devices, mobile internet device), a PDA (personal digital assistant), a desktop computer, a vehicle-mounted terminal (e.g., a vehicle-mounted navigation terminal, a vehicle-mounted computer, etc.), an intelligent sound box, a smart watch, etc.

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Fig. 2 is a schematic diagram of a machine translation method according to an embodiment of the present application. The method is a method executed by an electronic device, and the electronic device may be any device such as a terminal, a server, or a cloud computing center device, which is not limited in this application. As shown in fig. 2, the method includes the following steps 201 to 203.

Step 201, the electronic device obtains information to be translated.

The information to be translated is information of an original language to be translated, and the information to be translated of the original language can be translated into a translation result of a target language through the machine translation method. For example, a chinese sentence is translated into a corresponding english sentence.

IT should be noted that the information to be translated may be information in any field, for example, IT field, medical field, legal field, and the like. In the application, a Domain (Adapter) converter corresponding to each Domain is provided for each Domain, and the Domain converter corresponding to the Domain of the information to be translated can be utilized for translation. Thus, the selected domain converter may be utilized to translate by selecting an appropriate target domain converter for the information to be translated, through the following steps 202-203. In this application, the domain converter may also be referred to as an Expert module.

Step 202, the electronic device determines, based on the information to be translated, a target domain converter corresponding to the information to be translated from a plurality of candidate domain converters, where each candidate domain converter corresponds to at least one domain.

Step 203, the electronic device obtains a translation result corresponding to the information to be translated based on the target domain converter corresponding to the information to be translated.

In the application, a correspondence relationship between the candidate domain converter and each domain is established, for example, one domain converter corresponds to one domain or a plurality of domains. Based on the above, when translating information in each field, the field converters corresponding to each field can be selected and used in a targeted manner, and translating information from different fields can be achieved, so that the accuracy of translation can be improved.

Each candidate domain converter corresponds to at least one domain. Each candidate domain converter is used for converting the decoding characteristics of the information of the domain corresponding to the candidate domain converter. For example, for the information to be translated, the target domain converter is used for converting the decoding characteristics of the information to be translated, so that the decoding characteristics matched with the domain to which the information to be translated belongs can be obtained.

In the present application, the machine translation model may include an encoder, a decoder, and various candidate domain converters. During translation, acquiring a first coding feature of information to be translated through an encoder; and decoding the first coding feature through a decoder to obtain a decoding feature, converting the decoding feature through a target field converter, and obtaining a corresponding translation result by utilizing the converted decoding feature.

In one possible embodiment, the likelihood of each candidate domain converter being a target domain converter may be predicted based on the first coding feature of the information to be translated, so as to select the target domain converter of the information to be translated. Illustratively, this step 202 may be implemented by the following steps 2021 to 2023, and, correspondingly, step 203 may be implemented by the following step 2031:

step 2021, the electronic device obtains the first coding feature of the information to be translated.

The first encoding feature may be an encoded hidden state vector of the information to be translated. The electronic device can encode the information to be translated through an encoder to obtain the encoding hidden state vector.

Fig. 3 is a schematic structural diagram of a machine translation model provided in the present application. As shown in fig. 3, the machine translation model may include an encoder. The electronic equipment can acquire the word vector sequence of the information to be translated by inquiring the word vector table, input the word vector sequence into the encoder, and extract the characteristics of the information to be translated by the encoder to acquire the coding hidden state vector of the information to be translated, wherein the coding hidden state vector characterizes the semantic characteristics of the information to be translated.

For example, the information to be translated may be a chinese sentence x including n segmented words, where the word vector sequence of the chinese sentence x is x= (x) obtained by querying a word vector table ₁ ,…,x _i ,…,x _n )，x _i Word vectors representing the ith of n segmentations, e.g. x _i May be a 512-dimensional vector, then the word vector sequence is an n x 512-dimensional vector. Converting x into a coded hidden state vector h= (h) by an encoder ₁ ,…,h _i ,…,h _n )，h _i Representing the coded hidden state of the ith word, e.g. h _i The vector may be 256-dimensional or 512-dimensional, and in the case of 512-dimensional, h may be an n×512-dimensional vector.

Step 2022, the electronic device determines, according to the first coding feature, first indication information of the information to be translated.

The first indication characterizes a likelihood of each candidate domain converter as a target domain converter. For example, the first indication information may include a score that each candidate domain converter is a target domain converter. For example, the score may be a probability, a score, or any information that characterizes a likelihood, e.g., a higher probability indicates a higher likelihood that the candidate domain converter is the target domain converter.

In one possible implementation, the first coding feature comprises a feature vector of each word in a word sequence dimension, and the electronic device is operable toThe word sequence dimension of the first coding feature carries out pooling operation on the first coding feature; and mapping the first coding feature after the pooling operation into first indication information. For example, if the first coding feature is the coding hidden state vector h= (h) ₁ ,…,h _i ,…,h _n ) N represents that the information to be translated includes n words, for example, h may be an n×512-dimensional vector, n is a word sequence dimension, and h may be converted into a 1×512-dimensional vector by performing a pooling operation in the word sequence dimension. For example, if there are 12 candidate domain converters, the 1×512-dimensional coded hidden state vector can be mapped into scores of the information to be translated in the 12 candidate domain converters through mapping operations such as linear mapping or nonlinear mapping.

As shown in fig. 3, the machine translation model may include a data distribution prediction module, and the electronic device may input the first encoding feature into the data distribution prediction module and output the first indication information. Fig. 4 is a schematic structural diagram of a data distribution prediction module provided in the embodiment of the present application, where, as shown in fig. 4, a first layer is a pooling layer, a second layer is a full-connection layer, a third layer is a tanh activation function layer, and a fourth layer is a full-connection layer. Taking the first coding feature as an example of the coding hidden state vector, the electronic device may input the coding hidden state vector h into the data distribution prediction module, and perform a Pooling operation on the coding hidden state vector h in the word sequence dimension by using a Pooling function Pooling () in the Pooling layer through the following formula one, so as to compress the word sequence dimension of the coding hidden state vector:

Equation one:

where h represents the encoded hidden state vector,representing the coded hidden state vector after the pooling operation. For example, h may be an n×512-dimensional vector, and the word sequence dimension is reduced from n to 1 after the pooling operation, to obtain a 1×512-dimensional vector

Then, the coded hidden state vector output by the first layerSequentially passing through the second layer, the third layer and the fourth layer, each layer processing the output of the upper layer, such as in the second layer pair +.>Performing linear change and outputting; nonlinear change is carried out on the output of the second layer through a tanh activation function in the third layer, and the output is carried out; the output of the third layer is linearly changed at the fourth layer and output. The first indication information may be obtained based on the output of the fourth layer. Illustratively, from the second layer to the fourth layer, the +.>Processing to obtain a final score:

formula II:

wherein l ^D Is first indication information, which may be, for example, a first score vector of a sentence level corresponding to the entire sentence.Is a linearly varying parameter of the second layer, +.>Is a linearly changing parameter of the fourth layer. Exemplary, ->May be a 512 x 512 matrix, +.>May be1X 512-dimensional vector, 1X 512-dimensional vector +. >Through the process ofAfter linear change, a vector of 1×512 dimensions was obtained. Then, the vector with the dimension of 1 multiplied by 512 is processed through a tanh activation function of a third layer; and inputs the activated 1 x 512-dimensional vector into the fourth layer. />Is a 512 x 12 dimensional matrix, +.>Is a vector of 1×12 dimensions, and the vector of 1×512 dimensions after the activation process is subjected to +.>After linear change, a score vector l of 1X 12 dimensions is obtained ^D ，l ^D The scores of the information to be translated in the 12 candidate domain converters are included.

Step 2023, the electronic device determines, according to the first indication information, a target domain converter corresponding to the information to be translated from the plurality of candidate domain converters.

Step 2031, the electronic device decodes the first coding feature to obtain a decoded feature; converting the decoding characteristics based on a target field converter corresponding to the information to be translated; and obtaining a translation result corresponding to the information to be translated based on the decoding characteristics after the conversion processing.

For example, the electronic device may use the candidate domain converter having the highest probability as the target domain converter based on the first indication information.

The first instruction information is a preliminary domain judgment based on the whole information to be translated, and may be considered as a judgment of a sentence level, for example, the first instruction information may be a first score vector of a sentence level, for example, an encoding hidden state vector of an entire sentence output by an encoder is input into a data distribution prediction module, the first score vector of a sentence level corresponding to the entire sentence is output through the data distribution prediction module, and a candidate domain converter with the highest score is used as a target domain converter of a sentence level corresponding to the entire sentence.

The decoding feature may be a decoding hidden state vector, for example, the encoding hidden state vector output by the encoder is input to a decoder, and the decoding is performed on the encoding hidden state vector by the decoder to obtain the decoding hidden state vector.

In one possible implementation, the decoder includes at least two decoding levels, each decoding level corresponding to a respective plurality of candidate domain converters. The first indication information characterizes the possibility that each candidate domain converter is used as a target domain converter corresponding to the corresponding decoding level of the information to be translated. By way of example, one possible implementation of step 2023 may include: and the electronic equipment determines the target domain converter corresponding to the information to be translated in the corresponding decoding level from the candidate converters corresponding to each decoding level according to the first indication information. Accordingly, one possible implementation of step 203 may include: for each decoding level, according to the decoding characteristics of the information to be translated in the corresponding decoding level, converting the information to be translated in a target field converter corresponding to the corresponding decoding level to obtain converted decoding characteristics, and outputting the converted decoding characteristics; and outputting a translation result of the information to be translated according to the converted decoding characteristics output by the last decoding hierarchy.

In one possible implementation, each decoding level may correspond to a respective set of converters, and each candidate domain converter included in a set of converters may correspond to an overlay of each domain. For example, there are 6 total fields: domain a, domain B, domain C, domain D, domain E, domain F; there are 4 per set of candidate domain converters. Among the 4 candidate domain converters of the first decoding level: the 1 st transducer corresponds to the fields a and B, the 2 nd transducer corresponds to the fields C and E, the 3 rd transducer corresponds to the field D, and the 4 th transducer corresponds to the field F. Among the 4 candidate domain converters corresponding to the second decoding level, it may also be: the 1 st transducer corresponds to the fields a and B, the 2 nd transducer corresponds to the fields C and E, the 3 rd transducer corresponds to the field D, and the 4 th transducer corresponds to the field F. The network parameters between the 1 st converter of the first decoding level and the 1 st converter of the second decoding level may be different, i.e. two different converters. In addition, the present application only uses the above examples to illustrate the correspondence between the converters and the fields in each decoding level, and of course, the method of the present application is not limited by the numerical values or the correspondence in the above examples; in practical application, the correspondence between the fields and each group of converters, the number of fields, the number of converters corresponding to each decoding level, and the like may be configured based on needs, which is not limited in this application. For example, the method of the present application may support orders of magnitude of the total number of fields, orders of magnitude of the total number of converters may be ten, hundred or higher orders of magnitude, and so on.

In yet another possible embodiment, the present application designs a technical idea of selecting a corresponding target domain converter for each target segment; each target fragment is a fragment of the target language to which the information to be translated corresponds, and the fragment may be, but is not limited to, a word (token). That is, the target domain converter of the information to be translated includes target domain converters of respective target segments corresponding to the information to be translated. The decoding characteristics of the information to be translated may include segment decoding characteristics of the respective target segments. And the target domain converter is used for converting the segment decoding characteristics of each target segment. Illustratively, this step 202 may be accomplished by the following steps 2024 to 2025, and correspondingly, this step 203 may be accomplished by the following step 2032.

Step 2024, the electronic device obtains the first coding feature of the information to be translated.

In this step, the first coding feature is obtained in the same manner as in step 2021, and will not be described in detail here.

Step 2025, the electronic device obtains segment decoding features of each target segment corresponding to the information to be translated based on the first coding feature, obtains second indication information of the target segment based on the segment decoding feature of each target segment, and determines a target domain converter of the target segment based on the second indication information of the target segment.

Step 2032, for each target segment, the electronic device outputs a translation result of the target segment through the target domain converter corresponding to the target segment based on the segment decoding feature of the target segment.

Wherein the second indication information for each target segment characterizes a likelihood that the respective candidate domain converter is a target domain converter for the target segment. The second indication information of the target segment may include scores of target domain converters of which the respective candidate domain converters are the target segment; the score may be a similarity, probability, score, or any information that characterizes the likelihood. For each target segment, the electronic device may use the candidate domain converter with the highest probability as the target domain converter of the target segment based on the second indication information of the target segment.

The decoder includes at least one decoding level, and in this step, the electronic device may determine a target domain converter corresponding to the target segment at each decoding level. The target field converter of the target segment at each decoding level can be used for converting the segment decoding characteristics of the target segment at the corresponding decoding level.

The corresponding process flow for each decoding level is described first.

In one possible implementation manner, the electronic device outputs, based on the segment decoding characteristics of the target segment, a translation result of the target segment through a target domain converter corresponding to the target segment, including: for each decoding level, converting the target segment in a target field converter corresponding to the corresponding decoding level according to the segment decoding characteristics of the target segment in the corresponding decoding level to obtain converted segment decoding characteristics, and outputting the converted segment decoding characteristics; and outputting the translation result of the target fragment according to the fragment decoding characteristics after conversion processing output by the last decoding level.

It should be noted that, the segment decoding characteristics of the target segment at each decoding level are obtained by performing decoding processing on the first coding characteristics and the output result of the target segment at the last decoding level. Illustratively, for each target segment, the segment decoding characteristics of that target segment at each decoding level are obtained by:

for a first decoding level, obtaining a segment decoding feature of the target segment at the first decoding level based on the first coding feature and a second coding feature of a translated segment preceding the target segment;

For the second decoding level, obtaining the segment decoding characteristics of the target segment at the second decoding level based on the first coding characteristics and the segment decoding characteristics of the target segment after conversion processing output by the last decoding level;

wherein the first decoding level is a first decoding level of at least two decoding levels, and the second decoding level is any decoding level other than the first decoding level.

The translated segment refers to a segment that outputs a translation result before the target segment. In translation, each target segment can be sequentially output according to the order of each target segment in each target segment, the order refers to the output order, for example, there are 3 English words to be output, firstly, english word happy with order of 1 is output, secondly, english word new with order of 2 is output, and finally, english word year with order of 3 is output.

It should be noted that, for the ith segment to output the translation result, the first coding feature and the second coding feature of the first i-1 segments that have previously output the translation result may be input into the first decoding level, so as to obtain the segment decoding feature of the ith segment at the first decoding level. For example, when a chinese sentence is translated into an english sentence, for the i-th english segment to be output, the translation result of the i-th english segment may be output in combination with the second coding features of the previous i-1 english segments that have been output before; the first i-1 English fragments which are output can be subjected to feature extraction, and the second coding feature is obtained.

A description is given below of how to determine the target domain converter for each decoding level.

The electronic device may determine a target domain converter for the target segment at each decoding level based on segment decoding characteristics of the target segment at least one decoding level. In one possible implementation, the segment decoding characteristics of one decoding level may be used to predict the target domain converter corresponding to the target segment at each level, and correspondingly, the implementation of step 2025 may include step 2025-1. In another possible manner, the segment decoding characteristics of each decoding level may be utilized to predict the target domain converter corresponding to the target segment at the corresponding decoding level, and accordingly, the implementation of step 2025 may include step 2025-2.

Step 2025-1, for each target segment, determining, by the electronic device, second indication information for the target segment based on segment decoding characteristics of the target segment at the first decoding level; the electronic device determines a target domain converter corresponding to the target segment at each decoding level based on second indication information corresponding to the target segment at the first decoding level.

And the second indication information of the target segment at the first decoding level characterizes the possibility that each candidate domain converter is used as the target domain converter corresponding to the target segment at each decoding level. The electronic device can determine a target domain converter corresponding to the target segment at each decoding level by using the second indication information.

For example, the electronic device may determine, from the plurality of candidate domain converters corresponding to each decoding level, a target domain converter corresponding to the target segment at each decoding level based on the second indication information of the target segment at the first decoding level.

For example, each layer corresponds to 12 candidate domain converters: the 1 st domain transducer of each layer corresponds to a legal domain and a medical domain; the 2 nd domain converter of each layer corresponds to the IT domain; … … the 12 th domain converter of each layer corresponds to an artificial intelligence domain; the second indication information may be a second score vector of 1×12, and may include 12 scores, where the highest score corresponds to a converter corresponding to the artificial intelligence domain. Then in each layer, the 12 th domain converter in the 12 candidate domain converters corresponding to the layer is used as the target domain converter of the target segment in the layer.

For another example, if the decoder includes 3 layers in total, each layer corresponds to 12 candidate domain converters, and there are 36 candidate domain converters in total, the second indication information may be a second score vector of 3×12, and 36 scores may be included, where 12 scores of each row represent scores of 12 candidate domain converters of the corresponding layer. For example, based on the highest score of line 1, determining that the target domain converter of the target segment at layer 1 is the 2 nd of the 12 candidate domain converters of layer 1; based on the highest score of line 2, it is determined that the target domain converter of the target segment at layer 2 is the 5 th of the 12 candidate domain converters of layer 2.

Step 2025-2, for each target segment, determining second indication information corresponding to the target segment at the corresponding decoding level according to the segment decoding characteristics of the target segment at each decoding level, and determining a target domain converter corresponding to the target segment at the corresponding decoding level according to the second indication information corresponding to the target segment at the corresponding decoding level.

And the second indication information corresponding to the target segment at the corresponding decoding level characterizes the possibility that each candidate domain converter is used as the target domain converter corresponding to the target segment at the corresponding decoding level. For example, for each decoding level, second indication information corresponding to the target segment at the decoding level is obtained based on the similarity between the segment decoding characteristics of the target segment at the decoding level and the domain characteristic vectors of the candidate domain converters of the decoding level.

In one possible manner, each decoding level corresponds to a respective plurality of candidate domain converters, and determining the implementation manner of the target domain converter in the second manner may include: for each decoding level, the electronic device determines a target domain converter corresponding to the target segment in the corresponding decoding level from all candidate converters corresponding to the corresponding decoding level according to second indication information corresponding to the target segment in the corresponding decoding level.

For example, each layer corresponds to 12 candidate domain converters: the second indication information of each layer may be a second score vector of 1×12, and may include 12 scores. For the 1 st layer, selecting a 12 th domain converter with the highest score in the second score vector of the 1 st layer; for layer 2, the 3 rd domain converter with the highest score in the second score vector of layer 2 is selected.

The manner of determining the second instruction information is described below:

in this application, each candidate domain converter corresponds to a domain feature vector. In one possible manner, the likelihood of the candidate domain converter being the target domain converter of the target segment may be predicted from the domain feature vector of the candidate domain converter. In yet another possible way, for each target segment, the likelihood that the respective candidate domain converter is the target domain converter may also be predicted in combination with the segment decoding characteristics of the translated segment preceding the target segment. Accordingly, the manner of acquiring the second indication information may include the following manner 1 and manner 2.

Mode 1For each target segment, the electronic device may obtain the second indication information based on a similarity between segment decoding features of the target segment and domain feature vectors of respective candidate domain converters.

For example, the similarity between the segment decoding feature and each domain feature vector is taken as the score of the target domain converter for which the corresponding candidate domain converter is the target segment.

If step 2025 is implemented by step 2025-1, in mode 1, the segment decoding characteristics of the target segment are segment decoding characteristics of the target segment at the first decoding level.

If step 2025 is implemented by step 2025-2, in mode 1, the segment decoding characteristics of the target segment are the segment decoding characteristics of the target segment at each decoding level. That is, for each decoding level, the electronic device obtains second indication information of the target segment at the decoding level based on the similarity between the segment decoding features of the target segment at the decoding level and the domain feature vectors of the respective candidate domain converters.

Mode 2For each target segment, second indication information for the target segment is determined based on segment decoding characteristics of the target segment and segment decoding characteristics of translated segments preceding the target segment.

If step 2025 is implemented by step 2025-1, in mode 2, the segment decoding characteristics of the target segment are segment decoding characteristics of the target segment at the first decoding level. The segment decoding characteristics of the translated segment are segment decoding characteristics of the translated segment at the first decoding level.

If step 2025 is implemented by step 2025-2, in mode 2, the segment decoding characteristics of the target segment are the segment decoding characteristics of the target segment at each decoding level. The segment decoding characteristics of the translated segment are segment decoding characteristics of the translated segment at the first decoding level. That is, for each decoding level, the electronic device determines second indication information for the target segment at the decoding level based on the segment decoding characteristics of the target segment at the decoding level and the segment decoding characteristics of the translated segment at the decoding level. For example, for the ith English segment to be output, based on the segment decoding feature of the ith English segment in the 3 rd layer and the segment decoding feature of the previous i-1 English segment in the 3 rd layer, the second indication information of the ith English segment in the 3 rd layer is obtained.

The electronic device may further determine second indication information of the target segment in combination with the above-described modes 1 and 2. The manner of determining the second instruction information in combination of the manner 1 and the manner 2 includes the following steps S1 to S3:

step S1, for each target segment, the electronic device can acquire a third weight corresponding to the segment decoding characteristics of the target segment and a fourth weight corresponding to the segment decoding characteristics of the translated segment before the target segment;

Step S2, based on the third weight and the fourth weight, weighting the segment decoding characteristics of the target segment and the segment decoding characteristics of the translated segment before the target segment to obtain target decoding characteristics;

and step S3, obtaining the second indication information based on the similarity between the target decoding characteristics and the domain characteristic vectors of the candidate domain converters.

In one possible implementation, the target domain converter of the target segment may also be determined by combining the first indication information and the second indication information. Illustratively, the implementation manner of the step 202 may include a step 2024, a step 2022, and a step 2025, where in the step 2025, the step of determining the target domain converter of the target segment based on the second indication information of the target segment includes: for each target segment, determining a target domain converter of the target segment according to the first indication information and the second indication information of the target segment. For example, the target domain converter of the target segment may be determined using the integrated indication information by integrating the first indication information and the second indication information. The method for determining the target domain converter based on the first indication information and the second indication information may be implemented by the following step S4 and step S6:

S4, acquiring a first weight corresponding to the first indication information and a second weight corresponding to the second indication information;

step S5, weighting the first indication information and the second indication information based on the first weight and the second weight to obtain third indication information;

and S6, determining a target domain converter of the target segment based on the third indication information.

In step S4, the obtaining manners of the first weight and the second weight include: for each target segment, determining the second weight based on the bit sequence of the target segment in each target segment, and obtaining the first weight based on the second weight; wherein, a second weight corresponding to a target segment is positively correlated with the bit sequence.

It should be noted that the third indication information characterizes the likelihood that each candidate domain converter is a target domain converter of the target segment. For example, the third indication information may be a third score vector including scores of the respective candidate domain converters. The electronic device may take the candidate domain converter with the highest probability as the target domain converter.

The determination of the target domain converter corresponding to the target segment is a domain determination of the target segment, and is a determination of finer granularity than a domain determination of the entire information to be translated. For example, if a chinese sentence is translated into a corresponding english sentence, the target segment may be an english segment of the chinese sentence that is translated into english, e.g., the english segment may include at least one english word. By determining the domain converter corresponding to the english segment, the domain judgment at the word level can be considered, for example, for a certain english word, the target domain converter at the word level corresponding to the word can be obtained based on the second score vector at the word level corresponding to the english word.

In the technical implementation, as shown in fig. 5, the machine translation model may include an expert selector and respective expert modules, and the domain converter may be implemented by the expert modules, for example, one domain converter corresponds to one expert module; the determination process of the target domain converter of the target segment can be realized through an expert selector; for example, determining, by the data distribution prediction module, a first score vector at sentence level (sentence-level expert score result in the corresponding graph); and determining a second score vector of the word level (corresponding to the expert score result of the word level in the graph) through an expert selector, and obtaining a final expert score vector of the word level (corresponding to the expert score result of the graph) based on the first score vector of the sentence level and the second score vector of the word level.

Illustratively, the first indication information may be expressed as: p (P) _s (Ept|E _out ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein E is _out The coded hidden state vector representing the information to be translated Ept represents a candidate domain converter, for example Ept =1, representing the 1 st candidate domain converter. P (P) _s (Ept＝1|E _out ) Representing the score of the 1 st candidate domain converter based on the coded hidden state vector.

By way of example, taking the formula three as an example, the manner of obtaining the second indication information is illustrated below:

Then, for each target segment, the electronic device may determine a similarity between the target decoding feature of the target segment and the domain feature vector of each candidate domain converter through the following formula three, to obtain second indication information of the target segment:

and (3) a formula III:

wherein P is _t The second indication information may be, for example, a second score vector including scores of respective candidate domain converters. DS (DS) _k Representing the domain feature vector of the kth candidate domain converter.Refers to a j-th domain feature vector among the plurality of domain feature vectors of the k-th candidate domain converter. H _out,1～i Representing the target decoding hidden state of the i-th target segment. (x) ^kj ,y ^kj ) The hidden state center point of the j-th hidden state center points in the multiple hidden state center points of the k-th candidate domain converter is represented, x and y represent data sources adopted in calculating the j-th hidden state center point, for example, x represents a chinese sentence, y represents each english segment in the parallel english sentence corresponding to the chinese sentence, and when calculating the hidden state center point, segment decoding features of each english segment, that is, decoding the hidden state center point, are adopted, and detailed description is omitted herein with reference to a flow of constructing a prototype database corresponding to fig. 21. Wherein, english parallel sentences refer to training by using parallel corpus in model training stage, the parallel corpus is bilingual or multilingual corpus composed of original text and parallel corresponding translated text, the parallel corpus is a model training stage Where, english parallel sentences are parallel corpora of chinese sentences.

Sim([H _out,1～i ],DS _k ) And representing the similarity between the target decoding hidden state and the kth candidate domain converter.

Example 1, in equation three, for the ith target segment, the decoding hidden state vector H of the ith target segment may be directly utilized _out,i Calculating similarity, i.e., in equation three, H _out,1～i ＝H _out,i 。

Example 2, in equation three, for the ith target segment, the ith target segment H may be utilized _out,i And the decoded hidden state vector H of the previous i-1 target segments which have been output before _out,0～i-1 Obtaining the target decoding hidden state vector H _out,1～i . The following applies to the H in the procedure shown in FIG. 6 _out,1～i The acquisition process of (1) is described as follows:

FIG. 6 is a schematic diagram of a process flow corresponding to the expert selector. As shown in FIG. 6, the electronic device can utilize the attention model to decode the hidden state vector H according to the ith target segment _out,i And combining the decoded hidden state vector H of the translated first i-1 target fragments _out,0～i-1 Obtaining weighted decoding hidden stateThe weighted decoding hidden state is then usedDecoding hidden state H from previous step _out,0～i-1 Merging operation is carried out to obtain a target decoding hidden state vector H _out,1～i And decodes the object into the hidden state vector H _out,1～i And inputting the word level expert score result into an expert selector. Wherein the merge operation may be for +. >And H _out,0～i-1 The pooling operation can be performed by the method of +.>And H _out,0～i-1 And performing splicing operation and the like. It should be noted that, by using the attention mechanism to obtain the weighted decoding hidden state vector and combining the weighted decoding hidden state vector with the decoding hidden state vector of the translated segment, the target decoding hidden state vector is obtained, so that the target decoding hidden state vector can pay attention to the most relevant previous target segment, such as the english word translated and output before.

For example, one candidate domain converter may correspond to at least one domain feature vector. For each candidate domain converter, if the candidate domain converter corresponds to a plurality of domain feature vectors, the electronic device may calculate similarities between the target decoding feature and each domain feature vector of the candidate domain converter, so as to obtain a plurality of similarities, and the maximum similarity in the plurality of similarities is used as the similarity between the target segment and the candidate domain converter. For example, the electronic device calculates the similarity of the ith target segment and each field feature vector of the kth candidate field transformer; the similarity between the ith target segment and the jth field feature vector of the kth candidate field converter is the largest; the similarity between the ith target segment and the kth candidate domain converter is the similarity between the ith target segment and the jth domain feature vector in the kth candidate domain converter. It should be noted that, in this application, a prototype database including domain feature vectors of each candidate domain converter may be pre-constructed, for example, the candidate domain converter may be implemented by an expert module, where the domain feature vector may be represented as a hidden state center point, and then the prototype database may include a hidden state center point of each expert module, where one expert module may correspond to at least one hidden state center point.

By way of example, the following will exemplify a manner of obtaining third indication information based on the first indication information and the second indication information, taking formula four as an example:

for example, for the implementation manners of steps S4 and S5, the electronic device obtains the third indication information based on the first indication information and the second indication information through the following formula four:

equation four:

wherein,representing the target fragment w _i And corresponding third indication information. The third indication information may be a third score vector including the target segment w _i The score corresponding to the at least one candidate domain converter is a word-level score. P (P) _s The first indication information may be a score of sentence level. P (P) _t The second indication information may be a score of a word level.

For example, target fragment w _i The word level score at the kth candidate domain converter may be expressed as: p (P) _t ＝Sim([H _out,1～i ],DS _k ) The method comprises the steps of carrying out a first treatment on the surface of the Target fragment w _i The sentence level score at the kth candidate domain converter may be expressed as: p (P) _s (Ept＝k|E _out ). Where T represents the order of the target fragments and T represents the total number of the respective target fragments.

It should be noted that, when t=0, i.e. for the first target segment, the value of the third indication information is the same as the first indication information, i.e. the first score vector of the sentence level, and at this time, the target domain converter of the first target segment is determined directly by using the first indication information, and the target domain converter is a domain converter of the sentence level; when t=t, i.e. for the last target segment, the value of the third indication information is the same as the first indication information, i.e. the second score vector of word level; when 0< t, the sentence-level score vector and the word-level score vector are simultaneously considered as the final third score vector.

As shown in the flow chart corresponding to the expert selector in fig. 6, the expert selector stores a prototype database, and the present application may perform similarity calculation of the word-level expert module based on the hidden state center point of each candidate domain converter in the prototype database. For example, the target decoding hidden state can be characterized by combining the decoded hidden states of the translated segment and the current target segment, so that the target decoding hidden state can contain the context characteristics of the target segment to be translated, and the similarity calculation of the word level expert module is performed by using the target decoding hidden state, thereby being beneficial to improving the accuracy of the most relevant candidate domain converter of the determined target segment. The first score vector and the second score vector are then integrated. For example, a linear interpolation mode is adopted to integrate the score of the word level candidate domain converter and the score of the sentence level candidate domain converter, and the integration can be a dynamic interpolation factor based on a decoding position, so that the word level candidate domain converter, such as a word level expert module corresponding to the i-th word to be translated and output, is finally obtained.

The following describes the process flow corresponding to the target domain converter by way of the following example:

Illustratively, the processing procedure corresponding to the target domain converter includes: and for each decoding level, carrying out normalization processing on the decoding hidden state vector of the target segment in the decoding level through a target domain converter corresponding to the target segment in the decoding level, and carrying out linear or nonlinear transformation processing on the decoding hidden state vector after normalization processing to obtain the decoding hidden state vector after conversion processing.

For example, taking a conversion process flow corresponding to the target domain converter performed by an expert module as an example, fig. 7 is a schematic structural diagram of an expert module provided in an embodiment of the present application, and as shown in fig. 7, the expert module may include a layer normalization and a feedforward neural network. Decoding hidden state vector z for the ith layer of the decoder at layer normalization _i The electronic device can decode the hidden state vector z input to the target expert module based on the layer normalization function LN () by the following formula five _i Normalization is carried out:

formula five:

wherein,representing the decoded hidden state vector after normalization processing. i denotes the i-th decoding level in the decoder, z _i Representing the decoded hidden state vector of the i-th decoding level. For example, the z _i May be a 1 x 512 vector. Through the target expert module pair z _i After normalization, vector z of 1×512 is calculated _i The individual values of (2) are converted into values belonging to (0, 1), resulting in a vector 1X 512 +.>

For the feedforward neural network, the electronic device can transform the normalized state vector of the upper layer output by using FFN (Feed-Forward Networks) and then convert the vector with z through the following formula six and formula seven _i Adding and fusing:

formula six:

wherein, the feedforward neural network FFN () expands the following formula seven:

formula seven:

wherein o is _i Represents the decoded hidden state vector obtained by transforming and fusing the normalized decoded hidden state vector, z _i The decoding hidden state vector representing the i-th decoding level is the decoding hidden state vector of the input target expert module.Representing the normalized vector by the feedforward neural network>And processing, wherein the processing process is as shown in a formula seven. In the feedforward neural network, as shown in fig. 7, the feedforward neural networkThe first layer in the network is a full connection layer, the second layer is a ReLU (The Rectified Linear Unit, linear correction Unit) activation function layer, and the third layer is a full connection layer. />Is the linear transformation parameter of the first layer in the feed-forward neural network,/and >Is the linear transformation parameter of the third layer in the feedforward neural network, relu represents the pair pass +.>Vector after linear transformation ∈ ->And performing activation processing. Then utilizeAnd (3) further linearly changing the vector after the activation processing, and outputting the decoded hidden state vector after the conversion processing.

For example, as shown in fig. 8, if an expert module at sentence level is used, the corresponding translation process may include: step 1, firstly, converting a Chinese sentence to be translated into a coding hidden state vector by using an encoder. And 2, predicting the data distribution category to which the translation request belongs through an independently trained data distribution prediction module independent of the decoder based on the coding hidden state vector to obtain a first score vector of sentence level. And 3.1, obtaining a translation result of the whole sentence by using a sentence-level expert module corresponding to the first score vector based on the first score vector. For a translation request, "an esophageal double balloon catheter for treating esophageal lumen stenosis or occlusion", first encoding it into an encoded hidden state using an encoder of a reference machine translation model; the reference machine translation module comprises a trained encoder and a trained decoder; and then selecting a sentence-level expert module through a data distribution prediction module, and obtaining a sentence-level expert score vector (sentence-level expert score result in a corresponding diagram) of the translation request by using the sentence-level expert module.

For another example, as shown in fig. 9, if a word-level expert module is used, the corresponding translation process may include: after the prediction result of the data distribution prediction module is obtained through the step 2, executing the step 3.2, calculating a word-level expert score vector for the ith English word of the translation result output sequence by using an expert selector, and selecting a corresponding word-level expert module to process the decoding hidden state of the translation request according to the word-level expert score vector to obtain the ith English word of the output sequence, wherein the expert selected when generating the ith word is 'expert 2', and the generated ith word is 'catheter', as shown in fig. 9. Then, step 4 is executed, step 3.2 is cyclically executed until a final output is obtained, and as shown in fig. 10, when the execution proceeds to step j, the selected expert is 'expert 1', the corresponding jth word is generated as 'esphageal', and the coding feature of the jth word is input into the decoder to continue predicting the next word in combination with the jth word. In this process, the word level expert selector will switch the word level expert module that needs to be used for the next word based on the decoding hidden state of the next word.

For example, the translation process may be an autoregressive process, i.e., the target segment of the translated output of step i-1 may be used to predict the content of the translated output of step i. For example, taking the selection word level Expert Module flow shown in fig. 11, 12 and 13 as an example, the process of translating a chinese sentence "an esophageal double balloon catheter for treating esophageal lumen stenosis or occlusion" into an english sentence "will be described below with reference to the reference machine translation Module (Base NMT Module), the data distribution prediction Module (Discriminator Module) and the Expert selector (Expert switch) in fig. 11, 12 and 13. The autoregressive translation process begins with a < start > tag and ends with an < end > tag; where "token" is the smallest element of the input or translation result output sequence, in other words, if the sentence is divided by space, the "token" corresponds to a "word" in the sentence. As shown in fig. 11, beginning with the < start > tag, using expert 2, translate output "An esophageal double balloon catheter" (an esophageal double balloon catheter); as shown in fig. 12, using the < start > tag and "An esophageal double balloon catheter" of the previous step, expert 1 is used to output "for the treatment of" (for treatment); as shown in FIG. 13, using the < start > tag and "An esophageal double balloon catheter for the treatment of" of the previous step, expert 2 was used to output "esophageal stenosis or stricture" (esophageal lumen stenosis or occlusion). Based on which the final translation result is obtained: an esophageal double balloon catheter for the treatment of esophageal stenosis or stricture.

Fig. 14 is a schematic flowchart of a method performed by an electronic device, which may be a terminal or a server, and the application is not limited in this regard. As shown in fig. 14, the method includes the following steps 1401-1403.

Step 1401, the electronic device displays a translation domain list including identification information of at least one domain among a plurality of candidate translation domains.

For example, the at least one domain included in the translation domain list may be all or part of the plurality of candidate translation domains; for example, the at least one field may be: recommending a plurality of popular areas to a user, or recommending an area of interest to the user, etc.

Step 1402, the electronic device obtains a first input of a user, where the first input is used to select a domain corresponding to a translation from the translation domain list.

For example, the electronic device obtains the first input, which may be that the electronic device obtains identification information of a domain selected by the user from the translated domain list.

Step 1403, the electronic device downloads a domain converter of the corresponding domain in response to the first input.

In one possible manner, the electronic device may also prompt the user for a domain update, where the process includes: the electronic equipment displays update prompt information, wherein the update prompt information is used for prompting the field corresponding to the update translation; and the electronic equipment responds to the acquired updating instruction to update the domain converter of the corresponding domain.

For example, the update-prompting information may be used to prompt at least one of downloading a new recommended domain, updating a downloaded domain, or deleting a downloaded domain. The user can operate based on the displayed update prompt information, for example, the newly added field, the interest field learned according to the user behavior and the like can be recommended to the user in real time; if the user triggers to download a new recommended domain, the electronic device may download a domain converter corresponding to the new recommended domain triggered correspondingly based on the update indication triggered by the user. For another example, if some domain converters corresponding to the downloaded domains are not used for a long time, the user may be prompted to delete the domain converters corresponding to the cold enabled domains.

It should be noted that, for the electronic device, a machine translation module may be preconfigured, and in an offline translation scenario, when offline translation is performed by using the machine translation model of the present application, the technical solution includes:

1. The scheme when the user downloads the model for the first time;

in the related art, a machine translation model based on a hybrid expert architecture requires: 1. selecting a translation direction of the model (e.g., chinese to english translation); 2. the entire machine translation model (including all parameters of all expert modules) is downloaded. Therefore, the data traffic required to be consumed is large.

However, for the machine translation model of the present application, the user need only: 1. selecting a translation direction (field) of the model; 2. downloading necessary modules (reference machine translation model, data distribution prediction module, expert selector); also, for expert modules corresponding to respective fields, the present application can realize the following (1) - (2): (1) Recommending a corresponding expert module for the user by using the big data of the user behavior; or (2) the user manually selects the required domain and downloads the corresponding expert module. For example, the domain correspondence expert module that may be preferred by the user is recommended to the user based on the user's browsing records, translation records, etc. As shown in fig. 15, in the present application, the user only needs to download the reference translation network including the encoder and the decoder, and the data distribution prediction module, and for a specific domain, it is not necessary to download domain converters corresponding to all domains, for example, only download domain converters corresponding to default hot domains; for another example, only the domain converter corresponding to the domain selected by the user is downloaded.

The electronic device may also illustratively download expert selectors to support the selection of word level expert modules for translation. The expert selector may store a prototype database.

At point 2, daily maintenance updates for the model usage phase (assuming the model size is about 200MB, each expert module is about 1 MB);

each model update, in the related art, all experts of the entire model need to be updated each time. That is, the user needs to update the flow rate and time of about 200 MB. As shown in fig. 16, each model update requires an entire model (170M as illustrated in fig. 16) to be updated, and suffers from performance degradation.

However, for the machine translation model of the present application, the user only needs to update the expert module corresponding to a certain field that needs to be updated. Assuming that an expert module needs to be updated, the user only needs to spend updating the flow and time of about 1 MB. As shown in fig. 16, the expert in the biomedical field only needs 5MB, only 1M in the example), other fields are not affected, and when the user uses the field detection device for a period of time, the field preference detection device can automatically detect the field preference of the user, and according to the preference of the user, the expert module in the corresponding field is recommended to the user, and as shown in fig. 16, the field preference of the user is recommended to the user, and the fields such as medicine, patent, IT, restaurant and the like are recommended for the user to select and add. And the model use experience of the user is improved.

Fig. 17 is a schematic diagram of a network structure of a machine translation model provided in the present application. As shown in fig. 17, the machine translation model may be a FGD-MoE (Fine Grained Decoupled Mixture of Experts, fine-grained decoupling hybrid expert) model that includes three modules: (1) a reference machine translation model comprising a trained encoder, decoder; (2) a data distribution prediction module; (3) a hybrid expert module; wherein the mixed expert module comprises an expert selector and each expert module. The data distribution prediction module provides a prediction result; in a stage of using the trained machine translation model, the prediction result is a first score vector; in the training phase, the prediction result is a score of the data used for training in at least one data distribution category, for example, a sentence-level expert module score; the decoder has a word-level decoding function and can provide decoding hidden states of each target segment for the expert selector so as to support the expert selector to give more accurate word-level expert scores, and accordingly the corresponding expert modules are matched for each target segment for translation. Compared with the translation model network structure in the related art, each module in the method can be independently trained, and necessary modules in the module can be updated when the model is updated each time, so that training and deployment cost can be reduced, and model maintenance capability is improved. According to the method and the device, the expert module corresponding to the target fragment can be provided through the expert selector, for example, the word level expert module is used for translating, and especially in a translation request comprising a plurality of fields, the accuracy of translation can be improved.

In one possible implementation, the electronic device may also update modules in the local machine translation model based on the update data.

In one example, the electronic device may receive the first update data and the second update data if there is a new domain converter. The electronic device updates the data distribution prediction module based on the first update data; the electronic device adds a newly added domain converter in the machine translation model based on the second update data.

In yet another example, if the downloaded domain converter needs to be updated, the electronic device may further receive third update data, and update the corresponding downloaded domain converter based on the third update data.

In yet another example, if the electronic device downloads the expert selector, the electronic device may further receive fourth update data if the domain converter is newly added, and update the locally stored expert selector based on the fourth update data, such as updating a prototype database in the expert selector. The fourth update data may include the domain feature vector of the newly added domain converter, such as the hidden state center point of the newly added expert module.

The method comprises the steps of displaying a translation domain list to a user, wherein the translation domain list comprises identification information of at least one domain in a plurality of candidate translation domains; in response to the first input of the user, the domain converters of the corresponding domains selected by the user in the list are downloaded, so that translation can be completed by only downloading part of domain converters on the electronic equipment.

The applicant of the present application studied the art and found that the following problems exist in the related art in the field:

1. the machine translation model generally needs to meet the translation requirements of multiple fields, such as NMT (Neural Machine Translation ) model in related art, and can translate data of multiple fields. In the related art, a reference NMT model is first trained using hybrid data including a plurality of fields. And then, fine-tuning the reference NMT model through data in different fields to obtain the NMT model in the corresponding field. However, the applicant finds that, through research, it is necessary to obtain in advance which field the data to be translated belongs to, so that the NMT model of the corresponding field can be invoked, and the overall NMT model is large in size, which results in the technical disadvantage of poor practicality.

2. In the related art, a hybrid expert (Mixture of Experts) model may be employed to meet a variety of domain translation requirements. The translation model based on the mixed expert model architecture in the related art generally includes a gating network and a plurality of experts. However, the applicant, through a training process to study the translation model, found that: in the training stage, the sequence of the expert and the decoding capability of the expert are influenced by the learning result of the gate control network; and the learning result of the gating network is not controllable. For example, in iterative training, the experts selected by the gating network for the same data in different batches of training are not identical, but even have large differences. In the related art, in order to ensure the consistency of the expert module and the capability of the gating network, all modules of the translation model must be highly coupled; all modules in the translation model need to be trained together. That is, even if a data set in a certain field is updated, all modules and all experts of the translation model have to be trained and all model parameters are adjusted; this necessarily increases the cost of model training, resulting in a technical disadvantage of less efficient training.

Especially for the model deployed to the user equipment, all parameters of the model local to the equipment need to be updated in each training, so that a great deal of network cost is consumed in the updating process, and the technical defects of overhigh updating cost and lower updating efficiency are caused. In addition, for other fields which do not need updating, the highly coupled co-training mode is easy to cause performance rollback in other fields, so that the technical defect of translation quality reduction is caused.

In order to solve the problems, the application provides a model training method, and in the method, the application designs a technical idea of performing decomposition training on each module in a machine translation model. For example, fixing a trained codec module, training a data distribution prediction module; the trained data distribution prediction model is then fixed, and the hybrid expert module in the machine translation model is trained using the trained data distribution prediction model.

It should be noted that the hybrid expert module includes each expert module, and one expert module is configured to implement a process flow corresponding to one candidate domain converter in the translation method flow, that is, a process of converting the decoding feature. In the following training method flow, expert modules are used to refer to corresponding candidate domain converters.

It should be noted that, in the training stage, the data distribution prediction module may be used to classify the data set, and the number of expert modules required to be trained is determined based on the classification result, for example, one class corresponds to one expert, and a total of 12 classes may correspond to 12 experts; of course, it is also possible to design the technical idea of each decoding level corresponding to a plurality of expert modules, and assume that there are 3 decoding levels, and each decoding level corresponds to 12 expert modules, and there are 36 expert modules. Based on this, in the training phase, the first score vector output by the data distribution prediction module is a score corresponding to each category; in the stage of using the trained machine translation model, the first score vector output by the data distribution prediction module corresponds to the score of each expert module, namely, each candidate domain converter.

The model training method is described below with reference to the flowchart shown in fig. 18:

fig. 18 is a schematic diagram of a method performed by an electronic device according to an embodiment of the present application. The method may be a model training method. As shown in fig. 18, the method includes the following steps 1801 to 1803.

Step 1801, the electronic device obtains a data set tag of the target data set.

The dataset tag characterizes a data distribution class of each data in the target dataset. The dataset tab includes a category tab for each data in the target dataset, the category tab for each data identifying a data distribution category for the data. The data distribution class of the data characterizes the class to which the semantic features of the data belong. The application can divide at least one data with the same or similar semantic features into a data distribution category. The target data set may include at least one data, which may be translated source data; for example, the translation requirement is to translate a chinese sentence into an english sentence, and the target dataset may include a plurality of chinese sentences. In this application, the target data set is data used by the training data distribution prediction module.

In one possible implementation manner, the electronic device may acquire semantic features of at least one data in the target data set, determine data distribution categories of the respective data based on the semantic features of the at least one data, and obtain a data set tag of the target data set. For example, the semantic features may be represented as semantic feature vectors that include feature data of the data in at least one dimension.

In the present application, the machine translation model may include a codec module, a data distribution prediction module, and a hybrid expert module. The electronic equipment can train the encoding and decoding module to obtain a trained encoder; the data distribution prediction module is retrained. In one possible example, semantic features of the data may be obtained by a trained encoder in a machine translation model, which may be represented as encoded hidden state vectors obtained by feature extraction of the data by the encoder. One possible implementation of step 1801 comprises: the electronic equipment obtains a first coding feature of at least one data in the target data set through a trained encoder; and determining a data distribution category of the at least one data based on a first coding feature of the at least one data, which may be a coded hidden state vector, to obtain the dataset tag.

In a possible embodiment, the target data set includes at least a first data set obtained by sampling a source data set to be translated. In yet another possible embodiment, the target data set may include a first data set and a second data set belonging to a target domain. The electronic equipment can obtain a data set label in different category division modes based on different sources of data included in the target data set; accordingly, embodiments of this step 1801 may include the following three.

In one aspect, the target data set includes a first data set. For mode one, the execution of step 1801 may include the following steps 18011a-18012a.

Step 18011a, the electronic device obtains, based on the trained encoder, a first encoding characteristic, such as an encoded hidden state vector, of each first data in the first data set.

Step 18012a, the electronic device classifies the first data according to the first coding feature of the first data, to obtain a data set tag of the target data set.

Before executing step 1801, the electronic device may train the codec module to obtain a trained encoder; the data distribution prediction module is retrained.

In this step, the electronic device may perform feature extraction on each first data by using the trained encoder, to obtain the encoded hidden state vector of each first data. The manner of obtaining the first coding feature of each first data is the same as that of step 2021, and will not be described here again.

In one possible implementation, a clustering approach may be used to cluster individual data in the target dataset into multiple data distribution categories. The electronic device may cluster each first data based on the coded hidden state vector of the first data to obtain a cluster tag of each first data; and taking the cluster label of each first data as the data set label. Wherein the cluster labels of each first data represent the data distribution category of the first data. For example, the electronic device may cluster the first dataset in a supervised clustering manner, or in an unsupervised clustering manner. For example, the degree of similarity between data may be measured by a vector distance, the smaller the vector distance, the closer the feature distribution of the two data, and the greater the degree of similarity of the two data. The electronic device may cluster similar data into one data distribution category by calculating a vector distance between each two coded hidden state vectors of the first data.

For example, the electronic device may obtain a source data set in a machine-translated total data set and sample the source data set to obtain the first data set. The machine translation total data includes a source data set and translation result data of the source data set, for example, for translation requirements of Chinese and English, the source data is a Chinese sentence, and the translation result data is an English sentence. The source end data set comprises data of a plurality of fields; the electronic device can randomly sample the source data set to obtain a first data set, so that the data distribution of the first data set in each field can be the same as or similar to the source data set. For example, the data volume of the source end data set is 100 ten thousand, the data of the source end data set of 100 ten thousand belongs to four fields of legal field, medical field, spoken language field and patent field, the data ratio is 20%, 25%, 32% and 5%, the first data set with the data volume of 1000 is obtained by sampling the source end data set of 100 ten thousand, and the data ratio of the first data set belonging to the four fields can be 20%, 25%, 31% and 4%. For example, by way one, a first data set of 1000 may be randomly sampled from a 100-case source data set data volume and clustered into 12 data distribution categories, each data distribution category corresponding to one expert module in the hybrid expert module.

For example, the codec module may be a general NMT model, and the first data set may be randomly sampled from a training set of the general NMT model, so that a data distribution of the first data set is the same as or similar to a data distribution of the training set of the general NMT model. Namely S ^D ≈S ^T The data distribution prediction module may be a classifier, S ^D Representing the data distribution of the target data set used by the training of the data distribution prediction module, S ^T Data distribution representing the training set of the generic NMT model.

Mode two, the target data set includes a first target data set and a second data set. For mode two, the execution of step 1801 may include the following steps 18011b-18012b.

Step 18011b, the electronic device obtains, based on the trained encoder, a first encoding characteristic for each first data in the first data set, and a first encoding characteristic for each second data in the second data set.

Step 18012b, the electronic device clusters each first data and each second data based on the first coding features of each first data and each second data and the domain label corresponding to each second data, to obtain a dataset label of the target dataset.

The electronic device may use the first data set and the second data set as target data sets, and cluster each data (including each first data and each second data) in the target data sets to obtain a data set label. The dataset tab includes a cluster tab for each first data and a cluster tab for each second data.

The second data set is a data set belonging to the target field. Three possible application scenarios are taken as examples below, and the possible cases of the second data set are described.

In the first scenario, the target area may be an area in which the data amount meets a pre-configured condition.

The target domain may be, for example, a domain in which the sampled data amount corresponds to the first condition among a plurality of domains corresponding to the source data set, the sampled data amount representing the data amount of the domain sampled from the source data set. For example, the sampled data amount refers to the data amount of the field included in the first data set. The second data set may be obtained by: the electronic device obtains the data amount of the first data set belonging to each field, takes the field with the data amount meeting the first condition as a target field based on the data amount included in each field, and obtains a second data set belonging to the target field. For example, the first condition may include, but is not limited to: the comprised data has a duty cycle in the first data set below a first data amount threshold, or the data amount is below a second data amount threshold, etc. If the proportion of the domain A in the source end data set is 5%, the proportion of the domain A in the first data set obtained by sampling is 4%, the data volume is 4, the data volume is lower than the first data volume threshold value by 6% and also lower than the second data volume threshold value by 10, the second data set of the domain A can be additionally obtained.

It should be noted that, for the field with less sampled data, the second data set of the less field is additionally obtained, so as to increase training samples of the less field, enable the small field with less data to learn better translation capability, remove the limitation of less data in some fields, and ensure the translation quality of the small field, thereby enabling the translation quality of the translation machine model obtained by training in each field to reach higher level.

Scene two, the target domain may be a domain in which the translation quality meets a second condition.

The second data set is obtained by the following steps: the electronic equipment obtains a second data set with translation quality meeting a second condition based on the translation quality of the machine translation model in each field. For example, the second condition may include, but is not limited to: translation quality is lower than a first quality threshold, pre-configured translation quality to be qualified is higher than a second quality threshold, and the like. For example, the electronic device may count the translation quality of the data in each domain in the iterative training process, and for the domain with lower translation quality, may set the domain as the target domain, and by additionally acquiring the second data set of the domain with lower translation quality, the training sample of the low-quality domain is increased, so as to optimize the translation quality of the domain. For example, the translation quality to be reached may be a pre-configured translation quality to be reached, for example, for some critical fields to be reached with higher translation quality, the translation quality may be further improved by additionally obtaining a second data set of the critical field to increase training samples of the critical field.

For example, in the iterative training process, the field B with the translation accuracy lower than 50% may be set as the target field, and the dataset of the field B is additionally acquired for training. And for the field C with the accuracy reaching the standard of 90%, the translation accuracy of the field C is lower than 90% in the iterative training process, the field C can be set as a target field, and the data set of the field C is additionally acquired for training.

Scene three, the target domain may be the domain to which the data set to be updated corresponds.

The second data set is obtained by the following steps: the electronic equipment determines a target field corresponding to the data set to be updated, and samples updated data in the target field to obtain a second data set. For example, a machine translation module may provide translation requirements for 20 fields. For example, for newly added fields M and N, where field M includes data amount 5000 and field N includes data amount 6000, the fields M and N may be set as target fields, and newly added data amounts of field M and field N, such as data set M of 50 data amounts from newly added 5000 data amounts of field M and data set N of 60 data amounts from newly added 6000 data amounts of field N, may be sampled, and data M and data N may be clustered into class 13 during clustering, corresponding to dedicated expert module 13.

It should be noted that, the above-mentioned scenarios such as data amount, translation quality, or data update are merely examples, and several possible cases of the second data set are exemplified, and for other possible application scenarios, the target field may be obtained in other ways. For example, in yet another possible scenario, the target domain may also be a user-preconfigured specified domain, and the electronic device obtains the second data set based on the preconfigured specified domain. Of course, different target fields may be configured correspondingly according to different application scenarios, and the present application is only described by way of the above examples, but the method for acquiring the target field, the applicable scenario, and the like are not limited in particular.

In one possible implementation manner, the electronic device may cluster each first data and each second data based on the coding hidden state vector of the first data and the second data, and use the domain label corresponding to each second data as a clustering support point during clustering, and cluster the data belonging to the target domain in the target data set into at least one independent data distribution category, so as to obtain the clustering label of each first data and each second data. In one possible example, the clustering support point is used to cluster the target areas into separate data distribution categories during the clustering process. The data of the target domain is clustered into at least one independent category. An independent category may be understood as a category in which the data of the target area exceeds a target duty cycle threshold. For example, the target domain may include at least one domain, and when the target domain includes a plurality of domains, the electronic device may cluster the plurality of domains into one or more independent data distribution categories. For example, for the fields E and F with the data volume less than 10, the data set E of the field E and the data set F of the field F are additionally acquired, and the data set E and the data set F are clustered into one independent category 13, and more than 90% of the data belonging to the fields E and F in the category 13.

In a possible example, taking a K-Means clustering algorithm as an example, the electronic device may divide the target data set into K groups, where data of the target domain is taken as at least one independent group, and K cluster centers are selected from the K groups, where the K cluster centers include at least data belonging to the target domain. The electronic device may divide each data into data distribution categories of similar cluster centers based on vector distances between each data in the target data set and the K cluster centers; updating the clustering center of each data distribution category based on the newly added data in each data distribution category; and repeating the steps until the termination condition is reached, so as to obtain a plurality of final data distribution categories, wherein the target field corresponds to at least one independent category in the plurality of data distribution categories. For example, due to the small size of data, in class classification, some small domain data distribution will be attached to the large domain data distribution class. For small fields needing to be focused, the method can additionally take source end data of some small fields as clustering supporting points, and inject the mixed data sets of the small fields into a first data set which is randomly sampled to obtain a target data set. For example, S ^A Data feature distribution of a second dataset representing a target area S ^D ≈(n _r S ^T +n _a S ^A )/(n _r +n _a ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein n is _r 、n _a The data scale of the first data set and the second data set respectively, wherein the data scale refers to the data quantity for measuring the data scale; the first data set may be randomly sampled from a training set of the generic NMT model, S ^D Representing a data distribution representing a target data set for training by a data distribution prediction module, S ^T Data feature distribution representing a training set of a generic NMT model; after the two data sets are combined, the data features are fused, if the data size of one data set is relatively smaller, and the data size of the other data set is relatively larger, the data feature distribution of the combined data set is more biased to the data set with relatively larger data size after the two data sets are combined.

It should be noted that, when the target domain includes multiple domains, the data of the target domain may be clustered into one or more categories; that is, the fields and the categories may not be in a one-to-one correspondence, but the fields may be corresponding to dedicated data distribution categories, and the data distribution categories in the machine translation model are in one-to-one correspondence with the expert modules in the hybrid expert module, so that the fields may be corresponding to dedicated expert modules. For example, for the fields E and F with the data volume less than 10 in the first data set, the data belonging to the fields E and F in the target data set are clustered into an independent class 2, the class 2 corresponds to the dedicated expert module 2, and the trained expert module 2 is subsequently used to translate the data of the class 2 exclusively, so that the translation quality of the fields E and F is greatly improved. Therefore, in the second mode, the data in the target field is clustered into an independent data distribution category, so that the independent category can correspond to the exclusive expert module, and the translation quality of the target field is improved.

In a third aspect, the target data set includes a first target data set and a second data set.

For mode three, the execution of step 1801 may include the following steps 18011c-18013c.

Step 18011c, the electronic device obtains, based on the trained encoder, a first encoding characteristic of each first data in the first data set.

Step 18012c, the electronic device clusters each first data based on the first coding feature of each first data, to obtain a data set label of the first data set.

Step 18013c, the electronic device uses the domain label corresponding to each second data in the second data set as a data set label of the second data set.

In the third mode, the domain labels of the target domain and the data distribution categories are in one-to-one correspondence, and the electronic device can use each domain in the target domain as an independent data distribution category, that is, the domain label of each data in the second data set is the category label of the data. For example, for the additionally acquired data set E of the domain E, the data set F of the domain F, the data set E belongs to the class 13 corresponding to the domain E, and the data set F belongs to the class 14 corresponding to the domain F.

In one possible scenario, the second data set in the third mode may also include three cases, and in example one, the electronic device may use, as the target domain, a domain whose data amount meets the first condition that is preconfigured, based on the data amounts included in the respective domains, and acquire the second data set that belongs to the target domain. For example, the electronic device may obtain a second data set having a translation quality that meets a second condition based on the translation quality of the machine translation model in each field. In an example three, the electronic device determines a target domain corresponding to the data set to be updated, and samples the update data of the target domain to obtain a second data set. It should be noted that, in the above three cases, the second data set is obtained in the same manner as the second data set, and will not be described in detail herein.

It should be noted that, the method of extracting the first coding feature from the data such as the first data or the second data is the same as the process of step 2021, and the training phase is not repeated. The embodiment of clustering each first data in step 18012c is similar to the clustering process in step 18012a, and will not be described in detail herein. In the third mode, the second data set of the target area is maintained as an independent data distribution category, and is connected with the data distribution category obtained by clustering. In other words, based on the randomly sampled clustering result, the second data set in the target field is taken as an additional data distribution category, and the second data set and the category for clustering the first data set are spliced together to obtain the target data set and the data set label thereof.

In the above three ways, after the target data set used by the training data distribution prediction module is constructed, the training data distribution prediction module may be trained using the target data set in step 1802.

Step 1802, the electronic device trains the data distribution prediction module based on the target data set and the data set label.

In the application, the data set label can be used as a sample truth value label of the target data set, and a prediction result of the target data set is obtained through a data distribution prediction module, wherein the prediction result represents the data distribution category of each data in the target data set; the data distribution prediction module may be trained based on the dataset tag and the prediction result. In the training stage, the data distribution prediction module is used for predicting the probability that each data in the target data set belongs to each data distribution category, and each data distribution category corresponds to at least one field.

In one possible implementation, step 1802 may include the following steps 18021 through 18022.

Step 18021, the electronic device obtains a prediction result of the data distribution prediction module on the target data set.

The electronic equipment can input the coding hidden state vector of each data in the target data set into the data distribution prediction module, and a prediction result is obtained through the data distribution prediction module.

In one possible implementation, the manner in which the data distribution prediction module obtains the prediction result is the same as the process of obtaining the first indication information based on the first coding feature in step 2022, which is not described in detail herein. For example, a 1×512-dimensional coded hidden state vector may be mapped into a score of data in 12 data distribution categories by a mapping operation such as linear mapping or nonlinear mapping.

The prediction result can be obtained by a network structure as shown in FIG. 3, for example, a score vector l of 1×12 dimensions ^D The score vector includes the scores of the data in 12 data distribution categories.

Step 18022, the electronic device trains the data distribution prediction module based on the dataset label and the prediction result.

For example, the training loss may be obtained by comparing the differences between the dataset labels and the predictions, and iteratively training the data distribution prediction module based on the training loss.

When training the data distribution prediction module, how to obtain the data distribution label of the unlabeled sample data, namely the sample label, is an important and urgent problem to be solved by the training data distribution prediction module. The book is provided with In the application, firstly, a target data set is acquired to construct a sentence subset of source end data used by a training data distribution prediction module, and pooling operation is carried out on word sequence dimension of coding hidden state vector of each sentence to compress the word sequence dimension, for example, the word sequence dimension is reduced from n to 1 to obtain a vector of 1×512 dimensionsAs feature vectors characterizing the semantic features of the sentence. Then, use vector +.>The sentences are divided into at least one data distribution category by means of unsupervised clustering. The cluster tag of each sentence can be used as its true tag.

It should be noted that, the data distribution prediction module may be trained by using the target data set and using a supervised training method. Based on the method, the training result of the data distribution prediction module is controllable in a supervised training mode, and even if the data distribution prediction module, the mixed expert module and the encoding and decoding module are separately and independently trained, the modules cannot be mutually influenced. For example, after the data distribution prediction module is trained, even if the data set is updated, the data distribution type of the updated data set can be determined through the data distribution prediction module, and then the expert module corresponding to the data distribution type is determined, so that only the expert module corresponding to the updated data set is trained, and the training cost is greatly reduced.

Illustratively, as shown in FIG. 19, during the training phase, the encoder module of the reference machine translation model may be utilized to encode training data into Encoded hidden state vectors (also known as Encoded features), and then employ the optimization objectives of the multi-classification task:to train the data distribution prediction module. Wherein y is _i Is a category label; n is the number of categories (i.e., the number of data distribution categories); />Refers to the prediction probability of the data distribution prediction module.

After obtaining the trained data distribution prediction module through the step 1802, network parameters of the data distribution prediction module may be further fixed to train each candidate domain converter.

Step 1803, the electronic device trains corresponding domain converters in the candidate domain converters based on the trained data distribution prediction module to obtain a machine translation model.

Each candidate domain converter corresponds to at least one domain.

The electronic device may use the training data set as a sample to train each candidate domain converter. The training data set may be source data to be translated, and the training data set corresponds to a translation truth data set as a sample truth. For example, the translation requirement is to translate chinese sentences into corresponding english, and the training data set may include a large number of chinese sentences, each chinese sentence corresponding to an english sentence as a sample truth value.

The electronic equipment can obtain a prediction result of the training data set through a trained data distribution prediction module, and the prediction result represents the data distribution category of each data in the training data set; determining a target domain converter corresponding to each data based on the corresponding relation between the data distribution category and each expert; and obtaining translation results of the data based on the target domain converters corresponding to the data, and training the candidate domain converters based on the translation results of the data and the translation truth value data. In one possible implementation, the performing of step 1803 may include the following steps 18031 to 18034.

Step 18031, the electronic device obtains a prediction result of the training data set based on the trained data distribution prediction module.

In one possible implementation manner, the electronic device obtains the coding hidden state vector of the training data set based on the trained encoder, and performs data distribution prediction on the training data set based on the coding hidden state vector through the trained data distribution prediction module to obtain a prediction result of the training data set. Illustratively, the prediction results characterize data distribution categories of individual data in the training dataset. For example, the prediction result may include a score for each data in the training data set in at least one data distribution category.

It should be noted that, the implementation of obtaining the prediction result of the training data set by the trained data distribution prediction module is the same process as the process of obtaining the prediction result of the data distribution prediction module on the target data set in the step 1802, which is not described in detail herein.

Step 18032, the electronic device determines, based on the prediction result of the training data set, a target domain converter corresponding to each data in the training data set in each candidate domain converter.

In this step, the electronic device may determine, with the data as a unit, the target domain converter corresponding to each data, so as to facilitate subsequent use of the target domain converter corresponding to the data to obtain the translation result. Alternatively, the electronic device may further refine the expert division granularity; for example, each target segment in the sequence may be output based on the translation result to be output, where each target segment corresponds to a target domain converter, so that each target segment may be obtained by using the target domain converter of each segment later. For example, the data may be a chinese sentence, the translation result may be a corresponding english sentence, and the target segment may be each english word included in the english sentence. For example, the electronic device may directly determine the sentence-level domain converter; alternatively, a word level domain converter corresponding to each english word in the output sequence of the english sentence may be determined. Accordingly, this step may include the following two implementations.

In the first mode, the electronic device may determine, based on the data distribution category of each data in the training data set represented by the prediction result, a candidate domain converter corresponding to each data from the correspondence between the data distribution category and the candidate domain converter. For example, the candidate domain converter corresponding to each data distribution category may be configured in the initial respective candidate domain converters, and the correspondence relationship between the data distribution category and the candidate domain converter may be recorded so as to select the corresponding candidate domain converter based on the data distribution category of the data.

In one possible embodiment, the prediction result includes a score of each data in the training data set in at least one data distribution category. The electronic device may determine a data distribution category for each data based on the score of each data in at least one data distribution category.

In one possible embodiment, the electronic device may directly use the data distribution category with the score meeting the target score condition as the data distribution category of the data based on the score of each data in at least one data distribution category, and determine the target domain converter corresponding to the data based on the data distribution category of the data. For example, the goal scoring condition may include, but is not limited to: the data distribution category with the highest score, the data distribution category corresponding to any score in the first numerical score in the descending score sequence, or the data distribution category which is positioned in the first numerical score and the score is not lower than the second numerical value, etc.; for example, the highest-scoring data distribution class is taken as the data distribution class of the data.

In yet another possible embodiment, the predicted result may be noise-affected, and the data distribution category may be determined using the result after the noise-affected. In a possible implementation, step 18032 may include: and the electronic equipment determines probability vectors of the data based on the prediction result, and carries out noise processing on the probability vectors of the data to obtain target field conversion corresponding to the data. Illustratively, the probability vector for any data characterizes a probability that the data belongs to an alternative one of the at least one data distribution category; for example, noise may be added to the probability vector of each data, and the domain transition corresponding to the category having the highest probability may be selected based on the probability vector after the noise is added.

The electronic device may filter at least one score in the predicted outcome, remap to a probability vector and add noise. In one possible implementation, the performing of step 18032 may include the following steps A1 to A5:

and A1, for any data in the training data set, the electronic equipment screens out at least one score meeting a preset condition from a prediction result of the data to obtain a target score vector of the data.

The target score vector includes scores for the data in at least one alternative category. Illustratively, the pre-configuration conditions may include, but are not limited to: the score in the first preceding target value position in the descending order of the score sequences, the score in the second preceding target value position in the descending order of the score sequences with a score value higher than the third target value, etc. Taking the score at the first previous target numerical position in the descending score sequence as an example, the electronic device may obtain the target score vector of each data by the following formula eight:

formula eight: l '=topk' (l);

where l represents the data distribution scoring result, l may be represented as a predictive scoring vector; for example, the training data set includes 1000 data, and the data distribution prediction module obtains scores of each data corresponding to 12 cluster categories, and l can be represented as a prediction score vector of 12×1000. l 'represents a target score vector, and the formula eight can represent the first k' scores with higher scores screened from the l; for example, the first 3 scores with the highest score are chosen, and l' represents a 3×1000 target score vector.

And A2, the electronic equipment maps the target score vector into a probability vector of the data.

The electronic equipment can obtain the probability of the data in each data distribution category through the size of each score in the target score vector based on the data, and the larger the score is, the larger the probability is. For example, each score in the target score vector may be normalized to a probability value of no greater than 1 by a normalization process. For example, the score of the data in each data distribution category may be presented in the form of a probability by normalizing the exponential function softmax. The electronic device may map the target score vector to a probability vector for the data by equation nine below:

formula nine: p=softmax (l'/τ);

where l' represents the target score vector and p represents the probability vector; if l' represents a 3×1000 target score vector, p also corresponds to a probability vector represented as 3×1000. τ is a superparameter that can be used to control the distribution of probabilities in the probability vector, the magnitude of which can be preconfigured. Wherein the value of the hyper-parameter can be configured based on the need; the larger the value of the super parameter is, the smaller the difference among a plurality of probabilities included in the probability vector of the same data is; the smaller the value of the super parameter, the closer one hot (one hot) distribution is between the probabilities included in the probability vector of the same data.

And A3, adding noise into the first probability vector of the data by the electronic equipment to obtain a second probability vector.

For convenience of distinction, the probability vector before noise addition is referred to as a first probability vector, and the probability vector after noise addition is referred to as a second probability vector. The first probability vector includes probabilities of the data being in at least one alternative category; the second probability vector includes noise probabilities for the data in at least one alternative category.

Illustratively, the electronic device may add gummel (Geng Beier) noise to the first probability vector to obtain a second probability vector by the following formula ten:

formula ten: g (p) =log (p) +g;

where g represents adding noise that obeys the gummel distribution. G (p) represents a second probability vector after adding noise, and p represents the first probability vector. The data distribution of G (p) can be unfixed after noise is added, and randomness of the target domain converter based on the data distribution category obtained by G (p) is provided. For example, the first probability vector of data a is (0.41,0.39,0.08), and 0.41,0.39, and 0.08 represent probabilities that data a belongs to category 1, category 2, and category 5, respectively, and in the second probability vector after adding noise, the noise probability of data a belonging to category 1 may still be greater than the noise probability of data a belonging to category 2, or the noise probability of data a belonging to category 1 may be less than the noise probability of data a belonging to category 2; for example, in the current 80-iteration training process, data a is assigned 42 times to expert 1 corresponding to category 1, and data a is assigned 37 times to category 2 corresponding to category 2.

The randomness of the noise probability distribution is increased by processing the probability vector noise; especially when the probability difference of the data in at least two categories is smaller, the category with the small probability value in the plurality of categories with the smaller probability difference can be used as the category of the data, and corresponds to the expert module of the category with the small probability value of the data. Therefore, the ambiguous data of the category can be distributed to different expert modules in an equalizing mode, training data of the expert module corresponding to the category with small difference value is increased, the overall robustness of the model is improved, and further translation quality is improved.

And A4, the electronic equipment determines a target category in the at least one alternative category based on the second probability vector.

For the noise probability of at least one candidate class in the second probability vector, the electronic device may select, from among the at least one candidate class, the candidate class with the largest noise probability as the target class.

And A5, the electronic equipment determines that the candidate domain converter corresponding to the target category is the target domain converter corresponding to the data based on the corresponding relation between the data distribution category and the candidate domain converter.

The electronic device may be preconfigured with a one-to-one correspondence between data distribution categories and candidate domain converters; for example, 12 categories correspond to 12 candidate domain converters. The target class may be obtained, for example, by the following formula eleven:

Formula eleven: c=argmax (G (p));

wherein c represents the expert module of the final determination, c can be the serial number of the final determination target domain converter; argmax (G (p)) represents a candidate domain converter corresponding to the class having the largest noise probability.

In order to assign a sentence to the most suitable target domain converter, the data distribution category to which the sentence belongs needs to be determined. One simple strategy is to select the data distribution/for which the class score is greatest ^D . However, since a sentence can be regarded as being from several fieldsIs extracted from the mixture of (a) and its classification may be ambiguous. In the candidate domain converter training phase, the applicant found through research that for a straightforward task, such as a certain data having a higher score in class 1 than class 2, the straightforward task would directly select the class with the greatest score as the class of the data, however, the score of class 2 is less distant from the score of class 1, i.e. the probability that the data belongs to class 2 is closer to the probability that it belongs to class 1, and the straightforward task would directly ignore the consideration of class 2. Thus, for the respective categories with similar scores but lower scores than the maximum score, the sentences cannot be assigned to the candidate domain converters of the respective categories with similar scores but lower scores than the maximum score, so that the training set of the candidate domain converters is reduced to a certain extent. The candidate domain converters are allocated by using Gumbel noise and taking the maximum value of noise probability during training, so that the ambiguous sentences of the category can be possibly allocated to different candidate domain converters according to the noise probability. Through the ambiguity of sentences can be divided into a plurality of proper candidate domain converters, the training set of the candidate domain converters is increased, so that the candidate domain converters are trained more robustly; the training set of the candidate domain converter is enriched and balanced, and the overall robustness of the model is improved, so that the translation quality of the model is improved.

In the second mode, the translation result corresponding to each data includes at least one target segment corresponding to the data, and the target domain converter corresponding to each data includes the target domain converter corresponding to each target segment. For example, step 18032 may include the following step B1:

and B1, for each piece of data, the electronic equipment determines a target domain converter corresponding to each target segment through an expert selector based on a prediction result of the data and second indication information of each target segment corresponding to the data.

The second indication information characterizes the likelihood of each candidate domain converter as a target domain converter of the target segment. The second indication information may be a second score vector, for example.

In one possible implementation, the hybrid expert module includes an expert selector for providing an expert module with data corresponding to respective target segments of a target language; wherein the translation result of each data includes at least one target segment; the target segment may be part of data in the translation result output sequence to be output. For example, the data may be a chinese sentence, the translation result corresponds to an english sentence, and the target segment may be an english word or phrase in the english sentence; accordingly, the expert module corresponding to the data may be a sentence-level expert module, and the expert module of the target segment may be a word-level expert module. In the following step, sentence-level expert modules and word-level expert modules are used for example, but the target segment and the expert modules corresponding to the target segment are not limited; of course, for another example, the data may be a word segment including a plurality of chinese sentences, the translation result corresponds to a segment of english sentence, and the target segment may be an english phrase or an english sentence; accordingly, the expert module corresponding to the data may be a segment-level expert module, and the expert module of the target segment may be a sentence-level expert module.

In one possible manner, the expert selector stores a prototype database storing hidden state centerpoints of each expert module; the hidden state center point of the expert module and the decoded hidden state of the target segment can be used to determine the target domain converter that is most similar to each target segment. Illustratively, the implementation of step B1 may include the following steps C1 and C2:

and C1, for each target segment, the electronic equipment obtains second indication information corresponding to the target segment through the expert selector based on the similarity between the target decoding characteristics corresponding to the target segment and the domain characteristic vectors of the candidate domain converters respectively.

The second indication information includes a score of the target segment at the at least one expert module. For example, the second indication information may be a second score vector for the target segment.

And C2, the electronic equipment integrates the prediction result of the data and the second indication information corresponding to the target segment to obtain the target domain converter corresponding to the target segment.

In the step, the prediction result and the second indication information are integrated to obtain third indication information; the process is the same as the step S4 and the step S6, and will not be described in detail here. The prediction result may include scores of respective candidate domain converters as target domain converters. For example, a prediction result of the data may be predicted by the data distribution prediction module, and the prediction result may include a score of the data in an expert module corresponding to at least one data distribution category.

The electronic device can integrate the predicted result of the ith target segment with the second indication information to obtain the final, predicted ith target segment w by the following formula twelve _i Third indication information at the time:

formula twelve:

wherein,representing the target fragment w _i And corresponding third indication information. interaction represents an integration function. For example, word level score P at kth expert module _t Can be expressed as: p (P) _t ＝Sim([H _out,1～i ],DS _k ) The method comprises the steps of carrying out a first treatment on the surface of the Sentence level score P at kth expert module _s Can be expressed as: p (P) _s (Ept＝k|E _out ) The method comprises the steps of carrying out a first treatment on the surface of the Third instruction information in kth expert Module +.>Can be expressed as:

for example, the third indication information may include an integration score of the ith target segment corresponding to at least one expert module. For example, the electronic device may filter the at least one integration score, remap the integration score into a probability vector and add noise, e.g., the electronic device may select a final expert module for the i-th target segment based on the third indication information using a procedure similar to steps A1 through A5 of step 18032.

Illustratively, the integration function may be as shown in the following formula thirteen:

formula thirteen:

where N () represents a gaussian function, T represents the T-th target segment to be currently translated, and T represents the total number of target segments included in the translation result. Wherein, in the formula thirteen, Mean value of Gaussian function, +.>Representing the variance of the gaussian function.

The electronic device may set a gaussian function according to the mean and the variance, and perform random value taking according to probability distribution of the gaussian function, as third indication information of the ith target segment, and determine an expert module corresponding to the target segment based on the third indication information.

In the training phase, the word level score P is obtained by the training phase using the gaussian function N () _s Sentence level score P _t Is the ratio of (2) to (3) is the varianceThe disturbance is carried out, so that the phenomenon that the integration score has a small range change can be adapted in the model training stage, the robustness of the result is enhanced, and the accuracy and the high efficiency of model training are further improved.

In the model prediction stage or the machine translation model stage with trained, the variance of the Gaussian function can be set to zero, randomness is eliminated, and different translation results are avoided when the same data is translated for a plurality of times, so that a stable expert score is obtained.

Wherein the value α of the training phase integration function (α=interaction _t ) The trend of change in (c) is shown in fig. 20. According to the integration function, when t=0, that is, when calculating the expert second instruction information of the first word, the integration function is biased to directly utilize the sentence-level expert second instruction information; when t=t, i.e. when calculating expert second instruction information of the T-th word, the integration function is biased towards directly utilizing word-level expert second instruction information; when 0 is <t<At T, the integration function considers both the office level expert second instruction information and the word level expert second instruction information as final expert second instruction information.

In the training stage, the method for obtaining the target decoding features may refer to the process from step S1 to step S2, which is not described herein.

In one possible implementation, after the electronic device trains the data distribution prediction module, the electronic device may also construct a prototype database stored by the expert selector using the trained encoder, decoder, and data distribution prediction module. The prototype database is constructed by the following steps E1 to E3:

and E1, the electronic equipment translates each third data in the third data set into corresponding fourth data based on the trained encoder and decoder, and obtains a decoding hidden state vector corresponding to each data segment in each fourth data to obtain a decoding hidden state vector set.

Illustratively, the decoded hidden state vector corresponding to each data segment is the decoded hidden state vector employed in translating the output of the data segment. For example, for each third data, when translating into its corresponding fourth data, the translation process for each data segment in the fourth data may include: the electronic equipment can decode the feature vector of the first data segment and the coded hidden state vector of the third data which are translated and output through the trained decoder to obtain a decoded hidden state vector corresponding to the data segment, and translate and output the corresponding data segment based on the decoded hidden state vector. Thus, the electronic device can obtain the decoded hidden state vector employed in translating each data segment out.

E2, the electronic equipment builds a mapping relation between the field and the expert module based on the trained data distribution prediction module, and determines the expert module corresponding to each decoding hidden state vector in the decoding hidden state vector set based on the mapping relation and the field label of each data segment in each fourth data.

Data sets from various fields are adopted in training the data distribution prediction module, and are clustered to obtain data distribution categories of various data, and each data distribution category corresponds to an expert module, for example, 12 data distribution categories are in one-to-one correspondence with 12 expert modules, for example, IT fields are mapped to expert 1 and expert 2. For example, in the case where a domain corresponds to multiple experts, such as an IT domain, the domain word is randomly mapped to ITs corresponding expert, e.g., the word k-means belongs to the IT domain, and the corresponding experts in the IT domain are expert 1 and expert 2, where k-means will be labeled as corresponding expert 1 with a 50% probability and labeled as corresponding expert 2 with a 50% probability. Therefore, the electronic device can determine the data distribution category corresponding to each data segment based on the domain label of each data segment, so as to obtain the expert module corresponding to the data segment, and thus a large number of decoding hidden states corresponding to each expert module are constructed.

And E3, the electronic equipment determines the hidden state center point of each expert module based on the decoding hidden state vector corresponding to each expert module.

For example, for a plurality of decoding hidden state vectors corresponding to each expert module, the electronic device may cluster the decoding hidden state vectors to obtain a hidden state center point of the expert module. Each expert module corresponds to one or more hidden state center points.

As shown in fig. 21, the prototype database construction flow is as follows from step 1 to step 3:

step 1, marking the corresponding field of words at a decoding end in a data set;

step 2, for a word in a marked field, predicting the decoded hidden state vector q=f (x, y) _1:i-1 ) Marked as the same field as this word. For example, x refers to the input sentence at the encoding end, y _1:i-1 Refers to the decoding end outputting decoding hidden state vectors corresponding to the 1 st to i-1 st words.

And 3, after enough decoding hidden state vectors are collected, determining a decoding hidden state set corresponding to each expert module based on the corresponding relation between the field constructed by the data distribution prediction module and the expert module, calculating the central point of each expert module and storing the central point in a prototype database.

Step 18033, the electronic device obtains a translation result corresponding to each data in the training data set based on the expert module corresponding to each data.

In the step, the electronic equipment can directly use the expert module corresponding to the data to obtain a translation result; or, expert modules corresponding to each target segment of the translation result output sequence can be further obtained, and the corresponding target segment is obtained by using the expert module corresponding to each target segment. For example, the data may be a chinese sentence, the translation result may be a corresponding english sentence, and the target segment may be each english word included in the english sentence. In one possible way, the electronic device can directly use the sentence-level expert module to obtain the translation result; in another possible manner, the electronic device may also determine a word level expert module corresponding to each english word in the output sequence of the english sentence, and obtain the translation result by using each word level expert module.

Illustratively, step 18033 may include the following two approaches.

In the first mode, the electronic equipment directly utilizes the expert module corresponding to the data to obtain the translation result. The execution of this method is the same as that of step 2031, and will not be described here again.

In the second mode, the expert module corresponding to each data includes an expert module corresponding to each target segment. The execution of this method is the same as that of step 2032, and will not be described here again.

Step 18034, the electronic device trains the candidate domain converters based on the translation truth data set of the training data set and the translation result.

For example, a training penalty may be calculated based on the translation truth data set of the training data set and the translation result, and each candidate domain converter may be iteratively trained based on the training penalty to obtain trained each candidate domain converter, thereby ultimately obtaining a trained machine translation model.

In one possible embodiment, a training process for a codec module is provided, and an electronic device may obtain a sample data set, input the sample data set into an encoder to obtain an encoded hidden state vector of the sample data set, and decode the encoded hidden state vector of the sample data set by a decoder to obtain a decoded hidden state vector; and obtaining a translation result of the sample data set based on the decoding hidden state vector, and training the coding and decoding module. The sample data set comprises a source data set to be translated and a translation truth data set corresponding to the source data set. For example, bilingual parallel corpora from different sources may constitute a sample dataset for use by a training codec module. For example, sample dataset S ^T Can be expressed asWhere i represents the ith source corpus, S _i Representing a data feature distribution of an ith source corpus; lambda (lambda) _i Representing the corresponding mixed weights of the ith source corpus in the multiple corpora. Exemplary, λ if multiple corpora are mixed randomly _i May be proportional to the data size of the corresponding ith source corpus. Exemplary, if the electronic device receives a translation request as x= (x) ₁ ,…,x _n ) X can be first converted by the encoder into a coded hidden state h= (h) ₁ ,…,h _n ) Then, the coded hidden state vector h is decoded by a decoder in a self-loop way,obtain the final output result y= (y) ₁ ,…,y _m )。

The expert selector is used for scoring word level expert modules for each translation output of the output end by utilizing the decoding hidden state of the decoder; integrating the score with sentence-level expert module scores obtained by calculation of the data distribution prediction module; finally, a word-level expert module is dynamically assigned to the translation request based on the overall score. That is, the expert selector module may implement a word-level expert selection policy (the expert to which each output is assigned may be different, i.e., the word-level expert module corresponding to each word may be different). In particular, when the expert selector only marks with the sentence-level expert module obtained by the data distribution prediction module, its expert selection strategy will be degraded into a sentence-level (sentence-level) strategy (the same sentence-level expert module is assigned to each output).

As shown in fig. 22, in the training phase of the hybrid expert module, the hybrid expert module may be trained with the optimization objective of the machine translation task based on the domain-mixed machine translation dataset by the following formula fourteen:

formula fourteen:

wherein p is _θ Refers to all trainable parameters in the network, e.g. trainable parameters which may include a hybrid expert module, y _t Refer to the translation output of step t, y _＜t Refers to all outputs before step t; for example, y _t Refers to the current word to be translated and output the t-th English word, y _＜t Refers to the first t-1 English words which have been translated and output before; x refers to translation input, such as a chinese sentence; n refers to the total length or total number of steps of the translation output, such as the total number of words of the corresponding english sentence.

In a possible embodiment, after the trained machine translation model is obtained based on the model training method in steps 1801-1803, when there is a data update in a certain training set, for example, a batch of data is newly collected, a partial update mode may be used to reduce the training cost and update cost of the model. Wherein the device to be updated may be an electronic device performing the process of steps 1401-1403.

Illustratively, after step 1803, the model may be partially updated by performing the following steps (1) through (3), and updating the relevant model parameters of the device in which the model is deployed.

And (1) the electronic equipment acquires a first category corresponding to the updated data set in at least one data distribution category based on the trained data distribution prediction module.

The electronic equipment can conduct data distribution prediction on the updated data set through the data distribution prediction module to obtain a prediction result of the updated data set, and the prediction result is based on the prediction result; the prediction result represents the data distribution category of each data in the updated data set; the electronic device may determine a category of the at least one data distribution category that meets a most relevant condition as the first category based on the prediction result. By way of example, the most relevant conditions may include: the amount of data in the updated dataset belonging to the data distribution category exceeds a first data amount threshold. For example, as shown in fig. 23, a reference machine translation module and a data distribution prediction module including an encoder, a decoder are fixed; and acquiring the coding hidden state vector through the reference machine translation model, inputting the coding hidden state vector into the data distribution prediction module to obtain a first class of the updated data set, and taking the class 1 and the class 2 to which 90% of data belong as the most relevant first class of the updated data set when more than 90% of data in the updated data set A belong to the class 1 and the class 2 in 12 data distribution classes and the rest 10% of data are distributed in the rest classes 3 to 12.

And (2) training a first expert module corresponding to the first category in the mixed expert module based on the trained data distribution module and the updated data set by the electronic equipment to obtain third updated data.

The electronic equipment determines a first expert module corresponding to the first category based on the corresponding relation between the data distribution category and the expert module, fixes the network parameters of the trained data distribution prediction module and the coding and decoding module, and trains the first expert module based on the updated data set; and obtaining third updated data of the trained first expert module, the third updated data may include model parameters of the trained first expert module. As shown in fig. 24, the first category of the updated data set is category 2, corresponding to expert 2, and then training the expert 2 to obtain the third updated data corresponding to the trained expert 2.

And (3) the electronic equipment sends the third updating data to the equipment to be updated so that the equipment to be updated updates the first expert module based on the third updating data.

For example, the device to be updated is a device deploying the machine translation module, and when the user offline model is updated, only the model parameters of the first expert module may be sent to the device to be updated. The device to be updated updates the model parameters of the first expert module to third updated data.

It should be noted that, in the case that a data set in a certain field is updated, the first class most relevant to the updated data set can be found through the steps (1) to (3) to obtain the first expert module most relevant. Model parameters of the coder-decoder module and the data distribution prediction module can be fixed subsequently, and only the most relevant first expert module in the mixed expert module is trained, so that the designated module is trained according to the category of the data set, and the training cost is greatly reduced; meanwhile, each expert module is decoupled, only the first expert module which is most relevant is trained, no influence on translation quality is caused to other expert modules which are not relevant, and the problem of translation quality performance rollback of the other expert modules and other fields in the related technology is avoided.

In addition, in the offline model updating stage, only the model parameters of the first expert module are sent to the equipment to be updated, so that the network consumption cost is greatly reduced, the updating cost of the user is reduced, and the practicability of model training is improved.

In another possible embodiment, after obtaining the trained machine translation model based on the model training method in steps 1801-1803, when a data set of a data class is newly added, an expert module needs to be additionally applied to the data set of the newly added class. Illustratively, following step 1803, steps (4) through (7) may be performed to partially update the model and update the relevant model parameters of the device in which the model is deployed.

And (4) training the data distribution prediction module by the electronic equipment based on the target data set and the newly added type data set to obtain first updated data.

The new added category is different from the data distribution category of any one of the data in the target data set. For example, if the number of data distribution categories of the current data distribution prediction module is N, the new category may be marked as n+1, and the data set of the new category may be added to the constructed target data set of the data distribution prediction module, to obtain the updated target data set. The electronic device retrains the data distribution prediction module based on the updated target data set to obtain first updated data, the first updated data including model parameters of the trained data distribution prediction module.

It should be noted that, because the supervised training method is adopted, the training result is controllable, so that training of the data distribution category is equivalent to adding one data distribution category, that is, the number of the data distribution categories is changed from N to n+1, and the influence of the data distribution prediction module on the prediction capability of the data distribution categories from 1 to N is small and can be ignored.

And (5) adding a second expert module corresponding to the new category into the mixed expert module by the electronic equipment.

The electronic equipment adds a second expert module in the mixed expert module, and establishes a corresponding relation between the second expert module and the newly added category.

And (6) training the second expert module by the electronic equipment based on the newly added class data set to obtain second updated data.

The electronic equipment can fix the network parameters of the trained data distribution prediction module and the coding and decoding module, and train the second expert module based on the data set of the newly added class; and obtaining second updated data for the trained second expert module, the second updated data may include model parameters for the trained second expert module.

And (7) the electronic equipment sends the first updating data and the second updating data to the equipment to be updated so that the equipment to be updated updates the data distribution prediction module based on the first updating data and adds a second expert module in the machine translation model based on the second updating data.

For example, when the user updates the offline model, only the model parameters of the second expert module and the model parameters of the data distribution prediction module may be sent to the device to be updated. The device to be updated updates the model parameters of the data distribution prediction module to first updated data, and adds a second expert module in the machine translation model based on the second updated data.

In one possible manner, the electronic device may further obtain a set of decoding hidden state vectors of the data set of the new class, update the prototype database based on the set of decoding hidden state vectors, and correspondingly update an expert selector local to the device to be updated. The process may include the following steps F1 to F3:

step F1, the electronic equipment acquires a decoding hidden state vector set corresponding to the data set of the newly added class;

step F2, the electronic equipment determines the hidden state central point of the second expert module based on the decoding hidden state vector set, and updates the hidden state central point of the second expert module to a prototype database;

and F3, the electronic equipment sends fourth updating data to the equipment to be updated so that the equipment to be updated updates the prototype database in the expert selector based on the fourth updating data.

For example, the set of decoding hidden state vectors corresponding to the new added category includes decoding hidden state vectors corresponding to data segments of each data in the data set. In addition, the electronic device may further send fourth update data to the device to be updated, where the fourth update data may include a hidden state center point of the second expert module, so that the device to be updated updates the locally stored expert selector, such as updating a prototype database in the expert selector, based on the fourth update data.

It should be noted that, in the model maintenance stage, when a new domain or new class of data set is collected, a domain or class needs to be newly added, and the corresponding expert module needs to be added. One possible update procedure is as follows:

as shown in fig. 25, the updating of the data distribution prediction module includes: step 1, adding a data set of a new field or a new added category to a training set of a data distribution prediction module; step 2, calculating parameters of the data distribution prediction module in the new field; for example, when the data distribution prediction module employs k-means clustering, this parameter refers to the category center point. For example, when GMM (gaussian mixture model) is employed, this parameter refers to a gaussian parameter of a class. And step 3, updating the clustering result, such as updating the data distribution category. And 4, retraining a data distribution prediction module based on the updated clustering result, such as retraining a multi-classification model of the data distribution category used for acquiring the data in the data distribution prediction module.

As shown in fig. 26, for updating of the prototype database, including: calculating a decoding hidden state vector set corresponding to the data set in the new category or the new field; and calculating the hidden state center point of the newly added second expert module based on the decoded hidden state vector set of the data set, and updating the hidden state center point of the second expert module into a prototype database.

As shown in fig. 27, the data distribution prediction module may be trained first by the steps (4) to (7) described above for the data set of the new category. An expert module is additionally applied to the mixed expert module and recorded as a second expert module corresponding to the N+1 categories, such as expert N+1, parameters are initialized for the newly added second expert module, and the newly added category data set is adopted to train the second expert module independently, so that training cost is greatly reduced, and training efficiency is improved; meanwhile, as decoupling is carried out among the expert modules, the influence on the translation quality of other expert modules except the second expert module is avoided, and the problem of translation quality performance rollback of other expert modules and other fields in the related technology is avoided.

In addition, in the offline model updating stage, only model parameters of the newly added expert module and model meal data of the data distribution prediction module are sent to the equipment to be updated, and only the data distribution prediction module and the newly added expert module are updated on the user equipment, so that the network consumption cost is greatly reduced, the updating cost of the user is reduced, and the practicability of model training is improved. In addition, the application further provides an expert module corresponding to the target segment, for example, aiming at the translation request of sentence level, an expert selection strategy of word level is further provided, so that the situation that one translation request contains a plurality of fields is effectively solved, and the quality of multi-field machine translation is further improved.

Based on the model training method provided by the application, the machine translation model based on the decoupling hybrid expert framework is trained, the training cost and the updating cost are reduced, the problem of poor machine translation quality in multiple fields is solved, and compared with the traditional hybrid expert model, the method is easier to train and update part of the modules. Compared with other models, the model obtained by the model training method can obviously improve the translation quality of the field; as can be seen from comparison of results without field data, even if the field data is absent, the model obtained by the model training method can also slightly improve the field translation quality.

Fig. 28 is a schematic diagram of a network structure of a machine translation model provided in the present application. As shown in fig. 28, the machine translation model is a model based on a decoupled hybrid expert architecture. The machine translation model may include: the system comprises a coding and decoding module, a data distribution prediction module and a hybrid expert module; wherein the data prediction module may be a classifier as shown in fig. 28. The data distribution prediction module is independent of an independently trained module outside the decoder, and is used for determining the data distribution category of the input data. The hybrid expert module may include n expert modules of expert 1, expert 2, … … expert n, etc. The encoding and decoding module comprises an encoder and a decoder, wherein the encoder is used for extracting semantic features of input data; the decoder is for decoding the encoded features.

The encoder and decoder each include several levels. And inputting a word vector of the information to be translated to the encoder, and obtaining the encoding hidden state vector through the encoding level of the encoder. Inputting the coding hidden state vector into a classifier to obtain the data distribution type of the information to be translated, and obtaining the expert module corresponding to the information to be translated based on the data distribution type. In the application, the data distribution categories are in one-to-one correspondence with the expert modules. And carrying out pooling operation, linear transformation, tanh activation function activation operation, linear variation and other treatments on the coding hidden state vector based on each decoding level of the classifier in sequence to obtain the data distribution category of the information to be translated. In training the hybrid expert module, noise can be added to the probability vector of each data distribution category based on the data, for example, geng Beier noise is added to the probability vector in a Geng Beier maximum sampling manner, so as to improve the translation quality of each trained expert.

Wherein the input of the decoder is the encoded hidden state vector output by the encoder. The decoder includes N decoding levels, and the hybrid expert module may provide, at the end of each decoding level of the decoder, a corresponding expert module for each respective class of data to process the decoding hidden state vector output by each decoding level using the corresponding expert module. For example, for any decoding level in the decoder, the encoding hidden state vector output by the previous decoding level can be decoded by the any decoding level to obtain a first decoding hidden state vector, and the first decoding hidden state vector is processed by the expert module corresponding to the information to be translated in the N expert modules, for example, expert 2, to obtain a second decoding hidden state vector, and the second decoding hidden state vector is input to the next decoding level of the decoder, and the steps similar to the any decoding level are repeated in the next decoding level until the final decoding hidden state vector is obtained based on the last decoding level and the corresponding expert module. Wherein the decoder can decode the input data by adopting an autoregressive decoding mode. For each expert module, a series of processing such as linear change, reLU activation function processing, linear change and the like can be sequentially performed on the first decoding hidden state vector based on the feedforward neural network in the expert module, and the first decoding hidden state after a series of processing are fused in an additive manner to obtain a second decoding hidden state vector. Finally, the translation result of the information to be translated is obtained through linear change and Softmax activation function processing of the final decoding hidden state vector.

According to the model training method, the data distribution prediction module is trained by acquiring the data set label of the target data set and based on the target data set and the data set label; the method comprises the steps of performing supervised training on a data distribution prediction module based on a target data set with a data set label, enabling training results of the data distribution prediction module to be controllable, and providing possibility for decoupling among the modules; and then training a hybrid expert module based on the trained data distribution prediction module, and on the premise of ensuring the translation quality of the machine translation module obtained by training, decomposing the training process of each module, and independently training each module to realize decoupling among each module, thereby reducing the model training cost, reducing the model updating cost of the model deployment equipment and improving the practicability of the model training process.

In accordance with the present disclosure, among methods performed by an electronic device, a machine translation method for recognizing a user's speech and interpreting the user's intent may receive a speech signal as an analog signal via a speech collecting device (e.g., a microphone) and convert the speech portion into computer-readable text using an Automatic Speech Recognition (ASR) model. The user's speech intent may be obtained by interpreting the converted text using a Natural Language Understanding (NLU) model. The ASR model or NLU model may be an artificial intelligence model. The artificial intelligence model may be processed by an artificial intelligence specific processor designed in a hardware architecture specified for artificial intelligence model processing. The artificial intelligence model may be obtained through training. Here, "obtained by training" means that a basic artificial intelligence model is trained with a plurality of training data by a training algorithm to obtain a predefined operating rule or artificial intelligence model configured to perform a desired feature (or purpose). The artificial intelligence model may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values, and the neural network calculation is performed by calculation between the calculation result of the previous layer and the plurality of weight values.

Language understanding is a technique for recognizing and applying/processing human language/text, including, for example, natural language processing, machine translation, dialog systems, question-answering, or speech recognition/synthesis.

The apparatus provided in the embodiments of the present application may implement at least one module of the plurality of modules through an AI model. The functions associated with the AI may be performed by a non-volatile memory, a volatile memory, and a processor.

The processor may include one or more processors. In this case, the one or more processors may be general-purpose processors such as a Central Processing Unit (CPU), an Application Processor (AP), etc., or purely graphics processing units such as Graphics Processing Units (GPUs), visual Processing Units (VPUs), and/or AI-specific processors such as Neural Processing Units (NPUs).

The one or more processors control the processing of the input data according to predefined operating rules or Artificial Intelligence (AI) models stored in the non-volatile memory and the volatile memory. Predefined operational rules or artificial intelligence models are provided through training or learning.

Here, providing by learning refers to deriving a predefined operation rule or an AI model having a desired characteristic by applying a learning algorithm to a plurality of learning data. The learning may be performed in the apparatus itself in which the AI according to the embodiment is performed, and/or may be implemented by a separate server/system.

The AI model may include a plurality of neural network layers. Each layer has a plurality of weight values, and the calculation of one layer is performed by the calculation result of the previous layer and the plurality of weights of the current layer. Examples of neural networks include, but are not limited to, convolutional Neural Networks (CNNs), deep Neural Networks (DNNs), recurrent Neural Networks (RNNs), boltzmann machines limited (RBMs), deep Belief Networks (DBNs), bi-directional recurrent deep neural networks (BRDNNs), generation countermeasure networks (GANs), and deep Q networks.

A learning algorithm is a method of training a predetermined target device (e.g., a robot) using a plurality of learning data so that, allowing, or controlling the target device to make a determination or prediction. Examples of such learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

Fig. 29 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 29, the electronic device includes: a memory and a processor; at least one program stored in the memory for, when executed by the processor, performing the method performed by the electronic device as described above as compared to the prior art.

In an alternative embodiment, an electronic device is provided, as shown in fig. 29, the electronic device 1000 shown in fig. 29 includes: a processor 1001 and a memory 1003. The processor 1001 is coupled to the memory 1003, such as via a bus 1002. Optionally, the electronic device 1000 may further include a transceiver 1004, where the transceiver 1004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 1004 is not limited to one, and the structure of the electronic device 1000 is not limited to the embodiments of the present application.

The processor 1001 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor 1001 may also be a combination that implements computing functionality, such as a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 1002 may include a path to transfer information between the components. Bus 1002 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The bus 1002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 29, but not only one bus or one type of bus.

The Memory 1003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 1003 is used for storing application program codes (computer programs) for executing the present application, and execution is controlled by the processor 1001. The processor 1001 is configured to execute application code stored in the memory 1003 to implement what is shown in the foregoing method embodiment.

Among them, electronic devices include, but are not limited to: server, service cluster, terminal, etc.

Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, which when run on a computer, enables the computer to perform the respective content of the machine translation method and the model training method in the foregoing method embodiments.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the machine translation method and the model training method.

The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims of this application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the present application described herein may be implemented in other sequences than those illustrated or otherwise described.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

The foregoing is only a partial embodiment of the present invention, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims

1. A method performed by an electronic device, the method comprising:

acquiring information to be translated;

2. The method according to claim 1, wherein the determining, based on the information to be translated, a target domain converter corresponding to the information to be translated from a plurality of candidate domain converters includes:

acquiring a first coding feature of the information to be translated;

determining first indication information of the information to be translated according to the first coding characteristics, wherein the first indication information characterizes the possibility of each candidate domain converter as a target domain converter;

and determining a target domain converter corresponding to the information to be translated from the plurality of candidate domain converters according to the first indication information.

3. The method according to claim 1 or 2, wherein the determining, based on the information to be translated, a target domain converter corresponding to the information to be translated from a plurality of candidate domain converters includes:

Acquiring a first coding feature of the information to be translated;

obtaining segment decoding characteristics of each target segment corresponding to the information to be translated based on the first coding characteristics, obtaining second indication information of the target segments based on the segment decoding characteristics of each target segment, and determining a target domain converter of the target segment based on the second indication information of the target segments;

wherein the second indication information for each target segment characterizes a likelihood of the respective candidate domain converter as a target domain converter for the target segment;

the obtaining a translation result corresponding to the information to be translated based on the target domain converter corresponding to the information to be translated includes:

for each target segment, outputting a translation result of the target segment through a target domain converter corresponding to the target segment based on the segment decoding characteristics of the target segment.

4. The method of claim 3, wherein the obtaining second indication information of the target segments based on the segment decoding characteristics of each target segment, determining the target domain converter of the target segment based on the second indication information of the target segment, comprises:

For each target segment, determining second indication information of the target segment based on segment decoding characteristics of the target segment at a first decoding level;

and determining a target domain converter corresponding to the target fragment at each decoding level based on second indication information corresponding to the target fragment at the first decoding level.

5. The method of claim 3, wherein the obtaining second indication information of the target segments based on the segment decoding characteristics of each target segment, determining the target domain converter of the target segment based on the second indication information of the target segment, comprises:

for each target segment, determining second indication information corresponding to the target segment at the corresponding decoding level according to segment decoding characteristics of the target segment at each decoding level, and determining a target domain converter corresponding to the target segment at the corresponding decoding level according to the second indication information corresponding to the target segment at the corresponding decoding level;

and the second indication information corresponding to the target segment at the corresponding decoding level characterizes the possibility that each candidate domain converter is used as the target domain converter corresponding to the target segment at the corresponding decoding level.

6. The method of claim 5, wherein determining the target domain converter corresponding to the target segment at the corresponding decoding level according to the second indication information corresponding to the target segment at the corresponding decoding level comprises:

and determining a target domain converter corresponding to the target fragment in the corresponding decoding hierarchy from all candidate converters corresponding to the corresponding decoding hierarchy according to second indication information corresponding to the target fragment in the corresponding decoding hierarchy.

7. The method according to any one of claims 3 to 6, wherein the outputting, based on the segment decoding characteristics of the target segment, a translation result of the target segment through a target domain converter corresponding to the target segment, comprises:

for each decoding level, converting the target segment in a target field converter corresponding to the corresponding decoding level according to the segment decoding characteristics of the target segment in the corresponding decoding level to obtain converted segment decoding characteristics, and outputting the converted segment decoding characteristics;

and outputting the translation result of the target fragment according to the fragment decoding characteristics after conversion processing output by the last decoding level.

8. The method according to any one of claims 3 to 7, further comprising: for each target segment, obtaining the decoding characteristics of the target segment at each decoding level by the following method:

for a second decoding level, obtaining the segment decoding characteristics of the target segment at the second decoding level based on the first coding characteristics and the segment decoding characteristics of the target segment after conversion processing output by the last decoding level;

9. The method according to any one of claims 3 to 8, further comprising:

determining first indication information of the information to be translated according to the first coding characteristic of the information to be translated;

the determining the target domain converter of the target segment based on the second indication information of the target segment includes:

For each target segment, determining a target domain converter of the target segment according to the first indication information and the second indication information of the target segment.

10. The method of claim 9, wherein the determining the target domain converter for the target segment based on the first indication information and the second indication information for the target segment comprises:

acquiring a first weight corresponding to the first indication information and a second weight corresponding to the second indication information;

weighting the first indication information and the second indication information based on the first weight and the second weight to obtain third indication information;

and determining a target domain converter of the target segment based on the third indication information.

11. The method of claim 10, wherein the obtaining the first weight corresponding to the first indication information and the second weight corresponding to the second indication information comprises:

for each target segment, determining the second weight based on the bit sequence of the target segment in each target segment, and obtaining the first weight based on the second weight;

Wherein a second weight corresponding to a target segment is positively correlated with the bit sequence.

12. The method according to any one of claims 3 to 11, wherein the obtaining second indication information of the target segments based on the segment decoding characteristics of each target segment comprises:

and for each target segment, obtaining second indication information of the target segment based on the similarity between the segment decoding characteristics of the target segment and the domain characteristic vectors of the candidate domain converters.

13. The method according to any one of claims 3 to 11, wherein the obtaining second indication information of the target segments based on the segment decoding characteristics of each target segment comprises:

for each target segment, second indication information of the target segment is determined based on segment decoding characteristics of the target segment and segment decoding characteristics of translated segments preceding the target segment.

14. The method according to any one of claims 1 to 13, characterized in that the method comprises:

15. The method of claim 14, wherein the method further comprises:

displaying update prompt information, wherein the update prompt information is used for prompting the field corresponding to the update translation;

and in response to the acquired update instruction, updating the domain converter of the corresponding domain.

16. A method performed by an electronic device, the method comprising:

17. The method of claim 16, wherein the method further comprises:

18. A method performed by an electronic device, the method comprising:

19. The method of claim 18, wherein the target data set includes at least a first data set obtained by sampling a source data set to be translated;

the data set label of the target data set is obtained, which comprises the following steps:

acquiring first coding features of each first data in the first dataset based on the trained encoder;

and classifying the first data based on the first coding features of the first data to obtain a data set label of the target data set.

20. The method of claim 18 or 19, wherein training each candidate domain converter based on the trained data distribution prediction module comprises:

based on the trained data distribution prediction module, obtaining a prediction result of a training data set;

determining a target domain converter corresponding to each data in the training data set in each candidate domain converter based on a prediction result of the training data set;

acquiring translation results corresponding to all data in the training data set based on the target field converters corresponding to all data;

and training the candidate domain converters based on the translation results.

21. The method of any one of claims 18 to 20, wherein after training each candidate domain converter based on the trained data distribution prediction module, the method further comprises:

training a data distribution prediction module based on the target data set and a data set of a new type, so as to obtain first updated data, wherein the new type is different from any data distribution type of the target data set;

adding domain converters corresponding to the newly added categories into the candidate domain converters;

Training the first domain converter based on the newly added class of data sets to obtain second updated data;

and sending the first updating data and the second updating data to the equipment to be updated, so that the equipment to be updated updates a data distribution prediction module based on the first updating data, and adds a first domain converter in a machine translation model based on the second updating data.

22. An electronic device, the electronic device comprising:

one or more processors;

a memory;

one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more computer programs configured to: performing the method according to any one of claims 1 to 21.

23. A computer readable storage medium for storing computer instructions which, when run on a computer, cause the computer to perform the method of any one of the preceding claims 1 to 21.