CN108228576B

CN108228576B - Text translation method and device

Info

Publication number: CN108228576B
Application number: CN201711488585.6A
Authority: CN
Inventors: 黄宜鑫; 孟廷; 刘俊华; 魏思; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2021-07-02
Anticipated expiration: 2037-12-29
Also published as: CN108228576A

Abstract

The embodiment of the invention provides a text translation method and device, and belongs to the technical field of language processing. The method comprises the following steps: determining the clustering category to which the source text belongs based on the feature vector of the source text and the clustering center feature vector corresponding to each clustering category; vectorizing the clustering category to which the source text belongs to obtain a clustering category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the clustering category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text and a translation score corresponding to each candidate target text; and selecting one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text. The source text can be translated in the translation process by combining the overall semantics of the source text and other translation hidden reference elements. Therefore, the domain robustness and the translation accuracy of the translation model are improved.

Description

Text translation method and device

Technical Field

The embodiment of the invention relates to the technical field of language processing, in particular to a text translation method and a text translation device.

Background

Machine translation is the process of converting one natural language (source language) to another natural language (target language) using a computer. Currently, attention is focused on machine translation of source text (text corresponding to a source language) in combination with a user's use field, that is, an application field in which the user's speech content is considered in machine translation. The application fields can be divided into education, scientific research, human and so on. For a source text obtained after speech recognition, the following two text translation methods are provided in the related art:

the first is a text translation method at the corpus level, which mainly determines the application field of the source text, screens the training corpuses belonging to the application field, and constructs a translation model based on the screened training corpuses, thereby translating the source text by utilizing the constructed translation model.

The second is a text translation method at a model level, which mainly combines a plurality of translation models in different application fields, for example, weights are given to each translation model according to the correlation degree between the application field of a source text and the application fields of different translation models, so that all translation models are combined according to the weight of each translation model to generate a new mixed model, and the source text is translated by using the new mixed model.

The application fields of the source texts need to be determined in advance, but the application fields of the source texts in the actual translation process can be difficult to determine, and the same vocabulary can belong to multiple application fields, so that accurate translation is difficult.

Disclosure of Invention

In order to solve the above problems, embodiments of the present invention provide a text translation method and apparatus that overcome the above problems or at least partially solve the above problems.

According to a first aspect of the embodiments of the present invention, there is provided a text translation method, including:

determining the clustering category to which the source text belongs based on the feature vector of the source text and the clustering center feature vector corresponding to each clustering category; each clustering category corresponds to one clustering center feature vector, and each clustering category and the clustering center feature vector corresponding to each clustering category are determined after clustering the feature vectors of the training source texts;

vectorizing the clustering category to which the source text belongs to obtain a clustering category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the clustering category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text and a translation score corresponding to each candidate target text;

and selecting one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text.

According to the method provided by the embodiment of the invention, the clustering category to which the source text belongs is determined through the feature vector based on the source text and the clustering center feature vector corresponding to each clustering category. Vectorizing the cluster category to which the source text belongs to obtain a cluster category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the cluster category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text, wherein each candidate target text corresponds to a translation score. And selecting one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text. The cluster type to which the source text belongs can be determined before translation, and the source text and the cluster type to which the source text belongs can be used as input parameters of a translation model together, so that the translation process can be used for translating the source text by combining the overall semantics of the source text and other hidden translation elements. Therefore, the domain robustness and the translation accuracy of the translation model are improved.

With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner, the method further includes:

and averaging the word vectors of all the participles in the source text to obtain the feature vector of the source text.

With reference to the first possible implementation manner of the first aspect, in a third possible implementation manner, determining a cluster type to which a source text belongs based on a feature vector of the source text and a cluster center feature vector corresponding to each cluster type includes:

calculating the distance between the feature vector corresponding to the source text and each clustering center feature vector, determining the clustering center feature vector corresponding to the minimum distance in all the calculated distances, and taking the clustering center feature vector as a target clustering center feature vector;

and taking the cluster category corresponding to the target cluster center feature vector as the cluster category to which the source text belongs.

With reference to the first possible implementation manner of the first aspect, in a fourth possible implementation manner, selecting one candidate target text from all candidate target texts as a translation result of a source text based on a translation score of each candidate target text includes:

respectively inputting each candidate target text into a domain language model corresponding to the clustering category to which the source text belongs, and outputting a domain language model score of each candidate target text;

and selecting one candidate target text from all candidate target texts as a translation result of the source text according to the translation score of each candidate target text and the score of the domain language model.

With reference to the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner, selecting one candidate target text from all candidate target texts as a translation result of a source text according to a translation score of each candidate target text and a domain language model score includes:

and carrying out weighted summation on the translation score of each candidate target text and the score of the field language model to obtain the comprehensive score of each candidate target text, and selecting the candidate target text corresponding to the maximum comprehensive score from all the comprehensive scores as the translation result of the source text.

With reference to the first possible implementation manner of the first aspect, in a sixth possible implementation manner, integrating word vectors of word segments in a source text with cluster category vectors corresponding to the source text includes:

adding a clustering category vector corresponding to the source text before a word vector of a first word segmentation in the source text; alternatively, the first and second electrodes may be,

respectively splicing the clustering category vector corresponding to the source text with the word vector of each participle in the source text; alternatively, the first and second electrodes may be,

adding a clustering category vector corresponding to the source text before a word vector of a first word segmentation in the source text, and splicing the clustering category vector corresponding to the source text with a word vector of each word segmentation in the source text respectively.

With reference to the first possible implementation manner of the first aspect, in a seventh possible implementation manner, the translation model is an encoding and decoding model, an encoding model in the translation model adopts a bidirectional recurrent neural network structure, and a decoding model in the translation model adopts a recurrent neural network structure; accordingly, inputting the integration result into the translation model, and outputting at least one candidate target text, including:

inputting the integration result into a translation model, and respectively obtaining a forward representation and a reverse representation of each word in the source text under the clustering category to which the source text belongs;

splicing the forward representation and the reverse representation of each word under the clustering category to which the source text belongs to obtain a representation vector of each word in the source text;

and decoding the source text based on the characterization vector of each word segmentation in the source text to obtain at least one candidate target text.

According to a second aspect of embodiments of the present invention, there is provided a text translation apparatus including:

the determining module is used for determining the clustering category to which the source text belongs based on the feature vector of the source text and the clustering center feature vector corresponding to each clustering category; each clustering category corresponds to one clustering center feature vector, and each clustering category and the clustering center feature vector corresponding to each clustering category are determined after clustering the feature vectors of the training source texts;

the translation module is used for vectorizing the cluster category to which the source text belongs to obtain a cluster category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the cluster category vector corresponding to the source text, inputting an integration result into the translation model, and outputting at least one candidate target text and a translation score corresponding to each candidate target text;

and the selection module is used for selecting one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text.

According to a third aspect of embodiments of the present invention, there is provided a text translation apparatus including:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the text translation method provided by any of the various possible implementations of the first aspect.

According to a fourth aspect of the present invention, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a text translation method as provided in any one of the various possible implementations of the first aspect.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of embodiments of the invention.

Drawings

Fig. 1 is a schematic flow chart of a text translation method according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating another text translation method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating another text translation method according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a text translation method according to another embodiment of the present invention;

FIG. 5 is a block diagram of a text translation apparatus according to an embodiment of the present invention;

fig. 6 is a block diagram of a text translation apparatus according to an embodiment of the present invention.

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the drawings and examples. The following examples are intended to illustrate the examples of the present invention, but are not intended to limit the scope of the examples of the present invention.

At present, the related technology mainly translates the source text in combination with the application field of the source text. The application fields can be divided into scientific research fields, human body fields, education fields and the like according to application scenes. Since the application field of the source text in the actual translation process may be difficult to determine, if the source text may relate to multiple application fields, it is difficult to determine which application field is specific, which makes accurate translation difficult. Meanwhile, the participles in the source text may belong to multiple application fields, and the translation results are different in different application fields, for example, the vocabulary china is translated into china in the news field and porcelain is translated into porcelain in the antique field, so that the difficulty of accurate translation is further increased.

In addition to the application field, the source text may be used as a reference factor for translation, and some common hidden information such as a subject, a genre, and a writing style may also be used as a reference factor for translation. In view of the above situation, an embodiment of the present invention provides a text translation method. The method is suitable for a speech translation scenario in which a source speech signal is translated into a target text, and is also suitable for a general translation scenario in which a language text is translated into another language text, which is not specifically limited in the embodiment of the present invention. Referring to fig. 1, the method includes: 101. determining the clustering category to which the source text belongs based on the feature vector of the source text and the clustering center feature vector corresponding to each clustering category; each clustering category corresponds to one clustering center feature vector, and each clustering category and the clustering center feature vector corresponding to each clustering category are determined after clustering the feature vectors of the training source texts; 102. vectorizing the clustering category to which the source text belongs to obtain a clustering category vector corresponding to the source text, integrating a word vector of each word in the source text with the clustering category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text and a translation score corresponding to each candidate target text; 103. and selecting one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text.

The classification and clustering category mainly classifies common hidden information possibly used in translation, so that the common hidden information is used in the translation process, and the translation result is more accurate. Before performing the above step 101, a cluster type and a corresponding cluster center feature vector can be determined. Specifically, a large number of fields and training corpora containing various common hidden information can be subjected to unsupervised clustering through the KMeans algorithm, so that different types of clustering categories and clustering center feature vectors corresponding to each clustering category are determined. Of course, other clustering algorithms may also be adopted in the actual implementation process, and this is not specifically limited in the embodiment of the present invention.

Taking KMeans algorithm as an example, in order to implement unsupervised clustering, the feature vector of each training source text in the training corpus may also be calculated first. When calculating the feature vector of each training source text, a word2vec technology can be adopted to train on a data set formed by the training source texts in a training corpus, and after training is finished, a word vector of each word segmentation in each training source text can be obtained. And averaging the word vectors of all the participles in each training source text to obtain the feature vector of each training source text.

Clustering by using the feature vector of each training source text, and obtaining the clustering center feature vector (dv) of each clustering category after clustering₁,dv₂,dv₃,...,dv_K) Each cluster category may be separately denoted as { d }₁,d₂,d₃,...,d_K}. Where k represents the total number of cluster categories. E.g. d₁Representing a first cluster category, dv₁And representing the cluster center feature vector corresponding to the first cluster category. d_KRepresents the k-th cluster class, dv_KAnd representing the clustering center feature vector corresponding to the kth clustering category.

After the clustering process is completed, for the source text to be translated, the feature vector of the source text can be obtained. The embodiment of the present invention does not specifically limit the manner of obtaining the feature vector of the source text, and includes but is not limited to: and averaging the word vectors of all the participles in the source text to obtain the feature vector of the source text.

After the feature vector of the source text is obtained, the cluster type to which the source text belongs can be determined based on the feature vector of the source text and the cluster center feature vector corresponding to each cluster type. As can be seen from the above, each cluster type can be respectively denoted as { d }₁,d₂,d₃,...,d_KThat is, each cluster category may correspond to an identifier. In order to translate the source text based on the cluster category subsequently, the cluster category to which the source text belongs can be vectorized to obtain a cluster category vector corresponding to the source text. The clustering category may be vectorized by table lookup or word2vec, which is not specifically limited in this embodiment of the present invention.

After the clustering category vector corresponding to the source text is obtained, the word vector of the word segmentation in the source text and the clustering category vector corresponding to the source text can be integrated, the integrated result is used as input to a translation model, at least one candidate target text is output, and a translation score corresponding to each candidate target text is output at the same time. The translation model may be obtained by training an initial model based on training source texts and training target texts of different clustering categories, and the initial model may be a Recurrent Neural Networks (RNN) type or the like, which is not specifically limited in this embodiment of the present invention.

After the candidate target texts and the corresponding translation scores are obtained, one candidate target text can be selected from all the candidate target texts as a translation result of the source text based on the translation score of each candidate target text. In the specific selection, the candidate target text with the highest translation score may be selected as the target text after the source text is translated, which is not specifically limited in the embodiment of the present invention.

According to the method provided by the embodiment of the invention, the clustering category to which the source text belongs is determined through the feature vector based on the source text and the clustering center feature vector corresponding to each clustering category. Vectorizing the cluster category to which the source text belongs to obtain a cluster category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the cluster category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text, wherein each candidate target text corresponds to a translation score. And selecting one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text. The cluster type to which the source text belongs can be determined before translation, and the source text and the cluster type to which the source text belongs can be used as input parameters of a translation model together, so that the translation process can be used for translating the source text by combining the overall semantics of the source text and other translation hidden reference elements. Therefore, the domain robustness and the translation accuracy of the translation model are improved.

Based on the content of the above embodiment, as an optional embodiment, the embodiment of the present invention further provides a method for determining a cluster category to which the source text belongs. Referring to fig. 2, the method includes: 1011. calculating the distance between the feature vector corresponding to the source text and each clustering center feature vector, determining the clustering center feature vector corresponding to the minimum distance in all the calculated distances, and taking the clustering center feature vector as a target clustering center feature vector; 1012. and taking the cluster category corresponding to the target cluster center feature vector as the cluster category to which the source text belongs.

In step 1011 above, when the distance between the feature vector corresponding to the source text and each cluster center feature vector is calculated, the euclidean distance between the feature vector corresponding to the source text and each cluster center feature vector may be calculated. For the cluster center feature vector corresponding to the minimum distance, the cluster category corresponding to the cluster center feature vector can be used as the cluster category to which the source text belongs.

Based on the content of the foregoing embodiment, in view of that the matching degree between the candidate target texts and the cluster category to which the source text belongs may not be high enough, and in order to avoid the situation that the translation result is inaccurate, as an optional embodiment, the embodiment of the present invention further provides a method for selecting one candidate target text from all candidate target texts as the translation result of the source text. Referring to fig. 3, the method includes: 1031. respectively inputting each candidate target text into a domain language model corresponding to the clustering category to which the source text belongs, and outputting a domain language model score of each candidate target text; 1032. and selecting one candidate target text from all candidate target texts as a translation result of the source text according to the translation score of each candidate target text and the score of the domain language model.

In the above step 1031, the cluster category to which the source text belongs may be taken as the cluster category to which the target text belongs. A large number of target texts under the cluster category, namely current domain target texts, can be selected according to the cluster category to which the target texts belong, and a domain language model can be constructed by utilizing the current domain target texts under the cluster category. The construction method is the same as the construction method of the existing language model, and the embodiment of the present invention is not particularly limited in this respect. After the domain language model is obtained, the domain language model score of each candidate target text can be calculated through the domain language model. The higher the score of the domain language model is, the higher the accuracy of the candidate target text corresponding to the score of the domain language model as a translation result is.

After the domain language model score of each candidate target text is obtained, one candidate target text can be selected as a translation result of the source text according to the translation score of each candidate target text and the domain language model score. The embodiment of the present invention does not specifically limit the manner of selecting one candidate target text from all candidate target texts as the translation result of the source text according to the translation score and the domain language model score of each candidate target text, and includes but is not limited to: and carrying out weighted summation on the translation score of each candidate target text and the score of the field language model to obtain the comprehensive score of each candidate target text, and selecting the candidate target text corresponding to the maximum comprehensive score from all the comprehensive scores as the translation result of the source text.

The weighted summation may be linear fusion or nonlinear fusion, which is not specifically limited in this embodiment of the present invention. For linear integration, the process of calculating the composite score can refer to the following equation:

S_f＝λS_trans+(1-λ)S_lm

in the above formula, for any candidate target text, S_fA composite score, S, representing the candidate target text_transA translation score, S, representing the candidate target text_lmAnd expressing the domain language model score of the candidate target text, and expressing the weight of the translation score by lambda. The value of λ may be predetermined according to application requirements.

Based on the content of the foregoing embodiment, as an optional embodiment, an embodiment of the present invention further provides a method for integrating word vectors of word segments in a source text with cluster category vectors corresponding to the source text, where the method includes:

adding a clustering category vector corresponding to the source text before a word vector of a first word segmentation in the source text; or, respectively splicing the clustering category vector corresponding to the source text with the word vector of each participle in the source text; or adding a clustering category vector corresponding to the source text before the word vector of the first word segmentation in the source text, and splicing the clustering category vector corresponding to the source text with the word vector of each word segmentation in the source text respectively.

Taking the word vector of all the participles in the source text as x ═(x₁,x₂,x₃,...,x_m) Cluster category vector d corresponding to source text_KFor example. After integrating the word vectors of the word segments in the source text with the clustering category vectors corresponding to the source text, the integration result corresponding to the first integration mode is (d)_k,x₁,x₂,x₃,...,x_m) The integration result corresponding to the second integration method is (d)_kx₁,d_kx₂,d_kx₃,...,d_kx_m) The integration result corresponding to the third integration method is (d)_k,d_kx₁,d_kx₂,d_kx₃,...,d_kx_m). In the third integration manner, d_kx₁Representing the result of splicing the clustering category vector corresponding to the source text with the word vector of the first participle in the source text, d_kx₂And representing the result of splicing the clustering category vector corresponding to the source text with the word vector of the second participle in the source text, and the like in the following steps.

In the above embodiments, the source text is mainly translated according to the cluster type to which the source text belongs, that is, the source text is mainly translated from the perspective of semantic analysis, and in the actual translation process, context information is usually further required to be combined. Based on the content of the above embodiment, as an optional embodiment, the embodiment of the present invention further provides a method for translating an integration result of integrating word vectors of word segmentation in a source text and cluster category vectors corresponding to the source text into candidate target texts. Specifically, the translation model used by the translation process may be a codec model. The coding model in the translation model adopts a bidirectional cyclic neural network structure, and the decoding model in the translation model adopts a cyclic neural network structure.

Accordingly, referring to fig. 4, the method comprises: 1021. inputting the integration result into a translation model, and respectively obtaining a forward representation and a reverse representation of each word in the source text under the clustering category to which the source text belongs; 1022. splicing the forward representation and the reverse representation of each word under the clustering category to which the source text belongs to obtain a representation vector of each word in the source text; 1023. and decoding the source text based on the characterization vector of each word segmentation in the source text to obtain at least one candidate target text.

Specifically, for the cluster category to which the source text belongs, the forward representation f of each participle in the source text under the condition that the participle sees historical vocabulary information under the cluster category can be obtained through a forward cyclic neural network in a bidirectional cyclic neural network structure_i. Through a reverse circulation neural network in a bidirectional circulation neural network structure, a reverse characterization b of each participle for seeing future vocabulary information in the clustering class can be obtained_i. Finally, the two are spliced to form a characterization vector h of each word segmentation in the source text_i. On the basis of the token vectors, the token vector h of each participle in the source text is divided into a plurality of segments_iInputting the candidate target text into a recurrent neural network, and outputting at least one candidate target text. At the same time, a translation score for each candidate target text may also be output.

According to the method provided by the embodiment of the invention, the integration result is input into the translation model, and the forward representation and the reverse representation of each participle in the source text under the clustering category to which the source text belongs are respectively obtained. And splicing the forward representation and the reverse representation of each word under the clustering category to which the source text belongs to obtain the representation vector of each word in the source text. And decoding the source text based on the characterization vector of each word segmentation in the source text to obtain at least one candidate target text. The source text can be translated in the aspect of semantic analysis, and the source text can be translated by combining the context information on the premise of the cluster type of the source text, so that the accuracy of text translation is further improved.

It should be noted that, all the above-mentioned alternative embodiments may be combined arbitrarily to form alternative embodiments of the present invention, and are not described in detail herein.

Based on the content of the foregoing embodiments, an embodiment of the present invention provides a text translation apparatus, where the text translation apparatus is configured to execute the text translation method in the foregoing method embodiments. Referring to fig. 5, the apparatus includes:

a determining module 501, configured to determine a cluster type to which a source text belongs based on a feature vector of the source text and a cluster center feature vector corresponding to each cluster type; each clustering category corresponds to one clustering center feature vector, and each clustering category and the clustering center feature vector corresponding to each clustering category are determined after clustering the feature vectors of the training source texts;

the translation module 502 is configured to perform vectorization on a cluster category to which the source text belongs to obtain a cluster category vector corresponding to the source text, integrate word vectors of word segments in the source text with the cluster category vector corresponding to the source text, input an integration result into the translation model, and output at least one candidate target text and a translation score corresponding to each candidate target text;

a selecting module 503, configured to select one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text.

As an alternative embodiment, the apparatus further comprises:

and the calculation module is used for averaging the word vectors of all the participles in the source text to obtain the feature vector of the source text.

As an optional embodiment, the determining module 501 is configured to calculate a distance between a feature vector corresponding to a source text and each cluster center feature vector, determine a cluster center feature vector corresponding to a minimum distance among all the calculated distances, and use the cluster center feature vector as a target cluster center feature vector; and taking the cluster category corresponding to the target cluster center feature vector as the cluster category to which the source text belongs.

As an alternative embodiment, the selecting module 503 includes:

the computing unit is used for respectively inputting each candidate target text into the domain language model corresponding to the clustering category to which the source text belongs and outputting the domain language model score of each candidate target text;

and the selecting unit is used for selecting one candidate target text from all candidate target texts as a translation result of the source text according to the translation score of each candidate target text and the domain language model score.

As an optional embodiment, the selecting unit is configured to perform weighted summation on the translation score and the domain language model score of each candidate target text to obtain a comprehensive score of each candidate target text, and select the candidate target text corresponding to the maximum comprehensive score from all the comprehensive scores as the translation result of the source text.

As an alternative embodiment, the translation module 502 is configured to add a cluster category vector corresponding to a source text before a word vector of a first word segmentation in the source text; or, respectively splicing the clustering category vector corresponding to the source text with the word vector of each participle in the source text; or adding a clustering category vector corresponding to the source text before the word vector of the first word segmentation in the source text, and splicing the clustering category vector corresponding to the source text with the word vector of each word segmentation in the source text respectively.

As an alternative embodiment, the translation model includes a bidirectional recurrent neural network and a recurrent neural network, the bidirectional recurrent neural network includes a forward recurrent neural network and a reverse recurrent neural network; correspondingly, the translation module 502 is configured to input the integration result into a translation model, and obtain a forward characterization and a reverse characterization of each participle in the source text under the cluster category to which the source text belongs; splicing the forward representation and the reverse representation of each word under the clustering category to which the source text belongs to obtain a representation vector of each word in the source text; and decoding the source text based on the characterization vector of each word segmentation in the source text to obtain at least one candidate target text.

The device provided by the embodiment of the invention determines the clustering category to which the source text belongs through the feature vector based on the source text and the clustering center feature vector corresponding to each clustering category. Vectorizing the cluster category to which the source text belongs to obtain a cluster category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the cluster category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text, wherein each candidate target text corresponds to a translation score. And selecting one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text. The cluster type to which the source text belongs can be determined before translation, and the source text and the cluster type to which the source text belongs can be used as input parameters of a translation model together, so that the translation process can be used for translating the source text by combining the overall semantics of the source text and other translation hidden reference elements. Therefore, the domain robustness and the translation accuracy of the translation model are improved.

In addition, the integration result is input into the translation model, and the forward representation and the reverse representation of each word in the source text under the cluster type of the source text are respectively obtained. And splicing the forward representation and the reverse representation of each word under the clustering category to which the source text belongs to obtain the representation vector of each word in the source text. And decoding the source text based on the characterization vector of each word segmentation in the source text to obtain at least one candidate target text. The source text can be translated in the aspect of semantic analysis, and the source text can be translated by combining the context information on the premise of the cluster type of the source text, so that the accuracy of text translation is further improved.

The embodiment of the invention provides text translation equipment. Referring to fig. 6, the apparatus includes: a processor (processor)601, a memory (memory)602, and a bus 603;

the processor 601 and the memory 602 complete communication with each other through the bus 603, respectively;

the processor 601 is used for calling the program instructions in the memory 602 to execute the text translation method provided by the above embodiment, for example, including: determining the clustering category to which the source text belongs based on the feature vector of the source text and the clustering center feature vector corresponding to each clustering category; each clustering category corresponds to one clustering center feature vector, and each clustering category and the clustering center feature vector corresponding to each clustering category are determined after clustering the feature vectors of the training source texts; vectorizing the clustering category to which the source text belongs to obtain a clustering category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the clustering category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text and a translation score corresponding to each candidate target text; and selecting one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text.

An embodiment of the present invention provides a non-transitory computer-readable storage medium storing computer instructions, where the computer instructions cause a computer to execute a text translation method provided in the foregoing embodiment, for example, the method includes:

determining the clustering category to which the source text belongs based on the feature vector of the source text and the clustering center feature vector corresponding to each clustering category; each clustering category corresponds to one clustering center feature vector, and each clustering category and the clustering center feature vector corresponding to each clustering category are determined after clustering the feature vectors of the training source texts; vectorizing the clustering category to which the source text belongs to obtain a clustering category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the clustering category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text and a translation score corresponding to each candidate target text; and selecting one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The above-described embodiments of the text translation apparatus and the like are merely illustrative, where units illustrated as separate components may or may not be physically separate, and components displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute the various embodiments or some parts of the methods of the embodiments.

Finally, the method of the present application is only a preferred embodiment and is not intended to limit the scope of the embodiments of the present invention. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the embodiments of the present invention should be included in the protection scope of the embodiments of the present invention.

Claims

1. A method of text translation, comprising:

determining a clustering category to which a source text belongs based on a feature vector of the source text and a clustering center feature vector corresponding to each clustering category; each clustering category corresponds to one clustering center feature vector, and each clustering category and the clustering center feature vector corresponding to each clustering category are determined after clustering the feature vectors of the training source texts;

vectorizing the cluster category to which the source text belongs to obtain a cluster category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the cluster category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text and a translation score corresponding to each candidate target text;

2. The method of claim 1, further comprising:

and averaging word vectors of all the participles in the source text to obtain the feature vector of the source text.

3. The method according to claim 1, wherein the determining the cluster type to which the source text belongs based on the feature vector of the source text and the cluster center feature vector corresponding to each cluster type comprises:

4. The method of claim 1, wherein the selecting one candidate target text from all candidate target texts as the translation result of the source text based on the translation score of each candidate target text comprises:

inputting each candidate target text into a domain language model corresponding to the clustering category to which the source text belongs respectively, and outputting a domain language model score of each candidate target text;

5. The method of claim 4, wherein selecting one candidate target text from all candidate target texts as the translation result of the source text according to the translation score and the domain language model score of each candidate target text comprises:

6. The method of claim 1, wherein the integrating the word vector of the participle in the source text with the cluster category vector corresponding to the source text comprises:

7. The method according to claim 1, wherein the translation model is a coding/decoding model, a coding model in the translation model adopts a bidirectional recurrent neural network structure, and a decoding model in the translation model adopts a recurrent neural network structure; accordingly, the inputting the integration result into the translation model and the outputting at least one candidate target text comprises:

inputting the integration result into the translation model, and respectively obtaining a forward representation and a reverse representation of each word in the source text under the clustering category to which the source text belongs;

8. A text translation apparatus, comprising:

the translation module is used for vectorizing the cluster category to which the source text belongs to obtain a cluster category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the cluster category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text and a translation score corresponding to each candidate target text;

and the selection module is used for selecting one candidate target text from all candidate target texts as the translation result of the source text based on the translation score of each candidate target text.

9. A computer device, comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 7.

10. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 7.