CN108228576B - Text translation method and device - Google Patents

Text translation method and device Download PDF

Info

Publication number
CN108228576B
CN108228576B CN201711488585.6A CN201711488585A CN108228576B CN 108228576 B CN108228576 B CN 108228576B CN 201711488585 A CN201711488585 A CN 201711488585A CN 108228576 B CN108228576 B CN 108228576B
Authority
CN
China
Prior art keywords
text
source text
translation
candidate target
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711488585.6A
Other languages
Chinese (zh)
Other versions
CN108228576A (en
Inventor
黄宜鑫
孟廷
刘俊华
魏思
胡国平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201711488585.6A priority Critical patent/CN108228576B/en
Publication of CN108228576A publication Critical patent/CN108228576A/en
Application granted granted Critical
Publication of CN108228576B publication Critical patent/CN108228576B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a text translation method and device, and belongs to the technical field of language processing. The method comprises the following steps: determining the clustering category to which the source text belongs based on the feature vector of the source text and the clustering center feature vector corresponding to each clustering category; vectorizing the clustering category to which the source text belongs to obtain a clustering category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the clustering category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text and a translation score corresponding to each candidate target text; and selecting one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text. The source text can be translated in the translation process by combining the overall semantics of the source text and other translation hidden reference elements. Therefore, the domain robustness and the translation accuracy of the translation model are improved.

Description

Text translation method and device
Technical Field
The embodiment of the invention relates to the technical field of language processing, in particular to a text translation method and a text translation device.
Background
Machine translation is the process of converting one natural language (source language) to another natural language (target language) using a computer. Currently, attention is focused on machine translation of source text (text corresponding to a source language) in combination with a user's use field, that is, an application field in which the user's speech content is considered in machine translation. The application fields can be divided into education, scientific research, human and so on. For a source text obtained after speech recognition, the following two text translation methods are provided in the related art:
the first is a text translation method at the corpus level, which mainly determines the application field of the source text, screens the training corpuses belonging to the application field, and constructs a translation model based on the screened training corpuses, thereby translating the source text by utilizing the constructed translation model.
The second is a text translation method at a model level, which mainly combines a plurality of translation models in different application fields, for example, weights are given to each translation model according to the correlation degree between the application field of a source text and the application fields of different translation models, so that all translation models are combined according to the weight of each translation model to generate a new mixed model, and the source text is translated by using the new mixed model.
The application fields of the source texts need to be determined in advance, but the application fields of the source texts in the actual translation process can be difficult to determine, and the same vocabulary can belong to multiple application fields, so that accurate translation is difficult.
Disclosure of Invention
In order to solve the above problems, embodiments of the present invention provide a text translation method and apparatus that overcome the above problems or at least partially solve the above problems.
According to a first aspect of the embodiments of the present invention, there is provided a text translation method, including:
determining the clustering category to which the source text belongs based on the feature vector of the source text and the clustering center feature vector corresponding to each clustering category; each clustering category corresponds to one clustering center feature vector, and each clustering category and the clustering center feature vector corresponding to each clustering category are determined after clustering the feature vectors of the training source texts;
vectorizing the clustering category to which the source text belongs to obtain a clustering category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the clustering category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text and a translation score corresponding to each candidate target text;
and selecting one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text.
According to the method provided by the embodiment of the invention, the clustering category to which the source text belongs is determined through the feature vector based on the source text and the clustering center feature vector corresponding to each clustering category. Vectorizing the cluster category to which the source text belongs to obtain a cluster category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the cluster category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text, wherein each candidate target text corresponds to a translation score. And selecting one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text. The cluster type to which the source text belongs can be determined before translation, and the source text and the cluster type to which the source text belongs can be used as input parameters of a translation model together, so that the translation process can be used for translating the source text by combining the overall semantics of the source text and other hidden translation elements. Therefore, the domain robustness and the translation accuracy of the translation model are improved.
With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner, the method further includes:
and averaging the word vectors of all the participles in the source text to obtain the feature vector of the source text.
With reference to the first possible implementation manner of the first aspect, in a third possible implementation manner, determining a cluster type to which a source text belongs based on a feature vector of the source text and a cluster center feature vector corresponding to each cluster type includes:
calculating the distance between the feature vector corresponding to the source text and each clustering center feature vector, determining the clustering center feature vector corresponding to the minimum distance in all the calculated distances, and taking the clustering center feature vector as a target clustering center feature vector;
and taking the cluster category corresponding to the target cluster center feature vector as the cluster category to which the source text belongs.
With reference to the first possible implementation manner of the first aspect, in a fourth possible implementation manner, selecting one candidate target text from all candidate target texts as a translation result of a source text based on a translation score of each candidate target text includes:
respectively inputting each candidate target text into a domain language model corresponding to the clustering category to which the source text belongs, and outputting a domain language model score of each candidate target text;
and selecting one candidate target text from all candidate target texts as a translation result of the source text according to the translation score of each candidate target text and the score of the domain language model.
With reference to the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner, selecting one candidate target text from all candidate target texts as a translation result of a source text according to a translation score of each candidate target text and a domain language model score includes:
and carrying out weighted summation on the translation score of each candidate target text and the score of the field language model to obtain the comprehensive score of each candidate target text, and selecting the candidate target text corresponding to the maximum comprehensive score from all the comprehensive scores as the translation result of the source text.
With reference to the first possible implementation manner of the first aspect, in a sixth possible implementation manner, integrating word vectors of word segments in a source text with cluster category vectors corresponding to the source text includes:
adding a clustering category vector corresponding to the source text before a word vector of a first word segmentation in the source text; alternatively, the first and second electrodes may be,
respectively splicing the clustering category vector corresponding to the source text with the word vector of each participle in the source text; alternatively, the first and second electrodes may be,
adding a clustering category vector corresponding to the source text before a word vector of a first word segmentation in the source text, and splicing the clustering category vector corresponding to the source text with a word vector of each word segmentation in the source text respectively.
With reference to the first possible implementation manner of the first aspect, in a seventh possible implementation manner, the translation model is an encoding and decoding model, an encoding model in the translation model adopts a bidirectional recurrent neural network structure, and a decoding model in the translation model adopts a recurrent neural network structure; accordingly, inputting the integration result into the translation model, and outputting at least one candidate target text, including:
inputting the integration result into a translation model, and respectively obtaining a forward representation and a reverse representation of each word in the source text under the clustering category to which the source text belongs;
splicing the forward representation and the reverse representation of each word under the clustering category to which the source text belongs to obtain a representation vector of each word in the source text;
and decoding the source text based on the characterization vector of each word segmentation in the source text to obtain at least one candidate target text.
According to a second aspect of embodiments of the present invention, there is provided a text translation apparatus including:
the determining module is used for determining the clustering category to which the source text belongs based on the feature vector of the source text and the clustering center feature vector corresponding to each clustering category; each clustering category corresponds to one clustering center feature vector, and each clustering category and the clustering center feature vector corresponding to each clustering category are determined after clustering the feature vectors of the training source texts;
the translation module is used for vectorizing the cluster category to which the source text belongs to obtain a cluster category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the cluster category vector corresponding to the source text, inputting an integration result into the translation model, and outputting at least one candidate target text and a translation score corresponding to each candidate target text;
and the selection module is used for selecting one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text.
According to a third aspect of embodiments of the present invention, there is provided a text translation apparatus including:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the text translation method provided by any of the various possible implementations of the first aspect.
According to a fourth aspect of the present invention, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a text translation method as provided in any one of the various possible implementations of the first aspect.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of embodiments of the invention.
Drawings
Fig. 1 is a schematic flow chart of a text translation method according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating another text translation method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating another text translation method according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a text translation method according to another embodiment of the present invention;
FIG. 5 is a block diagram of a text translation apparatus according to an embodiment of the present invention;
fig. 6 is a block diagram of a text translation apparatus according to an embodiment of the present invention.
Detailed Description
The following describes embodiments of the present invention in further detail with reference to the drawings and examples. The following examples are intended to illustrate the examples of the present invention, but are not intended to limit the scope of the examples of the present invention.
At present, the related technology mainly translates the source text in combination with the application field of the source text. The application fields can be divided into scientific research fields, human body fields, education fields and the like according to application scenes. Since the application field of the source text in the actual translation process may be difficult to determine, if the source text may relate to multiple application fields, it is difficult to determine which application field is specific, which makes accurate translation difficult. Meanwhile, the participles in the source text may belong to multiple application fields, and the translation results are different in different application fields, for example, the vocabulary china is translated into china in the news field and porcelain is translated into porcelain in the antique field, so that the difficulty of accurate translation is further increased.
In addition to the application field, the source text may be used as a reference factor for translation, and some common hidden information such as a subject, a genre, and a writing style may also be used as a reference factor for translation. In view of the above situation, an embodiment of the present invention provides a text translation method. The method is suitable for a speech translation scenario in which a source speech signal is translated into a target text, and is also suitable for a general translation scenario in which a language text is translated into another language text, which is not specifically limited in the embodiment of the present invention. Referring to fig. 1, the method includes: 101. determining the clustering category to which the source text belongs based on the feature vector of the source text and the clustering center feature vector corresponding to each clustering category; each clustering category corresponds to one clustering center feature vector, and each clustering category and the clustering center feature vector corresponding to each clustering category are determined after clustering the feature vectors of the training source texts; 102. vectorizing the clustering category to which the source text belongs to obtain a clustering category vector corresponding to the source text, integrating a word vector of each word in the source text with the clustering category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text and a translation score corresponding to each candidate target text; 103. and selecting one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text.
The classification and clustering category mainly classifies common hidden information possibly used in translation, so that the common hidden information is used in the translation process, and the translation result is more accurate. Before performing the above step 101, a cluster type and a corresponding cluster center feature vector can be determined. Specifically, a large number of fields and training corpora containing various common hidden information can be subjected to unsupervised clustering through the KMeans algorithm, so that different types of clustering categories and clustering center feature vectors corresponding to each clustering category are determined. Of course, other clustering algorithms may also be adopted in the actual implementation process, and this is not specifically limited in the embodiment of the present invention.
Taking KMeans algorithm as an example, in order to implement unsupervised clustering, the feature vector of each training source text in the training corpus may also be calculated first. When calculating the feature vector of each training source text, a word2vec technology can be adopted to train on a data set formed by the training source texts in a training corpus, and after training is finished, a word vector of each word segmentation in each training source text can be obtained. And averaging the word vectors of all the participles in each training source text to obtain the feature vector of each training source text.
Clustering by using the feature vector of each training source text, and obtaining the clustering center feature vector (dv) of each clustering category after clustering1,dv2,dv3,...,dvK) Each cluster category may be separately denoted as { d }1,d2,d3,...,dK}. Where k represents the total number of cluster categories. E.g. d1Representing a first cluster category, dv1And representing the cluster center feature vector corresponding to the first cluster category. dKRepresents the k-th cluster class, dvKAnd representing the clustering center feature vector corresponding to the kth clustering category.
After the clustering process is completed, for the source text to be translated, the feature vector of the source text can be obtained. The embodiment of the present invention does not specifically limit the manner of obtaining the feature vector of the source text, and includes but is not limited to: and averaging the word vectors of all the participles in the source text to obtain the feature vector of the source text.
After the feature vector of the source text is obtained, the cluster type to which the source text belongs can be determined based on the feature vector of the source text and the cluster center feature vector corresponding to each cluster type. As can be seen from the above, each cluster type can be respectively denoted as { d }1,d2,d3,...,dKThat is, each cluster category may correspond to an identifier. In order to translate the source text based on the cluster category subsequently, the cluster category to which the source text belongs can be vectorized to obtain a cluster category vector corresponding to the source text. The clustering category may be vectorized by table lookup or word2vec, which is not specifically limited in this embodiment of the present invention.
After the clustering category vector corresponding to the source text is obtained, the word vector of the word segmentation in the source text and the clustering category vector corresponding to the source text can be integrated, the integrated result is used as input to a translation model, at least one candidate target text is output, and a translation score corresponding to each candidate target text is output at the same time. The translation model may be obtained by training an initial model based on training source texts and training target texts of different clustering categories, and the initial model may be a Recurrent Neural Networks (RNN) type or the like, which is not specifically limited in this embodiment of the present invention.
After the candidate target texts and the corresponding translation scores are obtained, one candidate target text can be selected from all the candidate target texts as a translation result of the source text based on the translation score of each candidate target text. In the specific selection, the candidate target text with the highest translation score may be selected as the target text after the source text is translated, which is not specifically limited in the embodiment of the present invention.
According to the method provided by the embodiment of the invention, the clustering category to which the source text belongs is determined through the feature vector based on the source text and the clustering center feature vector corresponding to each clustering category. Vectorizing the cluster category to which the source text belongs to obtain a cluster category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the cluster category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text, wherein each candidate target text corresponds to a translation score. And selecting one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text. The cluster type to which the source text belongs can be determined before translation, and the source text and the cluster type to which the source text belongs can be used as input parameters of a translation model together, so that the translation process can be used for translating the source text by combining the overall semantics of the source text and other translation hidden reference elements. Therefore, the domain robustness and the translation accuracy of the translation model are improved.
Based on the content of the above embodiment, as an optional embodiment, the embodiment of the present invention further provides a method for determining a cluster category to which the source text belongs. Referring to fig. 2, the method includes: 1011. calculating the distance between the feature vector corresponding to the source text and each clustering center feature vector, determining the clustering center feature vector corresponding to the minimum distance in all the calculated distances, and taking the clustering center feature vector as a target clustering center feature vector; 1012. and taking the cluster category corresponding to the target cluster center feature vector as the cluster category to which the source text belongs.
In step 1011 above, when the distance between the feature vector corresponding to the source text and each cluster center feature vector is calculated, the euclidean distance between the feature vector corresponding to the source text and each cluster center feature vector may be calculated. For the cluster center feature vector corresponding to the minimum distance, the cluster category corresponding to the cluster center feature vector can be used as the cluster category to which the source text belongs.
Based on the content of the foregoing embodiment, in view of that the matching degree between the candidate target texts and the cluster category to which the source text belongs may not be high enough, and in order to avoid the situation that the translation result is inaccurate, as an optional embodiment, the embodiment of the present invention further provides a method for selecting one candidate target text from all candidate target texts as the translation result of the source text. Referring to fig. 3, the method includes: 1031. respectively inputting each candidate target text into a domain language model corresponding to the clustering category to which the source text belongs, and outputting a domain language model score of each candidate target text; 1032. and selecting one candidate target text from all candidate target texts as a translation result of the source text according to the translation score of each candidate target text and the score of the domain language model.
In the above step 1031, the cluster category to which the source text belongs may be taken as the cluster category to which the target text belongs. A large number of target texts under the cluster category, namely current domain target texts, can be selected according to the cluster category to which the target texts belong, and a domain language model can be constructed by utilizing the current domain target texts under the cluster category. The construction method is the same as the construction method of the existing language model, and the embodiment of the present invention is not particularly limited in this respect. After the domain language model is obtained, the domain language model score of each candidate target text can be calculated through the domain language model. The higher the score of the domain language model is, the higher the accuracy of the candidate target text corresponding to the score of the domain language model as a translation result is.
After the domain language model score of each candidate target text is obtained, one candidate target text can be selected as a translation result of the source text according to the translation score of each candidate target text and the domain language model score. The embodiment of the present invention does not specifically limit the manner of selecting one candidate target text from all candidate target texts as the translation result of the source text according to the translation score and the domain language model score of each candidate target text, and includes but is not limited to: and carrying out weighted summation on the translation score of each candidate target text and the score of the field language model to obtain the comprehensive score of each candidate target text, and selecting the candidate target text corresponding to the maximum comprehensive score from all the comprehensive scores as the translation result of the source text.
The weighted summation may be linear fusion or nonlinear fusion, which is not specifically limited in this embodiment of the present invention. For linear integration, the process of calculating the composite score can refer to the following equation:
Sf=λStrans+(1-λ)Slm
in the above formula, for any candidate target text, SfA composite score, S, representing the candidate target texttransA translation score, S, representing the candidate target textlmAnd expressing the domain language model score of the candidate target text, and expressing the weight of the translation score by lambda. The value of λ may be predetermined according to application requirements.
Based on the content of the foregoing embodiment, as an optional embodiment, an embodiment of the present invention further provides a method for integrating word vectors of word segments in a source text with cluster category vectors corresponding to the source text, where the method includes:
adding a clustering category vector corresponding to the source text before a word vector of a first word segmentation in the source text; or, respectively splicing the clustering category vector corresponding to the source text with the word vector of each participle in the source text; or adding a clustering category vector corresponding to the source text before the word vector of the first word segmentation in the source text, and splicing the clustering category vector corresponding to the source text with the word vector of each word segmentation in the source text respectively.
Taking the word vector of all the participles in the source text as x ═(x1,x2,x3,...,xm) Cluster category vector d corresponding to source textKFor example. After integrating the word vectors of the word segments in the source text with the clustering category vectors corresponding to the source text, the integration result corresponding to the first integration mode is (d)k,x1,x2,x3,...,xm) The integration result corresponding to the second integration method is (d)kx1,dkx2,dkx3,...,dkxm) The integration result corresponding to the third integration method is (d)k,dkx1,dkx2,dkx3,...,dkxm). In the third integration manner, dkx1Representing the result of splicing the clustering category vector corresponding to the source text with the word vector of the first participle in the source text, dkx2And representing the result of splicing the clustering category vector corresponding to the source text with the word vector of the second participle in the source text, and the like in the following steps.
In the above embodiments, the source text is mainly translated according to the cluster type to which the source text belongs, that is, the source text is mainly translated from the perspective of semantic analysis, and in the actual translation process, context information is usually further required to be combined. Based on the content of the above embodiment, as an optional embodiment, the embodiment of the present invention further provides a method for translating an integration result of integrating word vectors of word segmentation in a source text and cluster category vectors corresponding to the source text into candidate target texts. Specifically, the translation model used by the translation process may be a codec model. The coding model in the translation model adopts a bidirectional cyclic neural network structure, and the decoding model in the translation model adopts a cyclic neural network structure.
Accordingly, referring to fig. 4, the method comprises: 1021. inputting the integration result into a translation model, and respectively obtaining a forward representation and a reverse representation of each word in the source text under the clustering category to which the source text belongs; 1022. splicing the forward representation and the reverse representation of each word under the clustering category to which the source text belongs to obtain a representation vector of each word in the source text; 1023. and decoding the source text based on the characterization vector of each word segmentation in the source text to obtain at least one candidate target text.
Specifically, for the cluster category to which the source text belongs, the forward representation f of each participle in the source text under the condition that the participle sees historical vocabulary information under the cluster category can be obtained through a forward cyclic neural network in a bidirectional cyclic neural network structurei. Through a reverse circulation neural network in a bidirectional circulation neural network structure, a reverse characterization b of each participle for seeing future vocabulary information in the clustering class can be obtainedi. Finally, the two are spliced to form a characterization vector h of each word segmentation in the source texti. On the basis of the token vectors, the token vector h of each participle in the source text is divided into a plurality of segmentsiInputting the candidate target text into a recurrent neural network, and outputting at least one candidate target text. At the same time, a translation score for each candidate target text may also be output.
According to the method provided by the embodiment of the invention, the integration result is input into the translation model, and the forward representation and the reverse representation of each participle in the source text under the clustering category to which the source text belongs are respectively obtained. And splicing the forward representation and the reverse representation of each word under the clustering category to which the source text belongs to obtain the representation vector of each word in the source text. And decoding the source text based on the characterization vector of each word segmentation in the source text to obtain at least one candidate target text. The source text can be translated in the aspect of semantic analysis, and the source text can be translated by combining the context information on the premise of the cluster type of the source text, so that the accuracy of text translation is further improved.
It should be noted that, all the above-mentioned alternative embodiments may be combined arbitrarily to form alternative embodiments of the present invention, and are not described in detail herein.
Based on the content of the foregoing embodiments, an embodiment of the present invention provides a text translation apparatus, where the text translation apparatus is configured to execute the text translation method in the foregoing method embodiments. Referring to fig. 5, the apparatus includes:
a determining module 501, configured to determine a cluster type to which a source text belongs based on a feature vector of the source text and a cluster center feature vector corresponding to each cluster type; each clustering category corresponds to one clustering center feature vector, and each clustering category and the clustering center feature vector corresponding to each clustering category are determined after clustering the feature vectors of the training source texts;
the translation module 502 is configured to perform vectorization on a cluster category to which the source text belongs to obtain a cluster category vector corresponding to the source text, integrate word vectors of word segments in the source text with the cluster category vector corresponding to the source text, input an integration result into the translation model, and output at least one candidate target text and a translation score corresponding to each candidate target text;
a selecting module 503, configured to select one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text.
As an alternative embodiment, the apparatus further comprises:
and the calculation module is used for averaging the word vectors of all the participles in the source text to obtain the feature vector of the source text.
As an optional embodiment, the determining module 501 is configured to calculate a distance between a feature vector corresponding to a source text and each cluster center feature vector, determine a cluster center feature vector corresponding to a minimum distance among all the calculated distances, and use the cluster center feature vector as a target cluster center feature vector; and taking the cluster category corresponding to the target cluster center feature vector as the cluster category to which the source text belongs.
As an alternative embodiment, the selecting module 503 includes:
the computing unit is used for respectively inputting each candidate target text into the domain language model corresponding to the clustering category to which the source text belongs and outputting the domain language model score of each candidate target text;
and the selecting unit is used for selecting one candidate target text from all candidate target texts as a translation result of the source text according to the translation score of each candidate target text and the domain language model score.
As an optional embodiment, the selecting unit is configured to perform weighted summation on the translation score and the domain language model score of each candidate target text to obtain a comprehensive score of each candidate target text, and select the candidate target text corresponding to the maximum comprehensive score from all the comprehensive scores as the translation result of the source text.
As an alternative embodiment, the translation module 502 is configured to add a cluster category vector corresponding to a source text before a word vector of a first word segmentation in the source text; or, respectively splicing the clustering category vector corresponding to the source text with the word vector of each participle in the source text; or adding a clustering category vector corresponding to the source text before the word vector of the first word segmentation in the source text, and splicing the clustering category vector corresponding to the source text with the word vector of each word segmentation in the source text respectively.
As an alternative embodiment, the translation model includes a bidirectional recurrent neural network and a recurrent neural network, the bidirectional recurrent neural network includes a forward recurrent neural network and a reverse recurrent neural network; correspondingly, the translation module 502 is configured to input the integration result into a translation model, and obtain a forward characterization and a reverse characterization of each participle in the source text under the cluster category to which the source text belongs; splicing the forward representation and the reverse representation of each word under the clustering category to which the source text belongs to obtain a representation vector of each word in the source text; and decoding the source text based on the characterization vector of each word segmentation in the source text to obtain at least one candidate target text.
The device provided by the embodiment of the invention determines the clustering category to which the source text belongs through the feature vector based on the source text and the clustering center feature vector corresponding to each clustering category. Vectorizing the cluster category to which the source text belongs to obtain a cluster category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the cluster category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text, wherein each candidate target text corresponds to a translation score. And selecting one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text. The cluster type to which the source text belongs can be determined before translation, and the source text and the cluster type to which the source text belongs can be used as input parameters of a translation model together, so that the translation process can be used for translating the source text by combining the overall semantics of the source text and other translation hidden reference elements. Therefore, the domain robustness and the translation accuracy of the translation model are improved.
In addition, the integration result is input into the translation model, and the forward representation and the reverse representation of each word in the source text under the cluster type of the source text are respectively obtained. And splicing the forward representation and the reverse representation of each word under the clustering category to which the source text belongs to obtain the representation vector of each word in the source text. And decoding the source text based on the characterization vector of each word segmentation in the source text to obtain at least one candidate target text. The source text can be translated in the aspect of semantic analysis, and the source text can be translated by combining the context information on the premise of the cluster type of the source text, so that the accuracy of text translation is further improved.
The embodiment of the invention provides text translation equipment. Referring to fig. 6, the apparatus includes: a processor (processor)601, a memory (memory)602, and a bus 603;
the processor 601 and the memory 602 complete communication with each other through the bus 603, respectively;
the processor 601 is used for calling the program instructions in the memory 602 to execute the text translation method provided by the above embodiment, for example, including: determining the clustering category to which the source text belongs based on the feature vector of the source text and the clustering center feature vector corresponding to each clustering category; each clustering category corresponds to one clustering center feature vector, and each clustering category and the clustering center feature vector corresponding to each clustering category are determined after clustering the feature vectors of the training source texts; vectorizing the clustering category to which the source text belongs to obtain a clustering category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the clustering category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text and a translation score corresponding to each candidate target text; and selecting one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text.
An embodiment of the present invention provides a non-transitory computer-readable storage medium storing computer instructions, where the computer instructions cause a computer to execute a text translation method provided in the foregoing embodiment, for example, the method includes:
determining the clustering category to which the source text belongs based on the feature vector of the source text and the clustering center feature vector corresponding to each clustering category; each clustering category corresponds to one clustering center feature vector, and each clustering category and the clustering center feature vector corresponding to each clustering category are determined after clustering the feature vectors of the training source texts; vectorizing the clustering category to which the source text belongs to obtain a clustering category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the clustering category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text and a translation score corresponding to each candidate target text; and selecting one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The above-described embodiments of the text translation apparatus and the like are merely illustrative, where units illustrated as separate components may or may not be physically separate, and components displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute the various embodiments or some parts of the methods of the embodiments.
Finally, the method of the present application is only a preferred embodiment and is not intended to limit the scope of the embodiments of the present invention. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the embodiments of the present invention should be included in the protection scope of the embodiments of the present invention.

Claims (10)

1. A method of text translation, comprising:
determining a clustering category to which a source text belongs based on a feature vector of the source text and a clustering center feature vector corresponding to each clustering category; each clustering category corresponds to one clustering center feature vector, and each clustering category and the clustering center feature vector corresponding to each clustering category are determined after clustering the feature vectors of the training source texts;
vectorizing the cluster category to which the source text belongs to obtain a cluster category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the cluster category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text and a translation score corresponding to each candidate target text;
and selecting one candidate target text from all candidate target texts as a translation result of the source text based on the translation score of each candidate target text.
2. The method of claim 1, further comprising:
and averaging word vectors of all the participles in the source text to obtain the feature vector of the source text.
3. The method according to claim 1, wherein the determining the cluster type to which the source text belongs based on the feature vector of the source text and the cluster center feature vector corresponding to each cluster type comprises:
calculating the distance between the feature vector corresponding to the source text and each clustering center feature vector, determining the clustering center feature vector corresponding to the minimum distance in all the calculated distances, and taking the clustering center feature vector as a target clustering center feature vector;
and taking the cluster category corresponding to the target cluster center feature vector as the cluster category to which the source text belongs.
4. The method of claim 1, wherein the selecting one candidate target text from all candidate target texts as the translation result of the source text based on the translation score of each candidate target text comprises:
inputting each candidate target text into a domain language model corresponding to the clustering category to which the source text belongs respectively, and outputting a domain language model score of each candidate target text;
and selecting one candidate target text from all candidate target texts as a translation result of the source text according to the translation score of each candidate target text and the score of the domain language model.
5. The method of claim 4, wherein selecting one candidate target text from all candidate target texts as the translation result of the source text according to the translation score and the domain language model score of each candidate target text comprises:
and carrying out weighted summation on the translation score of each candidate target text and the score of the field language model to obtain the comprehensive score of each candidate target text, and selecting the candidate target text corresponding to the maximum comprehensive score from all the comprehensive scores as the translation result of the source text.
6. The method of claim 1, wherein the integrating the word vector of the participle in the source text with the cluster category vector corresponding to the source text comprises:
adding a clustering category vector corresponding to the source text before a word vector of a first word segmentation in the source text; alternatively, the first and second electrodes may be,
respectively splicing the clustering category vector corresponding to the source text with the word vector of each participle in the source text; alternatively, the first and second electrodes may be,
adding a clustering category vector corresponding to the source text before a word vector of a first word segmentation in the source text, and splicing the clustering category vector corresponding to the source text with a word vector of each word segmentation in the source text respectively.
7. The method according to claim 1, wherein the translation model is a coding/decoding model, a coding model in the translation model adopts a bidirectional recurrent neural network structure, and a decoding model in the translation model adopts a recurrent neural network structure; accordingly, the inputting the integration result into the translation model and the outputting at least one candidate target text comprises:
inputting the integration result into the translation model, and respectively obtaining a forward representation and a reverse representation of each word in the source text under the clustering category to which the source text belongs;
splicing the forward representation and the reverse representation of each word under the clustering category to which the source text belongs to obtain a representation vector of each word in the source text;
and decoding the source text based on the characterization vector of each word segmentation in the source text to obtain at least one candidate target text.
8. A text translation apparatus, comprising:
the determining module is used for determining the clustering category to which the source text belongs based on the feature vector of the source text and the clustering center feature vector corresponding to each clustering category; each clustering category corresponds to one clustering center feature vector, and each clustering category and the clustering center feature vector corresponding to each clustering category are determined after clustering the feature vectors of the training source texts;
the translation module is used for vectorizing the cluster category to which the source text belongs to obtain a cluster category vector corresponding to the source text, integrating word vectors of word segmentation in the source text with the cluster category vector corresponding to the source text, inputting an integration result into a translation model, and outputting at least one candidate target text and a translation score corresponding to each candidate target text;
and the selection module is used for selecting one candidate target text from all candidate target texts as the translation result of the source text based on the translation score of each candidate target text.
9. A computer device, comprising:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 7.
10. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 7.
CN201711488585.6A 2017-12-29 2017-12-29 Text translation method and device Active CN108228576B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711488585.6A CN108228576B (en) 2017-12-29 2017-12-29 Text translation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711488585.6A CN108228576B (en) 2017-12-29 2017-12-29 Text translation method and device

Publications (2)

Publication Number Publication Date
CN108228576A CN108228576A (en) 2018-06-29
CN108228576B true CN108228576B (en) 2021-07-02

Family

ID=62647444

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711488585.6A Active CN108228576B (en) 2017-12-29 2017-12-29 Text translation method and device

Country Status (1)

Country Link
CN (1) CN108228576B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109598002A (en) * 2018-11-15 2019-04-09 重庆邮电大学 Neural machine translation method and system based on bidirectional circulating neural network
CN109902309B (en) * 2018-12-17 2023-06-02 北京百度网讯科技有限公司 Translation method, device, equipment and storage medium
CN111428518B (en) * 2019-01-09 2023-11-21 科大讯飞股份有限公司 Low-frequency word translation method and device
CN109885811B (en) * 2019-01-10 2024-05-14 平安科技(深圳)有限公司 Article style conversion method, apparatus, computer device and storage medium
CN111949789A (en) * 2019-05-16 2020-11-17 北京京东尚科信息技术有限公司 Text classification method and text classification system
CN110211570B (en) * 2019-05-20 2021-06-25 北京百度网讯科技有限公司 Simultaneous interpretation processing method, device and equipment
CN111460264B (en) * 2020-03-30 2023-08-01 口口相传(北京)网络技术有限公司 Training method and device for semantic similarity matching model
CN111597826B (en) * 2020-05-15 2021-10-01 苏州七星天专利运营管理有限责任公司 Method for processing terms in auxiliary translation

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN85101759A (en) * 1985-04-01 1987-01-10 株式会社日立制作所 Interpretation method
WO2006014343A2 (en) * 2004-07-02 2006-02-09 Text-Tech, Llc Automated evaluation systems and methods
CN101079028A (en) * 2007-05-29 2007-11-28 中国科学院计算技术研究所 On-line translation model selection method of statistic machine translation
CN103049436A (en) * 2011-10-12 2013-04-17 北京百度网讯科技有限公司 Method and device for obtaining corpus, method and system for generating translation model and method and system for mechanical translation
CN104090870A (en) * 2014-06-26 2014-10-08 武汉传神信息技术有限公司 Pushing method of online translation engines
CN104516870A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Translation check method and system
CN105528342A (en) * 2015-12-29 2016-04-27 科大讯飞股份有限公司 Intelligent translation method and system in input method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572631B (en) * 2014-12-03 2018-04-13 北京捷通华声语音技术有限公司 The training method and system of a kind of language model
RU2601166C2 (en) * 2015-03-19 2016-10-27 Общество с ограниченной ответственностью "Аби ИнфоПоиск" Anaphora resolution based on a deep analysis technology
CN105786798B (en) * 2016-02-25 2018-11-02 上海交通大学 Natural language is intended to understanding method in a kind of human-computer interaction
US10460038B2 (en) * 2016-06-24 2019-10-29 Facebook, Inc. Target phrase classifier

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN85101759A (en) * 1985-04-01 1987-01-10 株式会社日立制作所 Interpretation method
WO2006014343A2 (en) * 2004-07-02 2006-02-09 Text-Tech, Llc Automated evaluation systems and methods
CN101079028A (en) * 2007-05-29 2007-11-28 中国科学院计算技术研究所 On-line translation model selection method of statistic machine translation
CN103049436A (en) * 2011-10-12 2013-04-17 北京百度网讯科技有限公司 Method and device for obtaining corpus, method and system for generating translation model and method and system for mechanical translation
CN104516870A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Translation check method and system
CN104090870A (en) * 2014-06-26 2014-10-08 武汉传神信息技术有限公司 Pushing method of online translation engines
CN105528342A (en) * 2015-12-29 2016-04-27 科大讯飞股份有限公司 Intelligent translation method and system in input method

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Domain Adaptation for Statistical Machine Translation with Domain Dictionary and Monolingual Corpora;Hua Wu,et al;《Proceedings of the 22nd International Conference on Computational Linguistics》;20080831;第993-1000页 *
Dynamic Topic Adaptation for SMT using Distributional Profiles;Eva Hasler,et al;《Proceedings of the Ninth Workshop on Statistical Machine Translation》;20140627;第445-456页 *
Translation Model Adaptation for Statistical Machine Translation with Monolingual Topic Information;Jinsong Su,et al;《Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics》;20120714;第459-468页 *
基于聚类的统计机器翻译领域自适应研究;张文文;《中国优秀硕士学位论文全文数据库信息科技辑(月刊)》;20140315(第03期);第I138-1171页 *
基于语义分布相似度的翻译模型领域自适应研究;姚亮;《山东大学学报(理学版)》;20160731;第51卷(第7期);第43-50页 *
统计机器翻译领域自适应方法研究;刘昊;《中国优秀硕士学位论文全文数据库信息科技辑(月刊)》;20170215(第02期);第I138-4484页 *

Also Published As

Publication number Publication date
CN108228576A (en) 2018-06-29

Similar Documents

Publication Publication Date Title
CN108228576B (en) Text translation method and device
US11106714B2 (en) Summary generating apparatus, summary generating method and computer program
CN106202010B (en) Method and apparatus based on deep neural network building Law Text syntax tree
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN110163181B (en) Sign language identification method and device
CN111966800B (en) Emotion dialogue generation method and device and emotion dialogue model training method and device
CN107229610A (en) The analysis method and device of a kind of affection data
CN106875940B (en) Machine self-learning construction knowledge graph training method based on neural network
CN107480143A (en) Dialogue topic dividing method and system based on context dependence
CN109062902B (en) Text semantic expression method and device
CN113723166A (en) Content identification method and device, computer equipment and storage medium
CN110222328B (en) Method, device and equipment for labeling participles and parts of speech based on neural network and storage medium
CN110489554B (en) Attribute-level emotion classification method based on location-aware mutual attention network model
CN112463942A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN112860896A (en) Corpus generalization method and man-machine conversation emotion analysis method for industrial field
CN113392265A (en) Multimedia processing method, device and equipment
CN110750998A (en) Text output method and device, computer equipment and storage medium
CN113704393A (en) Keyword extraction method, device, equipment and medium
CN111368066B (en) Method, apparatus and computer readable storage medium for obtaining dialogue abstract
CN110929532A (en) Data processing method, device, equipment and storage medium
CN115525740A (en) Method and device for generating dialogue response sentence, electronic equipment and storage medium
CN110852066B (en) Multi-language entity relation extraction method and system based on confrontation training mechanism
CN107943299B (en) Emotion presenting method and device, computer equipment and computer readable storage medium
CN111477212A (en) Content recognition, model training and data processing method, system and equipment
CN112364666B (en) Text characterization method and device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant