CN108228576A

CN108228576A - Text interpretation method and device

Info

Publication number: CN108228576A
Application number: CN201711488585.6A
Authority: CN
Inventors: 黄宜鑫; 孟廷; 刘俊华; 魏思; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2018-06-29
Anticipated expiration: 2037-12-29
Also published as: CN108228576B

Abstract

The embodiment of the present invention provides a kind of text interpretation method and device, belongs to language processing techniques field.This method includes：Based on the feature vector of source text and each corresponding cluster centre feature vector of cluster classification, the cluster classification belonging to source text is determined；Cluster classification belonging to source text is subjected to vectorization, obtain the corresponding cluster categorization vector of source text, the term vector segmented in source text cluster categorization vector corresponding with source text is integrated, integrated results are input to translation model, export at least one candidate target text and the corresponding translation score value of each candidate target text；Based on the translation score value of each candidate target text, translation result of the candidate target text as source text is chosen from all candidate target texts.Reference feature is hidden due to the whole semantic and other translation that can combine source text in translation process to translate source text.This improves the field robustness of translation model and translation accuracy.

Description

Text interpretation method and device

Technical field

The present embodiments relate to language processing techniques field, more particularly, to a kind of text interpretation method and device.

Background technology

Machine translation is that a kind of natural language (original language) is converted to another natural language (target language using computer Speech) process.It lays particular emphasis at present and machine translation is carried out to source text (the corresponding text of original language) using field with reference to user, The application field of user's speech content is considered i.e. in machine translation.Wherein, application field can be divided into education sector, scientific research neck Domain and humane field etc..For the source text obtained after speech recognition, the translation of the following two kinds text is provided in the relevant technologies Method：

The first is the text interpretation method positioned at language material level, mainly first determines the application field of source text, screening The training corpus of the application field is belonged to, and translation model is built based on the training corpus filtered out, so as to turning over using structure Model is translated to be translated to source text.

Second is the text interpretation method for being located at model level, mainly by the translation model of multiple and different application fields It is combined, such as according to the degree of correlation between the application field of source text and the application field of different translation models, is turned over to be each It translates model and assigns weight, new hybrid guided mode is generated so as to be combined all translation models according to the weight of each translation model Type translates source text using new mixed model.

Since the above method is required to predefine the application field of source text, but the source text during actual translations Application field be likely difficult to determine, and same vocabulary may belong to multiple application fields, so as to cause being difficult accurate translation.

Invention content

To solve the above-mentioned problems, the embodiment of the present invention provides one kind and overcomes the above problem or solve at least partly State the text interpretation method and device of problem.

It is according to embodiments of the present invention in a first aspect, providing a kind of text interpretation method, this method includes：

Based on the feature vector of source text and each corresponding cluster centre feature vector of cluster classification, source text institute is determined The cluster classification of category；Wherein, each cluster classification corresponds to a cluster centre feature vector, each cluster classification and each cluster The corresponding cluster centre feature vector of classification is determined by after being clustered to the feature vector of training source text；

Cluster classification belonging to source text is subjected to vectorization, the corresponding cluster categorization vector of source text is obtained, by source document The term vector cluster categorization vector corresponding with source text segmented in this is integrated, and integrated results are input to translation model, Export at least one candidate target text and the corresponding translation score value of each candidate target text；

Based on the translation score value of each candidate target text, a candidate target text is chosen from all candidate target texts This translation result as source text.

Method provided in an embodiment of the present invention, by corresponding poly- based on the feature vector of source text and each cluster classification Class central feature vector, determines the cluster classification belonging to source text.Cluster classification belonging to source text is subjected to vectorization, is obtained The corresponding cluster categorization vector of source text carries out the term vector segmented in source text cluster categorization vector corresponding with source text It integrates, integrated results is input to translation model, export at least one candidate target text, each candidate target text corresponds to one A translation score value.Based on the translation score value of each candidate target text, a candidate mesh is chosen from all candidate target texts Mark translation result of the text as source text.Due to can determine the cluster classification belonging to source text before translation, and can be by source document Cluster classification belonging to this and source text together as translation model input parameter so that translation process can combine source document This whole semantic and other hiding translation element translates source text.This improves the field robusts of translation model Property and translation accuracy.

The possible realization method of with reference to first aspect the first, in second of possible realization method, this method is also Including：

The term vector of participles all in source text is averaged, obtains the feature vector of source text.

The possible realization method of with reference to first aspect the first, in the third possible realization method, based on source document This feature vector and each corresponding cluster centre feature vector of cluster classification, determine the cluster classification belonging to source text, wrap It includes：

The distance between the corresponding feature vector of source text and each cluster centre feature vector are calculated, determines to be calculated All distances in the corresponding cluster centre feature vector of minimum range, and be used as target cluster centre feature vector；

Using the corresponding cluster classification of target cluster centre feature vector as the cluster classification belonging to source text.

The possible realization method of with reference to first aspect the first, in the 4th kind of possible realization method, based on each The translation score value of candidate target text chooses candidate target text turning over as source text from all candidate target texts Translate as a result, including：

Each candidate target text is separately input into the corresponding domain language model of cluster classification belonging to source text, it is defeated Go out the domain language Model score of each candidate target text；

According to the translation score value of each candidate target text and domain language Model score, from all candidate target texts Choose translation result of the candidate target text as source text.

The 4th kind of possible realization method with reference to first aspect, in the 5th kind of possible realization method, according to each The translation score value of candidate target text and domain language Model score choose a candidate target from all candidate target texts Translation result of the text as source text, including：

Summation is weighted to the translation score value and domain language Model score of each candidate target text, obtains each time The comprehensive scores of target text are selected, the corresponding candidate target text of maximum comprehensive scores is chosen from all comprehensive scores as source The translation result of text.

The possible realization method of with reference to first aspect the first, in the 6th kind of possible realization method, by source text The term vector of middle participle cluster categorization vector corresponding with source text is integrated, including：

The corresponding cluster categorization vector of addition source text before the term vector of first participle in source text；Alternatively,

Term vector of the corresponding cluster categorization vector of source text respectively with participle each in source text is spliced；Or Person,

The corresponding cluster categorization vector of addition source text before the term vector of first participle in source text, and by source document This term vector of corresponding cluster categorization vector respectively with participle each in source text is spliced.

The possible realization method of with reference to first aspect the first, in the 7th kind of possible realization method, translation model For coding/decoding model, encoding model in translation model is using bidirectional circulating neural network structure, the decoding mould in translation model Type uses Recognition with Recurrent Neural Network structure；Correspondingly, integrated results are input to translation model, export at least one candidate target text This, including：

Integrated results are input in translation model, respectively obtain each in source text segment in the cluster belonging to source text Forward direction characterization and backward attribute under classification；

Forward direction of each participle under the cluster classification belonging to source text is characterized and backward attribute is spliced, is obtained every Characterization vector of one participle in source text；

Characterization vector based on each participle in source text is decoded source text, obtains at least one candidate target Text.

Second aspect according to embodiments of the present invention, provides a kind of text translating equipment, which includes：

Determining module, for based on the feature vector of source text and each corresponding cluster centre feature of cluster classification to Amount, determines the cluster classification belonging to source text；Wherein, each cluster classification corresponds to a cluster centre feature vector, each is poly- Class classification and each corresponding cluster centre feature vector of cluster classification are after being clustered to the feature vector of training source text It is identified；

Translation module for the cluster classification belonging to source text to be carried out vectorization, obtains the corresponding cluster class of source text It is not vectorial, the term vector segmented in source text cluster categorization vector corresponding with source text is integrated, integrated results are defeated Enter to translation model, export at least one candidate target text and the corresponding translation score value of each candidate target text；

Module is chosen, for the translation score value based on each candidate target text, is chosen from all candidate target texts Translation result of one candidate target text as source text.

The third aspect according to embodiments of the present invention provides a kind of text interpreting equipment, including：

At least one processor；And

At least one processor being connect with processor communication, wherein：

Memory is stored with the program instruction that can be executed by processor, and the instruction of processor caller is able to carry out first party The text interpretation method that any possible realization method is provided in the various possible realization methods in face.

According to the fourth aspect of the invention, a kind of non-transient computer readable storage medium storing program for executing, non-transient computer are provided Readable storage medium storing program for executing stores computer instruction, and computer instruction makes the various possible realization methods of computer execution first aspect In the text interpretation method that is provided of any possible realization method.

It should be understood that above general description and following detailed description is exemplary and explanatory, it can not Limit the embodiment of the present invention.

Description of the drawings

Fig. 1 is a kind of flow diagram of text interpretation method of the embodiment of the present invention；

Fig. 2 is the flow diagram of another text interpretation method of the embodiment of the present invention；

Fig. 3 is the flow diagram of another text interpretation method of the embodiment of the present invention；

Fig. 4 is the flow diagram of another text interpretation method of the embodiment of the present invention；

Fig. 5 is a kind of block diagram of text translating equipment of the embodiment of the present invention；

Fig. 6 is a kind of block diagram of text interpreting equipment of the embodiment of the present invention.

Specific embodiment

With reference to the accompanying drawings and examples, the specific embodiment of the embodiment of the present invention is described in further detail.With Lower embodiment is used to illustrate the embodiment of the present invention, but be not limited to the range of the embodiment of the present invention.

Presently relevant technology translates source text mainly in conjunction with the application field of source text.Wherein, application field It can be divided into scientific research field, humane field and education sector etc. according to application scenarios.Due to during actual translations source text should It may be difficult to determine with field, then be difficult to determine be specifically which application if source text may relate to multiple application fields Field, so as to cause being difficult accurate translation.At the same time, the participle in source text may also belong to multiple application fields, not It is different with translation result under application field, if vocabulary china is translated as China in News Field, it is translated as in antique field Porcelain, so as to further increase the difficulty of accurate translation.

In view of other than application field can be used as translation reference factor, some general character are hidden in actually source text Information such as theme, type and writing style etc. can also be used as translation and hide reference feature.For said circumstances, the present invention is real It applies example and provides a kind of text interpretation method.This method is suitable for translating into source voice signal the voiced translation field of target text Scape is also applied for translating into a kind of language text the usual translation scene of another language text, and the embodiment of the present invention is to this It is not especially limited.Should be referring to Fig. 1, this method includes：101st, it is corresponded to based on the feature vector of source text and each cluster classification Cluster centre feature vector, determine the cluster classification belonging to source text；Wherein, each cluster classification corresponds to a cluster centre Feature vector, each cluster classification and each corresponding cluster centre feature vector of cluster classification are the features to training source text Vector is identified after being clustered；102nd, the cluster classification belonging to source text is subjected to vectorization, it is corresponding obtains source text Categorization vector is clustered, the term vector of participle each in source text cluster categorization vector corresponding with source text is integrated, it will Integrated results are input to translation model, export at least one candidate target text and the corresponding translation point of each candidate target text Value；103rd, the translation score value based on each candidate target text chooses a candidate target text from all candidate target texts This translation result as source text.

Wherein, partition clustering classification mainly classifies the general character hiding information that may be used in translation, thus It is used in translation process to these general character hiding informations, so that translation result is more accurate.Perform above-mentioned steps 101 it Before, it may be determined that cluster classification and corresponding cluster centre feature vector.It specifically, can be by KMeans algorithms to a large amount of each neck Domain and the training corpus progress Unsupervised clustering comprising every general character hiding information, so that it is determined that different types of cluster class Not and each clusters the corresponding cluster centre feature vector of classification.Certainly, it can also be used during actual implementation other poly- Class algorithm, the embodiment of the present invention are not especially limited this.

By taking KMeans algorithms as an example, in order to realize Unsupervised clustering, it can also first calculate and source document is each trained in training corpus This feature vector.In the feature vector for calculating each training source text, word2vec technologies can be used in training corpus It is trained on the data set of middle trained source text composition, each participle in each trained source text is can obtain after training Term vector.The term vector of all participles in each trained source text is averaged, can arrive the feature of each training source text to Amount.

It is clustered using the feature vector of each trained source text, the poly- of each cluster classification is can obtain after cluster Class central feature vector (dv₁,dv₂,dv₃,...,dv_K), each cluster classification can be denoted as { d respectively₁,d₂,d₃,...,d_K}.Its In, k represents the total quantity of cluster classification.For example, d₁Represent the first cluster classification, dv₁Represent that the first cluster classification is corresponding Cluster centre feature vector.d_KRepresent kth kind cluster classification, dv_KRepresent the corresponding cluster centre feature of kth kind cluster classification to Amount.

After above-mentioned cluster process is completed, for source text to be translated, the feature vector of the source text can be obtained.This hair Bright embodiment does not limit the mode of the feature vector of acquisition source text specifically, including but not limited to：To owning in source text The term vector of participle is averaged, and obtains the feature vector of source text.

It, can be corresponding poly- based on the feature vector of source text and each cluster classification after the feature vector of source text is obtained Class central feature vector, determines the cluster classification belonging to source text.As shown in the above, each cluster classification can be denoted as respectively {d₁,d₂,d₃,...,d_KNamely each cluster classification can correspond to a mark.In order to subsequently be based on cluster category-translation Source text so as to which the cluster classification belonging to source text is carried out vectorization, obtains the corresponding cluster categorization vector of source text.Its In, can table look-up or word2vec by way of cluster classification is subjected to vectorization, the embodiment of the present invention do not make this specifically It limits.

It, can be by the term vector segmented in source text and source text pair after the corresponding cluster categorization vector of source text is obtained The cluster categorization vector answered is integrated, and using integrated results as translation model is input to, exports at least one candidate target text This, while export the corresponding translation score value of each candidate target text.Wherein, translation model can the instruction based on different cluster classifications White silk source text and training objective text obtain after being trained to initial model, and initial model can be Recognition with Recurrent Neural Network Types, the embodiment of the present invention such as (RNN, Recurrent Neural Networks) are not especially limited this.

It, can the translation based on each candidate target text point after candidate target text and corresponding translation score value is obtained Value chooses translation result of the candidate target text as source text from all candidate target texts.It, can during specific choice Using the highest candidate target text of selected text translation score value as source text translate after target text, the embodiment of the present invention to this not Make specific limit.

Method provided in an embodiment of the present invention, by corresponding poly- based on the feature vector of source text and each cluster classification Class central feature vector, determines the cluster classification belonging to source text.Cluster classification belonging to source text is subjected to vectorization, is obtained The corresponding cluster categorization vector of source text carries out the term vector segmented in source text cluster categorization vector corresponding with source text It integrates, integrated results is input to translation model, export at least one candidate target text, each candidate target text corresponds to one A translation score value.Based on the translation score value of each candidate target text, a candidate mesh is chosen from all candidate target texts Mark translation result of the text as source text.Due to can determine the cluster classification belonging to source text before translation, and can be by source document Cluster classification belonging to this and source text together as translation model input parameter so that translation process can combine source document This whole semantic and other translation hides reference feature and source text is translated.This improves the fields of translation model Robustness and translation accuracy.

Content based on above-described embodiment, as a kind of alternative embodiment, the embodiment of the present invention additionally provides a kind of determine Cluster class method for distinguishing belonging to source text.Referring to Fig. 2, this method includes：1011st, calculate the corresponding feature vector of source text with The distance between each cluster centre feature vector, determines the corresponding cluster centre of minimum range in all distances being calculated Feature vector, and it is used as target cluster centre feature vector；1012nd, by the corresponding cluster classification of target cluster centre feature vector As the cluster classification belonging to source text.

In above-mentioned steps 1011, calculating between the corresponding feature vector of source text and each cluster centre feature vector Apart from when, Euclidean distance between the two can be calculated, the embodiment of the present invention is not especially limited this.Minimum range is corresponded to Cluster centre feature vector, can using the cluster centre feature vector it is corresponding cluster classification as the cluster class belonging to source text Not.

Content based on above-described embodiment, it is contemplated that between cluster classification belonging to candidate target text and source text May be not high enough with degree, cause translation result inaccurate in order to avoid there is such case, as a kind of optional implementation Example, the embodiment of the present invention additionally provide one candidate target text turning over as source text of selection from all candidate target texts The method for translating result.Referring to Fig. 3, this method includes：1031st, each candidate target text is separately input into belonging to source text The corresponding domain language model of classification is clustered, exports the domain language Model score of each candidate target text；1032nd, according to every The translation score value of a candidate target text and domain language Model score choose a candidate mesh from all candidate target texts Mark translation result of the text as source text.

It, can be using the cluster classification belonging to source text as the cluster classification belonging to target text in above-mentioned steps 1031. The target text under a large amount of cluster classifications, i.e. current area target text may be selected in cluster classification according to belonging to target text This, domain language model can be built using the current area target text under the cluster classification.Wherein, construction method and existing language Say that model building method is identical, the embodiment of the present invention is not especially limited this.After domain language model is obtained, neck can be passed through Domain language model calculates the domain language Model score of each candidate target text.Wherein, domain language Model score is higher, then Accuracy when the corresponding candidate target text of the domain language Model score is as translation result is also higher.

It, can turning over according to each candidate target text after the domain language Model score of each candidate target text is obtained Score value and domain language Model score are translated, chooses translation result of the candidate target text as source text.The present invention is implemented Example is selected not to the translation score value and domain language Model score according to each candidate target text from all candidate target texts A candidate target text is taken specifically to be limited as the mode of the translation result of source text, including but not limited to：To each time The translation score value and domain language Model score for selecting target text are weighted summation, obtain the synthesis of each candidate target text Score value chooses translation result of the corresponding candidate target text of maximum comprehensive scores as source text from all comprehensive scores.

Wherein, the mode of weighted sum can be linear fusion or non-linear fusion, and the embodiment of the present invention does not make this It is specific to limit.By taking linear fusion as an example, the process for calculating comprehensive scores can refer to equation below：

S_f=λ S_trans+(1-λ)S_lm

In above-mentioned formula, for any candidate target text, S_fRepresent the comprehensive scores of the candidate target text, S_trans Represent the translation score value of the candidate target text, S_lmRepresent the domain language Model score of candidate target text, λ represents translation point The shared weight of value.Wherein, the value of λ can specifically be predefined according to application demand.

Content based on above-described embodiment, as a kind of alternative embodiment, the embodiment of the present invention additionally provides a kind of by source The method that the term vector segmented in text cluster categorization vector corresponding with source text is integrated, this method include：

The corresponding cluster categorization vector of addition source text before the term vector of first participle in source text；It alternatively, will Term vector of the corresponding cluster categorization vector of source text respectively with participle each in source text is spliced；Alternatively, in source text In first participle term vector before add the corresponding cluster categorization vector of source text, and by the corresponding cluster classification of source text Term vector of the vector respectively with participle each in source text is spliced.

Using the term vector of participles all in source text as x=(x₁,x₂,x₃,...,x_m), the corresponding cluster classification of source text Vectorial d_KFor.After the term vector segmented in source text cluster categorization vector corresponding with source text is integrated, above-mentioned the A kind of corresponding integrated results of Integration Mode are (d_k,x₁,x₂,x₃,...,x_m), above-mentioned second of Integration Mode is corresponding to integrate knot Fruit is (d_kx₁,d_kx₂,d_kx₃,...,d_kx_m), the corresponding integrated results of above-mentioned the third Integration Mode are (d_k,d_kx₁,d_kx₂, d_kx₃,...,d_kx_m).In the third above-mentioned Integration Mode, d_kx₁Represent the corresponding cluster categorization vector of source text and source text In first participle term vector carry out it is spliced as a result, d_kx₂Represent the corresponding cluster categorization vector of source text and source text In second participle term vector carry out it is spliced as a result, below and so on.

In the above-described embodiments, source text is translated mainly in conjunction with the cluster classification belonging to source text, Ye Jizhu Source text is translated from the angle of semantic analysis, and during actual translations, it usually also needs to combine context letter Breath.Content based on above-described embodiment, as a kind of alternative embodiment, the embodiment of the present invention additionally provide it is a kind of will be in source text The integrated results that the term vector of participle cluster categorization vector corresponding with source text is integrated, translate into candidate target text Method.Specifically, translation model used in translation process can be coding/decoding model.Wherein, the coding mould in translation model Type uses bidirectional circulating neural network structure, and the decoded model in translation model uses Recognition with Recurrent Neural Network structure.

Correspondingly, referring to Fig. 4, this method includes：1021st, integrated results are input in translation model, respectively obtain source Forward direction characterization and backward attribute of each participle under the cluster classification belonging to source text in text；1022nd, each participle is existed Forward direction characterization and backward attribute under cluster classification belonging to source text are spliced, and obtain table of each participle in source text Sign vector；1023rd, source text is decoded based on characterization vector of each participle in source text, obtains at least one candidate Target text.

Specifically, it for the cluster classification belonging to source text, is recycled by the forward direction in bidirectional circulating neural network structure Neural network can obtain each participle in source text and see that the forward direction under history lexical information characterizes f under the cluster classification_i。 By the recycled back neural network in bidirectional circulating neural network structure, seen in can obtain each participle under the cluster classification To the backward attribute b of following lexical information_i.Finally, the two is stitched together the table that can form each participle in the source text Levy vector h_i.On the basis of characterization vector, by characterization vector h of each participle in source text_iIt is input to Recognition with Recurrent Neural Network In, exportable at least one candidate target text.At the same time, the translation score value of also exportable each candidate target text.

Method provided in an embodiment of the present invention by the way that integrated results are input in translation model, respectively obtains source text In each forward direction of the participle under the cluster classification belonging to source text characterize and backward attribute.By each participle belonging to source text Cluster classification under forward direction characterization and backward attribute spliced, obtain it is each participle in source text characterization vector.Base Source text is decoded in characterization vector of each participle in source text, obtains at least one candidate target text.Due to Other than being translated from the angle of semantic analysis to source text, it can also be tied under the premise of the cluster classification belonging to source text It closes context information to translate source text, so as to further improve the accuracy of text translation.

It should be noted that above-mentioned all alternative embodiments, may be used the optional implementation that any combination forms the present invention Example, this is no longer going to repeat them.

Content based on above-described embodiment, an embodiment of the present invention provides a kind of text translating equipment, text translation dresses It puts to perform the text interpretation method in above method embodiment.Referring to Fig. 5, which includes：

Determining module 501, for based on the feature vector of source text and each corresponding cluster centre feature of cluster classification Vector determines the cluster classification belonging to source text；Wherein, each cluster classification corresponds to a cluster centre feature vector, each Cluster classification and each corresponding cluster centre feature vector of cluster classification are that the feature vector of training source text is clustered It is identified afterwards；

Translation module 502 for the cluster classification belonging to source text to be carried out vectorization, obtains the corresponding cluster of source text Categorization vector is integrated the term vector segmented in source text cluster categorization vector corresponding with source text, by integrated results Translation model is input to, exports at least one candidate target text and the corresponding translation score value of each candidate target text；

Module 503 is chosen, for the translation score value based on each candidate target text, is selected from all candidate target texts Take translation result of the candidate target text as source text.

As a kind of alternative embodiment, which further includes：

Computing module is averaged for the term vector to participles all in source text, obtains the feature vector of source text.

As a kind of alternative embodiment, determining module 501, for calculating the corresponding feature vector of source text and each cluster The distance between central feature vector, determine in all distances being calculated the corresponding cluster centre feature of minimum range to Amount, and it is used as target cluster centre feature vector；Using the corresponding cluster classification of target cluster centre feature vector as source text Affiliated cluster classification.

As a kind of alternative embodiment, module 503 is chosen, including：

Computing unit, for each candidate target text to be separately input into the corresponding neck of cluster classification belonging to source text Domain language model exports the domain language Model score of each candidate target text；

Selection unit, for the translation score value and domain language Model score according to each candidate target text, from all Translation result of the candidate target text as source text is chosen in candidate target text.

As a kind of alternative embodiment, selection unit, for the translation score value of each candidate target text and field language Speech Model score is weighted summation, obtains the comprehensive scores of each candidate target text, is chosen most from all comprehensive scores Translation result of the corresponding candidate target text of great synthesis score value as source text.

As a kind of alternative embodiment, translation module 502, for adding before the term vector of first participle in source text Add the corresponding cluster categorization vector of source text；Alternatively, by source text it is corresponding cluster categorization vector respectively with it is each in source text The term vector of participle is spliced；Alternatively, addition source text is corresponding poly- before the term vector of first participle in source text Class categorization vector, and term vector of the corresponding cluster categorization vector of source text respectively with participle each in source text is spelled It connects.

As a kind of alternative embodiment, translation model includes bidirectional circulating neural network and Recognition with Recurrent Neural Network, two-way to follow Ring neural network includes preceding to Recognition with Recurrent Neural Network and recycled back neural network；Correspondingly, translation module 502, for will be whole It closes result to be input in translation model, respectively obtains forward direction of each participle under the cluster classification belonging to source text in source text Characterization and backward attribute；Forward direction of each participle under the cluster classification belonging to source text is characterized and backward attribute is spelled It connects, obtains characterization vector of each participle in source text；Characterization vector based on each participle in source text is to source text It is decoded, obtains at least one candidate target text.

Device provided in an embodiment of the present invention, by corresponding poly- based on the feature vector of source text and each cluster classification Class central feature vector, determines the cluster classification belonging to source text.Cluster classification belonging to source text is subjected to vectorization, is obtained The corresponding cluster categorization vector of source text carries out the term vector segmented in source text cluster categorization vector corresponding with source text It integrates, integrated results is input to translation model, export at least one candidate target text, each candidate target text corresponds to one A translation score value.Based on the translation score value of each candidate target text, a candidate mesh is chosen from all candidate target texts Mark translation result of the text as source text.Due to can determine the cluster classification belonging to source text before translation, and can be by source document Cluster classification belonging to this and source text together as translation model input parameter so that translation process can combine source document This whole semantic and other translation hides reference feature and source text is translated.This improves the fields of translation model Robustness and translation accuracy.

In addition, by the way that integrated results are input in translation model, each in source text segment in source text is respectively obtained Forward direction characterization and backward attribute under affiliated cluster classification.By forward direction of each participle under the cluster classification belonging to source text Characterization and backward attribute are spliced, and obtain characterization vector of each participle in source text.Based on each participle in source text In characterization vector source text is decoded, obtain at least one candidate target text.Due in addition to from the angle of semantic analysis It, can also be under the premise of the cluster classification belonging to source text, with reference to contextual information to source document except degree translates source text This is translated, so as to further improve the accuracy of text translation.

An embodiment of the present invention provides a kind of text interpreting equipments.Referring to Fig. 6, which includes：Processor (processor) 601, memory (memory) 602 and bus 603；

Wherein, processor 601 and memory 602 complete mutual communication by bus 603 respectively；

Processor 601 is used to call the program instruction in memory 602, and the text provided with performing above-described embodiment turns over Translate method, such as including：Based on the feature vector of source text and each corresponding cluster centre feature vector of cluster classification, determine Cluster classification belonging to source text；Wherein, each cluster classification correspond to a cluster centre feature vector, each cluster classification and Each corresponding cluster centre feature vector of cluster classification is determined by after being clustered to the feature vector of training source text； Cluster classification belonging to source text is subjected to vectorization, the corresponding cluster categorization vector of source text is obtained, will be segmented in source text Term vector cluster categorization vector corresponding with source text integrated, integrated results are input to translation model, output is at least One candidate target text and the corresponding translation score value of each candidate target text；Translation based on each candidate target text point Value chooses translation result of the candidate target text as source text from all candidate target texts.

The embodiment of the present invention provides a kind of non-transient computer readable storage medium storing program for executing, the non-transient computer readable storage medium Matter stores computer instruction, which makes computer perform the text interpretation method that above-described embodiment is provided, such as Including：

Based on the feature vector of source text and each corresponding cluster centre feature vector of cluster classification, source text institute is determined The cluster classification of category；Wherein, each cluster classification corresponds to a cluster centre feature vector, each cluster classification and each cluster The corresponding cluster centre feature vector of classification is determined by after being clustered to the feature vector of training source text；By source text Affiliated cluster classification carries out vectorization, obtains the corresponding cluster categorization vector of source text, the term vector that will be segmented in source text Cluster categorization vector corresponding with source text is integrated, and integrated results are input to translation model, export at least one candidate Target text and the corresponding translation score value of each candidate target text；Based on the translation score value of each candidate target text, from institute Have and translation result of the candidate target text as source text is chosen in candidate target text.

One of ordinary skill in the art will appreciate that：Realizing all or part of step of above method embodiment can pass through The relevant hardware of program instruction is completed, and aforementioned program can be stored in a computer read/write memory medium, the program When being executed, step including the steps of the foregoing method embodiments is performed；And aforementioned storage medium includes：ROM, RAM, magnetic disc or light The various media that can store program code such as disk.

The embodiments such as text interpreting equipment described above are only schematical, wherein illustrate as separating component Unit may or may not be physically separate, and the component shown as unit may or may not be object Manage unit, you can be located at a place or can also be distributed in multiple network element.It can select according to the actual needs Some or all of module therein is selected to realize the purpose of this embodiment scheme.Those of ordinary skill in the art are not paying wound In the case of the labour for the property made, you can to understand and implement.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It is realized by the mode of software plus required general hardware platform, naturally it is also possible to pass through hardware.Based on such understanding, on Technical solution is stated substantially in other words to embody the part that the prior art contributes in the form of software product, it should Computer software product can store in a computer-readable storage medium, such as ROM/RAM, magnetic disc, CD, including several fingers It enables and (can be personal computer, server or the network equipment etc.) so that computer equipment is used to perform each implementation Certain Part Methods of example or embodiment.

Finally, the present processes are only preferable embodiment, are not intended to limit the protection model of the embodiment of the present invention It encloses.With within principle, any modification, equivalent replacement, improvement and so on should be included in all spirit in the embodiment of the present invention Within the protection domain of the embodiment of the present invention.

Claims

1. a kind of text interpretation method, which is characterized in that including：

Based on the feature vector of source text and each corresponding cluster centre feature vector of cluster classification, the source text institute is determined The cluster classification of category；Wherein, each cluster classification corresponds to a cluster centre feature vector, each cluster classification and each cluster The corresponding cluster centre feature vector of classification is determined by after being clustered to the feature vector of training source text；

Cluster classification belonging to the source text is subjected to vectorization, obtains the corresponding cluster categorization vector of the source text, it will The term vector segmented in the source text cluster categorization vector corresponding with the source text is integrated, and integrated results are inputted To translation model, at least one candidate target text and the corresponding translation score value of each candidate target text are exported；

Based on the translation score value of each candidate target text, a candidate target text work is chosen from all candidate target texts Translation result for the source text.

2. according to the method described in claim 1, it is characterized in that, the method further includes：

The term vector of all participles in the source text is averaged, obtains the feature vector of the source text.

3. according to the method described in claim 1, it is characterized in that, it is described based on the feature vector of source text and each cluster class Not corresponding cluster centre feature vector, determines the cluster classification belonging to the source text, including：

The distance between the corresponding feature vector of source text and each cluster centre feature vector are calculated, determines the institute being calculated There is the corresponding cluster centre feature vector of minimum range in distance, and be used as target cluster centre feature vector；

Using the corresponding cluster classification of the target cluster centre feature vector as the cluster classification belonging to the source text.

4. according to the method described in claim 1, it is characterized in that, the translation score value based on each candidate target text, Translation result of the candidate target text as the source text is chosen from all candidate target texts, including：

Each candidate target text is separately input into the corresponding domain language model of cluster classification belonging to the source text, it is defeated Go out the domain language Model score of each candidate target text；

According to the translation score value of each candidate target text and domain language Model score, chosen from all candidate target texts Translation result of one candidate target text as the source text.

5. according to the method described in claim 4, it is characterized in that, the translation score value of each candidate target text of the basis and Domain language Model score chooses translation of the candidate target text as the source text from all candidate target texts As a result, including：

Summation is weighted to the translation score value and domain language Model score of each candidate target text, obtains each candidate mesh The comprehensive scores of text are marked, the corresponding candidate target text of maximum comprehensive scores is chosen from all comprehensive scores as the source The translation result of text.

6. according to the method described in claim 1, it is characterized in that, it is described by the term vector segmented in the source text with it is described The corresponding cluster categorization vector of source text is integrated, including：

In the source text the corresponding cluster categorization vector of the source text is added before the term vector of first participle；Or Person,

Term vector of the corresponding cluster categorization vector of the source text respectively with participle each in the source text is spliced； Alternatively,

The corresponding cluster categorization vector of the source text is added before the term vector of first participle in the source text, and will The corresponding term vector for clustering categorization vector respectively with participle each in the source text of the source text is spliced.

7. according to the method described in claim 1, it is characterized in that, the translation model be coding/decoding model, the translation mould Encoding model in type is using bidirectional circulating neural network structure, and the decoded model in the translation model is using cycle nerve net Network structure；Correspondingly, it is described that integrated results are input to translation model, at least one candidate target text is exported, including：

Integrated results are input in the translation model, respectively obtain each in source text segment in the cluster belonging to source text Forward direction characterization and backward attribute under classification；

Forward direction of each participle under the cluster classification belonging to the source text is characterized and backward attribute is spliced, is obtained every Characterization vector of one participle in the source text；

Characterization vector based on each participle in the source text is decoded the source text, obtains at least one candidate Target text.

8. a kind of text translating equipment, which is characterized in that including：

Determining module, for based on the feature vector of source text and each corresponding cluster centre feature vector of cluster classification, really Cluster classification belonging to the fixed source text；Wherein, each cluster classification corresponds to a cluster centre feature vector, each cluster Classification and each corresponding cluster centre feature vector of cluster classification are institutes after being clustered to the feature vector of training source text Determining；

For the cluster classification belonging to the source text to be carried out vectorization, it is corresponding poly- to obtain the source text for translation module Class categorization vector integrates the term vector segmented in the source text cluster categorization vector corresponding with the source text, Integrated results are input to translation model, export at least one candidate target text and the corresponding translation of each candidate target text Score value；

Module is chosen, for the translation score value based on each candidate target text, one is chosen from all candidate target texts Translation result of the candidate target text as the source text.

9. a kind of text interpreting equipment, which is characterized in that including：

At least one processor；And

At least one processor being connect with the processor communication, wherein：

The memory is stored with the program instruction that can be performed by the processor, and the processor calls described program instruction energy Enough methods performed as described in claim 1 to 7 is any.

10. a kind of non-transient computer readable storage medium storing program for executing, which is characterized in that the non-transient computer readable storage medium storing program for executing is deposited Computer instruction is stored up, the computer instruction makes the computer perform the method as described in claim 1 to 7 is any.