CN108228576A - Text interpretation method and device - Google Patents

Text interpretation method and device Download PDF

Info

Publication number
CN108228576A
CN108228576A CN201711488585.6A CN201711488585A CN108228576A CN 108228576 A CN108228576 A CN 108228576A CN 201711488585 A CN201711488585 A CN 201711488585A CN 108228576 A CN108228576 A CN 108228576A
Authority
CN
China
Prior art keywords
text
source text
cluster
translation
candidate target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711488585.6A
Other languages
Chinese (zh)
Other versions
CN108228576B (en
Inventor
黄宜鑫
孟廷
刘俊华
魏思
胡国平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201711488585.6A priority Critical patent/CN108228576B/en
Publication of CN108228576A publication Critical patent/CN108228576A/en
Application granted granted Critical
Publication of CN108228576B publication Critical patent/CN108228576B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present invention provides a kind of text interpretation method and device, belongs to language processing techniques field.This method includes:Based on the feature vector of source text and each corresponding cluster centre feature vector of cluster classification, the cluster classification belonging to source text is determined;Cluster classification belonging to source text is subjected to vectorization, obtain the corresponding cluster categorization vector of source text, the term vector segmented in source text cluster categorization vector corresponding with source text is integrated, integrated results are input to translation model, export at least one candidate target text and the corresponding translation score value of each candidate target text;Based on the translation score value of each candidate target text, translation result of the candidate target text as source text is chosen from all candidate target texts.Reference feature is hidden due to the whole semantic and other translation that can combine source text in translation process to translate source text.This improves the field robustness of translation model and translation accuracy.

Description

Text interpretation method and device
Technical field
The present embodiments relate to language processing techniques field, more particularly, to a kind of text interpretation method and device.
Background technology
Machine translation is that a kind of natural language (original language) is converted to another natural language (target language using computer Speech) process.It lays particular emphasis at present and machine translation is carried out to source text (the corresponding text of original language) using field with reference to user, The application field of user's speech content is considered i.e. in machine translation.Wherein, application field can be divided into education sector, scientific research neck Domain and humane field etc..For the source text obtained after speech recognition, the translation of the following two kinds text is provided in the relevant technologies Method:
The first is the text interpretation method positioned at language material level, mainly first determines the application field of source text, screening The training corpus of the application field is belonged to, and translation model is built based on the training corpus filtered out, so as to turning over using structure Model is translated to be translated to source text.
Second is the text interpretation method for being located at model level, mainly by the translation model of multiple and different application fields It is combined, such as according to the degree of correlation between the application field of source text and the application field of different translation models, is turned over to be each It translates model and assigns weight, new hybrid guided mode is generated so as to be combined all translation models according to the weight of each translation model Type translates source text using new mixed model.
Since the above method is required to predefine the application field of source text, but the source text during actual translations Application field be likely difficult to determine, and same vocabulary may belong to multiple application fields, so as to cause being difficult accurate translation.
Invention content
To solve the above-mentioned problems, the embodiment of the present invention provides one kind and overcomes the above problem or solve at least partly State the text interpretation method and device of problem.
It is according to embodiments of the present invention in a first aspect, providing a kind of text interpretation method, this method includes:
Based on the feature vector of source text and each corresponding cluster centre feature vector of cluster classification, source text institute is determined The cluster classification of category;Wherein, each cluster classification corresponds to a cluster centre feature vector, each cluster classification and each cluster The corresponding cluster centre feature vector of classification is determined by after being clustered to the feature vector of training source text;
Cluster classification belonging to source text is subjected to vectorization, the corresponding cluster categorization vector of source text is obtained, by source document The term vector cluster categorization vector corresponding with source text segmented in this is integrated, and integrated results are input to translation model, Export at least one candidate target text and the corresponding translation score value of each candidate target text;
Based on the translation score value of each candidate target text, a candidate target text is chosen from all candidate target texts This translation result as source text.
Method provided in an embodiment of the present invention, by corresponding poly- based on the feature vector of source text and each cluster classification Class central feature vector, determines the cluster classification belonging to source text.Cluster classification belonging to source text is subjected to vectorization, is obtained The corresponding cluster categorization vector of source text carries out the term vector segmented in source text cluster categorization vector corresponding with source text It integrates, integrated results is input to translation model, export at least one candidate target text, each candidate target text corresponds to one A translation score value.Based on the translation score value of each candidate target text, a candidate mesh is chosen from all candidate target texts Mark translation result of the text as source text.Due to can determine the cluster classification belonging to source text before translation, and can be by source document Cluster classification belonging to this and source text together as translation model input parameter so that translation process can combine source document This whole semantic and other hiding translation element translates source text.This improves the field robusts of translation model Property and translation accuracy.
The possible realization method of with reference to first aspect the first, in second of possible realization method, this method is also Including:
The term vector of participles all in source text is averaged, obtains the feature vector of source text.
The possible realization method of with reference to first aspect the first, in the third possible realization method, based on source document This feature vector and each corresponding cluster centre feature vector of cluster classification, determine the cluster classification belonging to source text, wrap It includes:
The distance between the corresponding feature vector of source text and each cluster centre feature vector are calculated, determines to be calculated All distances in the corresponding cluster centre feature vector of minimum range, and be used as target cluster centre feature vector;
Using the corresponding cluster classification of target cluster centre feature vector as the cluster classification belonging to source text.
The possible realization method of with reference to first aspect the first, in the 4th kind of possible realization method, based on each The translation score value of candidate target text chooses candidate target text turning over as source text from all candidate target texts Translate as a result, including:
Each candidate target text is separately input into the corresponding domain language model of cluster classification belonging to source text, it is defeated Go out the domain language Model score of each candidate target text;
According to the translation score value of each candidate target text and domain language Model score, from all candidate target texts Choose translation result of the candidate target text as source text.
The 4th kind of possible realization method with reference to first aspect, in the 5th kind of possible realization method, according to each The translation score value of candidate target text and domain language Model score choose a candidate target from all candidate target texts Translation result of the text as source text, including:
Summation is weighted to the translation score value and domain language Model score of each candidate target text, obtains each time The comprehensive scores of target text are selected, the corresponding candidate target text of maximum comprehensive scores is chosen from all comprehensive scores as source The translation result of text.
The possible realization method of with reference to first aspect the first, in the 6th kind of possible realization method, by source text The term vector of middle participle cluster categorization vector corresponding with source text is integrated, including:
The corresponding cluster categorization vector of addition source text before the term vector of first participle in source text;Alternatively,
Term vector of the corresponding cluster categorization vector of source text respectively with participle each in source text is spliced;Or Person,
The corresponding cluster categorization vector of addition source text before the term vector of first participle in source text, and by source document This term vector of corresponding cluster categorization vector respectively with participle each in source text is spliced.
The possible realization method of with reference to first aspect the first, in the 7th kind of possible realization method, translation model For coding/decoding model, encoding model in translation model is using bidirectional circulating neural network structure, the decoding mould in translation model Type uses Recognition with Recurrent Neural Network structure;Correspondingly, integrated results are input to translation model, export at least one candidate target text This, including:
Integrated results are input in translation model, respectively obtain each in source text segment in the cluster belonging to source text Forward direction characterization and backward attribute under classification;
Forward direction of each participle under the cluster classification belonging to source text is characterized and backward attribute is spliced, is obtained every Characterization vector of one participle in source text;
Characterization vector based on each participle in source text is decoded source text, obtains at least one candidate target Text.
Second aspect according to embodiments of the present invention, provides a kind of text translating equipment, which includes:
Determining module, for based on the feature vector of source text and each corresponding cluster centre feature of cluster classification to Amount, determines the cluster classification belonging to source text;Wherein, each cluster classification corresponds to a cluster centre feature vector, each is poly- Class classification and each corresponding cluster centre feature vector of cluster classification are after being clustered to the feature vector of training source text It is identified;
Translation module for the cluster classification belonging to source text to be carried out vectorization, obtains the corresponding cluster class of source text It is not vectorial, the term vector segmented in source text cluster categorization vector corresponding with source text is integrated, integrated results are defeated Enter to translation model, export at least one candidate target text and the corresponding translation score value of each candidate target text;
Module is chosen, for the translation score value based on each candidate target text, is chosen from all candidate target texts Translation result of one candidate target text as source text.
The third aspect according to embodiments of the present invention provides a kind of text interpreting equipment, including:
At least one processor;And
At least one processor being connect with processor communication, wherein:
Memory is stored with the program instruction that can be executed by processor, and the instruction of processor caller is able to carry out first party The text interpretation method that any possible realization method is provided in the various possible realization methods in face.
According to the fourth aspect of the invention, a kind of non-transient computer readable storage medium storing program for executing, non-transient computer are provided Readable storage medium storing program for executing stores computer instruction, and computer instruction makes the various possible realization methods of computer execution first aspect In the text interpretation method that is provided of any possible realization method.
It should be understood that above general description and following detailed description is exemplary and explanatory, it can not Limit the embodiment of the present invention.
Description of the drawings
Fig. 1 is a kind of flow diagram of text interpretation method of the embodiment of the present invention;
Fig. 2 is the flow diagram of another text interpretation method of the embodiment of the present invention;
Fig. 3 is the flow diagram of another text interpretation method of the embodiment of the present invention;
Fig. 4 is the flow diagram of another text interpretation method of the embodiment of the present invention;
Fig. 5 is a kind of block diagram of text translating equipment of the embodiment of the present invention;
Fig. 6 is a kind of block diagram of text interpreting equipment of the embodiment of the present invention.
Specific embodiment
With reference to the accompanying drawings and examples, the specific embodiment of the embodiment of the present invention is described in further detail.With Lower embodiment is used to illustrate the embodiment of the present invention, but be not limited to the range of the embodiment of the present invention.
Presently relevant technology translates source text mainly in conjunction with the application field of source text.Wherein, application field It can be divided into scientific research field, humane field and education sector etc. according to application scenarios.Due to during actual translations source text should It may be difficult to determine with field, then be difficult to determine be specifically which application if source text may relate to multiple application fields Field, so as to cause being difficult accurate translation.At the same time, the participle in source text may also belong to multiple application fields, not It is different with translation result under application field, if vocabulary china is translated as China in News Field, it is translated as in antique field Porcelain, so as to further increase the difficulty of accurate translation.
In view of other than application field can be used as translation reference factor, some general character are hidden in actually source text Information such as theme, type and writing style etc. can also be used as translation and hide reference feature.For said circumstances, the present invention is real It applies example and provides a kind of text interpretation method.This method is suitable for translating into source voice signal the voiced translation field of target text Scape is also applied for translating into a kind of language text the usual translation scene of another language text, and the embodiment of the present invention is to this It is not especially limited.Should be referring to Fig. 1, this method includes:101st, it is corresponded to based on the feature vector of source text and each cluster classification Cluster centre feature vector, determine the cluster classification belonging to source text;Wherein, each cluster classification corresponds to a cluster centre Feature vector, each cluster classification and each corresponding cluster centre feature vector of cluster classification are the features to training source text Vector is identified after being clustered;102nd, the cluster classification belonging to source text is subjected to vectorization, it is corresponding obtains source text Categorization vector is clustered, the term vector of participle each in source text cluster categorization vector corresponding with source text is integrated, it will Integrated results are input to translation model, export at least one candidate target text and the corresponding translation point of each candidate target text Value;103rd, the translation score value based on each candidate target text chooses a candidate target text from all candidate target texts This translation result as source text.
Wherein, partition clustering classification mainly classifies the general character hiding information that may be used in translation, thus It is used in translation process to these general character hiding informations, so that translation result is more accurate.Perform above-mentioned steps 101 it Before, it may be determined that cluster classification and corresponding cluster centre feature vector.It specifically, can be by KMeans algorithms to a large amount of each neck Domain and the training corpus progress Unsupervised clustering comprising every general character hiding information, so that it is determined that different types of cluster class Not and each clusters the corresponding cluster centre feature vector of classification.Certainly, it can also be used during actual implementation other poly- Class algorithm, the embodiment of the present invention are not especially limited this.
By taking KMeans algorithms as an example, in order to realize Unsupervised clustering, it can also first calculate and source document is each trained in training corpus This feature vector.In the feature vector for calculating each training source text, word2vec technologies can be used in training corpus It is trained on the data set of middle trained source text composition, each participle in each trained source text is can obtain after training Term vector.The term vector of all participles in each trained source text is averaged, can arrive the feature of each training source text to Amount.
It is clustered using the feature vector of each trained source text, the poly- of each cluster classification is can obtain after cluster Class central feature vector (dv1,dv2,dv3,...,dvK), each cluster classification can be denoted as { d respectively1,d2,d3,...,dK}.Its In, k represents the total quantity of cluster classification.For example, d1Represent the first cluster classification, dv1Represent that the first cluster classification is corresponding Cluster centre feature vector.dKRepresent kth kind cluster classification, dvKRepresent the corresponding cluster centre feature of kth kind cluster classification to Amount.
After above-mentioned cluster process is completed, for source text to be translated, the feature vector of the source text can be obtained.This hair Bright embodiment does not limit the mode of the feature vector of acquisition source text specifically, including but not limited to:To owning in source text The term vector of participle is averaged, and obtains the feature vector of source text.
It, can be corresponding poly- based on the feature vector of source text and each cluster classification after the feature vector of source text is obtained Class central feature vector, determines the cluster classification belonging to source text.As shown in the above, each cluster classification can be denoted as respectively {d1,d2,d3,...,dKNamely each cluster classification can correspond to a mark.In order to subsequently be based on cluster category-translation Source text so as to which the cluster classification belonging to source text is carried out vectorization, obtains the corresponding cluster categorization vector of source text.Its In, can table look-up or word2vec by way of cluster classification is subjected to vectorization, the embodiment of the present invention do not make this specifically It limits.
It, can be by the term vector segmented in source text and source text pair after the corresponding cluster categorization vector of source text is obtained The cluster categorization vector answered is integrated, and using integrated results as translation model is input to, exports at least one candidate target text This, while export the corresponding translation score value of each candidate target text.Wherein, translation model can the instruction based on different cluster classifications White silk source text and training objective text obtain after being trained to initial model, and initial model can be Recognition with Recurrent Neural Network Types, the embodiment of the present invention such as (RNN, Recurrent Neural Networks) are not especially limited this.
It, can the translation based on each candidate target text point after candidate target text and corresponding translation score value is obtained Value chooses translation result of the candidate target text as source text from all candidate target texts.It, can during specific choice Using the highest candidate target text of selected text translation score value as source text translate after target text, the embodiment of the present invention to this not Make specific limit.
Method provided in an embodiment of the present invention, by corresponding poly- based on the feature vector of source text and each cluster classification Class central feature vector, determines the cluster classification belonging to source text.Cluster classification belonging to source text is subjected to vectorization, is obtained The corresponding cluster categorization vector of source text carries out the term vector segmented in source text cluster categorization vector corresponding with source text It integrates, integrated results is input to translation model, export at least one candidate target text, each candidate target text corresponds to one A translation score value.Based on the translation score value of each candidate target text, a candidate mesh is chosen from all candidate target texts Mark translation result of the text as source text.Due to can determine the cluster classification belonging to source text before translation, and can be by source document Cluster classification belonging to this and source text together as translation model input parameter so that translation process can combine source document This whole semantic and other translation hides reference feature and source text is translated.This improves the fields of translation model Robustness and translation accuracy.
Content based on above-described embodiment, as a kind of alternative embodiment, the embodiment of the present invention additionally provides a kind of determine Cluster class method for distinguishing belonging to source text.Referring to Fig. 2, this method includes:1011st, calculate the corresponding feature vector of source text with The distance between each cluster centre feature vector, determines the corresponding cluster centre of minimum range in all distances being calculated Feature vector, and it is used as target cluster centre feature vector;1012nd, by the corresponding cluster classification of target cluster centre feature vector As the cluster classification belonging to source text.
In above-mentioned steps 1011, calculating between the corresponding feature vector of source text and each cluster centre feature vector Apart from when, Euclidean distance between the two can be calculated, the embodiment of the present invention is not especially limited this.Minimum range is corresponded to Cluster centre feature vector, can using the cluster centre feature vector it is corresponding cluster classification as the cluster class belonging to source text Not.
Content based on above-described embodiment, it is contemplated that between cluster classification belonging to candidate target text and source text May be not high enough with degree, cause translation result inaccurate in order to avoid there is such case, as a kind of optional implementation Example, the embodiment of the present invention additionally provide one candidate target text turning over as source text of selection from all candidate target texts The method for translating result.Referring to Fig. 3, this method includes:1031st, each candidate target text is separately input into belonging to source text The corresponding domain language model of classification is clustered, exports the domain language Model score of each candidate target text;1032nd, according to every The translation score value of a candidate target text and domain language Model score choose a candidate mesh from all candidate target texts Mark translation result of the text as source text.
It, can be using the cluster classification belonging to source text as the cluster classification belonging to target text in above-mentioned steps 1031. The target text under a large amount of cluster classifications, i.e. current area target text may be selected in cluster classification according to belonging to target text This, domain language model can be built using the current area target text under the cluster classification.Wherein, construction method and existing language Say that model building method is identical, the embodiment of the present invention is not especially limited this.After domain language model is obtained, neck can be passed through Domain language model calculates the domain language Model score of each candidate target text.Wherein, domain language Model score is higher, then Accuracy when the corresponding candidate target text of the domain language Model score is as translation result is also higher.
It, can turning over according to each candidate target text after the domain language Model score of each candidate target text is obtained Score value and domain language Model score are translated, chooses translation result of the candidate target text as source text.The present invention is implemented Example is selected not to the translation score value and domain language Model score according to each candidate target text from all candidate target texts A candidate target text is taken specifically to be limited as the mode of the translation result of source text, including but not limited to:To each time The translation score value and domain language Model score for selecting target text are weighted summation, obtain the synthesis of each candidate target text Score value chooses translation result of the corresponding candidate target text of maximum comprehensive scores as source text from all comprehensive scores.
Wherein, the mode of weighted sum can be linear fusion or non-linear fusion, and the embodiment of the present invention does not make this It is specific to limit.By taking linear fusion as an example, the process for calculating comprehensive scores can refer to equation below:
Sf=λ Strans+(1-λ)Slm
In above-mentioned formula, for any candidate target text, SfRepresent the comprehensive scores of the candidate target text, Strans Represent the translation score value of the candidate target text, SlmRepresent the domain language Model score of candidate target text, λ represents translation point The shared weight of value.Wherein, the value of λ can specifically be predefined according to application demand.
Content based on above-described embodiment, as a kind of alternative embodiment, the embodiment of the present invention additionally provides a kind of by source The method that the term vector segmented in text cluster categorization vector corresponding with source text is integrated, this method include:
The corresponding cluster categorization vector of addition source text before the term vector of first participle in source text;It alternatively, will Term vector of the corresponding cluster categorization vector of source text respectively with participle each in source text is spliced;Alternatively, in source text In first participle term vector before add the corresponding cluster categorization vector of source text, and by the corresponding cluster classification of source text Term vector of the vector respectively with participle each in source text is spliced.
Using the term vector of participles all in source text as x=(x1,x2,x3,...,xm), the corresponding cluster classification of source text Vectorial dKFor.After the term vector segmented in source text cluster categorization vector corresponding with source text is integrated, above-mentioned the A kind of corresponding integrated results of Integration Mode are (dk,x1,x2,x3,...,xm), above-mentioned second of Integration Mode is corresponding to integrate knot Fruit is (dkx1,dkx2,dkx3,...,dkxm), the corresponding integrated results of above-mentioned the third Integration Mode are (dk,dkx1,dkx2, dkx3,...,dkxm).In the third above-mentioned Integration Mode, dkx1Represent the corresponding cluster categorization vector of source text and source text In first participle term vector carry out it is spliced as a result, dkx2Represent the corresponding cluster categorization vector of source text and source text In second participle term vector carry out it is spliced as a result, below and so on.
In the above-described embodiments, source text is translated mainly in conjunction with the cluster classification belonging to source text, Ye Jizhu Source text is translated from the angle of semantic analysis, and during actual translations, it usually also needs to combine context letter Breath.Content based on above-described embodiment, as a kind of alternative embodiment, the embodiment of the present invention additionally provide it is a kind of will be in source text The integrated results that the term vector of participle cluster categorization vector corresponding with source text is integrated, translate into candidate target text Method.Specifically, translation model used in translation process can be coding/decoding model.Wherein, the coding mould in translation model Type uses bidirectional circulating neural network structure, and the decoded model in translation model uses Recognition with Recurrent Neural Network structure.
Correspondingly, referring to Fig. 4, this method includes:1021st, integrated results are input in translation model, respectively obtain source Forward direction characterization and backward attribute of each participle under the cluster classification belonging to source text in text;1022nd, each participle is existed Forward direction characterization and backward attribute under cluster classification belonging to source text are spliced, and obtain table of each participle in source text Sign vector;1023rd, source text is decoded based on characterization vector of each participle in source text, obtains at least one candidate Target text.
Specifically, it for the cluster classification belonging to source text, is recycled by the forward direction in bidirectional circulating neural network structure Neural network can obtain each participle in source text and see that the forward direction under history lexical information characterizes f under the cluster classificationi。 By the recycled back neural network in bidirectional circulating neural network structure, seen in can obtain each participle under the cluster classification To the backward attribute b of following lexical informationi.Finally, the two is stitched together the table that can form each participle in the source text Levy vector hi.On the basis of characterization vector, by characterization vector h of each participle in source textiIt is input to Recognition with Recurrent Neural Network In, exportable at least one candidate target text.At the same time, the translation score value of also exportable each candidate target text.
Method provided in an embodiment of the present invention by the way that integrated results are input in translation model, respectively obtains source text In each forward direction of the participle under the cluster classification belonging to source text characterize and backward attribute.By each participle belonging to source text Cluster classification under forward direction characterization and backward attribute spliced, obtain it is each participle in source text characterization vector.Base Source text is decoded in characterization vector of each participle in source text, obtains at least one candidate target text.Due to Other than being translated from the angle of semantic analysis to source text, it can also be tied under the premise of the cluster classification belonging to source text It closes context information to translate source text, so as to further improve the accuracy of text translation.
It should be noted that above-mentioned all alternative embodiments, may be used the optional implementation that any combination forms the present invention Example, this is no longer going to repeat them.
Content based on above-described embodiment, an embodiment of the present invention provides a kind of text translating equipment, text translation dresses It puts to perform the text interpretation method in above method embodiment.Referring to Fig. 5, which includes:
Determining module 501, for based on the feature vector of source text and each corresponding cluster centre feature of cluster classification Vector determines the cluster classification belonging to source text;Wherein, each cluster classification corresponds to a cluster centre feature vector, each Cluster classification and each corresponding cluster centre feature vector of cluster classification are that the feature vector of training source text is clustered It is identified afterwards;
Translation module 502 for the cluster classification belonging to source text to be carried out vectorization, obtains the corresponding cluster of source text Categorization vector is integrated the term vector segmented in source text cluster categorization vector corresponding with source text, by integrated results Translation model is input to, exports at least one candidate target text and the corresponding translation score value of each candidate target text;
Module 503 is chosen, for the translation score value based on each candidate target text, is selected from all candidate target texts Take translation result of the candidate target text as source text.
As a kind of alternative embodiment, which further includes:
Computing module is averaged for the term vector to participles all in source text, obtains the feature vector of source text.
As a kind of alternative embodiment, determining module 501, for calculating the corresponding feature vector of source text and each cluster The distance between central feature vector, determine in all distances being calculated the corresponding cluster centre feature of minimum range to Amount, and it is used as target cluster centre feature vector;Using the corresponding cluster classification of target cluster centre feature vector as source text Affiliated cluster classification.
As a kind of alternative embodiment, module 503 is chosen, including:
Computing unit, for each candidate target text to be separately input into the corresponding neck of cluster classification belonging to source text Domain language model exports the domain language Model score of each candidate target text;
Selection unit, for the translation score value and domain language Model score according to each candidate target text, from all Translation result of the candidate target text as source text is chosen in candidate target text.
As a kind of alternative embodiment, selection unit, for the translation score value of each candidate target text and field language Speech Model score is weighted summation, obtains the comprehensive scores of each candidate target text, is chosen most from all comprehensive scores Translation result of the corresponding candidate target text of great synthesis score value as source text.
As a kind of alternative embodiment, translation module 502, for adding before the term vector of first participle in source text Add the corresponding cluster categorization vector of source text;Alternatively, by source text it is corresponding cluster categorization vector respectively with it is each in source text The term vector of participle is spliced;Alternatively, addition source text is corresponding poly- before the term vector of first participle in source text Class categorization vector, and term vector of the corresponding cluster categorization vector of source text respectively with participle each in source text is spelled It connects.
As a kind of alternative embodiment, translation model includes bidirectional circulating neural network and Recognition with Recurrent Neural Network, two-way to follow Ring neural network includes preceding to Recognition with Recurrent Neural Network and recycled back neural network;Correspondingly, translation module 502, for will be whole It closes result to be input in translation model, respectively obtains forward direction of each participle under the cluster classification belonging to source text in source text Characterization and backward attribute;Forward direction of each participle under the cluster classification belonging to source text is characterized and backward attribute is spelled It connects, obtains characterization vector of each participle in source text;Characterization vector based on each participle in source text is to source text It is decoded, obtains at least one candidate target text.
Device provided in an embodiment of the present invention, by corresponding poly- based on the feature vector of source text and each cluster classification Class central feature vector, determines the cluster classification belonging to source text.Cluster classification belonging to source text is subjected to vectorization, is obtained The corresponding cluster categorization vector of source text carries out the term vector segmented in source text cluster categorization vector corresponding with source text It integrates, integrated results is input to translation model, export at least one candidate target text, each candidate target text corresponds to one A translation score value.Based on the translation score value of each candidate target text, a candidate mesh is chosen from all candidate target texts Mark translation result of the text as source text.Due to can determine the cluster classification belonging to source text before translation, and can be by source document Cluster classification belonging to this and source text together as translation model input parameter so that translation process can combine source document This whole semantic and other translation hides reference feature and source text is translated.This improves the fields of translation model Robustness and translation accuracy.
In addition, by the way that integrated results are input in translation model, each in source text segment in source text is respectively obtained Forward direction characterization and backward attribute under affiliated cluster classification.By forward direction of each participle under the cluster classification belonging to source text Characterization and backward attribute are spliced, and obtain characterization vector of each participle in source text.Based on each participle in source text In characterization vector source text is decoded, obtain at least one candidate target text.Due in addition to from the angle of semantic analysis It, can also be under the premise of the cluster classification belonging to source text, with reference to contextual information to source document except degree translates source text This is translated, so as to further improve the accuracy of text translation.
An embodiment of the present invention provides a kind of text interpreting equipments.Referring to Fig. 6, which includes:Processor (processor) 601, memory (memory) 602 and bus 603;
Wherein, processor 601 and memory 602 complete mutual communication by bus 603 respectively;
Processor 601 is used to call the program instruction in memory 602, and the text provided with performing above-described embodiment turns over Translate method, such as including:Based on the feature vector of source text and each corresponding cluster centre feature vector of cluster classification, determine Cluster classification belonging to source text;Wherein, each cluster classification correspond to a cluster centre feature vector, each cluster classification and Each corresponding cluster centre feature vector of cluster classification is determined by after being clustered to the feature vector of training source text; Cluster classification belonging to source text is subjected to vectorization, the corresponding cluster categorization vector of source text is obtained, will be segmented in source text Term vector cluster categorization vector corresponding with source text integrated, integrated results are input to translation model, output is at least One candidate target text and the corresponding translation score value of each candidate target text;Translation based on each candidate target text point Value chooses translation result of the candidate target text as source text from all candidate target texts.
The embodiment of the present invention provides a kind of non-transient computer readable storage medium storing program for executing, the non-transient computer readable storage medium Matter stores computer instruction, which makes computer perform the text interpretation method that above-described embodiment is provided, such as Including:
Based on the feature vector of source text and each corresponding cluster centre feature vector of cluster classification, source text institute is determined The cluster classification of category;Wherein, each cluster classification corresponds to a cluster centre feature vector, each cluster classification and each cluster The corresponding cluster centre feature vector of classification is determined by after being clustered to the feature vector of training source text;By source text Affiliated cluster classification carries out vectorization, obtains the corresponding cluster categorization vector of source text, the term vector that will be segmented in source text Cluster categorization vector corresponding with source text is integrated, and integrated results are input to translation model, export at least one candidate Target text and the corresponding translation score value of each candidate target text;Based on the translation score value of each candidate target text, from institute Have and translation result of the candidate target text as source text is chosen in candidate target text.
One of ordinary skill in the art will appreciate that:Realizing all or part of step of above method embodiment can pass through The relevant hardware of program instruction is completed, and aforementioned program can be stored in a computer read/write memory medium, the program When being executed, step including the steps of the foregoing method embodiments is performed;And aforementioned storage medium includes:ROM, RAM, magnetic disc or light The various media that can store program code such as disk.
The embodiments such as text interpreting equipment described above are only schematical, wherein illustrate as separating component Unit may or may not be physically separate, and the component shown as unit may or may not be object Manage unit, you can be located at a place or can also be distributed in multiple network element.It can select according to the actual needs Some or all of module therein is selected to realize the purpose of this embodiment scheme.Those of ordinary skill in the art are not paying wound In the case of the labour for the property made, you can to understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It is realized by the mode of software plus required general hardware platform, naturally it is also possible to pass through hardware.Based on such understanding, on Technical solution is stated substantially in other words to embody the part that the prior art contributes in the form of software product, it should Computer software product can store in a computer-readable storage medium, such as ROM/RAM, magnetic disc, CD, including several fingers It enables and (can be personal computer, server or the network equipment etc.) so that computer equipment is used to perform each implementation Certain Part Methods of example or embodiment.
Finally, the present processes are only preferable embodiment, are not intended to limit the protection model of the embodiment of the present invention It encloses.With within principle, any modification, equivalent replacement, improvement and so on should be included in all spirit in the embodiment of the present invention Within the protection domain of the embodiment of the present invention.

Claims (10)

1. a kind of text interpretation method, which is characterized in that including:
Based on the feature vector of source text and each corresponding cluster centre feature vector of cluster classification, the source text institute is determined The cluster classification of category;Wherein, each cluster classification corresponds to a cluster centre feature vector, each cluster classification and each cluster The corresponding cluster centre feature vector of classification is determined by after being clustered to the feature vector of training source text;
Cluster classification belonging to the source text is subjected to vectorization, obtains the corresponding cluster categorization vector of the source text, it will The term vector segmented in the source text cluster categorization vector corresponding with the source text is integrated, and integrated results are inputted To translation model, at least one candidate target text and the corresponding translation score value of each candidate target text are exported;
Based on the translation score value of each candidate target text, a candidate target text work is chosen from all candidate target texts Translation result for the source text.
2. according to the method described in claim 1, it is characterized in that, the method further includes:
The term vector of all participles in the source text is averaged, obtains the feature vector of the source text.
3. according to the method described in claim 1, it is characterized in that, it is described based on the feature vector of source text and each cluster class Not corresponding cluster centre feature vector, determines the cluster classification belonging to the source text, including:
The distance between the corresponding feature vector of source text and each cluster centre feature vector are calculated, determines the institute being calculated There is the corresponding cluster centre feature vector of minimum range in distance, and be used as target cluster centre feature vector;
Using the corresponding cluster classification of the target cluster centre feature vector as the cluster classification belonging to the source text.
4. according to the method described in claim 1, it is characterized in that, the translation score value based on each candidate target text, Translation result of the candidate target text as the source text is chosen from all candidate target texts, including:
Each candidate target text is separately input into the corresponding domain language model of cluster classification belonging to the source text, it is defeated Go out the domain language Model score of each candidate target text;
According to the translation score value of each candidate target text and domain language Model score, chosen from all candidate target texts Translation result of one candidate target text as the source text.
5. according to the method described in claim 4, it is characterized in that, the translation score value of each candidate target text of the basis and Domain language Model score chooses translation of the candidate target text as the source text from all candidate target texts As a result, including:
Summation is weighted to the translation score value and domain language Model score of each candidate target text, obtains each candidate mesh The comprehensive scores of text are marked, the corresponding candidate target text of maximum comprehensive scores is chosen from all comprehensive scores as the source The translation result of text.
6. according to the method described in claim 1, it is characterized in that, it is described by the term vector segmented in the source text with it is described The corresponding cluster categorization vector of source text is integrated, including:
In the source text the corresponding cluster categorization vector of the source text is added before the term vector of first participle;Or Person,
Term vector of the corresponding cluster categorization vector of the source text respectively with participle each in the source text is spliced; Alternatively,
The corresponding cluster categorization vector of the source text is added before the term vector of first participle in the source text, and will The corresponding term vector for clustering categorization vector respectively with participle each in the source text of the source text is spliced.
7. according to the method described in claim 1, it is characterized in that, the translation model be coding/decoding model, the translation mould Encoding model in type is using bidirectional circulating neural network structure, and the decoded model in the translation model is using cycle nerve net Network structure;Correspondingly, it is described that integrated results are input to translation model, at least one candidate target text is exported, including:
Integrated results are input in the translation model, respectively obtain each in source text segment in the cluster belonging to source text Forward direction characterization and backward attribute under classification;
Forward direction of each participle under the cluster classification belonging to the source text is characterized and backward attribute is spliced, is obtained every Characterization vector of one participle in the source text;
Characterization vector based on each participle in the source text is decoded the source text, obtains at least one candidate Target text.
8. a kind of text translating equipment, which is characterized in that including:
Determining module, for based on the feature vector of source text and each corresponding cluster centre feature vector of cluster classification, really Cluster classification belonging to the fixed source text;Wherein, each cluster classification corresponds to a cluster centre feature vector, each cluster Classification and each corresponding cluster centre feature vector of cluster classification are institutes after being clustered to the feature vector of training source text Determining;
For the cluster classification belonging to the source text to be carried out vectorization, it is corresponding poly- to obtain the source text for translation module Class categorization vector integrates the term vector segmented in the source text cluster categorization vector corresponding with the source text, Integrated results are input to translation model, export at least one candidate target text and the corresponding translation of each candidate target text Score value;
Module is chosen, for the translation score value based on each candidate target text, one is chosen from all candidate target texts Translation result of the candidate target text as the source text.
9. a kind of text interpreting equipment, which is characterized in that including:
At least one processor;And
At least one processor being connect with the processor communication, wherein:
The memory is stored with the program instruction that can be performed by the processor, and the processor calls described program instruction energy Enough methods performed as described in claim 1 to 7 is any.
10. a kind of non-transient computer readable storage medium storing program for executing, which is characterized in that the non-transient computer readable storage medium storing program for executing is deposited Computer instruction is stored up, the computer instruction makes the computer perform the method as described in claim 1 to 7 is any.
CN201711488585.6A 2017-12-29 2017-12-29 Text translation method and device Active CN108228576B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711488585.6A CN108228576B (en) 2017-12-29 2017-12-29 Text translation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711488585.6A CN108228576B (en) 2017-12-29 2017-12-29 Text translation method and device

Publications (2)

Publication Number Publication Date
CN108228576A true CN108228576A (en) 2018-06-29
CN108228576B CN108228576B (en) 2021-07-02

Family

ID=62647444

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711488585.6A Active CN108228576B (en) 2017-12-29 2017-12-29 Text translation method and device

Country Status (1)

Country Link
CN (1) CN108228576B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109598002A (en) * 2018-11-15 2019-04-09 重庆邮电大学 Neural machine translation method and system based on bidirectional circulating neural network
CN109885811A (en) * 2019-01-10 2019-06-14 平安科技(深圳)有限公司 Written style conversion method, device, computer equipment and storage medium
CN109902309A (en) * 2018-12-17 2019-06-18 北京百度网讯科技有限公司 Interpretation method, device, equipment and storage medium
CN110211570A (en) * 2019-05-20 2019-09-06 北京百度网讯科技有限公司 Simultaneous interpretation processing method, device and equipment
CN111428518A (en) * 2019-01-09 2020-07-17 科大讯飞股份有限公司 Low-frequency word translation method and device
CN111460264A (en) * 2020-03-30 2020-07-28 口口相传(北京)网络技术有限公司 Training method and device of semantic similarity matching model
CN111597826A (en) * 2020-05-15 2020-08-28 苏州七星天专利运营管理有限责任公司 Method for processing terms in auxiliary translation
CN111949789A (en) * 2019-05-16 2020-11-17 北京京东尚科信息技术有限公司 Text classification method and text classification system

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN85101759A (en) * 1985-04-01 1987-01-10 株式会社日立制作所 Interpretation method
WO2006014343A2 (en) * 2004-07-02 2006-02-09 Text-Tech, Llc Automated evaluation systems and methods
CN101079028A (en) * 2007-05-29 2007-11-28 中国科学院计算技术研究所 On-line translation model selection method of statistic machine translation
CN103049436A (en) * 2011-10-12 2013-04-17 北京百度网讯科技有限公司 Method and device for obtaining corpus, method and system for generating translation model and method and system for mechanical translation
CN104090870A (en) * 2014-06-26 2014-10-08 武汉传神信息技术有限公司 Pushing method of online translation engines
CN104516870A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Translation check method and system
CN104572631A (en) * 2014-12-03 2015-04-29 北京捷通华声语音技术有限公司 Training method and system for language model
CN105528342A (en) * 2015-12-29 2016-04-27 科大讯飞股份有限公司 Intelligent translation method and system in input method
CN105786798A (en) * 2016-02-25 2016-07-20 上海交通大学 Natural language intention understanding method in man-machine interaction
US20160275074A1 (en) * 2015-03-19 2016-09-22 Abbyy Infopoisk Llc Anaphora resolution based on linguistic technologies
US20170371865A1 (en) * 2016-06-24 2017-12-28 Facebook, Inc. Target phrase classifier

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN85101759A (en) * 1985-04-01 1987-01-10 株式会社日立制作所 Interpretation method
WO2006014343A2 (en) * 2004-07-02 2006-02-09 Text-Tech, Llc Automated evaluation systems and methods
CN101079028A (en) * 2007-05-29 2007-11-28 中国科学院计算技术研究所 On-line translation model selection method of statistic machine translation
CN103049436A (en) * 2011-10-12 2013-04-17 北京百度网讯科技有限公司 Method and device for obtaining corpus, method and system for generating translation model and method and system for mechanical translation
CN104516870A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Translation check method and system
CN104090870A (en) * 2014-06-26 2014-10-08 武汉传神信息技术有限公司 Pushing method of online translation engines
CN104572631A (en) * 2014-12-03 2015-04-29 北京捷通华声语音技术有限公司 Training method and system for language model
US20160275074A1 (en) * 2015-03-19 2016-09-22 Abbyy Infopoisk Llc Anaphora resolution based on linguistic technologies
CN105528342A (en) * 2015-12-29 2016-04-27 科大讯飞股份有限公司 Intelligent translation method and system in input method
CN105786798A (en) * 2016-02-25 2016-07-20 上海交通大学 Natural language intention understanding method in man-machine interaction
US20170371865A1 (en) * 2016-06-24 2017-12-28 Facebook, Inc. Target phrase classifier

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
EVA HASLER,ET AL: "Dynamic Topic Adaptation for SMT using Distributional Profiles", 《PROCEEDINGS OF THE NINTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION》 *
HUA WU,ET AL: "Domain Adaptation for Statistical Machine Translation with Domain Dictionary and Monolingual Corpora", 《PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS》 *
JINSONG SU,ET AL: "Translation Model Adaptation for Statistical Machine Translation with Monolingual Topic Information", 《PROCEEDINGS OF THE 50TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》 *
MANAAL FARUQUI, ET AL: "Improving Vector SpaceWord Representations Using Multilingual Correlation", 《PROCEEDINGS OF THE 14TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS,》 *
丁亮,等: "融合领域知识与深度学习的机器翻译领域自适应研究", 《情报科学》 *
刘昊: "统计机器翻译领域自适应方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑(月刊)》 *
姚亮: "基于语义分布相似度的翻译模型领域自适应研究", 《山东大学学报(理学版)》 *
崔磊: "统计机器翻译领域自适应研究", 《中国博士学位论文全文数据库信息科技辑(月刊)》 *
张文文: "基于聚类的统计机器翻译领域自适应研究", 《中国优秀硕士学位论文全文数据库信息科技辑(月刊)》 *
赵迎功: "统计机器翻译中领域自适应问题研究", 《中国博士学位论文全文数据库信息科技辑(月刊)》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109598002A (en) * 2018-11-15 2019-04-09 重庆邮电大学 Neural machine translation method and system based on bidirectional circulating neural network
CN109902309A (en) * 2018-12-17 2019-06-18 北京百度网讯科技有限公司 Interpretation method, device, equipment and storage medium
CN111428518A (en) * 2019-01-09 2020-07-17 科大讯飞股份有限公司 Low-frequency word translation method and device
CN111428518B (en) * 2019-01-09 2023-11-21 科大讯飞股份有限公司 Low-frequency word translation method and device
CN109885811A (en) * 2019-01-10 2019-06-14 平安科技(深圳)有限公司 Written style conversion method, device, computer equipment and storage medium
CN109885811B (en) * 2019-01-10 2024-05-14 平安科技(深圳)有限公司 Article style conversion method, apparatus, computer device and storage medium
CN111949789A (en) * 2019-05-16 2020-11-17 北京京东尚科信息技术有限公司 Text classification method and text classification system
CN110211570A (en) * 2019-05-20 2019-09-06 北京百度网讯科技有限公司 Simultaneous interpretation processing method, device and equipment
CN110211570B (en) * 2019-05-20 2021-06-25 北京百度网讯科技有限公司 Simultaneous interpretation processing method, device and equipment
CN111460264A (en) * 2020-03-30 2020-07-28 口口相传(北京)网络技术有限公司 Training method and device of semantic similarity matching model
CN111597826A (en) * 2020-05-15 2020-08-28 苏州七星天专利运营管理有限责任公司 Method for processing terms in auxiliary translation

Also Published As

Publication number Publication date
CN108228576B (en) 2021-07-02

Similar Documents

Publication Publication Date Title
CN108228576A (en) Text interpretation method and device
CN107229610B (en) A kind of analysis method and device of affection data
CN105122279B (en) Deep neural network is conservatively adapted in identifying system
CN107977356A (en) Method and device for correcting recognized text
CN107220235A (en) Speech recognition error correction method, device and storage medium based on artificial intelligence
CN109977234A (en) A kind of knowledge mapping complementing method based on subject key words filtering
CN107844481B (en) Text recognition error detection method and device
GB2517212A (en) A Computer Generated Emulation of a subject
CN107423363A (en) Art generation method, device, equipment and storage medium based on artificial intelligence
CN109902672A (en) Image labeling method and device, storage medium, computer equipment
CN111816169B (en) Method and device for training Chinese and English hybrid speech recognition model
JP2020038343A (en) Method and device for training language identification model, and computer program for it
CN108595436A (en) The generation method and system of emotion conversation content, storage medium
CN106683667A (en) Automatic rhythm extracting method, system and application thereof in natural language processing
CN114911932A (en) Heterogeneous graph structure multi-conversation person emotion analysis method based on theme semantic enhancement
CN110119443A (en) A kind of sentiment analysis method towards recommendation service
CN116704085B (en) Avatar generation method, apparatus, electronic device, and storage medium
CN111899766B (en) Speech emotion recognition method based on optimization fusion of depth features and acoustic features
CN113505198A (en) Keyword-driven generating type dialogue reply method and device and electronic equipment
CN110399488A (en) File classification method and device
CN113590078A (en) Virtual image synthesis method and device, computing equipment and storage medium
CN113327575B (en) Speech synthesis method, device, computer equipment and storage medium
CN117152308B (en) Virtual person action expression optimization method and system
CN109726386B (en) Word vector model generation method, device and computer readable storage medium
CN111046674B (en) Semantic understanding method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant