CN104391842A - Translation model establishing method and system - Google Patents

Translation model establishing method and system Download PDF

Info

Publication number
CN104391842A
CN104391842A CN201410797926.8A CN201410797926A CN104391842A CN 104391842 A CN104391842 A CN 104391842A CN 201410797926 A CN201410797926 A CN 201410797926A CN 104391842 A CN104391842 A CN 104391842A
Authority
CN
China
Prior art keywords
phrase
semantic
mrow
word
source language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410797926.8A
Other languages
Chinese (zh)
Inventor
熊德意
王超超
张民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201410797926.8A priority Critical patent/CN104391842A/en
Publication of CN104391842A publication Critical patent/CN104391842A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a translation model establishing method and system. The translation model establishing method comprises the following steps: respectively generating a regular alignment table, a word semantic vector table and a phrase table according to alignment information of a double-language parallel corpus, subsequently generating a source language phrase semantic vector table of a source language semantic space and a target language phrase semantic vector table of a target language semantic space by using the word semantic vector table and the phrase table, and finally training by using phrase semantic vector tables of different semantic spaces, thereby generating a translation model integrated with semantic information. The result shows that phrase semantic information can be integrated in statistic machine translation, the research shows that the relevance of words or phrases to context words or phrases can be reflected in the semantic information, and compared with a conventional translation method based on words or phrases, the translation model is relatively high in translation quality after the phrase semantic information is integrated, so that the translation property of the statistic machine translation is further improved as compared with that of the prior art.

Description

Translation model construction method and system
Technical Field
The invention belongs to the technical field of statistical machine translation, and particularly relates to a translation model construction method and system.
Background
In recent years, with the improvement of computing power and the continuous enrichment of corpus resources, statistical machine translation technology is becoming the most important research hotspot in the field of natural language processing.
The implementation of statistical machine translation typically involves two main processes: training and decoding. Training refers to training a statistical translation model from a corpus resource according to a certain algorithm; decoding, i.e., translation, refers to translating a text to be translated according to a trained translation model. The initial statistical machine translation method is established based on a noise channel model, then researchers put forward the statistical machine translation method based on the maximum entropy idea to the model in further generalization in practice, on the basis, the statistical machine translation method is developed based on words, phrases and syntax respectively, and the performance of machine translation is improved more or less, namely compared with the prior translation model, the translation performance of the translation model based on the words, the phrases or the syntax is improved to a certain extent, but the translation goal of realizing 'information, reach and elegance' is still far away.
Disclosure of Invention
In view of this, the present invention provides a translation model construction method and system to effectively improve the translation quality of statistical machine translation and further advance the realization of the "belief, reach, and elegance" translation goal.
Therefore, the invention discloses the following technical scheme:
a translation model building method, comprising:
obtaining a bilingual parallel corpus, wherein the bilingual parallel corpus comprises a contrast translation from a source language sentence to a target language sentence;
generating a rule alignment table, a word semantic vector table and a phrase table by using the bilingual parallel corpus, wherein the rule alignment table comprises a bilingual comparison hierarchical phrase rule, the word semantic vector table comprises bilingual comparison word semantic vectors, and the phrase table comprises bilingual comparison phrase information;
generating a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space by using the word semantic vector table and the phrase table;
and processing the source language phrase semantic vector table in the source language semantic space and the target language phrase semantic vector table in the target language semantic space to obtain a translation model.
Preferably, the generating a rule alignment table, a word semantic vector table, and a phrase table by using the bilingual parallel corpus includes:
preprocessing the bilingual parallel corpus to obtain word alignment information, wherein the word alignment information comprises bilingual word words;
generating a rule alignment table according to the word alignment information, wherein the rule alignment table comprises a bilingual hierarchical phrase rule, and the hierarchical phrase rule can be expressed as: x → < γ, α →, >, wherein X is a non-terminator, γ and α are character strings consisting of terminators and non-terminators, and the symbol "-" represents a one-to-one correspondence between the non-terminators appearing in γ and the non-terminators appearing in α;
generating a word semantic vector table according to the word alignment information, wherein the word semantic vector table comprises word semantic vectors of bilingual comparison, and the word semantic vectors are obtained by calculating through node mutual information PMI:
<math> <mrow> <mi>pmi</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>log</mi> <mfrac> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>=</mo> <mfrac> <mrow> <msub> <mi>freq</mi> <mrow> <msub> <mi>c</mi> <mi>i</mi> </msub> <mi>t</mi> </mrow> </msub> <mo>&times;</mo> <msub> <mi>freq</mi> <mi>total</mi> </msub> </mrow> <mrow> <msub> <mi>freq</mi> <msub> <mi>c</mi> <mi>i</mi> </msub> </msub> <mo>&times;</mo> <msub> <mi>freq</mi> <mi>t</mi> </msub> </mrow> </mfrac> </mrow> </math>
where t denotes the target word, ciRepresents the related words adjacent to t in the context window,indicating a related word ciNumber of co-occurrences with target word t, freqtotalIndicating the number of times all of the words have occurred,indicates the number of occurrences of the context word, freqtRepresenting the number of times the target word appears;
and generating a phrase table according to the word alignment information.
Preferably, the generating a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space by using the word semantic vector table and the phrase table includes:
for each source language phrase in a phrase table, retrieving a semantic vector of each word contained in the source language phrase from the word semantic vector table;
performing weighted vector addition operation on semantic vectors of words included in the source language phrase to obtain semantic vectors of the source language phrase, wherein the semantic vectors of the source language phrase in the phrase table form a source language phrase semantic vector table in a source language semantic space;
for each target language phrase in a phrase table, retrieving a semantic vector of each word contained in the phrase table from the word semantic vector table;
and performing weighted vector addition operation on semantic vectors of all words included in the target language phrase to obtain the semantic vector of the target language phrase, wherein the semantic vector of each target language phrase in the phrase table forms a target language phrase semantic vector table in a target language semantic space.
Preferably, the processing a source language phrase semantic vector table in the source language semantic space and a target language phrase semantic vector table in the target language semantic space to obtain a translation model includes:
training and generating a neural network model containing a hidden layer through a source language phrase semantic vector table in the source language semantic space and a target language phrase semantic vector table in the target language semantic space;
and mapping the source language phrase semantic vector table in the source language semantic space to a target language semantic space by using the neural network model to obtain a source language phrase semantic vector table in the target language semantic space, wherein the source language phrase semantic vector table and the target language phrase semantic vector table in the target language semantic space form a translation model.
The method preferably further includes translating the text to be translated by using the translation model.
Preferably, the translating the text to be translated by using the translation model includes:
carrying out phrase segmentation on sentences of a text to be translated to obtain a phrase sequence corresponding to the text to be translated;
extracting phrases in the phrase sequence in sequence, and for the extracted phrases, retrieving aligned phrases corresponding to the extracted phrases from the rule alignment table, wherein the aligned phrases comprise source language phrases of the extracted phrases and N candidate target language phrases corresponding to the source language phrases, and N is a natural number not less than 1;
searching phrase semantic vectors corresponding to the source language phrases and the N candidate target language phrases from the translation model respectively;
and respectively calculating the semantic similarity between each candidate target language phrase and the source language phrase based on the retrieved phrase semantic vectors, and taking the candidate target language phrase with the maximum semantic similarity as the translation result of the extracted phrase so as to translate the text to be translated.
Preferably, in the above method, the formula for calculating the semantic similarity is as follows:
<math> <mrow> <mi>Sim</mi> <mrow> <mo>(</mo> <mover> <mi>u</mi> <mo>&RightArrow;</mo> </mover> <mo>,</mo> <mover> <mi>v</mi> <mo>&RightArrow;</mo> </mover> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mover> <mi>u</mi> <mo>&RightArrow;</mo> </mover> <mo>&CenterDot;</mo> <mover> <mi>v</mi> <mo>&RightArrow;</mo> </mover> </mrow> <mrow> <mo>|</mo> <mo>|</mo> <mover> <mi>u</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> <mo>|</mo> <mo>&times;</mo> <mo>|</mo> <mo>|</mo> <mover> <mi>v</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> <mo>=</mo> <mfrac> <mrow> <munder> <mi>&Sigma;</mi> <mi>i</mi> </munder> <mrow> <mo>(</mo> <msub> <mi>a</mi> <mi>i</mi> </msub> <mo>&times;</mo> <msub> <mi>b</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> <msqrt> <munder> <mi>&Sigma;</mi> <mi>i</mi> </munder> <msubsup> <mi>a</mi> <mi>i</mi> <mn>2</mn> </msubsup> <mo>&times;</mo> <munder> <mi>&Sigma;</mi> <mi>i</mi> </munder> <msubsup> <mi>b</mi> <mi>i</mi> <mn>2</mn> </msubsup> </msqrt> </mfrac> </mrow> </math>
wherein,representing semantic vectorsAndsemantic similarity of the corresponding phrase, aiAnd biRespectively representAndthe value of each dimension.
A translation model building system comprising:
the system comprises an acquisition module, a translation module and a translation module, wherein the acquisition module is used for acquiring a bilingual parallel corpus which comprises a contrast translation from a source language sentence to a target language sentence;
the first generation module is used for generating a rule alignment table, a word semantic vector table and a phrase table by utilizing the bilingual parallel corpus, wherein the rule alignment table comprises bilingual comparison hierarchical phrase rules, the word semantic vector table comprises bilingual comparison word semantic vectors, and the phrase table comprises bilingual comparison phrase information;
the second generation module is used for generating a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space by using the word semantic vector table and the phrase table;
and the processing module is used for processing the source language phrase semantic vector table in the source language semantic space and the target language phrase semantic vector table in the target language semantic space to obtain a translation model.
In the above system, preferably, the first generating module includes:
the preprocessing unit is used for preprocessing the bilingual parallel corpus to obtain word alignment information, and the word alignment information comprises bilingual matched word words;
a first generating unit, configured to generate a rule alignment table according to word alignment information, where the rule alignment table includes a hierarchical phrase rule for bilingual comparison, and the hierarchical phrase rule may be expressed as: x → < γ, α →, >, wherein X is a non-terminator, γ and α are character strings consisting of terminators and non-terminators, and the symbol "-" represents a one-to-one correspondence between the non-terminators appearing in γ and the non-terminators appearing in α;
a second generating unit, configured to generate a word semantic vector table according to the word alignment information, where the word semantic vector table includes word semantic vectors for bilingual comparison, and the word semantic vectors are obtained through PMI calculation:
<math> <mrow> <mi>pmi</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>log</mi> <mfrac> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>=</mo> <mfrac> <mrow> <msub> <mi>freq</mi> <mrow> <msub> <mi>c</mi> <mi>i</mi> </msub> <mi>t</mi> </mrow> </msub> <mo>&times;</mo> <msub> <mi>freq</mi> <mi>total</mi> </msub> </mrow> <mrow> <msub> <mi>freq</mi> <msub> <mi>c</mi> <mi>i</mi> </msub> </msub> <mo>&times;</mo> <msub> <mi>freq</mi> <mi>t</mi> </msub> </mrow> </mfrac> </mrow> </math>
where t denotes the target word, ciRepresents the related words adjacent to t in the context window,indicating a related word ciNumber of co-occurrences with target word t, freqtotalIndicating the number of times all of the words have occurred,indicating context word occurrencesNumber of times, freqtRepresenting the number of times the target word appears;
and the third generating unit is used for generating a phrase table according to the word alignment information.
In the above system, preferably, the second generating module includes:
the first retrieval unit is used for retrieving semantic vectors of all words contained in the source language phrases from the word semantic vector table for each source language phrase in the phrase table;
the first calculating unit is used for performing weighted vector addition operation on semantic vectors of all words included in the source language phrase to obtain semantic vectors of the source language phrase, and the semantic vectors of all the source language phrases in the phrase table form a source language phrase semantic vector table in a source language semantic space;
a second retrieval unit, which is used for retrieving the semantic vector of each word contained in each target language phrase in the phrase table from the word semantic vector table;
and the second calculation unit is used for carrying out weighted vector addition operation on semantic vectors of all words included in the target language phrase to obtain the semantic vector of the target language phrase, and the semantic vector of each target language phrase in the phrase table forms a target language phrase semantic vector table in a target language semantic space.
In the above system, preferably, the processing module includes:
the intermediate module generating unit is used for training and generating a neural network model containing a hidden layer through a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space;
and the translation model generating unit is used for mapping the source language phrase semantic vector table in the source language semantic space to a target language semantic space by using the neural network model to obtain a source language phrase semantic vector table in the target language semantic space, and the source language phrase semantic vector table and the target language phrase semantic vector table in the target language semantic space form a translation model.
Preferably, the system further includes a translation module, and the translation module is configured to translate the text to be translated by using the translation model.
The above system, preferably, the translation module includes:
the segmentation unit is used for performing phrase segmentation on the sentences of the text to be translated to obtain a phrase sequence corresponding to the text to be translated;
an aligned phrase retrieval unit, configured to sequentially extract phrases in the phrase sequence, and retrieve, for the extracted phrases, aligned phrases corresponding to the extracted phrases from the rule alignment table, where the aligned phrases include source language phrases of the extracted phrases and N candidate target language phrases corresponding to the source language phrases, where N is a natural number not less than 1;
a semantic vector retrieval unit, configured to retrieve, from the translation model, phrase semantic vectors corresponding to the source language phrase and the N candidate target language phrases, respectively;
and the similarity calculation unit is used for respectively calculating the semantic similarity between each candidate target language phrase and the source language phrase based on the retrieved phrase semantic vector, and taking the candidate target language phrase with the maximum semantic similarity as the translation result of the extracted phrase so as to translate the text to be translated.
In summary, based on the consideration of integrating semantic information in the statistical machine translation, the invention uses the alignment information of the bilingual parallel corpus to respectively generate a rule alignment table, a word semantic vector table and a phrase table, and then uses the word semantic vector table and the phrase table to generate a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space; and finally, training and generating a translation model fused with semantic information by utilizing the phrase semantic vector table in different semantic spaces. Therefore, the invention realizes the fusion of phrase semantic information in the statistical machine translation, and the applicant finds that the semantic information of the words or phrases can reflect the correlation between the words or phrases and the context words or phrases through research, and compared with the traditional translation model based on the words or phrases, the translation quality of the translation model after the phrase semantic information is fused is higher, so that compared with the prior art, the invention further improves the translation performance of the statistical machine translation.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flowchart of a method for constructing a translation model according to an embodiment of the present invention;
FIG. 2 is another flowchart of a translation model building method disclosed in the second embodiment of the present invention;
FIG. 3 is a flowchart illustrating a translation of a text to be translated by using a translation model according to a second embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a translation model building system according to a third embodiment of the present invention;
FIG. 5 is another schematic structural diagram of a translation model building system according to a third embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a translation module disclosed in the third embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
The applicant finds that the semantic information of a word or a phrase can reflect the correlation between the semantic information of the word or the phrase and a context word or the phrase, and compared with the traditional translation method based on the word or the phrase, the translation quality of a translation model is higher after the semantic information of the phrase is merged, so that the translation performance of statistical machine translation is further improved, the purpose of the invention is to realize the merging of the semantic information of the phrase into a statistical machine translation system, and further realize the translation process based on the semantic information of the phrase.
To this end, this embodiment discloses a translation model building method fusing phrase semantic information, and with reference to fig. 1, the method may include the following steps:
s101: a bilingual parallel corpus is obtained that includes a comparison translation of a source language sentence to a target language sentence.
The method comprises the following steps of collecting a bilingual parallel corpus to realize the training and generation of a translation model and provide original corpus support, wherein the collected bilingual parallel corpus is a sentence-aligned bilingual corpus, and the corpus contains the contrast translation from a source language sentence to a target language sentence.
S102: and generating a rule alignment table, a word semantic vector table and a phrase table by using the bilingual parallel corpus, wherein the rule alignment table comprises a bilingual comparison hierarchical phrase rule, the word semantic vector table comprises bilingual comparison word semantic vectors, and the phrase table comprises bilingual comparison phrase information.
Specifically, the method comprises the steps of preprocessing a bilingual parallel corpus with aligned sentences to obtain a bilingual corpus containing word alignment information, wherein the word alignment information comprises bilingual word words.
On the basis of preprocessing, generating a bilingual comparison rule alignment table according to the word alignment information, wherein the rule alignment table specifically refers to a table structure formed by source language level phrase rules and target language level phrase rules, and the expression form of the level phrase rules is shown in formula (1):
X→<γ,α,~> (1)
in the formula (1), X is a non-terminal, γ and α are character strings composed of terminal and non-terminal, and the symbols "-" represent a one-to-one correspondence between the non-terminal appearing in γ and the non-terminal appearing in α.
In this embodiment, the hierarchical phrase rule includes at most 2 non-terminal characters, and the non-terminal characters must cover at least 2 words, two non-terminal characters at the source language end of the rule cannot be adjacent to each other, and at least one word alignment information exists in the rule table.
And then, continuing to train and generate corresponding word semantic vectors at the source language end and the target language end by using the bilingual parallel corpus with the word alignment information to obtain a bilingual aligned word semantic vector table.
The word semantic vector may be obtained by computing through a node mutual information PMI, that is, in this embodiment, the word semantic vector is equal to the node mutual information PMI, and a computing formula of the PMI is specifically:
<math> <mrow> <mi>pmi</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>log</mi> <mfrac> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>=</mo> <mfrac> <mrow> <msub> <mi>freq</mi> <mrow> <msub> <mi>c</mi> <mi>i</mi> </msub> <mi>t</mi> </mrow> </msub> <mo>&times;</mo> <msub> <mi>freq</mi> <mi>total</mi> </msub> </mrow> <mrow> <msub> <mi>freq</mi> <msub> <mi>c</mi> <mi>i</mi> </msub> </msub> <mo>&times;</mo> <msub> <mi>freq</mi> <mi>t</mi> </msub> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </math>
in the formula (2), t represents a target word, ciWhich indicates related words adjacent to t in the context window, the present embodiment defines the length of the context window (one word corresponds to one basic length unit) as 5 in advance,indicating a related word ciNumber of co-occurrences with target word t, freqtotalIndicating the number of times all of the words have occurred,indicates the number of occurrences of the context word, freqtIndicating a target word outThe number of times now.
Next, a phrase table is generated by using the bilingual parallel corpus with word alignment information, in this embodiment, the length interval of the phrases in the phrase table is [2, 5], that is, the number of words included in each phrase in the phrase table is between 2 and 5.
S103: and generating a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space by using the word semantic vector table and the phrase table.
Specifically, for each source language phrase in the phrase table, this step retrieves the semantic vector of each word contained in the phrase from the word semantic vector table; and carrying out weighted vector addition operation on the semantic vectors of the retrieved words to obtain the semantic vector of the source language phrase. Finally, the semantic vector of each source language phrase in the phrase table constitutes a source language phrase semantic vector table in the source language semantic space.
Correspondingly, for each target language phrase in the phrase table, the step retrieves the semantic vector of each word contained in the phrase from the word semantic vector table; and carrying out vector addition operation with weight on the semantic vector of each retrieved word to obtain the semantic vector of the target language phrase. Finally, the semantic vector of each target language phrase in the phrase table constitutes a target language phrase semantic vector table in the target language semantic space.
Taking a phrase P including two words as an example, the formula for calculating the semantic vector of the phrase through vector addition with weight can be specifically expressed as:
<math> <mrow> <mover> <mi>p</mi> <mo>&RightArrow;</mo> </mover> <mo>=</mo> <mi>&alpha;</mi> <mover> <mi>u</mi> <mo>&RightArrow;</mo> </mover> <mo>+</mo> <mi>&beta;</mi> <mover> <mi>v</mi> <mo>&RightArrow;</mo> </mover> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow> </math>
in the formula (3), the reaction mixture is,a phrase semantic vector representing the phrase P,respectively representing the word semantic vectors of two words contained in the phrase P, alpha and beta respectively representingCorresponding weights, where α and β can be obtained using the open source tool disect training.
S104: and processing the source language phrase semantic vector table in the source language semantic space and the target language phrase semantic vector table in the target language semantic space to obtain a translation model.
Firstly, a source language phrase semantic vector under a source language semantic space and a target language phrase semantic vector under a target language semantic space are respectively used as input and output of a neural network to train and generate a neural network model with a hidden layer.
On the basis, the neural network model is utilized to map the phrase semantic vector tables in different semantic spaces to the same semantic space, and a required translation model is generated.
Specifically, the neural network model is used for mapping the source language phrase semantic vector table in the source language semantic space to the target language semantic space to obtain the source language phrase semantic vector table in the target language semantic space. The adopted mapping formula is as follows:
<math> <mrow> <mover> <mi>p</mi> <mo>&RightArrow;</mo> </mover> <mo>=</mo> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>W</mi> <mn>2</mn> </msub> <mrow> <mo>(</mo> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>W</mi> <mn>1</mn> </msub> <mover> <mi>x</mi> <mo>&RightArrow;</mo> </mover> <mo>+</mo> <msub> <mi>b</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>+</mo> <msub> <mi>b</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow> </math>
in the formula (4), W1Representing a mapping matrix from an input layer to a hidden layer in a neural network model; w2Representing a mapping matrix from a hidden layer to an output layer, b1And b2For the bias value, g (x) is usually chosen as a non-linear function (e.g., sigmoid and tanh).
Finally, the source language phrase semantic vector table and the target language phrase semantic vector table in the same semantic space (i.e. the target semantic space) form a required translation model, and then, a translation process of fusing phrase semantic information can be realized based on the translation model.
In summary, based on the consideration of integrating semantic information in the statistical machine translation, the invention uses the alignment information of the bilingual parallel corpus to respectively generate a rule alignment table, a word semantic vector table and a phrase table, and then uses the word semantic vector table and the phrase table to generate a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space; and finally, training and generating a translation model fused with semantic information by utilizing the phrase semantic vector table in different semantic spaces. Therefore, the invention realizes the fusion of the semantic information of the phrases in the statistical machine translation, and compared with the traditional translation method based on the words or the phrases, the translation quality of the translation model after the semantic information of the phrases is fused is higher because the semantic information of the words or the phrases can reflect the correlation between the words or the phrases and the context words or the phrases, thereby further improving the translation performance of the statistical machine translation compared with the prior art.
Example two
In this second embodiment, referring to fig. 2, the method may further include the following steps:
s105: and translating the text to be translated by utilizing the translation model.
Specifically, referring to fig. 3, the implementation process of translating the text to be translated in this step specifically includes:
s301: carrying out phrase segmentation on sentences of a text to be translated to obtain a phrase sequence corresponding to the text to be translated;
s302: extracting phrases in the phrase sequence in sequence, and for the extracted phrases, retrieving aligned phrases corresponding to the extracted phrases from the rule alignment table, wherein the aligned phrases comprise source language phrases of the extracted phrases and N candidate target language phrases corresponding to the source language phrases, and N is a natural number not less than 1;
s303: searching phrase semantic vectors corresponding to the source language phrases and the N candidate target language phrases from the translation model respectively;
s304: and respectively calculating the semantic similarity between each candidate target language phrase and the source language phrase based on the retrieved phrase semantic vectors, and taking the candidate target language phrase with the maximum semantic similarity as the translation result of the extracted phrase so as to translate the text to be translated.
After the translation model is generated through training, the translation model can be used for translating the text to be translated.
Specifically, for a text to be translated, firstly, carrying out word segmentation on a sentence of the text by a process sequence, carrying out character string matching on each source language phrase in a phrase sequence obtained by segmentation from a rule alignment table, and searching an alignment phrase corresponding to each source language phrase so as to obtain each possible target language phrase corresponding to each source language phrase as a candidate target language phrase; then, for each candidate target language phrase of the source language phrase, a corresponding phrase semantic vector is matched thereto from the phrase semantic vector table of the translation model.
On the basis, according to the retrieval and matching results, calculating the semantic similarity between the source language phrase and each candidate target language phrase, wherein the adopted semantic similarity calculation formula is as follows:
<math> <mrow> <mi>Sim</mi> <mrow> <mo>(</mo> <mover> <mi>u</mi> <mo>&RightArrow;</mo> </mover> <mo>,</mo> <mover> <mi>v</mi> <mo>&RightArrow;</mo> </mover> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mover> <mi>u</mi> <mo>&RightArrow;</mo> </mover> <mo>&CenterDot;</mo> <mover> <mi>v</mi> <mo>&RightArrow;</mo> </mover> </mrow> <mrow> <mo>|</mo> <mo>|</mo> <mover> <mi>u</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> <mo>|</mo> <mo>&times;</mo> <mo>|</mo> <mo>|</mo> <mover> <mi>v</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> <mo>=</mo> <mfrac> <mrow> <munder> <mi>&Sigma;</mi> <mi>i</mi> </munder> <mrow> <mo>(</mo> <msub> <mi>a</mi> <mi>i</mi> </msub> <mo>&times;</mo> <msub> <mi>b</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> <msqrt> <munder> <mi>&Sigma;</mi> <mi>i</mi> </munder> <msubsup> <mi>a</mi> <mi>i</mi> <mn>2</mn> </msubsup> <mo>&times;</mo> <munder> <mi>&Sigma;</mi> <mi>i</mi> </munder> <msubsup> <mi>b</mi> <mi>i</mi> <mn>2</mn> </msubsup> </msqrt> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow> </math>
in the formula (5), the reaction mixture is,representing a vectorAndsemantic similarity of the corresponding phrase, aiAnd biRespectively representAndthe value of each dimension.Andthe smaller the included angle is, the higher the semantic similarity of the corresponding phrase is, and the larger the cosine value is;andthe larger the included angle is, the lower the semantic similarity of the corresponding phrase is, and the smaller the rest chord values are.
And finally, selecting the candidate target language phrase corresponding to the maximum semantic similarity value as a translation result of the source language phrase to realize translation.
For example, for a source language phrase to be translated, assuming that there are 5 possible target language phrases through matching rule alignment table, and their corresponding similarities are calculated to be 0.96, 0.85, 0.54, 0.38, and 0.15, respectively, then the present embodiment selects the phrase with the highest similarity, i.e. the phrase corresponding to 0.96, as the target translation result.
EXAMPLE III
The third embodiment discloses a translation model building system, which corresponds to the translation model building method disclosed in each of the above embodiments.
First, referring to fig. 4, the system includes an obtaining module 100, a first generating module 200, a second generating module 300, and a processing module 400, corresponding to the first embodiment.
An obtaining module 100 is configured to obtain a bilingual parallel corpus, which includes a comparison translation from a source language sentence to a target language sentence.
A first generating module 200, configured to generate a rule alignment table, a word semantic vector table, and a phrase table by using the bilingual parallel corpus, where the rule alignment table includes hierarchical phrase rules for bilingual comparison, the word semantic vector table includes word semantic vectors for bilingual comparison, and the phrase table includes phrase information for bilingual comparison.
The first generation module 200 includes a preprocessing unit, a first generation unit, a second generation unit, and a third generation unit.
The preprocessing unit is used for preprocessing the bilingual parallel corpus to obtain word alignment information, and the word alignment information comprises bilingual matched word words;
a first generating unit, configured to generate a rule alignment table according to word alignment information, where the rule alignment table includes a hierarchical phrase rule for bilingual comparison, and the hierarchical phrase rule may be expressed as: x → < γ, α →, >, wherein X is a non-terminator, γ and α are character strings consisting of terminators and non-terminators, and the symbol "-" represents a one-to-one correspondence between the non-terminators appearing in γ and the non-terminators appearing in α;
a second generating unit, configured to generate a word semantic vector table according to the word alignment information, where the word semantic vector table includes word semantic vectors for bilingual comparison, and the word semantic vectors are obtained through PMI calculation:
<math> <mrow> <mi>pmi</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>log</mi> <mfrac> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>=</mo> <mfrac> <mrow> <msub> <mi>freq</mi> <mrow> <msub> <mi>c</mi> <mi>i</mi> </msub> <mi>t</mi> </mrow> </msub> <mo>&times;</mo> <msub> <mi>freq</mi> <mi>total</mi> </msub> </mrow> <mrow> <msub> <mi>freq</mi> <msub> <mi>c</mi> <mi>i</mi> </msub> </msub> <mo>&times;</mo> <msub> <mi>freq</mi> <mi>t</mi> </msub> </mrow> </mfrac> </mrow> </math>
where t denotes the target word, ciRepresents the related words adjacent to t in the context window,indicating a related word ciNumber of co-occurrences with target word t, freqtotalIndicating the number of times all of the words have occurred,indicates the number of occurrences of the context word, freqtIndicating the number of occurrences of the target word.
And the third generating unit is used for generating a phrase table according to the word alignment information.
A second generating module 300, configured to generate a source language phrase semantic vector table in the source language semantic space and a target language phrase semantic vector table in the target language semantic space by using the word semantic vector table and the phrase table.
Specifically, the second generation module 300 includes a first retrieval unit, a first calculation unit, a second retrieval unit, and a second calculation unit.
The first retrieval unit is used for retrieving semantic vectors of all words contained in the source language phrases from the word semantic vector table for each source language phrase in the phrase table;
the first calculating unit is used for performing weighted vector addition operation on semantic vectors of all words included in the source language phrase to obtain semantic vectors of the source language phrase, and the semantic vectors of all the source language phrases in the phrase table form a source language phrase semantic vector table in a source language semantic space;
a second retrieval unit, which is used for retrieving the semantic vector of each word contained in each target language phrase in the phrase table from the word semantic vector table;
and the second calculation unit is used for carrying out weighted vector addition operation on semantic vectors of all words included in the target language phrase to obtain the semantic vector of the target language phrase, and the semantic vector of each target language phrase in the phrase table forms a target language phrase semantic vector table in a target language semantic space.
The processing module 400 is configured to process the source language phrase semantic vector table in the source language semantic space and the target language phrase semantic vector table in the target language semantic space to obtain a translation model.
Wherein the processing module 400 comprises an intermediate module generating unit and a translation model generating unit.
The intermediate module generating unit is used for training and generating a neural network model containing a hidden layer through a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space;
and the translation model generating unit is used for mapping the source language phrase semantic vector table in the source language semantic space to the target language semantic space by using the neural network model to obtain a source language phrase semantic vector table in the target language semantic space, and marking the source language phrase semantic vector table and the target language phrase semantic vector table in the target language semantic space as translation models.
Corresponding to the second embodiment, referring to fig. 5, the system further includes a translation module 500, configured to translate the text to be translated by using the translation model.
Specifically, referring to fig. 6, the translation module includes a segmentation unit 601, an aligned phrase retrieval unit 602, a semantic vector retrieval unit 603, and a similarity calculation unit 604.
The segmentation unit 601 is configured to perform phrase segmentation on a sentence of a text to be translated to obtain a phrase sequence corresponding to the text to be translated;
an aligned phrase retrieving unit 602, configured to sequentially extract phrases in the phrase sequence, and for the extracted phrases, retrieve an aligned phrase corresponding to the extracted phrases from the rule alignment table, where the aligned phrase includes source language phrases of the extracted phrases and N candidate target language phrases corresponding to the source language phrases, where N is a natural number not less than 1;
a semantic vector retrieving unit 603, configured to retrieve, from the translation model, phrase semantic vectors corresponding to the source language phrase and the N candidate target language phrases, respectively;
and a similarity calculation unit 604, configured to calculate semantic similarity between each candidate target language phrase and the source language phrase based on the retrieved phrase semantic vector, and use the candidate target language phrase with the largest semantic similarity as the translation result of the extracted phrase, so as to translate the text to be translated.
For the translation model construction system disclosed in the third embodiment of the present invention, since it corresponds to the translation model construction method disclosed in each of the above embodiments, the description is relatively simple, and for the relevant similarities, please refer to the description of the translation model construction method in each of the above embodiments, and the details are not described here.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.
For convenience of description, the above system is described as being divided into various modules or units by function, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
Finally, it is further noted that, herein, relational terms such as first, second, third, fourth, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (13)

1. A translation model construction method is characterized by comprising the following steps:
obtaining a bilingual parallel corpus, wherein the bilingual parallel corpus comprises a contrast translation from a source language sentence to a target language sentence;
generating a rule alignment table, a word semantic vector table and a phrase table by using the bilingual parallel corpus, wherein the rule alignment table comprises a bilingual comparison hierarchical phrase rule, the word semantic vector table comprises bilingual comparison word semantic vectors, and the phrase table comprises bilingual comparison phrase information;
generating a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space by using the word semantic vector table and the phrase table;
and processing the source language phrase semantic vector table in the source language semantic space and the target language phrase semantic vector table in the target language semantic space to obtain a translation model.
2. The method of claim 1, wherein generating a rule alignment table, a word semantic vector table, and a phrase table using the bilingual parallel corpus comprises:
preprocessing the bilingual parallel corpus to obtain word alignment information, wherein the word alignment information comprises bilingual word words;
generating a rule alignment table according to the word alignment information, wherein the rule alignment table comprises a bilingual hierarchical phrase rule, and the hierarchical phrase rule can be expressed as: x → < γ, α →, >, wherein X is a non-terminator, γ and α are character strings consisting of terminators and non-terminators, and the symbol "-" represents a one-to-one correspondence between the non-terminators appearing in γ and the non-terminators appearing in α;
generating a word semantic vector table according to the word alignment information, wherein the word semantic vector table comprises word semantic vectors of bilingual comparison, and the word semantic vectors are obtained by calculating through node mutual information PMI:
<math> <mrow> <mi>pmi</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>log</mi> <mfrac> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>=</mo> <mfrac> <mrow> <msub> <mi>freq</mi> <mrow> <msub> <mi>c</mi> <mi>i</mi> </msub> <mi>t</mi> </mrow> </msub> <mo>&times;</mo> <msub> <mi>freq</mi> <mi>total</mi> </msub> </mrow> <mrow> <msub> <mi>freq</mi> <msub> <mi>c</mi> <mi>i</mi> </msub> </msub> <mo>&times;</mo> <msub> <mi>freq</mi> <mi>t</mi> </msub> </mrow> </mfrac> </mrow> </math>
where t denotes the target word, ciRepresents the related words adjacent to t in the context window,indicating a related word ciNumber of co-occurrences with target word t, freqtotalIndicating the number of times all of the words have occurred,indicates the number of occurrences of the context word, freqtRepresenting the number of times the target word appears;
and generating a phrase table according to the word alignment information.
3. The method of claim 2, wherein generating a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space using the word semantic vector table and the phrase table comprises:
for each source language phrase in a phrase table, retrieving a semantic vector of each word contained in the source language phrase from the word semantic vector table;
performing weighted vector addition operation on semantic vectors of words included in the source language phrase to obtain semantic vectors of the source language phrase, wherein the semantic vectors of the source language phrase in the phrase table form a source language phrase semantic vector table in a source language semantic space;
for each target language phrase in a phrase table, retrieving a semantic vector of each word contained in the phrase table from the word semantic vector table;
and performing weighted vector addition operation on semantic vectors of all words included in the target language phrase to obtain the semantic vector of the target language phrase, wherein the semantic vector of each target language phrase in the phrase table forms a target language phrase semantic vector table in a target language semantic space.
4. The method of claim 3, wherein the processing the source language phrase semantic vector table in the source language semantic space and the target language phrase semantic vector table in the target language semantic space to obtain a translation model comprises:
training and generating a neural network model containing a hidden layer through a source language phrase semantic vector table in the source language semantic space and a target language phrase semantic vector table in the target language semantic space;
and mapping the source language phrase semantic vector table in the source language semantic space to a target language semantic space by using the neural network model to obtain a source language phrase semantic vector table in the target language semantic space, wherein the source language phrase semantic vector table and the target language phrase semantic vector table in the target language semantic space form a translation model.
5. The method of claim 4, further comprising translating text to be translated using the translation model.
6. The method of claim 5, wherein translating the text to be translated using the translation model comprises:
carrying out phrase segmentation on sentences of a text to be translated to obtain a phrase sequence corresponding to the text to be translated;
extracting phrases in the phrase sequence in sequence, and for the extracted phrases, retrieving aligned phrases corresponding to the extracted phrases from the rule alignment table, wherein the aligned phrases comprise source language phrases of the extracted phrases and N candidate target language phrases corresponding to the source language phrases, and N is a natural number not less than 1;
searching phrase semantic vectors corresponding to the source language phrases and the N candidate target language phrases from the translation model respectively;
and respectively calculating the semantic similarity between each candidate target language phrase and the source language phrase based on the retrieved phrase semantic vectors, and taking the candidate target language phrase with the maximum semantic similarity as the translation result of the extracted phrase so as to translate the text to be translated.
7. The method according to claim 6, wherein the semantic similarity is calculated by the formula:
<math> <mrow> <mi>Sim</mi> <mrow> <mo>(</mo> <mover> <mi>u</mi> <mo>&RightArrow;</mo> </mover> <mo>,</mo> <mover> <mi>v</mi> <mo>&RightArrow;</mo> </mover> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mover> <mi>u</mi> <mo>&RightArrow;</mo> </mover> <mo>&CenterDot;</mo> <mover> <mi>v</mi> <mo>&RightArrow;</mo> </mover> </mrow> <mrow> <mo>|</mo> <mo>|</mo> <mover> <mi>u</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> <mo>|</mo> <mo>&times;</mo> <mo>|</mo> <mo>|</mo> <mover> <mi>v</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> <mo>=</mo> <mfrac> <mrow> <munder> <mi>&Sigma;</mi> <mi>i</mi> </munder> <mrow> <mo>(</mo> <msub> <mi>a</mi> <mi>i</mi> </msub> <mo>&times;</mo> <msub> <mi>b</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> <msqrt> <munder> <mi>&Sigma;</mi> <mi>i</mi> </munder> <msubsup> <mi>a</mi> <mi>i</mi> <mn>2</mn> </msubsup> <mo>&times;</mo> <munder> <mi>&Sigma;</mi> <mi>i</mi> </munder> <msubsup> <mi>b</mi> <mi>i</mi> <mn>2</mn> </msubsup> </msqrt> </mfrac> </mrow> </math>
wherein,representing semantic vectorsAndsemantic similarity of the corresponding phrase, aiAnd biRespectively representAndthe value of each dimension.
8. A translation model building system, comprising:
the system comprises an acquisition module, a translation module and a translation module, wherein the acquisition module is used for acquiring a bilingual parallel corpus which comprises a contrast translation from a source language sentence to a target language sentence;
the first generation module is used for generating a rule alignment table, a word semantic vector table and a phrase table by utilizing the bilingual parallel corpus, wherein the rule alignment table comprises bilingual comparison hierarchical phrase rules, the word semantic vector table comprises bilingual comparison word semantic vectors, and the phrase table comprises bilingual comparison phrase information;
the second generation module is used for generating a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space by using the word semantic vector table and the phrase table;
and the processing module is used for processing the source language phrase semantic vector table in the source language semantic space and the target language phrase semantic vector table in the target language semantic space to obtain a translation model.
9. The system of claim 8, wherein the first generation module comprises:
the preprocessing unit is used for preprocessing the bilingual parallel corpus to obtain word alignment information, and the word alignment information comprises bilingual matched word words;
a first generating unit, configured to generate a rule alignment table according to word alignment information, where the rule alignment table includes a hierarchical phrase rule for bilingual comparison, and the hierarchical phrase rule may be expressed as: x → < γ, α →, >, wherein X is a non-terminator, γ and α are character strings consisting of terminators and non-terminators, and the symbol "-" represents a one-to-one correspondence between the non-terminators appearing in γ and the non-terminators appearing in α;
a second generating unit, configured to generate a word semantic vector table according to the word alignment information, where the word semantic vector table includes word semantic vectors for bilingual comparison, and the word semantic vectors are obtained through PMI calculation:
<math> <mrow> <mi>pmi</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>log</mi> <mfrac> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>=</mo> <mfrac> <mrow> <msub> <mi>freq</mi> <mrow> <msub> <mi>c</mi> <mi>i</mi> </msub> <mi>t</mi> </mrow> </msub> <mo>&times;</mo> <msub> <mi>freq</mi> <mi>total</mi> </msub> </mrow> <mrow> <msub> <mi>freq</mi> <msub> <mi>c</mi> <mi>i</mi> </msub> </msub> <mo>&times;</mo> <msub> <mi>freq</mi> <mi>t</mi> </msub> </mrow> </mfrac> </mrow> </math>
where t denotes the target word, ciRepresents the related words adjacent to t in the context window,indicating a related word ciNumber of co-occurrences with target word t, freqtotalIndicating the number of times all of the words have occurred,indicates the number of occurrences of the context word, freqtRepresenting the number of times the target word appears; (ii) a
And the third generating unit is used for generating a phrase table according to the word alignment information.
10. The system of claim 9, wherein the second generation module comprises:
the first retrieval unit is used for retrieving semantic vectors of all words contained in the source language phrases from the word semantic vector table for each source language phrase in the phrase table;
the first calculating unit is used for performing weighted vector addition operation on semantic vectors of all words included in the source language phrase to obtain semantic vectors of the source language phrase, and the semantic vectors of all the source language phrases in the phrase table form a source language phrase semantic vector table in a source language semantic space;
a second retrieval unit, which is used for retrieving the semantic vector of each word contained in each target language phrase in the phrase table from the word semantic vector table;
and the second calculation unit is used for carrying out weighted vector addition operation on semantic vectors of all words included in the target language phrase to obtain the semantic vector of the target language phrase, and the semantic vector of each target language phrase in the phrase table forms a target language phrase semantic vector table in a target language semantic space.
11. The system of claim 10, wherein the processing module comprises:
the intermediate module generating unit is used for training and generating a neural network model containing a hidden layer through a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space;
and the translation model generating unit is used for mapping the source language phrase semantic vector table in the source language semantic space to a target language semantic space by using the neural network model to obtain a source language phrase semantic vector table in the target language semantic space, and the source language phrase semantic vector table and the target language phrase semantic vector table in the target language semantic space form a translation model.
12. The system of claim 11, further comprising a translation module configured to translate text to be translated using the translation model.
13. The system of claim 12, wherein the translation module comprises:
the segmentation unit is used for performing phrase segmentation on the sentences of the text to be translated to obtain a phrase sequence corresponding to the text to be translated;
an aligned phrase retrieval unit, configured to sequentially extract phrases in the phrase sequence, and retrieve, for the extracted phrases, aligned phrases corresponding to the extracted phrases from the rule alignment table, where the aligned phrases include source language phrases of the extracted phrases and N candidate target language phrases corresponding to the source language phrases, where N is a natural number not less than 1;
a semantic vector retrieval unit, configured to retrieve, from the translation model, phrase semantic vectors corresponding to the source language phrase and the N candidate target language phrases, respectively;
and the similarity calculation unit is used for respectively calculating the semantic similarity between each candidate target language phrase and the source language phrase based on the retrieved phrase semantic vector, and taking the candidate target language phrase with the maximum semantic similarity as the translation result of the extracted phrase so as to translate the text to be translated.
CN201410797926.8A 2014-12-18 2014-12-18 Translation model establishing method and system Pending CN104391842A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410797926.8A CN104391842A (en) 2014-12-18 2014-12-18 Translation model establishing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410797926.8A CN104391842A (en) 2014-12-18 2014-12-18 Translation model establishing method and system

Publications (1)

Publication Number Publication Date
CN104391842A true CN104391842A (en) 2015-03-04

Family

ID=52609748

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410797926.8A Pending CN104391842A (en) 2014-12-18 2014-12-18 Translation model establishing method and system

Country Status (1)

Country Link
CN (1) CN104391842A (en)

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105183720A (en) * 2015-08-05 2015-12-23 百度在线网络技术(北京)有限公司 Machine translation method and apparatus based on RNN model
CN105808530A (en) * 2016-03-23 2016-07-27 苏州大学 Translation method and device in statistical machine translation
CN105912533A (en) * 2016-04-12 2016-08-31 苏州大学 Method and device for long statement segmentation aiming at neural machine translation
CN106066851A (en) * 2016-06-06 2016-11-02 清华大学 A kind of neural network training method considering evaluation index and device
CN106484681A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 A kind of method generating candidate's translation, device and electronic equipment
CN106484682A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 Based on the machine translation method of statistics, device and electronic equipment
CN106649289A (en) * 2016-12-16 2017-05-10 中国科学院自动化研究所 Realization method and realization system for simultaneously identifying bilingual terms and word alignment
CN106708811A (en) * 2016-12-19 2017-05-24 新译信息科技(深圳)有限公司 Data processing method and data processing device
CN106776586A (en) * 2016-12-19 2017-05-31 新译信息科技(深圳)有限公司 Machine translation method and device
CN107193800A (en) * 2017-05-18 2017-09-22 苏州黑云信息科技有限公司 A kind of semantic goodness of fit evaluating method and device towards third party's language text
CN107430599A (en) * 2015-05-18 2017-12-01 谷歌公司 For providing the technology for the visual translation card for including context-sensitive definition and example
WO2018032765A1 (en) * 2016-08-19 2018-02-22 华为技术有限公司 Sequence conversion method and apparatus
CN108363704A (en) * 2018-03-02 2018-08-03 北京理工大学 A kind of neural network machine translation corpus expansion method based on statistics phrase table
CN108388561A (en) * 2017-02-03 2018-08-10 百度在线网络技术(北京)有限公司 Neural network machine interpretation method and device
CN108415901A (en) * 2018-02-07 2018-08-17 大连理工大学 A kind of short text topic model of word-based vector sum contextual information
CN108804427A (en) * 2018-06-12 2018-11-13 深圳市译家智能科技有限公司 Speech robot interpretation method and device
CN108874786A (en) * 2018-06-12 2018-11-23 深圳市译家智能科技有限公司 Machine translation method and device
CN109902090A (en) * 2019-02-19 2019-06-18 北京明略软件系统有限公司 Field name acquisition methods and device
WO2019119852A1 (en) * 2017-12-23 2019-06-27 华为技术有限公司 Language processing method and device
WO2019144906A1 (en) * 2018-01-25 2019-08-01 腾讯科技(深圳)有限公司 Information conversion method and device, storage medium and electronic device
CN110210041A (en) * 2019-05-23 2019-09-06 北京百度网讯科技有限公司 The neat method, device and equipment of intertranslation sentence pair
CN110705273A (en) * 2019-09-02 2020-01-17 腾讯科技(深圳)有限公司 Information processing method and device based on neural network, medium and electronic equipment
CN110874537A (en) * 2018-08-31 2020-03-10 阿里巴巴集团控股有限公司 Generation method of multi-language translation model, translation method and translation equipment
CN111274813A (en) * 2018-12-05 2020-06-12 阿里巴巴集团控股有限公司 Language sequence marking method, device storage medium and computer equipment
CN111444730A (en) * 2020-03-27 2020-07-24 新疆大学 Data enhancement Weihan machine translation system training method and device based on Transformer model
CN113158695A (en) * 2021-05-06 2021-07-23 上海极链网络科技有限公司 Semantic auditing method and system for multi-language mixed text
WO2021184769A1 (en) * 2020-03-17 2021-09-23 江苏省舜禹信息技术有限公司 Operation method and apparatus for neural network text translation model, and device and medium
US11132516B2 (en) 2016-11-04 2021-09-28 Huawei Technologies Co., Ltd. Sequence translation probability adjustment
CN113515952A (en) * 2021-08-18 2021-10-19 内蒙古工业大学 Mongolian dialogue model combined modeling method, system and equipment
CN117195922A (en) * 2023-11-07 2023-12-08 四川语言桥信息技术有限公司 Human-in-loop neural machine translation method, system and readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120123766A1 (en) * 2007-03-22 2012-05-17 Konstantin Anisimovich Indicating and Correcting Errors in Machine Translation Systems
CN103314369A (en) * 2010-12-17 2013-09-18 北京交通大学 Method and device for machine translation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120123766A1 (en) * 2007-03-22 2012-05-17 Konstantin Anisimovich Indicating and Correcting Errors in Machine Translation Systems
CN103314369A (en) * 2010-12-17 2013-09-18 北京交通大学 Method and device for machine translation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王超超 等: "基于双语合成语义的翻译相似度模型", 《北京大学学报(自然科学版)》 *

Cited By (53)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107430599A (en) * 2015-05-18 2017-12-01 谷歌公司 For providing the technology for the visual translation card for including context-sensitive definition and example
CN105183720B (en) * 2015-08-05 2019-07-09 百度在线网络技术(北京)有限公司 Machine translation method and device based on RNN model
CN105183720A (en) * 2015-08-05 2015-12-23 百度在线网络技术(北京)有限公司 Machine translation method and apparatus based on RNN model
CN106484682A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 Based on the machine translation method of statistics, device and electronic equipment
CN106484681B (en) * 2015-08-25 2019-07-09 阿里巴巴集团控股有限公司 A kind of method, apparatus and electronic equipment generating candidate translation
US10268685B2 (en) 2015-08-25 2019-04-23 Alibaba Group Holding Limited Statistics-based machine translation method, apparatus and electronic device
US10255275B2 (en) 2015-08-25 2019-04-09 Alibaba Group Holding Limited Method and system for generation of candidate translations
US10810379B2 (en) 2015-08-25 2020-10-20 Alibaba Group Holding Limited Statistics-based machine translation method, apparatus and electronic device
US10860808B2 (en) 2015-08-25 2020-12-08 Alibaba Group Holding Limited Method and system for generation of candidate translations
CN106484682B (en) * 2015-08-25 2019-06-25 阿里巴巴集团控股有限公司 Machine translation method, device and electronic equipment based on statistics
CN106484681A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 A kind of method generating candidate's translation, device and electronic equipment
CN105808530B (en) * 2016-03-23 2019-11-08 苏州大学 Interpretation method and device in a kind of statistical machine translation
CN105808530A (en) * 2016-03-23 2016-07-27 苏州大学 Translation method and device in statistical machine translation
CN105912533A (en) * 2016-04-12 2016-08-31 苏州大学 Method and device for long statement segmentation aiming at neural machine translation
CN105912533B (en) * 2016-04-12 2019-02-12 苏州大学 Long sentence cutting method and device towards neural machine translation
CN106066851A (en) * 2016-06-06 2016-11-02 清华大学 A kind of neural network training method considering evaluation index and device
WO2018032765A1 (en) * 2016-08-19 2018-02-22 华为技术有限公司 Sequence conversion method and apparatus
US11288458B2 (en) 2016-08-19 2022-03-29 Huawei Technologies Co., Ltd. Sequence conversion method and apparatus in natural language processing based on adjusting a weight associated with each word
US11132516B2 (en) 2016-11-04 2021-09-28 Huawei Technologies Co., Ltd. Sequence translation probability adjustment
CN106649289A (en) * 2016-12-16 2017-05-10 中国科学院自动化研究所 Realization method and realization system for simultaneously identifying bilingual terms and word alignment
CN106708811A (en) * 2016-12-19 2017-05-24 新译信息科技(深圳)有限公司 Data processing method and data processing device
CN106776586A (en) * 2016-12-19 2017-05-31 新译信息科技(深圳)有限公司 Machine translation method and device
US11403520B2 (en) 2017-02-03 2022-08-02 Baidu Online Network Technology (Beijing) Co., Ltd. Neural network machine translation method and apparatus
CN108388561A (en) * 2017-02-03 2018-08-10 百度在线网络技术(北京)有限公司 Neural network machine interpretation method and device
CN107193800B (en) * 2017-05-18 2023-09-01 苏州黑云智能科技有限公司 Semantic fitness evaluation method and device for third-party language text
CN107193800A (en) * 2017-05-18 2017-09-22 苏州黑云信息科技有限公司 A kind of semantic goodness of fit evaluating method and device towards third party's language text
WO2019119852A1 (en) * 2017-12-23 2019-06-27 华为技术有限公司 Language processing method and device
CN109960812A (en) * 2017-12-23 2019-07-02 华为技术有限公司 Language processing method and equipment
US11704505B2 (en) 2017-12-23 2023-07-18 Huawei Technologies Co., Ltd. Language processing method and device
CN109960812B (en) * 2017-12-23 2021-05-04 华为技术有限公司 Language processing method and device
WO2019144906A1 (en) * 2018-01-25 2019-08-01 腾讯科技(深圳)有限公司 Information conversion method and device, storage medium and electronic device
US11880667B2 (en) 2018-01-25 2024-01-23 Tencent Technology (Shenzhen) Company Limited Information conversion method and apparatus, storage medium, and electronic apparatus
CN108415901A (en) * 2018-02-07 2018-08-17 大连理工大学 A kind of short text topic model of word-based vector sum contextual information
CN108363704A (en) * 2018-03-02 2018-08-03 北京理工大学 A kind of neural network machine translation corpus expansion method based on statistics phrase table
CN108874786A (en) * 2018-06-12 2018-11-23 深圳市译家智能科技有限公司 Machine translation method and device
CN108804427A (en) * 2018-06-12 2018-11-13 深圳市译家智能科技有限公司 Speech robot interpretation method and device
CN108874786B (en) * 2018-06-12 2022-05-31 深圳市译家智能科技有限公司 Machine translation method and device
CN110874537B (en) * 2018-08-31 2023-06-27 阿里巴巴集团控股有限公司 Method for generating multilingual translation model, translation method and equipment
CN110874537A (en) * 2018-08-31 2020-03-10 阿里巴巴集团控股有限公司 Generation method of multi-language translation model, translation method and translation equipment
CN111274813B (en) * 2018-12-05 2023-05-02 阿里巴巴集团控股有限公司 Language sequence labeling method, device storage medium and computer equipment
CN111274813A (en) * 2018-12-05 2020-06-12 阿里巴巴集团控股有限公司 Language sequence marking method, device storage medium and computer equipment
CN109902090A (en) * 2019-02-19 2019-06-18 北京明略软件系统有限公司 Field name acquisition methods and device
CN109902090B (en) * 2019-02-19 2022-06-07 北京明略软件系统有限公司 Method and device for acquiring field name
CN110210041A (en) * 2019-05-23 2019-09-06 北京百度网讯科技有限公司 The neat method, device and equipment of intertranslation sentence pair
CN110210041B (en) * 2019-05-23 2023-04-18 北京百度网讯科技有限公司 Inter-translation sentence alignment method, device and equipment
CN110705273A (en) * 2019-09-02 2020-01-17 腾讯科技(深圳)有限公司 Information processing method and device based on neural network, medium and electronic equipment
WO2021184769A1 (en) * 2020-03-17 2021-09-23 江苏省舜禹信息技术有限公司 Operation method and apparatus for neural network text translation model, and device and medium
CN111444730A (en) * 2020-03-27 2020-07-24 新疆大学 Data enhancement Weihan machine translation system training method and device based on Transformer model
CN113158695A (en) * 2021-05-06 2021-07-23 上海极链网络科技有限公司 Semantic auditing method and system for multi-language mixed text
CN113515952A (en) * 2021-08-18 2021-10-19 内蒙古工业大学 Mongolian dialogue model combined modeling method, system and equipment
CN113515952B (en) * 2021-08-18 2023-09-12 内蒙古工业大学 Combined modeling method, system and equipment for Mongolian dialogue model
CN117195922A (en) * 2023-11-07 2023-12-08 四川语言桥信息技术有限公司 Human-in-loop neural machine translation method, system and readable storage medium
CN117195922B (en) * 2023-11-07 2024-01-26 四川语言桥信息技术有限公司 Human-in-loop neural machine translation method, system and readable storage medium

Similar Documents

Publication Publication Date Title
CN104391842A (en) Translation model establishing method and system
US11775760B2 (en) Man-machine conversation method, electronic device, and computer-readable medium
Liu et al. Unsupervised paraphrasing by simulated annealing
CN107291693B (en) Semantic calculation method for improved word vector model
CN107766324B (en) Text consistency analysis method based on deep neural network
CN106919646B (en) Chinese text abstract generating system and method
CN100527125C (en) On-line translation model selection method of statistic machine translation
CN107844469A (en) The text method for simplifying of word-based vector query model
CN103870000B (en) The method and device that candidate item caused by a kind of pair of input method is ranked up
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
US20070174040A1 (en) Word alignment apparatus, example sentence bilingual dictionary, word alignment method, and program product for word alignment
CN104391942A (en) Short text characteristic expanding method based on semantic atlas
CN105068997B (en) The construction method and device of parallel corpora
CN110134946A (en) A kind of machine reading understanding method for complex data
CN106611041A (en) New text similarity solution method
CN110909116B (en) Entity set expansion method and system for social media
CN104156349A (en) Unlisted word discovering and segmenting system and method based on statistical dictionary model
CN113704416A (en) Word sense disambiguation method and device, electronic equipment and computer-readable storage medium
CN110929022A (en) Text abstract generation method and system
CN104699797A (en) Webpage data structured analytic method and device
CN111813923A (en) Text summarization method, electronic device and storage medium
CN114757184B (en) Method and system for realizing knowledge question and answer in aviation field
CN113626584A (en) Automatic text abstract generation method, system, computer equipment and storage medium
CN113988012B (en) Unsupervised social media abstract method integrating social context and multi-granularity relationship
CN117271736A (en) Question-answer pair generation method and system, electronic equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150304

WD01 Invention patent application deemed withdrawn after publication