CN104391842A

CN104391842A - Translation model establishing method and system

Info

Publication number: CN104391842A
Application number: CN201410797926.8A
Authority: CN
Inventors: 熊德意; 王超超; 张民
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2014-12-18
Filing date: 2014-12-18
Publication date: 2015-03-04

Abstract

The invention discloses a translation model establishing method and system. The translation model establishing method comprises the following steps: respectively generating a regular alignment table, a word semantic vector table and a phrase table according to alignment information of a double-language parallel corpus, subsequently generating a source language phrase semantic vector table of a source language semantic space and a target language phrase semantic vector table of a target language semantic space by using the word semantic vector table and the phrase table, and finally training by using phrase semantic vector tables of different semantic spaces, thereby generating a translation model integrated with semantic information. The result shows that phrase semantic information can be integrated in statistic machine translation, the research shows that the relevance of words or phrases to context words or phrases can be reflected in the semantic information, and compared with a conventional translation method based on words or phrases, the translation model is relatively high in translation quality after the phrase semantic information is integrated, so that the translation property of the statistic machine translation is further improved as compared with that of the prior art.

Description

Translation model construction method and system

Technical Field

The invention belongs to the technical field of statistical machine translation, and particularly relates to a translation model construction method and system.

Background

In recent years, with the improvement of computing power and the continuous enrichment of corpus resources, statistical machine translation technology is becoming the most important research hotspot in the field of natural language processing.

The implementation of statistical machine translation typically involves two main processes: training and decoding. Training refers to training a statistical translation model from a corpus resource according to a certain algorithm; decoding, i.e., translation, refers to translating a text to be translated according to a trained translation model. The initial statistical machine translation method is established based on a noise channel model, then researchers put forward the statistical machine translation method based on the maximum entropy idea to the model in further generalization in practice, on the basis, the statistical machine translation method is developed based on words, phrases and syntax respectively, and the performance of machine translation is improved more or less, namely compared with the prior translation model, the translation performance of the translation model based on the words, the phrases or the syntax is improved to a certain extent, but the translation goal of realizing 'information, reach and elegance' is still far away.

Disclosure of Invention

In view of this, the present invention provides a translation model construction method and system to effectively improve the translation quality of statistical machine translation and further advance the realization of the "belief, reach, and elegance" translation goal.

Therefore, the invention discloses the following technical scheme:

a translation model building method, comprising:

obtaining a bilingual parallel corpus, wherein the bilingual parallel corpus comprises a contrast translation from a source language sentence to a target language sentence;

generating a rule alignment table, a word semantic vector table and a phrase table by using the bilingual parallel corpus, wherein the rule alignment table comprises a bilingual comparison hierarchical phrase rule, the word semantic vector table comprises bilingual comparison word semantic vectors, and the phrase table comprises bilingual comparison phrase information;

generating a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space by using the word semantic vector table and the phrase table;

and processing the source language phrase semantic vector table in the source language semantic space and the target language phrase semantic vector table in the target language semantic space to obtain a translation model.

Preferably, the generating a rule alignment table, a word semantic vector table, and a phrase table by using the bilingual parallel corpus includes:

preprocessing the bilingual parallel corpus to obtain word alignment information, wherein the word alignment information comprises bilingual word words;

generating a rule alignment table according to the word alignment information, wherein the rule alignment table comprises a bilingual hierarchical phrase rule, and the hierarchical phrase rule can be expressed as: x → < γ, α →, >, wherein X is a non-terminator, γ and α are character strings consisting of terminators and non-terminators, and the symbol "-" represents a one-to-one correspondence between the non-terminators appearing in γ and the non-terminators appearing in α;

generating a word semantic vector table according to the word alignment information, wherein the word semantic vector table comprises word semantic vectors of bilingual comparison, and the word semantic vectors are obtained by calculating through node mutual information PMI:

<math> <mrow> <mi>pmi</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>log</mi> <mfrac> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>=</mo> <mfrac> <mrow> <msub> <mi>freq</mi> <mrow> <msub> <mi>c</mi> <mi>i</mi> </msub> <mi>t</mi> </mrow> </msub> <mo>×</mo> <msub> <mi>freq</mi> <mi>total</mi> </msub> </mrow> <mrow> <msub> <mi>freq</mi> <msub> <mi>c</mi> <mi>i</mi> </msub> </msub> <mo>×</mo> <msub> <mi>freq</mi> <mi>t</mi> </msub> </mrow> </mfrac> </mrow> </math>

where t denotes the target word, c_iRepresents the related words adjacent to t in the context window,indicating a related word c_iNumber of co-occurrences with target word t, freq_totalIndicating the number of times all of the words have occurred,indicates the number of occurrences of the context word, freq_tRepresenting the number of times the target word appears;

and generating a phrase table according to the word alignment information.

Preferably, the generating a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space by using the word semantic vector table and the phrase table includes:

for each source language phrase in a phrase table, retrieving a semantic vector of each word contained in the source language phrase from the word semantic vector table;

performing weighted vector addition operation on semantic vectors of words included in the source language phrase to obtain semantic vectors of the source language phrase, wherein the semantic vectors of the source language phrase in the phrase table form a source language phrase semantic vector table in a source language semantic space;

for each target language phrase in a phrase table, retrieving a semantic vector of each word contained in the phrase table from the word semantic vector table;

and performing weighted vector addition operation on semantic vectors of all words included in the target language phrase to obtain the semantic vector of the target language phrase, wherein the semantic vector of each target language phrase in the phrase table forms a target language phrase semantic vector table in a target language semantic space.

Preferably, the processing a source language phrase semantic vector table in the source language semantic space and a target language phrase semantic vector table in the target language semantic space to obtain a translation model includes:

training and generating a neural network model containing a hidden layer through a source language phrase semantic vector table in the source language semantic space and a target language phrase semantic vector table in the target language semantic space;

and mapping the source language phrase semantic vector table in the source language semantic space to a target language semantic space by using the neural network model to obtain a source language phrase semantic vector table in the target language semantic space, wherein the source language phrase semantic vector table and the target language phrase semantic vector table in the target language semantic space form a translation model.

The method preferably further includes translating the text to be translated by using the translation model.

Preferably, the translating the text to be translated by using the translation model includes:

carrying out phrase segmentation on sentences of a text to be translated to obtain a phrase sequence corresponding to the text to be translated;

extracting phrases in the phrase sequence in sequence, and for the extracted phrases, retrieving aligned phrases corresponding to the extracted phrases from the rule alignment table, wherein the aligned phrases comprise source language phrases of the extracted phrases and N candidate target language phrases corresponding to the source language phrases, and N is a natural number not less than 1;

searching phrase semantic vectors corresponding to the source language phrases and the N candidate target language phrases from the translation model respectively;

and respectively calculating the semantic similarity between each candidate target language phrase and the source language phrase based on the retrieved phrase semantic vectors, and taking the candidate target language phrase with the maximum semantic similarity as the translation result of the extracted phrase so as to translate the text to be translated.

Preferably, in the above method, the formula for calculating the semantic similarity is as follows:

<math> <mrow> <mi>Sim</mi> <mrow> <mo>(</mo> <mover> <mi>u</mi> <mo>&RightArrow;</mo> </mover> <mo>,</mo> <mover> <mi>v</mi> <mo>&RightArrow;</mo> </mover> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mover> <mi>u</mi> <mo>&RightArrow;</mo> </mover> <mo>·</mo> <mover> <mi>v</mi> <mo>&RightArrow;</mo> </mover> </mrow> <mrow> <mo>|</mo> <mo>|</mo> <mover> <mi>u</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> <mo>|</mo> <mo>×</mo> <mo>|</mo> <mo>|</mo> <mover> <mi>v</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> <mo>=</mo> <mfrac> <mrow> <munder> <mi>Σ</mi> <mi>i</mi> </munder> <mrow> <mo>(</mo> <msub> <mi>a</mi> <mi>i</mi> </msub> <mo>×</mo> <msub> <mi>b</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> <msqrt> <munder> <mi>Σ</mi> <mi>i</mi> </munder> <msubsup> <mi>a</mi> <mi>i</mi> <mn>2</mn> </msubsup> <mo>×</mo> <munder> <mi>Σ</mi> <mi>i</mi> </munder> <msubsup> <mi>b</mi> <mi>i</mi> <mn>2</mn> </msubsup> </msqrt> </mfrac> </mrow> </math>

wherein,representing semantic vectorsAndsemantic similarity of the corresponding phrase, a_iAnd b_iRespectively representAndthe value of each dimension.

A translation model building system comprising:

the system comprises an acquisition module, a translation module and a translation module, wherein the acquisition module is used for acquiring a bilingual parallel corpus which comprises a contrast translation from a source language sentence to a target language sentence;

the first generation module is used for generating a rule alignment table, a word semantic vector table and a phrase table by utilizing the bilingual parallel corpus, wherein the rule alignment table comprises bilingual comparison hierarchical phrase rules, the word semantic vector table comprises bilingual comparison word semantic vectors, and the phrase table comprises bilingual comparison phrase information;

the second generation module is used for generating a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space by using the word semantic vector table and the phrase table;

and the processing module is used for processing the source language phrase semantic vector table in the source language semantic space and the target language phrase semantic vector table in the target language semantic space to obtain a translation model.

In the above system, preferably, the first generating module includes:

the preprocessing unit is used for preprocessing the bilingual parallel corpus to obtain word alignment information, and the word alignment information comprises bilingual matched word words;

a first generating unit, configured to generate a rule alignment table according to word alignment information, where the rule alignment table includes a hierarchical phrase rule for bilingual comparison, and the hierarchical phrase rule may be expressed as: x → < γ, α →, >, wherein X is a non-terminator, γ and α are character strings consisting of terminators and non-terminators, and the symbol "-" represents a one-to-one correspondence between the non-terminators appearing in γ and the non-terminators appearing in α;

a second generating unit, configured to generate a word semantic vector table according to the word alignment information, where the word semantic vector table includes word semantic vectors for bilingual comparison, and the word semantic vectors are obtained through PMI calculation:

where t denotes the target word, c_iRepresents the related words adjacent to t in the context window,indicating a related word c_iNumber of co-occurrences with target word t, freq_totalIndicating the number of times all of the words have occurred,indicating context word occurrencesNumber of times, freq_tRepresenting the number of times the target word appears;

and the third generating unit is used for generating a phrase table according to the word alignment information.

In the above system, preferably, the second generating module includes:

the first retrieval unit is used for retrieving semantic vectors of all words contained in the source language phrases from the word semantic vector table for each source language phrase in the phrase table;

the first calculating unit is used for performing weighted vector addition operation on semantic vectors of all words included in the source language phrase to obtain semantic vectors of the source language phrase, and the semantic vectors of all the source language phrases in the phrase table form a source language phrase semantic vector table in a source language semantic space;

a second retrieval unit, which is used for retrieving the semantic vector of each word contained in each target language phrase in the phrase table from the word semantic vector table;

and the second calculation unit is used for carrying out weighted vector addition operation on semantic vectors of all words included in the target language phrase to obtain the semantic vector of the target language phrase, and the semantic vector of each target language phrase in the phrase table forms a target language phrase semantic vector table in a target language semantic space.

In the above system, preferably, the processing module includes:

the intermediate module generating unit is used for training and generating a neural network model containing a hidden layer through a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space;

and the translation model generating unit is used for mapping the source language phrase semantic vector table in the source language semantic space to a target language semantic space by using the neural network model to obtain a source language phrase semantic vector table in the target language semantic space, and the source language phrase semantic vector table and the target language phrase semantic vector table in the target language semantic space form a translation model.

Preferably, the system further includes a translation module, and the translation module is configured to translate the text to be translated by using the translation model.

The above system, preferably, the translation module includes:

the segmentation unit is used for performing phrase segmentation on the sentences of the text to be translated to obtain a phrase sequence corresponding to the text to be translated;

an aligned phrase retrieval unit, configured to sequentially extract phrases in the phrase sequence, and retrieve, for the extracted phrases, aligned phrases corresponding to the extracted phrases from the rule alignment table, where the aligned phrases include source language phrases of the extracted phrases and N candidate target language phrases corresponding to the source language phrases, where N is a natural number not less than 1;

a semantic vector retrieval unit, configured to retrieve, from the translation model, phrase semantic vectors corresponding to the source language phrase and the N candidate target language phrases, respectively;

and the similarity calculation unit is used for respectively calculating the semantic similarity between each candidate target language phrase and the source language phrase based on the retrieved phrase semantic vector, and taking the candidate target language phrase with the maximum semantic similarity as the translation result of the extracted phrase so as to translate the text to be translated.

In summary, based on the consideration of integrating semantic information in the statistical machine translation, the invention uses the alignment information of the bilingual parallel corpus to respectively generate a rule alignment table, a word semantic vector table and a phrase table, and then uses the word semantic vector table and the phrase table to generate a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space; and finally, training and generating a translation model fused with semantic information by utilizing the phrase semantic vector table in different semantic spaces. Therefore, the invention realizes the fusion of phrase semantic information in the statistical machine translation, and the applicant finds that the semantic information of the words or phrases can reflect the correlation between the words or phrases and the context words or phrases through research, and compared with the traditional translation model based on the words or phrases, the translation quality of the translation model after the phrase semantic information is fused is higher, so that compared with the prior art, the invention further improves the translation performance of the statistical machine translation.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flowchart of a method for constructing a translation model according to an embodiment of the present invention;

FIG. 2 is another flowchart of a translation model building method disclosed in the second embodiment of the present invention;

FIG. 3 is a flowchart illustrating a translation of a text to be translated by using a translation model according to a second embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a translation model building system according to a third embodiment of the present invention;

FIG. 5 is another schematic structural diagram of a translation model building system according to a third embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a translation module disclosed in the third embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

The applicant finds that the semantic information of a word or a phrase can reflect the correlation between the semantic information of the word or the phrase and a context word or the phrase, and compared with the traditional translation method based on the word or the phrase, the translation quality of a translation model is higher after the semantic information of the phrase is merged, so that the translation performance of statistical machine translation is further improved, the purpose of the invention is to realize the merging of the semantic information of the phrase into a statistical machine translation system, and further realize the translation process based on the semantic information of the phrase.

To this end, this embodiment discloses a translation model building method fusing phrase semantic information, and with reference to fig. 1, the method may include the following steps:

s101: a bilingual parallel corpus is obtained that includes a comparison translation of a source language sentence to a target language sentence.

The method comprises the following steps of collecting a bilingual parallel corpus to realize the training and generation of a translation model and provide original corpus support, wherein the collected bilingual parallel corpus is a sentence-aligned bilingual corpus, and the corpus contains the contrast translation from a source language sentence to a target language sentence.

S102: and generating a rule alignment table, a word semantic vector table and a phrase table by using the bilingual parallel corpus, wherein the rule alignment table comprises a bilingual comparison hierarchical phrase rule, the word semantic vector table comprises bilingual comparison word semantic vectors, and the phrase table comprises bilingual comparison phrase information.

Specifically, the method comprises the steps of preprocessing a bilingual parallel corpus with aligned sentences to obtain a bilingual corpus containing word alignment information, wherein the word alignment information comprises bilingual word words.

On the basis of preprocessing, generating a bilingual comparison rule alignment table according to the word alignment information, wherein the rule alignment table specifically refers to a table structure formed by source language level phrase rules and target language level phrase rules, and the expression form of the level phrase rules is shown in formula (1):

X→<γ^,α^,～> (1)

in the formula (1), X is a non-terminal, γ and α are character strings composed of terminal and non-terminal, and the symbols "-" represent a one-to-one correspondence between the non-terminal appearing in γ and the non-terminal appearing in α.

In this embodiment, the hierarchical phrase rule includes at most 2 non-terminal characters, and the non-terminal characters must cover at least 2 words, two non-terminal characters at the source language end of the rule cannot be adjacent to each other, and at least one word alignment information exists in the rule table.

And then, continuing to train and generate corresponding word semantic vectors at the source language end and the target language end by using the bilingual parallel corpus with the word alignment information to obtain a bilingual aligned word semantic vector table.

The word semantic vector may be obtained by computing through a node mutual information PMI, that is, in this embodiment, the word semantic vector is equal to the node mutual information PMI, and a computing formula of the PMI is specifically:

<math> <mrow> <mi>pmi</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>log</mi> <mfrac> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>,</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>=</mo> <mfrac> <mrow> <msub> <mi>freq</mi> <mrow> <msub> <mi>c</mi> <mi>i</mi> </msub> <mi>t</mi> </mrow> </msub> <mo>×</mo> <msub> <mi>freq</mi> <mi>total</mi> </msub> </mrow> <mrow> <msub> <mi>freq</mi> <msub> <mi>c</mi> <mi>i</mi> </msub> </msub> <mo>×</mo> <msub> <mi>freq</mi> <mi>t</mi> </msub> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </math>

in the formula (2), t represents a target word, c_iWhich indicates related words adjacent to t in the context window, the present embodiment defines the length of the context window (one word corresponds to one basic length unit) as 5 in advance,indicating a related word c_iNumber of co-occurrences with target word t, freq_totalIndicating the number of times all of the words have occurred,indicates the number of occurrences of the context word, freq_tIndicating a target word outThe number of times now.

Next, a phrase table is generated by using the bilingual parallel corpus with word alignment information, in this embodiment, the length interval of the phrases in the phrase table is [2, 5], that is, the number of words included in each phrase in the phrase table is between 2 and 5.

S103: and generating a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space by using the word semantic vector table and the phrase table.

Specifically, for each source language phrase in the phrase table, this step retrieves the semantic vector of each word contained in the phrase from the word semantic vector table; and carrying out weighted vector addition operation on the semantic vectors of the retrieved words to obtain the semantic vector of the source language phrase. Finally, the semantic vector of each source language phrase in the phrase table constitutes a source language phrase semantic vector table in the source language semantic space.

Correspondingly, for each target language phrase in the phrase table, the step retrieves the semantic vector of each word contained in the phrase from the word semantic vector table; and carrying out vector addition operation with weight on the semantic vector of each retrieved word to obtain the semantic vector of the target language phrase. Finally, the semantic vector of each target language phrase in the phrase table constitutes a target language phrase semantic vector table in the target language semantic space.

Taking a phrase P including two words as an example, the formula for calculating the semantic vector of the phrase through vector addition with weight can be specifically expressed as:

<math> <mrow> <mover> <mi>p</mi> <mo>&RightArrow;</mo> </mover> <mo>=</mo> <mi>α</mi> <mover> <mi>u</mi> <mo>&RightArrow;</mo> </mover> <mo>+</mo> <mi>β</mi> <mover> <mi>v</mi> <mo>&RightArrow;</mo> </mover> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow> </math>

in the formula (3), the reaction mixture is,a phrase semantic vector representing the phrase P,respectively representing the word semantic vectors of two words contained in the phrase P, alpha and beta respectively representingCorresponding weights, where α and β can be obtained using the open source tool disect training.

S104: and processing the source language phrase semantic vector table in the source language semantic space and the target language phrase semantic vector table in the target language semantic space to obtain a translation model.

Firstly, a source language phrase semantic vector under a source language semantic space and a target language phrase semantic vector under a target language semantic space are respectively used as input and output of a neural network to train and generate a neural network model with a hidden layer.

On the basis, the neural network model is utilized to map the phrase semantic vector tables in different semantic spaces to the same semantic space, and a required translation model is generated.

Specifically, the neural network model is used for mapping the source language phrase semantic vector table in the source language semantic space to the target language semantic space to obtain the source language phrase semantic vector table in the target language semantic space. The adopted mapping formula is as follows:

<math> <mrow> <mover> <mi>p</mi> <mo>&RightArrow;</mo> </mover> <mo>=</mo> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>W</mi> <mn>2</mn> </msub> <mrow> <mo>(</mo> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>W</mi> <mn>1</mn> </msub> <mover> <mi>x</mi> <mo>&RightArrow;</mo> </mover> <mo>+</mo> <msub> <mi>b</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>+</mo> <msub> <mi>b</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow> </math>

in the formula (4), W₁Representing a mapping matrix from an input layer to a hidden layer in a neural network model; w₂Representing a mapping matrix from a hidden layer to an output layer, b₁And b₂For the bias value, g (x) is usually chosen as a non-linear function (e.g., sigmoid and tanh).

Finally, the source language phrase semantic vector table and the target language phrase semantic vector table in the same semantic space (i.e. the target semantic space) form a required translation model, and then, a translation process of fusing phrase semantic information can be realized based on the translation model.

In summary, based on the consideration of integrating semantic information in the statistical machine translation, the invention uses the alignment information of the bilingual parallel corpus to respectively generate a rule alignment table, a word semantic vector table and a phrase table, and then uses the word semantic vector table and the phrase table to generate a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space; and finally, training and generating a translation model fused with semantic information by utilizing the phrase semantic vector table in different semantic spaces. Therefore, the invention realizes the fusion of the semantic information of the phrases in the statistical machine translation, and compared with the traditional translation method based on the words or the phrases, the translation quality of the translation model after the semantic information of the phrases is fused is higher because the semantic information of the words or the phrases can reflect the correlation between the words or the phrases and the context words or the phrases, thereby further improving the translation performance of the statistical machine translation compared with the prior art.

Example two

In this second embodiment, referring to fig. 2, the method may further include the following steps:

s105: and translating the text to be translated by utilizing the translation model.

Specifically, referring to fig. 3, the implementation process of translating the text to be translated in this step specifically includes:

s301: carrying out phrase segmentation on sentences of a text to be translated to obtain a phrase sequence corresponding to the text to be translated;

s302: extracting phrases in the phrase sequence in sequence, and for the extracted phrases, retrieving aligned phrases corresponding to the extracted phrases from the rule alignment table, wherein the aligned phrases comprise source language phrases of the extracted phrases and N candidate target language phrases corresponding to the source language phrases, and N is a natural number not less than 1;

s303: searching phrase semantic vectors corresponding to the source language phrases and the N candidate target language phrases from the translation model respectively;

s304: and respectively calculating the semantic similarity between each candidate target language phrase and the source language phrase based on the retrieved phrase semantic vectors, and taking the candidate target language phrase with the maximum semantic similarity as the translation result of the extracted phrase so as to translate the text to be translated.

After the translation model is generated through training, the translation model can be used for translating the text to be translated.

Specifically, for a text to be translated, firstly, carrying out word segmentation on a sentence of the text by a process sequence, carrying out character string matching on each source language phrase in a phrase sequence obtained by segmentation from a rule alignment table, and searching an alignment phrase corresponding to each source language phrase so as to obtain each possible target language phrase corresponding to each source language phrase as a candidate target language phrase; then, for each candidate target language phrase of the source language phrase, a corresponding phrase semantic vector is matched thereto from the phrase semantic vector table of the translation model.

On the basis, according to the retrieval and matching results, calculating the semantic similarity between the source language phrase and each candidate target language phrase, wherein the adopted semantic similarity calculation formula is as follows:

<math> <mrow> <mi>Sim</mi> <mrow> <mo>(</mo> <mover> <mi>u</mi> <mo>&RightArrow;</mo> </mover> <mo>,</mo> <mover> <mi>v</mi> <mo>&RightArrow;</mo> </mover> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mover> <mi>u</mi> <mo>&RightArrow;</mo> </mover> <mo>·</mo> <mover> <mi>v</mi> <mo>&RightArrow;</mo> </mover> </mrow> <mrow> <mo>|</mo> <mo>|</mo> <mover> <mi>u</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> <mo>|</mo> <mo>×</mo> <mo>|</mo> <mo>|</mo> <mover> <mi>v</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> <mo>=</mo> <mfrac> <mrow> <munder> <mi>Σ</mi> <mi>i</mi> </munder> <mrow> <mo>(</mo> <msub> <mi>a</mi> <mi>i</mi> </msub> <mo>×</mo> <msub> <mi>b</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> <msqrt> <munder> <mi>Σ</mi> <mi>i</mi> </munder> <msubsup> <mi>a</mi> <mi>i</mi> <mn>2</mn> </msubsup> <mo>×</mo> <munder> <mi>Σ</mi> <mi>i</mi> </munder> <msubsup> <mi>b</mi> <mi>i</mi> <mn>2</mn> </msubsup> </msqrt> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow> </math>

in the formula (5), the reaction mixture is,representing a vectorAndsemantic similarity of the corresponding phrase, a_iAnd b_iRespectively representAndthe value of each dimension.Andthe smaller the included angle is, the higher the semantic similarity of the corresponding phrase is, and the larger the cosine value is;andthe larger the included angle is, the lower the semantic similarity of the corresponding phrase is, and the smaller the rest chord values are.

And finally, selecting the candidate target language phrase corresponding to the maximum semantic similarity value as a translation result of the source language phrase to realize translation.

For example, for a source language phrase to be translated, assuming that there are 5 possible target language phrases through matching rule alignment table, and their corresponding similarities are calculated to be 0.96, 0.85, 0.54, 0.38, and 0.15, respectively, then the present embodiment selects the phrase with the highest similarity, i.e. the phrase corresponding to 0.96, as the target translation result.

EXAMPLE III

The third embodiment discloses a translation model building system, which corresponds to the translation model building method disclosed in each of the above embodiments.

First, referring to fig. 4, the system includes an obtaining module 100, a first generating module 200, a second generating module 300, and a processing module 400, corresponding to the first embodiment.

An obtaining module 100 is configured to obtain a bilingual parallel corpus, which includes a comparison translation from a source language sentence to a target language sentence.

A first generating module 200, configured to generate a rule alignment table, a word semantic vector table, and a phrase table by using the bilingual parallel corpus, where the rule alignment table includes hierarchical phrase rules for bilingual comparison, the word semantic vector table includes word semantic vectors for bilingual comparison, and the phrase table includes phrase information for bilingual comparison.

The first generation module 200 includes a preprocessing unit, a first generation unit, a second generation unit, and a third generation unit.

where t denotes the target word, c_iRepresents the related words adjacent to t in the context window,indicating a related word c_iNumber of co-occurrences with target word t, freq_totalIndicating the number of times all of the words have occurred,indicates the number of occurrences of the context word, freq_tIndicating the number of occurrences of the target word.

A second generating module 300, configured to generate a source language phrase semantic vector table in the source language semantic space and a target language phrase semantic vector table in the target language semantic space by using the word semantic vector table and the phrase table.

Specifically, the second generation module 300 includes a first retrieval unit, a first calculation unit, a second retrieval unit, and a second calculation unit.

The processing module 400 is configured to process the source language phrase semantic vector table in the source language semantic space and the target language phrase semantic vector table in the target language semantic space to obtain a translation model.

Wherein the processing module 400 comprises an intermediate module generating unit and a translation model generating unit.

and the translation model generating unit is used for mapping the source language phrase semantic vector table in the source language semantic space to the target language semantic space by using the neural network model to obtain a source language phrase semantic vector table in the target language semantic space, and marking the source language phrase semantic vector table and the target language phrase semantic vector table in the target language semantic space as translation models.

Corresponding to the second embodiment, referring to fig. 5, the system further includes a translation module 500, configured to translate the text to be translated by using the translation model.

Specifically, referring to fig. 6, the translation module includes a segmentation unit 601, an aligned phrase retrieval unit 602, a semantic vector retrieval unit 603, and a similarity calculation unit 604.

The segmentation unit 601 is configured to perform phrase segmentation on a sentence of a text to be translated to obtain a phrase sequence corresponding to the text to be translated;

an aligned phrase retrieving unit 602, configured to sequentially extract phrases in the phrase sequence, and for the extracted phrases, retrieve an aligned phrase corresponding to the extracted phrases from the rule alignment table, where the aligned phrase includes source language phrases of the extracted phrases and N candidate target language phrases corresponding to the source language phrases, where N is a natural number not less than 1;

a semantic vector retrieving unit 603, configured to retrieve, from the translation model, phrase semantic vectors corresponding to the source language phrase and the N candidate target language phrases, respectively;

and a similarity calculation unit 604, configured to calculate semantic similarity between each candidate target language phrase and the source language phrase based on the retrieved phrase semantic vector, and use the candidate target language phrase with the largest semantic similarity as the translation result of the extracted phrase, so as to translate the text to be translated.

For the translation model construction system disclosed in the third embodiment of the present invention, since it corresponds to the translation model construction method disclosed in each of the above embodiments, the description is relatively simple, and for the relevant similarities, please refer to the description of the translation model construction method in each of the above embodiments, and the details are not described here.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.

For convenience of description, the above system is described as being divided into various modules or units by function, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

Finally, it is further noted that, herein, relational terms such as first, second, third, fourth, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A translation model construction method is characterized by comprising the following steps:

2. The method of claim 1, wherein generating a rule alignment table, a word semantic vector table, and a phrase table using the bilingual parallel corpus comprises:

and generating a phrase table according to the word alignment information.

3. The method of claim 2, wherein generating a source language phrase semantic vector table in a source language semantic space and a target language phrase semantic vector table in a target language semantic space using the word semantic vector table and the phrase table comprises:

4. The method of claim 3, wherein the processing the source language phrase semantic vector table in the source language semantic space and the target language phrase semantic vector table in the target language semantic space to obtain a translation model comprises:

5. The method of claim 4, further comprising translating text to be translated using the translation model.

6. The method of claim 5, wherein translating the text to be translated using the translation model comprises:

7. The method according to claim 6, wherein the semantic similarity is calculated by the formula:

8. A translation model building system, comprising:

9. The system of claim 8, wherein the first generation module comprises:

where t denotes the target word, c_iRepresents the related words adjacent to t in the context window,indicating a related word c_iNumber of co-occurrences with target word t, freq_totalIndicating the number of times all of the words have occurred,indicates the number of occurrences of the context word, freq_tRepresenting the number of times the target word appears; (ii) a

10. The system of claim 9, wherein the second generation module comprises:

11. The system of claim 10, wherein the processing module comprises:

12. The system of claim 11, further comprising a translation module configured to translate text to be translated using the translation model.

13. The system of claim 12, wherein the translation module comprises: