CN107066455B

CN107066455B - Multi-language intelligent preprocessing real-time statistics machine translation system

Info

Publication number: CN107066455B
Application number: CN201710203439.8A
Authority: CN
Inventors: 张昱琪; 唐亮
Original assignee: Individual
Current assignee: Individual
Priority date: 2017-03-30
Filing date: 2017-03-30
Publication date: 2020-07-28
Anticipated expiration: 2037-03-30
Also published as: CN107066455A

Abstract

The invention discloses a multilingual intelligent preprocessing real-time statistical machine translation system, which comprises: the device comprises a receiving module, a preprocessing module, a machine translation module and a post-processing module. The receiving module comprises a text language receiving module and a voice recognition result receiving module; the preprocessing module comprises a text preprocessing module and a voice recognition result preprocessing module; the machine translation module is used for learning the translation of phrases by the phrases, finding out corresponding translation phrases for the phrases processed by the preprocessing module and connecting the phrases into a complete sentence; and the post-processing module is used for carrying out word punctuation standardization, case standardization and format standardization processing on the translation result so as to enable the translation result to be closer to the expression habit of the target language and output as a final result. The invention can be used for translating text languages and voice languages, and improves the translation accuracy of words and phrases with small probability.

Description

Multi-language intelligent preprocessing real-time statistics machine translation system

Technical Field

The invention relates to the technical field of artificial intelligence machine translation, in particular to a multi-language intelligent preprocessing real-time statistical machine translation system.

Background

Machine translation is a technique for automatically translating human natural languages using a computer, and is a process for converting one natural language into another natural language using a computer, and the two natural languages should be equivalent in meaning.

At present, a relatively mature and mainstream machine translation method is a statistical-based method, and the method has the advantages that translation rules are hardly required to be written manually, and all translation information is obtained by automatically learning from linguistic data, so that the method furthest exerts the characteristic of high-speed operation of a computer, and greatly reduces the labor cost.

Statistical model-based machine translation techniques learn phrase translations from one language a to another language B from a parallel corpus. When translating a new sentence, the sentence in the input language A is decomposed into a plurality of phrases, and the sentence in the language A is translated into the sentence in the language B according to the co-occurrence probability of the learned phrases (A language) and (B language). The whole learning and translation process is completely based on a statistical model.

However, the machine translation based on the co-occurrence frequency in the probabilistic method is not enough for small-probability phrases (e.g., proper noun translation), and how to add the expression of syntax semantics into the statistical model to make the generated translated sentence more in line with the expression habit of human, which is also a problem to be solved by the current machine translation technology.

Disclosure of Invention

In view of the above technical problems in the related art, the present invention provides a multilingual intelligent preprocessing real-time statistical machine translation system, which can overcome the above disadvantages in the prior art.

In order to achieve the technical purpose, the technical scheme of the invention is realized as follows:

a multi-language intelligent pre-processing real-time statistics machine translation system, comprising:

the receiving module is used for checking the normalization of system input and comprises a text language receiving module and a voice recognition result receiving module; the system comprises a text language receiving module, a voice recognition result processing module and a voice recognition result processing module, wherein the text language receiving module is used for carrying out sentence segmentation and format conversion on a text language, and the voice recognition result receiving module is used for carrying out segmentation, noise elimination and format conversion on a voice;

the system comprises a preprocessing module and a voice recognition module, wherein the preprocessing module comprises a text preprocessing module and a voice recognition result preprocessing module, and the text preprocessing module is used for performing word standardization operation, category recognition labeling and language block word order adjustment on a language input by a text; the voice recognition result preprocessing module is used for carrying out word standardization operation and punctuation prediction on voice;

the machine translation module is used for learning the translation of phrases to phrases, finding out corresponding translation phrases for the phrases processed by the preprocessing module and generating complete sentences;

and the post-processing module is used for carrying out word punctuation standardization, case standardization and format standardization processing on the translation result so as to enable the translation result to be closer to the expression habit of the target language and output as a final result.

Further, the text language receiving module comprises a sentence segmentation module and a format conversion module, wherein the sentence segmentation module is used for breaking the input text at the punctuation mark so that the basic unit translated by the subsequent machine translation module is a sentence; the format conversion module is used for converting different formats of language texts into formats supported by the machine translation module during translation.

Preferably, the supported format of the machine translation module during translation is a plain text format or an XM L format.

Furthermore, the voice recognition result receiving module comprises a sentence segmentation module and a noise elimination module, wherein the sentence segmentation module is used for segmenting input voice text streams according to pause between words; the noise cancellation module is configured to remove adjacent repeated segments from the stream of spoken text in the input.

The text preprocessing module comprises a word normalization module, a category identification marking module and a language block word order adjusting module, wherein the word normalization module is used for enabling the language to be translated to be closer to the target language on the word level, the category identification marking module is used for marking numbers, dates, time and UR L in the language text to be translated as $ number, $ date, $ hour and $ www respectively and translating the content in the category into the target language in advance, the language block word order adjusting module is used for conducting grammar analysis on sentences of the language to be translated, and then adjusting the language block order of the language to be translated according to an automatic learning rule to enable the language order of the language to be translated to be closer to the word order of the target language.

Furthermore, the voice recognition result preprocessing module comprises a word normalization module and a punctuation prediction module, wherein the word normalization module is used for enabling word particles in the language to be translated to be closer to words in the target language; the punctuation prediction module is used for judging the position of a period in the speech recognition output according to the context and the pause between words; the acceptable modes of the voice recognition result preprocessing module for the voice recognition result are plain text and a confusion network.

Further, the machine translation module comprises a training module and a translation module, wherein the training module learns the translation of the phrase to the phrase in the large-scale balanced corpus by using a GIZA + + toolkit; the translation module is used for dividing each input sentence into phrase segments, and translating each phrase segment according to the training result of the training module, wherein the translation process of the translation module is a search process, namely, an optimal translation combination is found out from the translation combinations formed by the translation results of each translation sub-model, and the optimal translation combination is a final translation result.

Preferably, the translation submodels include a phrase translation model, a language model, a word order change model, a part-of-speech based language model, a bilingual language model and a domain adaptive model.

Furthermore, the post-processing module comprises a word punctuation standardization module, a case conversion module and a format conversion module, wherein the word punctuation standardization module is used for standardizing words and punctuations in the machine translation result into the expression form of the target language; the case and case conversion module is used for translating by taking western language as a target language; the format conversion module is used for enabling the format of the translated target language to be consistent with the format of the language to be translated.

Preferably, the case conversion module is used for changing the letters of the first letter and the proper noun in the target language into capital form.

The machine translation system has the advantages that sentences and chapters of one language can be translated into another language in real time, the system can translate the sentences completely and correctly, the text language with punctuations can be translated without segmentation, the sentences can be incomplete without punctuations and noisy speech in the sentences, the translation accuracy of small-probability words and phrases is improved, namely small-probability words such as numbers, dates, time, UR L and the like are respectively marked and preferentially translated, the preprocessing module can carry out standardized processing on the input sentences, and the post-processing module can improve the fluency of translation results.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a translation flow diagram of a multilingual intelligent-preprocessing real-time statistical machine translation system according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a text receiving module of the multilingual intelligent preprocessing real-time statistical machine translation system according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a speech recognition result receiving module of the intelligent preprocessing realtime statistics machine translation system according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a text pre-processing module of the multilingual intelligent pre-processing real-time statistics machine translation system according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a speech recognition result preprocessing module of the multilingual intelligent preprocessing real-time statistical machine translation system according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a machine translation module of the multilingual intelligent pre-processing real-time statistics machine translation system according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating a post-processing module of the multilingual intelligent pre-processing real-time statistics machine translation system according to an embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.

Referring to fig. 1-7, a real-time statistical machine translation system with intelligent preprocessing for multiple languages according to an embodiment of the present invention includes:

In a specific embodiment, the text language receiving module comprises a sentence segmentation module and a format conversion module, wherein the sentence segmentation module is used for breaking the input text at the punctuation mark so that the basic unit translated by the subsequent machine translation module is a sentence; the format conversion module is used for converting different formats of language texts into formats supported by the machine translation module during translation.

In one embodiment, the supported format for translation by the machine translation module is a plain text format or an XM L format.

In one embodiment, the speech recognition result receiving module comprises a sentence segmentation module and a noise elimination module, wherein the sentence segmentation module is used for segmenting the input speech text stream according to pause between words; the noise cancellation module is configured to remove adjacent repeated segments from the stream of spoken text in the input.

In a specific embodiment, the text preprocessing module comprises a word normalization module, a category identification and labeling module and a language block word order adjusting module, wherein the word normalization module is used for enabling the language to be translated to be closer to the target language on a word level, the category identification and labeling module is used for labeling numbers, dates, time and UR L in the language text to be translated as number, $ date, $ hour and $ www respectively and translating the content in the category into the target language in advance, the language block word order adjusting module is used for performing grammar analysis on sentences of the language to be translated, and then adjusting the language block order of the language to be translated according to an automatic learning rule so that the language order of the language to be translated is closer to the word order of the target language.

In a specific embodiment, the speech recognition result preprocessing module comprises a word normalization module and a punctuation prediction module, wherein the word normalization module is used for enabling word particles in the language to be translated to be closer to words in a target language; the punctuation prediction module is used for judging the position of a period in the speech recognition output according to the context and the pause between words; the acceptable modes of the voice recognition result preprocessing module for the voice recognition result are plain text and a confusion network.

In one embodiment, the machine translation module comprises a training module and a translation module, wherein the training module learns the translation of phrases in a large-scale balanced corpus using a GIZA + + toolkit; the translation module is used for dividing each input sentence into phrase segments, and translating each phrase segment according to the training result of the training module, wherein the translation process of the translation module is a search process, namely, an optimal translation combination is found out from the translation combinations formed by the translation results of each translation sub-model, and the optimal translation combination is a final translation result.

In one embodiment, the translation submodels include a phrase translation model, a language model, a word order change model, a part-of-speech based language model, a bilingual language model, and a domain adaptation model.

In a specific embodiment, the post-processing module comprises a word punctuation normalization module, a case conversion module and a format conversion module, wherein the word punctuation normalization module is used for normalizing words and punctuations in the machine translation result into a representation form of a target language; the case and case conversion module is used for translating by taking western language as a target language; the format conversion module is used for enabling the format of the translated target language to be consistent with the format of the language to be translated.

In one embodiment, the case conversion module is configured to change the letters of the first letter and the proper noun in the target language into capitalization form.

In order to facilitate understanding of the above-described technical aspects of the present invention, the above-described technical aspects of the present invention will be described in detail below in terms of specific usage.

When the system is used specifically, the multilingual intelligent preprocessing real-time statistical machine translation system comprises a receiving module, a preprocessing module, a translation module and a post-processing module;

the receiving module checks the normalcy of the system input, including the text language receiving module and the speech recognition result receiving module, the text language receiving module is mainly composed of two parts, as shown in fig. 2 of the attached drawings, a sentence segmentation module and a format conversion module, a.1 sentence segmentation module breaks the input text at punctuation periods, question marks and exclamation marks, so that the basic units translated by the subsequent machine translation module are sentences, when the input text contains html marks, the content between a pair of html marks is formed into sentences alone to ensure that the sentence is translated as a complete sentence, but not as a part of the text outside the html marks, the subsequent modules of the flow support the translation of plain text and text in XM L format, when the input text is in other formats, such as PDF or pictures, a.2 format conversion module converts the other formats into plain text and XM L format, a speech recognition result receiving module is also mainly composed of two parts, as shown in fig. 3 of the attached drawings, a sentence segmentation module and a noise elimination module a.3. when the input text flow is a sentence, the input text flow is divided into a simplified sentence, i.3, i.e.e. when the input text flow is a simplified speech recognition result is considered as a simplified speech recognition module, i.e. when the input text flow is a simplified speech recognition module eliminates the noise of a punctuation word segmentation module, i.5.

The system includes a pre-processing module that performs operations on an input language A to bring the input language A closer to a translated target language B for better translation quality by a subsequent machine translation module, the pre-processing module includes a text pre-processing module and a speech recognition result pre-processing module, the text pre-processing module is composed of three parts, as shown in FIG. 4 of the accompanying drawings, a B.1 word normalization module makes the source language A closer to the target language B at the word level, such as when performing a middle-to-English translation, Chinese is to be segmented, source space is inserted between words, when performing a German-to-English translation, compound words in German are to be segmented, and a one-to-one correspondence of words in German sentences is added, a B.2 category recognition tagging module labels numbers, dates, times, UR L in the source language A as corresponding categories of number, date, hour and www in the source language A category as a word, and optional word pre-to be translated by a rule pre-target language B-translation module, and a speech recognition module adjusts the phrase recognition module to generate a word pre-to generate a translation result of the optional sentence according to whether the phrase pre-target language B.5.

Wherein the B.2 class identification label is based on bilingual semi-automatic class identification and translation. The semi-automatic method is characterized in that a category needing to be identified is manually defined in a source language in bilingual; then automatically learning out the needed category and translation of the category in another language according to the balanced corpus and word alignment (word alignment). Taking the english translation as an example, first define the categories $ number, $ date, $ hour, $ www to be identified in english. All numbers are then identified in chinese, labeled as $ bnumber, and the words www, http,. com, etc. associated with the world wide web, labeled as $ bww. Here, $ bnumber and $ bww are the core of the Chinese Categories. On the basis of the core, the Chinese category corresponding to the English category can be formed finally only by including the preceding and following words. Including which words before and after, we extract automatically through word alignment. The Chinese words corresponding to the English category boundary words in the word comparison can also be Chinese category boundary words. The boundary words of the Chinese category are determined, and the extracted Chinese category content implies the corresponding English category Chinese translation. From which to learn translation rules from english to chinese categories, for example:

$ number 2 → $ number 2 }

$ number 2 worth → $ number 20% }

$ number 2- → $ number 2nd }

The rules extracted by the method better conform to the actual situation of data, errors generated in actual application by manually defined rules are reduced, and compared with the traditional method of respectively defining categories and rules on two languages, the method improves the efficiency; nor does it require the rule-maker to be familiar with both languages at the same time; the rate of mismatch of rules in the two languages is also greatly reduced, thereby improving machine translation quality.

The B.3 language block word order adjusting method adds grammar restriction in the aspect of word order adjustment in a statistical translation system. When one language is translated into another language, the order in which words are expressed often differs due to differences in grammatical and expression conventions. In addition to translating a word or phrase into another language, the translated phrase is put in place when the translation is completed. In a statistical translation system, its basic unit-phrase-is an arbitrary word string, and it is not required to conform to a grammatical structure. This results in misbehaving chunks of speech that are often rejoined to produce strange translations. The invention introduces the information of phrases according with grammatical rules through shallow syntactic analysis in the preprocessing stage. In the subsequent phrase position moving step, only phrases which accord with grammatical constraints are moved, so that the correctness and the fluency of a translation result are improved, and the specific steps are as follows:

and performing shallow syntactic analysis on the source language to generate grammatical information such as NP (noun phrase), VP (verb phrase), PP (preposition phrase) and the like.

The word order adjustment rules are learned through word alignment (word alignment), and the probability of each rule, the learned rules, for example:

DNP NP VP –>DNP NP VP (0.89)

DNP NP VP –>NP DNP VP (0.11)

i.e., the probability of the phrase sequence DNP NP VP being invariant to the phrase order is 0.89 and the probability of becoming NP DNP VP is 0.11, these are applied to the source language input sentence. Different rule combination applications produce different phrase sequence variations. All of these changes are represented in the form of word lattices (lattices). And calculating the probability of each path in the word lattice according to the probability of the rule. The optimal path, or the entire word lattice network, serves as the new input for subsequent machine translation modules.

The Translation process is essentially a search process, finding the optimal combination from different concatenations, i.e. the final Translation result, during the search process, many submodels are applied to help search out the optimal path, the necessary submodels include a phrase Translation Model (Translation Model), a language Model (L language) other submodels, such as a language order change Model (Translation Model), a Model based on a dual language Model (POS Model), etc., and whether the Model is open based on an adaptive language Model (ad Model) may be determined according to the actual language (POS Model) 52).

The post-processing module further processes the translation result to enable the translation result to be closer to the expression habit of the target language and output as a final result. Further processing, as shown in FIG. 7 of the drawings, mainly includes a D.1 word punctuation normalization module that normalizes the words and punctuation in the machine translation results into a common representation of the target language. For example, spaces between Chinese words are removed in the translation results of an English-to-Chinese translation. And removing spaces between periods, commas and words before the commas in the western language translation result, and the like. And the D.2 case conversion module is mainly suitable for translation with western language as a target language. For example, the initials of an english sentence are capitalized. Some terminology, such as USA, are also capitalized. The sub-module converts the corresponding lower case letters in the translation result into upper case letters. The d.3 format conversion module is the inverse operation of the a.2 format conversion module, i.e. it is ensured that the output is in accordance with the format of the input.

In conclusion, the machine translation system can translate sentences and chapters of one language into another language in real time, can translate complete sentences and express correctly, can translate text languages with punctuation marks, can translate voices which are not segmented, possibly incomplete sentences and have punctuation marks and noise in the sentences, improves the translation accuracy of small-probability words and phrases, namely marks and preferentially translates small-probability words such as numbers, dates, time, UR L and the like, can standardize the input sentences, and can improve the fluency of translation results by the post-processing module.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A multi-language intelligent preprocessing real-time statistics machine translation system is characterized by comprising:

the system comprises a preprocessing module and a voice recognition result preprocessing module, wherein the preprocessing module comprises a text preprocessing module and a voice recognition result preprocessing module, the text preprocessing module is used for carrying out word standardization operation, category identification marking and language block word order adjustment on a language input by a text, the text preprocessing module comprises a word standardization module, a category identification marking module and a language block word order adjustment module, the word standardization module is used for enabling the language to be translated to be closer to a target language on a word level, the category identification marking module is used for marking numbers, dates, time and UR L in the language text to be translated into number $ date $ hour $ www respectively and translating contents in categories into the target language in advance, the language block word order adjustment module is used for carrying out grammar analysis on sentences of the language to be translated and then adjusting according to a language block order of an automatically learned rule so that the word order of the language to be translated is closer to the word order of the target language to be translated;

the machine translation module is used for learning the translation of phrases by the phrases, finding out corresponding translation phrases for the phrases processed by the preprocessing module and connecting the phrases into a complete sentence;

2. The system of claim 1, wherein the text language receiving module comprises a sentence segmentation module and a format conversion module, the sentence segmentation module is configured to break the input text at punctuation marks, such that the basic units translated by the subsequent machine translation module are a sentence; the format conversion module is used for converting different formats of language texts into formats supported by the machine translation module during translation.

3. The system of claim 2, wherein the supported format for translation by the machine translation module is plain text format or XM L format.

4. The system of claim 1, wherein the speech recognition result receiving module comprises a sentence segmentation module and a noise elimination module, the sentence segmentation module is configured to segment the input speech text stream according to word-to-word pauses; the noise cancellation module is configured to remove adjacent repeated segments from the stream of spoken text in the input.

5. The system of claim 1, wherein the speech recognition result preprocessing module comprises a word normalization module and a punctuation prediction module, the word normalization module is used for enabling word particles in the language to be translated to be closer to words in the target language; the punctuation prediction module is used for judging the position of a period in the speech recognition output according to the context and the pause between words, and the speech recognition result preprocessing module is a pure text and a confusion network for the receivable mode of the speech recognition result.

6. The system of claim 1, wherein the machine translation modules comprise a training module and a translation module, and the training module learns phrase-to-phrase translations in a large-scale balanced corpus using a GIZA + + toolkit; the translation module is used for dividing each input sentence into phrase segments, and translating each phrase segment according to the training result of the training module, wherein the translation process of the translation module is a search process, namely, an optimal translation combination is found out from the translation combinations formed by the translation results of each translation sub-model, and the optimal translation combination is a final translation result.

7. The system of claim 6, wherein the translation sub-models comprise a phrase translation model, a language model, a word order change model, a part-of-speech based language model, a bilingual language model, and a domain adaptive model.

8. The system of claim 1, wherein the post-processing module comprises a word punctuation normalization module, a case conversion module and a format conversion module, the word punctuation normalization module is used for normalizing words and punctuation in the machine translation result into an expression form of a target language; the case and case conversion module is used for translating by taking western language as a target language; the format conversion module is used for enabling the format of the translated target language to be consistent with the format of the language to be translated.

9. The system of claim 8, wherein the case conversion module is configured to change the initials and proper nouns in the target language to capitalized form.