CN107066455B - Multi-language intelligent preprocessing real-time statistics machine translation system - Google Patents

Multi-language intelligent preprocessing real-time statistics machine translation system Download PDF

Info

Publication number
CN107066455B
CN107066455B CN201710203439.8A CN201710203439A CN107066455B CN 107066455 B CN107066455 B CN 107066455B CN 201710203439 A CN201710203439 A CN 201710203439A CN 107066455 B CN107066455 B CN 107066455B
Authority
CN
China
Prior art keywords
module
language
translation
text
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710203439.8A
Other languages
Chinese (zh)
Other versions
CN107066455A (en
Inventor
张昱琪
唐亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201710203439.8A priority Critical patent/CN107066455B/en
Publication of CN107066455A publication Critical patent/CN107066455A/en
Application granted granted Critical
Publication of CN107066455B publication Critical patent/CN107066455B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/45Example-based machine translation; Alignment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a multilingual intelligent preprocessing real-time statistical machine translation system, which comprises: the device comprises a receiving module, a preprocessing module, a machine translation module and a post-processing module. The receiving module comprises a text language receiving module and a voice recognition result receiving module; the preprocessing module comprises a text preprocessing module and a voice recognition result preprocessing module; the machine translation module is used for learning the translation of phrases by the phrases, finding out corresponding translation phrases for the phrases processed by the preprocessing module and connecting the phrases into a complete sentence; and the post-processing module is used for carrying out word punctuation standardization, case standardization and format standardization processing on the translation result so as to enable the translation result to be closer to the expression habit of the target language and output as a final result. The invention can be used for translating text languages and voice languages, and improves the translation accuracy of words and phrases with small probability.

Description

Multi-language intelligent preprocessing real-time statistics machine translation system
Technical Field
The invention relates to the technical field of artificial intelligence machine translation, in particular to a multi-language intelligent preprocessing real-time statistical machine translation system.
Background
Machine translation is a technique for automatically translating human natural languages using a computer, and is a process for converting one natural language into another natural language using a computer, and the two natural languages should be equivalent in meaning.
At present, a relatively mature and mainstream machine translation method is a statistical-based method, and the method has the advantages that translation rules are hardly required to be written manually, and all translation information is obtained by automatically learning from linguistic data, so that the method furthest exerts the characteristic of high-speed operation of a computer, and greatly reduces the labor cost.
Statistical model-based machine translation techniques learn phrase translations from one language a to another language B from a parallel corpus. When translating a new sentence, the sentence in the input language A is decomposed into a plurality of phrases, and the sentence in the language A is translated into the sentence in the language B according to the co-occurrence probability of the learned phrases (A language) and (B language). The whole learning and translation process is completely based on a statistical model.
However, the machine translation based on the co-occurrence frequency in the probabilistic method is not enough for small-probability phrases (e.g., proper noun translation), and how to add the expression of syntax semantics into the statistical model to make the generated translated sentence more in line with the expression habit of human, which is also a problem to be solved by the current machine translation technology.
Disclosure of Invention
In view of the above technical problems in the related art, the present invention provides a multilingual intelligent preprocessing real-time statistical machine translation system, which can overcome the above disadvantages in the prior art.
In order to achieve the technical purpose, the technical scheme of the invention is realized as follows:
a multi-language intelligent pre-processing real-time statistics machine translation system, comprising:
the receiving module is used for checking the normalization of system input and comprises a text language receiving module and a voice recognition result receiving module; the system comprises a text language receiving module, a voice recognition result processing module and a voice recognition result processing module, wherein the text language receiving module is used for carrying out sentence segmentation and format conversion on a text language, and the voice recognition result receiving module is used for carrying out segmentation, noise elimination and format conversion on a voice;
the system comprises a preprocessing module and a voice recognition module, wherein the preprocessing module comprises a text preprocessing module and a voice recognition result preprocessing module, and the text preprocessing module is used for performing word standardization operation, category recognition labeling and language block word order adjustment on a language input by a text; the voice recognition result preprocessing module is used for carrying out word standardization operation and punctuation prediction on voice;
the machine translation module is used for learning the translation of phrases to phrases, finding out corresponding translation phrases for the phrases processed by the preprocessing module and generating complete sentences;
and the post-processing module is used for carrying out word punctuation standardization, case standardization and format standardization processing on the translation result so as to enable the translation result to be closer to the expression habit of the target language and output as a final result.
Further, the text language receiving module comprises a sentence segmentation module and a format conversion module, wherein the sentence segmentation module is used for breaking the input text at the punctuation mark so that the basic unit translated by the subsequent machine translation module is a sentence; the format conversion module is used for converting different formats of language texts into formats supported by the machine translation module during translation.
Preferably, the supported format of the machine translation module during translation is a plain text format or an XM L format.
Furthermore, the voice recognition result receiving module comprises a sentence segmentation module and a noise elimination module, wherein the sentence segmentation module is used for segmenting input voice text streams according to pause between words; the noise cancellation module is configured to remove adjacent repeated segments from the stream of spoken text in the input.
The text preprocessing module comprises a word normalization module, a category identification marking module and a language block word order adjusting module, wherein the word normalization module is used for enabling the language to be translated to be closer to the target language on the word level, the category identification marking module is used for marking numbers, dates, time and UR L in the language text to be translated as $ number, $ date, $ hour and $ www respectively and translating the content in the category into the target language in advance, the language block word order adjusting module is used for conducting grammar analysis on sentences of the language to be translated, and then adjusting the language block order of the language to be translated according to an automatic learning rule to enable the language order of the language to be translated to be closer to the word order of the target language.
Furthermore, the voice recognition result preprocessing module comprises a word normalization module and a punctuation prediction module, wherein the word normalization module is used for enabling word particles in the language to be translated to be closer to words in the target language; the punctuation prediction module is used for judging the position of a period in the speech recognition output according to the context and the pause between words; the acceptable modes of the voice recognition result preprocessing module for the voice recognition result are plain text and a confusion network.
Further, the machine translation module comprises a training module and a translation module, wherein the training module learns the translation of the phrase to the phrase in the large-scale balanced corpus by using a GIZA + + toolkit; the translation module is used for dividing each input sentence into phrase segments, and translating each phrase segment according to the training result of the training module, wherein the translation process of the translation module is a search process, namely, an optimal translation combination is found out from the translation combinations formed by the translation results of each translation sub-model, and the optimal translation combination is a final translation result.
Preferably, the translation submodels include a phrase translation model, a language model, a word order change model, a part-of-speech based language model, a bilingual language model and a domain adaptive model.
Furthermore, the post-processing module comprises a word punctuation standardization module, a case conversion module and a format conversion module, wherein the word punctuation standardization module is used for standardizing words and punctuations in the machine translation result into the expression form of the target language; the case and case conversion module is used for translating by taking western language as a target language; the format conversion module is used for enabling the format of the translated target language to be consistent with the format of the language to be translated.
Preferably, the case conversion module is used for changing the letters of the first letter and the proper noun in the target language into capital form.
The machine translation system has the advantages that sentences and chapters of one language can be translated into another language in real time, the system can translate the sentences completely and correctly, the text language with punctuations can be translated without segmentation, the sentences can be incomplete without punctuations and noisy speech in the sentences, the translation accuracy of small-probability words and phrases is improved, namely small-probability words such as numbers, dates, time, UR L and the like are respectively marked and preferentially translated, the preprocessing module can carry out standardized processing on the input sentences, and the post-processing module can improve the fluency of translation results.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a translation flow diagram of a multilingual intelligent-preprocessing real-time statistical machine translation system according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a text receiving module of the multilingual intelligent preprocessing real-time statistical machine translation system according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a speech recognition result receiving module of the intelligent preprocessing realtime statistics machine translation system according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a text pre-processing module of the multilingual intelligent pre-processing real-time statistics machine translation system according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating a speech recognition result preprocessing module of the multilingual intelligent preprocessing real-time statistical machine translation system according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating a machine translation module of the multilingual intelligent pre-processing real-time statistics machine translation system according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating a post-processing module of the multilingual intelligent pre-processing real-time statistics machine translation system according to an embodiment of the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.
Referring to fig. 1-7, a real-time statistical machine translation system with intelligent preprocessing for multiple languages according to an embodiment of the present invention includes:
the receiving module is used for checking the normalization of system input and comprises a text language receiving module and a voice recognition result receiving module; the system comprises a text language receiving module, a voice recognition result processing module and a voice recognition result processing module, wherein the text language receiving module is used for carrying out sentence segmentation and format conversion on a text language, and the voice recognition result receiving module is used for carrying out segmentation, noise elimination and format conversion on a voice;
the system comprises a preprocessing module and a voice recognition module, wherein the preprocessing module comprises a text preprocessing module and a voice recognition result preprocessing module, and the text preprocessing module is used for performing word standardization operation, category recognition labeling and language block word order adjustment on a language input by a text; the voice recognition result preprocessing module is used for carrying out word standardization operation and punctuation prediction on voice;
the machine translation module is used for learning the translation of phrases to phrases, finding out corresponding translation phrases for the phrases processed by the preprocessing module and generating complete sentences;
and the post-processing module is used for carrying out word punctuation standardization, case standardization and format standardization processing on the translation result so as to enable the translation result to be closer to the expression habit of the target language and output as a final result.
In a specific embodiment, the text language receiving module comprises a sentence segmentation module and a format conversion module, wherein the sentence segmentation module is used for breaking the input text at the punctuation mark so that the basic unit translated by the subsequent machine translation module is a sentence; the format conversion module is used for converting different formats of language texts into formats supported by the machine translation module during translation.
In one embodiment, the supported format for translation by the machine translation module is a plain text format or an XM L format.
In one embodiment, the speech recognition result receiving module comprises a sentence segmentation module and a noise elimination module, wherein the sentence segmentation module is used for segmenting the input speech text stream according to pause between words; the noise cancellation module is configured to remove adjacent repeated segments from the stream of spoken text in the input.
In a specific embodiment, the text preprocessing module comprises a word normalization module, a category identification and labeling module and a language block word order adjusting module, wherein the word normalization module is used for enabling the language to be translated to be closer to the target language on a word level, the category identification and labeling module is used for labeling numbers, dates, time and UR L in the language text to be translated as number, $ date, $ hour and $ www respectively and translating the content in the category into the target language in advance, the language block word order adjusting module is used for performing grammar analysis on sentences of the language to be translated, and then adjusting the language block order of the language to be translated according to an automatic learning rule so that the language order of the language to be translated is closer to the word order of the target language.
In a specific embodiment, the speech recognition result preprocessing module comprises a word normalization module and a punctuation prediction module, wherein the word normalization module is used for enabling word particles in the language to be translated to be closer to words in a target language; the punctuation prediction module is used for judging the position of a period in the speech recognition output according to the context and the pause between words; the acceptable modes of the voice recognition result preprocessing module for the voice recognition result are plain text and a confusion network.
In one embodiment, the machine translation module comprises a training module and a translation module, wherein the training module learns the translation of phrases in a large-scale balanced corpus using a GIZA + + toolkit; the translation module is used for dividing each input sentence into phrase segments, and translating each phrase segment according to the training result of the training module, wherein the translation process of the translation module is a search process, namely, an optimal translation combination is found out from the translation combinations formed by the translation results of each translation sub-model, and the optimal translation combination is a final translation result.
In one embodiment, the translation submodels include a phrase translation model, a language model, a word order change model, a part-of-speech based language model, a bilingual language model, and a domain adaptation model.
In a specific embodiment, the post-processing module comprises a word punctuation normalization module, a case conversion module and a format conversion module, wherein the word punctuation normalization module is used for normalizing words and punctuations in the machine translation result into a representation form of a target language; the case and case conversion module is used for translating by taking western language as a target language; the format conversion module is used for enabling the format of the translated target language to be consistent with the format of the language to be translated.
In one embodiment, the case conversion module is configured to change the letters of the first letter and the proper noun in the target language into capitalization form.
In order to facilitate understanding of the above-described technical aspects of the present invention, the above-described technical aspects of the present invention will be described in detail below in terms of specific usage.
When the system is used specifically, the multilingual intelligent preprocessing real-time statistical machine translation system comprises a receiving module, a preprocessing module, a translation module and a post-processing module;
the receiving module checks the normalcy of the system input, including the text language receiving module and the speech recognition result receiving module, the text language receiving module is mainly composed of two parts, as shown in fig. 2 of the attached drawings, a sentence segmentation module and a format conversion module, a.1 sentence segmentation module breaks the input text at punctuation periods, question marks and exclamation marks, so that the basic units translated by the subsequent machine translation module are sentences, when the input text contains html marks, the content between a pair of html marks is formed into sentences alone to ensure that the sentence is translated as a complete sentence, but not as a part of the text outside the html marks, the subsequent modules of the flow support the translation of plain text and text in XM L format, when the input text is in other formats, such as PDF or pictures, a.2 format conversion module converts the other formats into plain text and XM L format, a speech recognition result receiving module is also mainly composed of two parts, as shown in fig. 3 of the attached drawings, a sentence segmentation module and a noise elimination module a.3. when the input text flow is a sentence, the input text flow is divided into a simplified sentence, i.3, i.e.e. when the input text flow is a simplified speech recognition result is considered as a simplified speech recognition module, i.e. when the input text flow is a simplified speech recognition module eliminates the noise of a punctuation word segmentation module, i.5.
The system includes a pre-processing module that performs operations on an input language A to bring the input language A closer to a translated target language B for better translation quality by a subsequent machine translation module, the pre-processing module includes a text pre-processing module and a speech recognition result pre-processing module, the text pre-processing module is composed of three parts, as shown in FIG. 4 of the accompanying drawings, a B.1 word normalization module makes the source language A closer to the target language B at the word level, such as when performing a middle-to-English translation, Chinese is to be segmented, source space is inserted between words, when performing a German-to-English translation, compound words in German are to be segmented, and a one-to-one correspondence of words in German sentences is added, a B.2 category recognition tagging module labels numbers, dates, times, UR L in the source language A as corresponding categories of number, date, hour and www in the source language A category as a word, and optional word pre-to be translated by a rule pre-target language B-translation module, and a speech recognition module adjusts the phrase recognition module to generate a word pre-to generate a translation result of the optional sentence according to whether the phrase pre-target language B.5.
Wherein the B.2 class identification label is based on bilingual semi-automatic class identification and translation. The semi-automatic method is characterized in that a category needing to be identified is manually defined in a source language in bilingual; then automatically learning out the needed category and translation of the category in another language according to the balanced corpus and word alignment (word alignment). Taking the english translation as an example, first define the categories $ number, $ date, $ hour, $ www to be identified in english. All numbers are then identified in chinese, labeled as $ bnumber, and the words www, http,. com, etc. associated with the world wide web, labeled as $ bww. Here, $ bnumber and $ bww are the core of the Chinese Categories. On the basis of the core, the Chinese category corresponding to the English category can be formed finally only by including the preceding and following words. Including which words before and after, we extract automatically through word alignment. The Chinese words corresponding to the English category boundary words in the word comparison can also be Chinese category boundary words. The boundary words of the Chinese category are determined, and the extracted Chinese category content implies the corresponding English category Chinese translation. From which to learn translation rules from english to chinese categories, for example:
$ number 2 → $ number 2 }
$ number 2 worth → $ number 20% }
$ number 2- → $ number 2nd }
The rules extracted by the method better conform to the actual situation of data, errors generated in actual application by manually defined rules are reduced, and compared with the traditional method of respectively defining categories and rules on two languages, the method improves the efficiency; nor does it require the rule-maker to be familiar with both languages at the same time; the rate of mismatch of rules in the two languages is also greatly reduced, thereby improving machine translation quality.
The B.3 language block word order adjusting method adds grammar restriction in the aspect of word order adjustment in a statistical translation system. When one language is translated into another language, the order in which words are expressed often differs due to differences in grammatical and expression conventions. In addition to translating a word or phrase into another language, the translated phrase is put in place when the translation is completed. In a statistical translation system, its basic unit-phrase-is an arbitrary word string, and it is not required to conform to a grammatical structure. This results in misbehaving chunks of speech that are often rejoined to produce strange translations. The invention introduces the information of phrases according with grammatical rules through shallow syntactic analysis in the preprocessing stage. In the subsequent phrase position moving step, only phrases which accord with grammatical constraints are moved, so that the correctness and the fluency of a translation result are improved, and the specific steps are as follows:
and performing shallow syntactic analysis on the source language to generate grammatical information such as NP (noun phrase), VP (verb phrase), PP (preposition phrase) and the like.
The word order adjustment rules are learned through word alignment (word alignment), and the probability of each rule, the learned rules, for example:
DNP NP VP –>DNP NP VP (0.89)
DNP NP VP –>NP DNP VP (0.11)
i.e., the probability of the phrase sequence DNP NP VP being invariant to the phrase order is 0.89 and the probability of becoming NP DNP VP is 0.11, these are applied to the source language input sentence. Different rule combination applications produce different phrase sequence variations. All of these changes are represented in the form of word lattices (lattices). And calculating the probability of each path in the word lattice according to the probability of the rule. The optimal path, or the entire word lattice network, serves as the new input for subsequent machine translation modules.
The Translation process is essentially a search process, finding the optimal combination from different concatenations, i.e. the final Translation result, during the search process, many submodels are applied to help search out the optimal path, the necessary submodels include a phrase Translation Model (Translation Model), a language Model (L language) other submodels, such as a language order change Model (Translation Model), a Model based on a dual language Model (POS Model), etc., and whether the Model is open based on an adaptive language Model (ad Model) may be determined according to the actual language (POS Model) 52).
The post-processing module further processes the translation result to enable the translation result to be closer to the expression habit of the target language and output as a final result. Further processing, as shown in FIG. 7 of the drawings, mainly includes a D.1 word punctuation normalization module that normalizes the words and punctuation in the machine translation results into a common representation of the target language. For example, spaces between Chinese words are removed in the translation results of an English-to-Chinese translation. And removing spaces between periods, commas and words before the commas in the western language translation result, and the like. And the D.2 case conversion module is mainly suitable for translation with western language as a target language. For example, the initials of an english sentence are capitalized. Some terminology, such as USA, are also capitalized. The sub-module converts the corresponding lower case letters in the translation result into upper case letters. The d.3 format conversion module is the inverse operation of the a.2 format conversion module, i.e. it is ensured that the output is in accordance with the format of the input.
In conclusion, the machine translation system can translate sentences and chapters of one language into another language in real time, can translate complete sentences and express correctly, can translate text languages with punctuation marks, can translate voices which are not segmented, possibly incomplete sentences and have punctuation marks and noise in the sentences, improves the translation accuracy of small-probability words and phrases, namely marks and preferentially translates small-probability words such as numbers, dates, time, UR L and the like, can standardize the input sentences, and can improve the fluency of translation results by the post-processing module.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (9)

1. A multi-language intelligent preprocessing real-time statistics machine translation system is characterized by comprising:
the receiving module is used for checking the normalization of system input and comprises a text language receiving module and a voice recognition result receiving module; the system comprises a text language receiving module, a voice recognition result processing module and a voice recognition result processing module, wherein the text language receiving module is used for carrying out sentence segmentation and format conversion on a text language, and the voice recognition result receiving module is used for carrying out segmentation, noise elimination and format conversion on a voice;
the system comprises a preprocessing module and a voice recognition result preprocessing module, wherein the preprocessing module comprises a text preprocessing module and a voice recognition result preprocessing module, the text preprocessing module is used for carrying out word standardization operation, category identification marking and language block word order adjustment on a language input by a text, the text preprocessing module comprises a word standardization module, a category identification marking module and a language block word order adjustment module, the word standardization module is used for enabling the language to be translated to be closer to a target language on a word level, the category identification marking module is used for marking numbers, dates, time and UR L in the language text to be translated into number $ date $ hour $ www respectively and translating contents in categories into the target language in advance, the language block word order adjustment module is used for carrying out grammar analysis on sentences of the language to be translated and then adjusting according to a language block order of an automatically learned rule so that the word order of the language to be translated is closer to the word order of the target language to be translated;
the machine translation module is used for learning the translation of phrases by the phrases, finding out corresponding translation phrases for the phrases processed by the preprocessing module and connecting the phrases into a complete sentence;
and the post-processing module is used for carrying out word punctuation standardization, case standardization and format standardization processing on the translation result so as to enable the translation result to be closer to the expression habit of the target language and output as a final result.
2. The system of claim 1, wherein the text language receiving module comprises a sentence segmentation module and a format conversion module, the sentence segmentation module is configured to break the input text at punctuation marks, such that the basic units translated by the subsequent machine translation module are a sentence; the format conversion module is used for converting different formats of language texts into formats supported by the machine translation module during translation.
3. The system of claim 2, wherein the supported format for translation by the machine translation module is plain text format or XM L format.
4. The system of claim 1, wherein the speech recognition result receiving module comprises a sentence segmentation module and a noise elimination module, the sentence segmentation module is configured to segment the input speech text stream according to word-to-word pauses; the noise cancellation module is configured to remove adjacent repeated segments from the stream of spoken text in the input.
5. The system of claim 1, wherein the speech recognition result preprocessing module comprises a word normalization module and a punctuation prediction module, the word normalization module is used for enabling word particles in the language to be translated to be closer to words in the target language; the punctuation prediction module is used for judging the position of a period in the speech recognition output according to the context and the pause between words, and the speech recognition result preprocessing module is a pure text and a confusion network for the receivable mode of the speech recognition result.
6. The system of claim 1, wherein the machine translation modules comprise a training module and a translation module, and the training module learns phrase-to-phrase translations in a large-scale balanced corpus using a GIZA + + toolkit; the translation module is used for dividing each input sentence into phrase segments, and translating each phrase segment according to the training result of the training module, wherein the translation process of the translation module is a search process, namely, an optimal translation combination is found out from the translation combinations formed by the translation results of each translation sub-model, and the optimal translation combination is a final translation result.
7. The system of claim 6, wherein the translation sub-models comprise a phrase translation model, a language model, a word order change model, a part-of-speech based language model, a bilingual language model, and a domain adaptive model.
8. The system of claim 1, wherein the post-processing module comprises a word punctuation normalization module, a case conversion module and a format conversion module, the word punctuation normalization module is used for normalizing words and punctuation in the machine translation result into an expression form of a target language; the case and case conversion module is used for translating by taking western language as a target language; the format conversion module is used for enabling the format of the translated target language to be consistent with the format of the language to be translated.
9. The system of claim 8, wherein the case conversion module is configured to change the initials and proper nouns in the target language to capitalized form.
CN201710203439.8A 2017-03-30 2017-03-30 Multi-language intelligent preprocessing real-time statistics machine translation system Expired - Fee Related CN107066455B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710203439.8A CN107066455B (en) 2017-03-30 2017-03-30 Multi-language intelligent preprocessing real-time statistics machine translation system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710203439.8A CN107066455B (en) 2017-03-30 2017-03-30 Multi-language intelligent preprocessing real-time statistics machine translation system

Publications (2)

Publication Number Publication Date
CN107066455A CN107066455A (en) 2017-08-18
CN107066455B true CN107066455B (en) 2020-07-28

Family

ID=59601701

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710203439.8A Expired - Fee Related CN107066455B (en) 2017-03-30 2017-03-30 Multi-language intelligent preprocessing real-time statistics machine translation system

Country Status (1)

Country Link
CN (1) CN107066455B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107783968B (en) * 2017-11-23 2021-04-02 浪潮金融信息技术有限公司 Language conversion method, device, readable medium and storage controller
CN108519963B (en) * 2018-03-02 2021-12-03 山东科技大学 Method for automatically converting process model into multi-language text
CN108563644A (en) * 2018-03-29 2018-09-21 河南工学院 A kind of English Translation electronic system
CN108647267A (en) * 2018-04-28 2018-10-12 广东金贝贝智能机器人研究院有限公司 One kind being based on internet big data robot Internet of things system
CN109213851B (en) * 2018-07-04 2021-05-25 中国科学院自动化研究所 Cross-language migration method for spoken language understanding in dialog system
CN110858268B (en) * 2018-08-20 2024-03-08 北京紫冬认知科技有限公司 Method and system for detecting unsmooth phenomenon in voice translation system
CN115455988A (en) * 2018-12-29 2022-12-09 苏州七星天专利运营管理有限责任公司 High-risk statement processing method and system
CN110032934A (en) * 2019-03-07 2019-07-19 永德利硅橡胶科技(深圳)有限公司 The implementation method and Related product of Quan Yutong based on picture
CN112584252B (en) * 2019-09-29 2022-02-22 深圳市万普拉斯科技有限公司 Instant translation display method and device, mobile terminal and computer storage medium
CN111401052A (en) * 2020-04-24 2020-07-10 南京莱科智能工程研究院有限公司 Semantic understanding-based multilingual text matching method and system
CN111654658B (en) * 2020-06-17 2022-04-15 平安科技(深圳)有限公司 Audio and video call processing method and system, coder and decoder and storage device
CN113706977A (en) * 2020-08-13 2021-11-26 苏州韵果莘莘影视科技有限公司 Playing method and system based on intelligent sign language translation software
CN112764535A (en) * 2021-01-08 2021-05-07 温州职业技术学院 System for realizing multi-language information exchange
CN113158695A (en) * 2021-05-06 2021-07-23 上海极链网络科技有限公司 Semantic auditing method and system for multi-language mixed text
CN116050420B (en) * 2022-11-12 2023-09-22 武汉大学 Chinese and French voice semantic recognition method and device based on preposition sentence
CN116453132B (en) * 2023-06-14 2023-09-05 成都锦城学院 Japanese kana and Chinese character recognition method, equipment and memory based on machine translation
CN116822517B (en) * 2023-08-29 2023-11-10 百舜信息技术有限公司 Multi-language translation term identification method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101361064A (en) * 2005-12-16 2009-02-04 Emil有限公司 A text editing apparatus and method
CN102650987A (en) * 2011-02-25 2012-08-29 北京百度网讯科技有限公司 Machine translation method and device both based on source language repeat resource
CN103116578A (en) * 2013-02-07 2013-05-22 北京赛迪翻译技术有限公司 Translation method integrating syntactic tree and statistical machine translation technology and translation device
CN103164399A (en) * 2013-02-26 2013-06-19 北京捷通华声语音技术有限公司 Punctuation addition method and device in speech recognition
CN103956162A (en) * 2014-04-04 2014-07-30 上海元趣信息技术有限公司 Voice recognition method and device oriented towards child
CN104391839A (en) * 2014-11-13 2015-03-04 百度在线网络技术(北京)有限公司 Method and device for machine translation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9697201B2 (en) * 2014-11-24 2017-07-04 Microsoft Technology Licensing, Llc Adapting machine translation data using damaging channel model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101361064A (en) * 2005-12-16 2009-02-04 Emil有限公司 A text editing apparatus and method
CN102650987A (en) * 2011-02-25 2012-08-29 北京百度网讯科技有限公司 Machine translation method and device both based on source language repeat resource
CN103116578A (en) * 2013-02-07 2013-05-22 北京赛迪翻译技术有限公司 Translation method integrating syntactic tree and statistical machine translation technology and translation device
CN103164399A (en) * 2013-02-26 2013-06-19 北京捷通华声语音技术有限公司 Punctuation addition method and device in speech recognition
CN103956162A (en) * 2014-04-04 2014-07-30 上海元趣信息技术有限公司 Voice recognition method and device oriented towards child
CN104391839A (en) * 2014-11-13 2015-03-04 百度在线网络技术(北京)有限公司 Method and device for machine translation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向统计机器翻译的语料处理与评价技术研究;姚树杰;《中国优秀硕士学位论文全文数据库信息科技辑》;中国学术期刊(光盘版)电子杂志社;20130415(第04期);第1、7-8、16-17页,图2.1,表3.1 *

Also Published As

Publication number Publication date
CN107066455A (en) 2017-08-18

Similar Documents

Publication Publication Date Title
CN107066455B (en) Multi-language intelligent preprocessing real-time statistics machine translation system
CN111382580B (en) Encoder-decoder framework pre-training method for neural machine translation
CN107463553B (en) Text semantic extraction, representation and modeling method and system for elementary mathematic problems
Popovic et al. Towards the Use of Word Stems and Suffixes for Statistical Machine Translation.
US20140324435A1 (en) Combined statistical and rule-based part-of-speech tagging for text-to-speech synthesis
Kaur et al. Review of machine transliteration techniques
CN111339750A (en) Spoken language text processing method for removing stop words and predicting sentence boundaries
Dien et al. POS-tagger for English-Vietnamese bilingual corpus
Xu et al. Do we need Chinese word segmentation for statistical machine translation?
Tawfik et al. Morphology-aware word-segmentation in dialectal Arabic adaptation of neural machine translation
CN111553157A (en) Entity replacement-based dialog intention identification method
Tennage et al. Transliteration and byte pair encoding to improve tamil to sinhala neural machine translation
Ananthakrishnan et al. Automatic diacritization of Arabic transcripts for automatic speech recognition
Sherif et al. Bootstrapping a stochastic transducer for Arabic-English transliteration extraction
Ahmadnia et al. Round-trip training approach for bilingually low-resource statistical machine translation systems
Sinhal et al. Machine translation approaches and design aspects
CN111046663A (en) Intelligent correction method for Chinese form
Amri et al. Amazigh POS tagging using TreeTagger: a language independant model
CN111597827B (en) Method and device for improving accuracy of machine translation
CN109446537B (en) Translation evaluation method and device for machine translation
Garside The large-scale production of syntactically analysed corpora
CN112632259A (en) Automatic dialog intention recognition system based on linguistic rule generation
Tukur et al. Parts-of-speech tagging of Hausa-based texts using hidden Markov model
CN110569510A (en) method for identifying named entity of user request data
Seresangtakul et al. Thai-Isarn dialect parallel corpus construction for machine translation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200728

Termination date: 20210330