CN110096715A - A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method - Google Patents

A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method Download PDF

Info

Publication number
CN110096715A
CN110096715A CN201910382004.3A CN201910382004A CN110096715A CN 110096715 A CN110096715 A CN 110096715A CN 201910382004 A CN201910382004 A CN 201910382004A CN 110096715 A CN110096715 A CN 110096715A
Authority
CN
China
Prior art keywords
chinese
vietnamese
tone
syllable
vowel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910382004.3A
Other languages
Chinese (zh)
Inventor
史树敏
罗丹
黄河燕
陈友英
苏超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201910382004.3A priority Critical patent/CN110096715A/en
Publication of CN110096715A publication Critical patent/CN110096715A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation methods, belong to machine translation and Fusion Features applied technical field.This method passes through Chinese-Vietnamese parallel corpora, utilize the vowel, simple or compound vowel of a Chinese syllable and the correlation between consonant and tone of Chinese Pin Yin pseudonym and Vietnamese that statistics obtains, Chinese data based on pure Chinese character is converted into Chinese character and is aided with phonetic-initial consonant-simple or compound vowel of a Chinese syllable-tone format, the Vietnamese corpus conversion syllabication based on pure tone section is aided with vowel-consonant-tone format;Format corpus is inputted in Machine Translation Model again and is trained, the more bilingual unique language regulation information of the Chinese is made full use of.The method reduces dependence of the scarce resource statistical machine translation to large-scale corpus, solves the disadvantage that the phrase-based statistical machine translation of tradition cannot merge pronunciation character, promotes the machine translation performance between scarcity of resources type language.

Description

A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method
Technical field
The present invention relates to a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation methods more particularly to one kind to melt Close the side Chinese of the pronunciation character based on the factor-Vietnamese statistical machine translation (Factored Translation Model, FTM) Method belongs to machine translation and Fusion Features applied technical field.
Background technique
In recent years, performance of the machine translation (Machine Translation, MT) in multiple translation evaluation and test tasks took It obtained and was obviously improved, statistical machine translation is considered as most classic method in machine translation, it is first to entire original language The translation process of sentence carries out mathematical modeling, forms an original language to the probabilistic model between object language, then passes through search The path for finding out maximum probability forms optimal translation.However the statistical machine translation between scarcity of resources type language is due to available The shortage of training corpus, translation quality are very poor.
Chinese-Vietnamese is scarcity of resources type language pair, high quality, large-scale parallel corpora and relevant pretreatment work Tool extremely lacks, and the quality that this makes the Chinese get over statistical machine translation is bad.Have in Vietnamese and gets over word (Sino- in 65% Vietnamese) exist, these words originate from Chinese, and similar to Chinese speech pronunciation.Equally possess the language of these features also There are Japanese, Korean etc..How Chinese this feature similar to Vietnamese pronunciation is utilized, to reduce machine translation to extensive parallel The dependence of corpus is the problem to merit attention.
The limited method of traditional solution scarce resource translation quality is to introduce pivot, however this method is transported It uses in Chinese-Vietnamese statistical machine translation, needs to obtain the pivot corpus based on Vietnamese on a large scale, instantly This requirement is can not be attainable.In statistical machine translation, phrase-based statistical machine translation is considered as statistical machine State-of-the-art method in device translation, but the defect of this method be cannot be directly by morphology, grammer, the language regulations knowledge such as semanteme is melted It closes in translation system.In addition, also there is method that macaronic syntactic information or morphologic information are fused to statistical translation mould In type, to solve scarce resource translation quality limitation problem, however the effect of this method is still bad.
Summary of the invention
The purpose of the present invention is the skill for then leading to translation quality difference is limited for solution Chinese-Vietnamese machine translation resource Art defect proposes a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method.
Chinese of the present invention-Vietnamese pronunciation correlation and concept are as follows:
1) Vietnamese belongs to tone language without tense and conjugation as Chinese, constitutes the similar Chinese phonetic alphabet, by Vowel, consonant and tone composition;
2) Vietnamese and Chinese belong to isolated language, do not have gap between word;
3) Chinese phonetic alphabet includes 23 initial consonants, 36 simple or compound vowel of a Chinese syllable and four tones;Vietnamese include 23 vowels, 16 it is auxiliary Sound and five tones;
4) a corresponding unique word of Vietnamese pronunciation, and the pronunciation of the Chinese phonetic alphabet on the other side, correspondence are multiple Chinese character;
Related definition of the present invention is as follows:
Define 1: pronunciation correlation, including initial consonant correlation, simple or compound vowel of a Chinese syllable correlation and tone correlation;
Wherein, initial consonant correlation refers to the degree of association between Chinese Pin Yin pseudonym and Vietnamese vowel;Simple or compound vowel of a Chinese syllable correlation is Refer to the degree of association between Chinese phonetic alphabet simple or compound vowel of a Chinese syllable and Vietnamese consonant;Tone correlation refers to Chinese phonetic alphabet tone and Vietnamese tone Between the degree of association;
Define 2: the factor refers to calculating source language when the statistical machine translation model based on the factor generates language model The unit of speech and object language translation probability;
In phrase-based statistical machine translation, the complete sentence of source language and the target language can be separated into first short Language, then these phrases are based on, the translation probability of calculating original language to object language;
And in the statistical machine translation based on the factor, translation process is no longer based on phrase, but is based on the factor;These because Son refers to initial consonant, simple or compound vowel of a Chinese syllable and tone in this application;
Wherein, statistical machine translation model, i.e. Factored Translation Model, are abbreviated as FTM;
Define 3: the Chinese gets over bilingual corpora, refers to Chinese-Vietnamese control bilingual documents;For every in Chinese data One Chinese sentence has a semantic identical Vietnamese sentence to be corresponding to it in Vietnamese corpus;
Define 4: translation process refers to generating Chinese-Vietnamese language model process;
Define 5: generating process refers to completing original language to object language using the language model that translation process generates Translation, i.e. generation object language;
6:BLEU value is defined, refers to the general translation quality evaluation index in machine translation field, BLEU value is bigger, represents It is better to translate effect.
Translation process and generating process are two processes that statistical machine translation includes;
A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method, comprising the following steps:
Step 1: getting over bilingual corpora by the Chinese, Chinese-Vietnamese initial consonant correlation is calculated;
The calculating process of initial consonant correlation are as follows: choose and Vietnamese pronunciation and semantic all similar Chinese vocabulary, and by this The Chinese Pin Yin pseudonym of a little vocabulary extracts, and calculates separately each Chinese Pin Yin pseudonym extracted and accounts for all Chinese phonetic alphabet The ratio of initial consonant, this ratio is just by the initial consonant correlation as each Chinese Pin Yin pseudonym and Vietnamese vowel;
Initial consonant correlation between Chinese Pin Yin pseudonym and Vietnamese vowel is calculated by formula (1);
Wherein, n is of the different Chinese Pin Yin pseudonyms relevant to a Vietnamese vowel extracted in Chinese data Number, i is the serial number of these Chinese Pin Yin pseudonyms, and j is the serial number of the different Chinese of the same Chinese Pin Yin pseudonym, miIt is i-th The number for the Chinese that Chinese Pin Yin pseudonym represents,Indicate i-th of Chinese spelling pronunciation relevant to a Vietnamese vowel Female number;Indicate Chinese number relevant to a Vietnamese vowel,Represent i-th of Chinese phonetic alphabet J-th of Chinese of initial consonant;
Step 2: getting over bilingual corpora by the Chinese, Chinese-Vietnamese simple or compound vowel of a Chinese syllable correlation is obtained;
It chooses and Vietnamese pronounces and semantic all similar Chinese, extract Chinese phonetic alphabet simple or compound vowel of a Chinese syllable from these Chinese Come, calculates separately the ratio that each Chinese phonetic alphabet simple or compound vowel of a Chinese syllable accounts for all Chinese phonetic alphabet simple or compound vowel of a Chinese syllable extracted, this ratio is just by conduct The simple or compound vowel of a Chinese syllable correlation of Vietnamese consonant and Chinese phonetic alphabet simple or compound vowel of a Chinese syllable;
Wherein, Vietnamese consonant and the simple or compound vowel of a Chinese syllable correlation of Chinese phonetic alphabet simple or compound vowel of a Chinese syllable are calculated by formula (2);
Wherein, n is the number of the Chinese phonetic alphabet simple or compound vowel of a Chinese syllable relevant to a Vietnamese consonant extracted in Chinese data, t It is the serial number of these Chinese phonetic alphabet simple or compound vowel of a Chinese syllable, k is the serial number of the different Chinese of the same Chinese phonetic alphabet simple or compound vowel of a Chinese syllable, mtIt is t-th of Chinese The number of the Chinese of phonetic simple or compound vowel of a Chinese syllable,Indicate of t-th of Chinese phonetic alphabet simple or compound vowel of a Chinese syllable relevant to a Vietnamese consonant Number;Indicate Chinese number relevant to a Vietnamese consonant,Indicate t-th of Chinese phonetic alphabet simple or compound vowel of a Chinese syllable K-th of Chinese;
Step 3: getting over bilingual corpora by the Chinese, Chinese-Vietnamese tone correlation is directly acquired, specifically: by Chinese Four tones: '-' , ‘ ˊ ', ' ˇ ' , ‘ ˋ ' respectively correspond the profound sound of Vietnamese, sharp sound, ask sound and weight sound;Vietnamese is fallen into sound Corresponding phonetic is softly;
The reason of step 3, is: the negligible amounts of tone, and Chinese phonetic alphabet tone, which adds, softly 5, Vietnamese tone There are 5, the classification of tone does not have initial consonant, and the classification of simple or compound vowel of a Chinese syllable is mostly and the association between tone comes compared to the association between the initial and the final It says, it is more simple and clear;
Step 4: carrying out digital substitution to the tone of Chinese data and Vietnamese corpus respectively and to pronunciation character point From, including following sub-step:
Step 4.1 is according to the tone correlation counted in step 3, by the sound in Chinese data and Vietnamese corpus Continuous number is called to replace, specifically:
1) by the tone of Chinese: the profound sound of '-' and Vietnamese is replaced with number 1;
2) the sharp sound of the tone: ‘ ˊ ' of Chinese and Vietnamese number 2 is replaced;
3) by the tone of Chinese: ' ˇ ' and Vietnamese ask that sound number 3 replaces;
4) the weight sound of the tone: ‘ ˋ ' of Chinese and Vietnamese number 4 is replaced;
5) by Chinese softly and Vietnamese fall sound number 0 replace;
Step 4.2 carries out pronunciation character separation to Chinese data: the Chinese sentence of pure hanzi form is converted into initial consonant, rhythm Female and tone text, that is, text after converting, for each part word of text after conversion, if word is number, just Be converted to word | word | word | form is converted into consonant if word is phonetic | vowel | tone form;
Step 4.3 carries out pronunciation character separation to Vietnamese corpus: being converted into vowel, auxiliary to the Vietnamese corpus of pure tone section The text of sound and tone, that is, text after converting;For each part word of text after conversion, if word is number, just Be converted to word | word | word | form is converted into consonant if word is syllable | vowel | tone form;
So far, by step 4.1, step 4.2 and step 4.3, the Chinese for obtaining pronunciation character separation gets over bilingual corpora;
Step 5: the Chinese for the pronunciation character separation that extraction step four obtains gets over the factor of bilingual corpora, specifically:
In Chinese data, Chinese is extracted, pronounce PRc, Chinese Pin Yin pseudonym IN, Chinese phonetic alphabet simple or compound vowel of a Chinese syllable FI and Chinese Phonetic tone Toc is as the CF factor;
In Vietnamese corpus, Vietnamese, pronunciation PRv, Vietnamese vowel CO, Vietnamese consonant VO and Vietnamese are extracted Tone TOv is as the VF factor;
Step 6: the correspondence and use FTM between the setting CF factor and the VF factor generate Chinese-Vietnamese language model, tool Steps are as follows for body;
The correspondence between the CF factor and the VF factor is arranged in step 6.1, specifically:
Chinese in Chinese data corresponds to the syllable of Vietnamese corpus, and Chinese Pin Yin pseudonym IN corresponds to Vietnamese vowel CO, Chinese phonetic alphabet simple or compound vowel of a Chinese syllable FI corresponds to Vietnamese consonant VO, and Chinese phonetic alphabet tone TOv corresponds to Vietnamese tone VF;Specific single Chinese Phonetic initial consonant IN and single Vietnamese vowel CO, single Chinese phonetic alphabet simple or compound vowel of a Chinese syllable FI and single Vietnamese consonant VO, single Chinese are spelled Speech tune TOv is corresponding with single Vietnamese tone VF's, by Step 1: step 2 and step 3 calculate the Chinese-obtained more Initial consonant correlation, simple or compound vowel of a Chinese syllable correlation, the tone correlation of southern language are configured;
The Chinese for the pronunciation character separation that step 4 obtains is got over bilingual corpora and is transported in FTM by step 6.2, and FTM is based on step The CF factor and the VF factor extracted in rapid five calculate translation probability;
For step 6.3 using Chinese as original language, Vietnamese generates a Chinese-Vietnamese language as object language, FTM Model;Using Vietnamese as original language, Chinese generates a Vietnamese-Chinese language model as object language, FTM;
So far, translation process is constituted by step 6.1, step 6.2 and step 6.3;
Step 7: translation is completed using the language model that step 6.3 obtains, and during Chinese translates Vietnamese, language Model generates syllable-vowel-consonant-tone form Vietnamese, and during Vietnamese translates Chinese, language model generates the Chinese Word-initial consonant-simple or compound vowel of a Chinese syllable-tone form Chinese;
Step 7, that is, generating process;
Step 8: the syllable generated in step 7-vowel-consonant-tone form Vietnamese is converted into pure tone section The Chinese character of generation-initial consonant-simple or compound vowel of a Chinese syllable-tone form Chinese is converted into the Chinese of pure Chinese character by Vietnamese.
Beneficial effect
A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method of the present invention, compares the prior art, has such as It is lower the utility model has the advantages that
The method extracts Chinese-Vietnamese pronunciation character, and it is one that this, which gets over South Uietnam statistical machine translation field in Chinese-, New method, the dependence the method reduce statistical machine translation to extensive parallel corpora improve the translation of Chinese-Vietnamese Quality.
Detailed description of the invention
Fig. 1 is a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method specific implementation process of the present invention Schematic diagram.
Specific embodiment
The method of the present invention is described further with reference to the accompanying drawings and embodiments.
Embodiment 1
Fig. 1 is the stream of a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method of the present invention and the present embodiment Cheng Tu.
From figure 1 it appears that the present invention includes the following steps:
Step A: pronunciation correlation is calculated;
Specially calculate initial consonant correlation, simple or compound vowel of a Chinese syllable correlation and tone correlation;
Specific in the present embodiment, by the bilingual corpus of acquisition, calculate in Vietnamese each Vietnamese vowel with The initial consonant correlation of Chinese Pin Yin pseudonym is specifically identical as step 1;Calculate each Vietnamese consonant and Chinese phonetic alphabet simple or compound vowel of a Chinese syllable Simple or compound vowel of a Chinese syllable correlation, it is specifically identical as step 2, the tone correlation of each Vietnamese tone with Chinese phonetic alphabet tone is calculated, specifically It is identical as step 3;
Step B: pronunciation character separation;
It is identical as step 4.1, step 4.2 and step 4.3 specific in the present embodiment;
Step C: the CF factor, the VF factor are extracted;
Specific in the present embodiment, the Chinese after the pronunciation character separation obtained to step B gets over bilingual corpora, and extraction factor is made It is specifically identical as step 5 for the unit that the translation probability that FTM model carries out translation process calculates;
Step D: the input Chinese gets over bilingual corpora to FTM;
It is identical as step 6.2 specific in the present embodiment;
Step E: language model is generated;
It is identical as step 6.3 specific in the present embodiment;
Step F: translation is generated;
Specific in the present embodiment, and Step 7: step 8 is identical;
So far, step A to step F completes a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method.
Embodiment 2
The present embodiment will be " full of hope with Vietnamese vowel b, Chinese sentence " study makes one progressive " and Chinese sentence Trekking more can be to people's enjoyment than arriving at the destination " for unite to a kind of fusion pronunciation character Chinese of the present invention-Vietnamese The concrete operation step of meter machine translation method is described in detail.
A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method process flow is as shown in Figure 1.From Fig. 1 It can be seen that a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method, comprising the following steps:
Step A1: pronunciation correlation is calculated
It gets over to extract in bilingual corpora from the Chinese to pronounce with Vietnamese vowel b and in 100 Chinese of semantic similarity;
Wherein, Chinese-Vietnam's bilingual corpora is scarce resource type language pair, and bilingual corpora definition is as described in defining 3.
When it is implemented, being obtained by disclosing collection on the net, the corpus of acquisition is related to news, caption, works and expressions for everyday use Etc. multiple fields, the Chinese-Vietnamese corpus that will be downloaded from each website, arrange is 550,000 more bilingual languages of the Chinese on a large scale Material, for calculating pronunciation correlation.
The specific calculating process of Vietnamese vowel b and the initial consonant correlation of Chinese Pin Yin pseudonym are illustrated, is described in detail The calculation method of initial consonant correlation;
It is as shown in table 1 specific to embodiment Vietnamese vowel b and Chinese Pin Yin pseudonym correlation results:
1 step 1 initial consonant correlativity calculation result of table
Four kinds of Chinese Pin Yin pseudonym b have been extracted with the Chinese of Vietnamese vowel b pronunciation and semantic similarity at 100, F, m, p indicate the Chinese of not initial consonant using 0 here.Initial consonant is that the Chinese of b has 67, the Chinese that initial consonant is f and initial consonant is m Language has 1, and initial consonant is that the Chinese of p has 30;Chinese Pin Yin pseudonym b, f, m are defined, p is related to the initial consonant of Vietnamese vowel b Property is p1,p2,p3,p4.P is calculated by formula (1)1=67/100=67%, p2=p3=1/100=1%, p4=30/100 =30%;Probability value is bigger to indicate that the degree of association between the Chinese Pin Yin pseudonym and Vietnamese vowel is bigger.Pass through probability value pi The conclusion that can obtain of size be that in this example, Chinese Pin Yin pseudonym relevant to Vietnamese initial consonant b is b, p.Use tricks The method for calculating Vietnamese initial consonant b and the initial consonant correlation of Chinese Pin Yin pseudonym, calculates other Chinese-Vietnamese initial consonant correlation. Since amount of calculation is very huge, the initial consonant correlation of each Vietnamese vowel with Chinese Pin Yin pseudonym is not just enumerated here Calculating process;
Chinese phonetic alphabet simple or compound vowel of a Chinese syllable relevant to each Vietnamese consonant is counted, calculation method is the same as calculating Vietnamese sound in step A Female b is consistent with the method for initial consonant correlation of Chinese Pin Yin pseudonym, no longer illustrates here;
The results are shown in Table 2 for correspondence between tone.
The result of 2 step 3 tone correlation of table
Phonetic tone is four, respectively '-' , ‘ ˊ ', ' ˇ ' , ‘ ˋ ', and one to the four tones of standard Chinese pronunciation corresponds to the profound sound of Vietnamese, and sharp sound is asked Sound, weight sound, the sound that falls of Vietnamese correspond to phonetic softly;
Step B1, pronunciation character separates, and the pronunciation character separation process specific to embodiment " study makes one progressive " is as follows:
According to the tone correlation enumerated in table 2, the four tones of standard Chinese pronunciation are arrived by the one of Chinese phonetic alphabet tone, with continuous digital 1,2,3, 4 replace, and by the profound sound of Vietnamese tone, sharp sound asks that sound, weight sound are replaced with 1,2,3,4.Softly with fall sound with 0 replace.
To Chinese sentence " study makes one progressive ", it is first converted into PINYIN form " xu é x í sh ǐ r é n j ì n b ù ", is turned Rear text is changed, be then converted to text after conversion " x | ue | 2x | i | 2sh | i | 3r | en | 2j | in | 4b | u | 4 ".
The corresponding Vietnamese of Chinese " study makes one progressive " is Text after being converted is converted to text after conversion
Step C1, the CF factor, the VF factor are extracted;
The Chinese of the pronunciation character separation obtained according to step 4 gets over bilingual corpora, extracts pronunciation character as the factor, specifically Factor extraction result to embodiment " full of hope trekking more can be to people's enjoyment than arriving at the destination " is as follows:
Firstly, Chinese sentence " more can than arriving at the destination by full of hope trekking in Chinese-Vietnamese parallel corpora Give people's enjoyment ", the Chinese sentence after step 4 separates pronunciation character be " fill | chong1 | ch | ong is full | man3 | m | an is uncommon | xi1 | x | i is hoped | wang4 | w | ang's | de5 | d | e crosses mountains | ba2 | b | a is related to | she4 | sh | e ratio | bi3 | b | i is arrived | dao4 | d | ao Reach | da2 | d | a mesh | mu4 | m | u's | de5 | d | e | di4 | d | i more | geng4 | g | eng can | neng2 | n | eng is to | gei3 | g | ei people | ren2 | r | en is happy | le4 | l | e interest | qu4 | q | u ";
The pronunciation (PRc) of the Chinese phonetic alphabet is extracted, initial consonant (IN), simple or compound vowel of a Chinese syllable (FI), tone (TOc) is as the CF factor;
The corresponding Vietnamese sentence of Chinese sentence " full of hope trekking more can be to people's enjoyment than arriving at the destination "It is separated by step 4 Vietnamese sentence after pronunciation character is
The pronunciation (PRv) of Vietnamese is extracted, initial consonant (CO), simple or compound vowel of a Chinese syllable (VO), tone (TOv) is as the VF factor;
Step D1, the input Chinese gets over bilingual corpora to FTM, as follows specific to the process of embodiment " advertisement ":
Each factor in the Chinese phonetic alphabet is corresponding with each factor in Vietnamese.Specifically: " advertisement " in Chinese Corresponding Vietnamese
Wherein, " wide " word passes through step 4.1, after 4.2 pronunciation character separation, format be " it is wide | guang3 | g | uang ", VietnameseBy step 4.1, after 4.3 pronunciation character separation, format is It is wide correspondingGuang3 is correspondingG corresponds to Q, and uang is corresponding
It is determined specific corresponding to rule by the pronunciation correlation calculated in step 1 and step 2.
The Chinese for the pronunciation character separation that step 4 obtains is got over bilingual corpora to be input in FTM, FTM is based on step 6.1 Factor pair is answered, and translation probability is calculated;
Step E1, language model is generated;
Specific to the present embodiment are as follows: FTM generates Chinese-Vietnamese by the bilingual corpora after training separation pronunciation character Language model;
Step F1, translation is generated;
The language model obtained using step E1, it is as follows specific to the generation translation process of embodiment " advertisement ": to carry out When the translation of Chinese-Vietnamese, generate shaped likePronunciation it is special Levy isolated Vietnamese, when carrying out the translation of Vietnamese-Chinese, generate shaped like " extensively | guang3 | g | uang announcement | gao4 | g | The Chinese of ao " pronunciation character separation;
To generationThe Vietnamese of form is converted into pure The Vietnamese of syllableBy " it is wide | guang3 | g | uang | gao4 | g | ao " Chinese of form is converted into no spelling The Chinese " advertisement " of sound;
Embodiment 3
It is effective in order to further verify a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method of the present invention Property, the present embodiment gets over bilingual corpora using 550,000 Chinese used in embodiment 2, does not merge to fusion pronunciation character and pronunciation spy The statistical machine translation model of sign is tested, meanwhile, in order between the factor that is arranged in verification step 6.1 corresponded manner it is effective Property, bilingual corpora equally is got over based on 550,000 Chinese used in embodiment 2, it, will be upper provided with the corresponded manner between other factors The result for telling that experiment obtains compares and analyzes,
Comparing result is as shown in table 3.
3 contrast and experiment of table
The BLEU of table 3 is determined by defining 6, is tested 1- not fusion factor, is by the Chinese data of pure Chinese and pure tone section Vietnamese corpus for translation model training, in the experiment 2 of fusion factor, in experiment 3 and experiment 4, experiment 2 is provided with this The corresponded manner of the factor in invention, i.e. initial consonant-vowel, simple or compound vowel of a Chinese syllable-consonant, tone-tone, experiment 3 are provided with initial consonant-vowel It is corresponding, it is not provided with simple or compound vowel of a Chinese syllable-consonant and tone-tone correspondence, experiment 4 are provided with simple or compound vowel of a Chinese syllable-consonant correspondence, do not set Set initial consonant-vowel and tone-tone correspondence.From the results shown in Table 3, it is experiment 2 that BLEU value is highest, based on this Invent the fusion pronunciation character proposed and setting initial consonant-vowel, simple or compound vowel of a Chinese syllable-consonant, tone-tone factor corresponding method, experiment 1 The experimental result BLEU for not merging pronunciation character is minimum, and experiment 3 and 4 factor corresponded manners of experiment are different from experiment 2, experiment 3 And the BLEU result of experiment 4 is lower than the BLEU value that experiment 2 obtains.
From table 3 it can be seen that a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation of the present invention is compared to more traditional The method for not merging pronunciation character has promotion in translation quality, and factor corresponding method proposed by the present invention can also be compared Chinese-Vietnamese translation quality is further promoted with other factor corresponding methods.
The basic principles, main features and advantages of the invention have been shown and described above.The technical staff of the industry should Understand, the present invention is not limited to the above embodiments, and the above embodiments and description only describe originals of the invention Reason, without departing from the spirit and scope of the present invention, various changes and improvements may be made to the invention, these changes and improvements All within the scope of the claimed invention, the claimed scope of the invention is by appended claims and its equivalent circle It is fixed.

Claims (5)

1. a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method, it is characterised in that: the Chinese-Vietnam being related to Language pronunciation correlation and concept are as follows:
1) Vietnamese belongs to tone language without tense and conjugation as Chinese, the similar Chinese phonetic alphabet is constituted, by member Sound, consonant and tone composition;
2) Vietnamese and Chinese belong to isolated language, do not have gap between word;
3) Chinese phonetic alphabet includes 23 initial consonants, 36 simple or compound vowel of a Chinese syllable and four tones;Vietnamese include 23 vowels, 16 consonants with And five tones;
4) a corresponding unique word of Vietnamese pronunciation, and the pronunciation of the Chinese phonetic alphabet on the other side, corresponding multiple Chinese characters;
Related definition of the present invention is as follows:
Define 1: pronunciation correlation, including initial consonant correlation, simple or compound vowel of a Chinese syllable correlation and tone correlation;
Wherein, initial consonant correlation refers to the degree of association between Chinese Pin Yin pseudonym and Vietnamese vowel;Simple or compound vowel of a Chinese syllable correlation refers to the Chinese The degree of association between language phonetic simple or compound vowel of a Chinese syllable and Vietnamese consonant;Tone correlation refers between Chinese phonetic alphabet tone and Vietnamese tone The degree of association;
Define 2: the factor, refer to based on the factor statistical machine translation model generate language model when, calculate original language with The unit of object language translation probability;
In phrase-based statistical machine translation, the complete sentence of source language and the target language can be separated into phrase first, It is based on these phrases, the translation probability of calculating original language to object language again;
And in the statistical machine translation based on the factor, translation process is no longer based on phrase, but is based on the factor;
Wherein, statistical machine translation model, i.e. Factored Translation Model, are abbreviated as FTM;
Define 3: the Chinese gets over bilingual corpora, refers to Chinese-Vietnamese control bilingual documents;For each of Chinese data Chinese sentence has a semantic identical Vietnamese sentence to be corresponding to it in Vietnamese corpus;
Define 4: translation process refers to generating Chinese-Vietnamese language model process;
Define 5: generating process refers to completing original language turning over to object language using the language model that translation process generates It translates, i.e. generation object language;
6:BLEU value is defined, refers to the general translation quality evaluation index in machine translation field;
Translation process and generating process are two processes that statistical machine translation includes;
The Chinese-Vietnamese statistical machine translation method, comprising the following steps:
Step 1: getting over bilingual corpora by the Chinese, Chinese-Vietnamese initial consonant correlation is calculated;
Initial consonant correlation between Chinese Pin Yin pseudonym and Vietnamese vowel is calculated by formula (1);
Wherein, n is the number of the different Chinese Pin Yin pseudonyms relevant to a Vietnamese vowel extracted in Chinese data, i It is the serial number of these Chinese Pin Yin pseudonyms, j is the serial number of the different Chinese of the same Chinese Pin Yin pseudonym, miIt is i-th of Chinese The number for the Chinese that phonetic initial consonant represents,Indicate i-th of Chinese Pin Yin pseudonym relevant to a Vietnamese vowel Number;Indicate Chinese number relevant to a Vietnamese vowel,Represent i-th of Chinese Pin Yin pseudonym J-th of Chinese;
Step 2: getting over bilingual corpora by the Chinese, Chinese-Vietnamese simple or compound vowel of a Chinese syllable correlation is obtained;
Wherein, Vietnamese consonant and the simple or compound vowel of a Chinese syllable correlation of Chinese phonetic alphabet simple or compound vowel of a Chinese syllable are calculated by formula (2);
Wherein, n is the number of the Chinese phonetic alphabet simple or compound vowel of a Chinese syllable relevant to a Vietnamese consonant extracted in Chinese data, and t is this The serial number of a little Chinese phonetic alphabet simple or compound vowel of a Chinese syllable, k is the serial number of the different Chinese of the same Chinese phonetic alphabet simple or compound vowel of a Chinese syllable, mtIt is t-th of Chinese phonetic alphabet The number of the Chinese of simple or compound vowel of a Chinese syllable,Indicate the number of t-th of Chinese phonetic alphabet simple or compound vowel of a Chinese syllable relevant to a Vietnamese consonant;Indicate Chinese number relevant to a Vietnamese consonant,Indicate the kth of t-th of Chinese phonetic alphabet simple or compound vowel of a Chinese syllable A Chinese;
Step 3: getting over bilingual corpora by the Chinese, Chinese-Vietnamese tone correlation is directly acquired;
Step 4: carrying out digital substitution to the tone of Chinese data and Vietnamese corpus respectively and being separated to pronunciation character, packet Include following sub-step:
Step 4.1 uses the tone in Chinese data and Vietnamese corpus according to the tone correlation counted in step 3 Continuous number replaces;
Step 4.2 to Chinese data carry out pronunciation character separation: by the Chinese sentence of pure hanzi form be converted into initial consonant, simple or compound vowel of a Chinese syllable with And the text of tone, that is, text after converting just convert each part word of text after conversion if word is number For word | word | word | form is converted into consonant if word is phonetic | vowel | tone form;
Step 4.3 to Vietnamese corpus carry out pronunciation character separation: to the Vietnamese corpus of pure tone section be converted into vowel, consonant with And the text of tone, that is, text after converting;Each part word of text after conversion is just converted if word is number For word | word | word | form is converted into consonant if word is syllable | vowel | tone form;
So far, by step 4.1, step 4.2 and step 4.3, the Chinese for obtaining pronunciation character separation gets over bilingual corpora;
Step 5: the Chinese for the pronunciation character separation that extraction step four obtains gets over the factor of bilingual corpora, specifically:
In Chinese data, Chinese is extracted, pronounce PRc, Chinese Pin Yin pseudonym IN, Chinese phonetic alphabet simple or compound vowel of a Chinese syllable FI and the Chinese phonetic alphabet Tone Toc is as the CF factor;
In Vietnamese corpus, Vietnamese, pronunciation PRv, Vietnamese vowel CO, Vietnamese consonant VO and Vietnamese tone are extracted TOv is as the VF factor;
Step 6: correspondence between the setting CF factor and the VF factor and Chinese-Vietnamese language model is generated using FTM, it is specific to walk It is rapid as follows;
The correspondence between the CF factor and the VF factor is arranged in step 6.1, specifically:
Chinese in Chinese data corresponds to the syllable of Vietnamese corpus, and Chinese Pin Yin pseudonym IN corresponds to Vietnamese vowel CO, Chinese Phonetic simple or compound vowel of a Chinese syllable FI corresponds to Vietnamese consonant VO, and Chinese phonetic alphabet tone TOv corresponds to Vietnamese tone VF;The specific single Chinese phonetic alphabet Initial consonant IN and single Vietnamese vowel CO, single Chinese phonetic alphabet simple or compound vowel of a Chinese syllable FI and single Vietnamese consonant VO, single Chinese spelling pronunciation Adjust TOv corresponding with single Vietnamese tone VF, by Step 1: step 2 and step 3 calculate the Chinese-Vietnamese obtained Initial consonant correlation, simple or compound vowel of a Chinese syllable correlation, tone correlation be configured;
The Chinese for the pronunciation character separation that step 4 obtains is got over bilingual corpora and is transported in FTM by step 6.2, and FTM is based on step 5 The CF factor and the VF factor of middle extraction calculate translation probability;
For step 6.3 using Chinese as original language, Vietnamese generates a Chinese-Vietnamese language mould as object language, FTM Type;Using Vietnamese as original language, Chinese generates a Vietnamese-Chinese language model as object language, FTM;
So far, translation process is constituted by step 6.1, step 6.2 and step 6.3;
Step 7: translation is completed using the language model that step 6.3 obtains, and during Chinese translates Vietnamese, language model Syllable-vowel-consonant-tone form Vietnamese is generated, during Vietnamese translates Chinese, language model generates Chinese character- Initial consonant-simple or compound vowel of a Chinese syllable-tone form Chinese;
Step 7, that is, generating process;
Step 8: the syllable generated in step 7-vowel-consonant-tone form Vietnamese to be converted into Vietnam of pure tone section The Chinese character of generation-initial consonant-simple or compound vowel of a Chinese syllable-tone form Chinese is converted into the Chinese of pure Chinese character by language.
2. a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method according to claim 1, feature exist In: the calculating process of initial consonant correlation in step 1 are as follows: choose and Vietnamese pronounces and semantic all similar Chinese vocabulary, and will The Chinese Pin Yin pseudonym of these vocabulary extracts, and calculates separately each Chinese Pin Yin pseudonym extracted and accounts for all Chinese spellings The ratio of speech mother, this ratio is just by the initial consonant correlation as each Chinese Pin Yin pseudonym and Vietnamese vowel.
3. a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method according to claim 1, feature exist In: step 2 specifically: choose and Vietnamese pronounces and semantic all similar Chinese, by Chinese phonetic alphabet simple or compound vowel of a Chinese syllable from these Chinese It extracts, calculates separately the ratio that each Chinese phonetic alphabet simple or compound vowel of a Chinese syllable accounts for all Chinese phonetic alphabet simple or compound vowel of a Chinese syllable extracted, this ratio is just By the simple or compound vowel of a Chinese syllable correlation as Vietnamese consonant and Chinese phonetic alphabet simple or compound vowel of a Chinese syllable.
4. a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method according to claim 1, feature exist In: step 4.1, specifically:
1) by the tone of Chinese: the profound sound of '-' and Vietnamese is replaced with number 1;
2) the sharp sound of the tone: ‘ ˊ ' of Chinese and Vietnamese number 2 is replaced;
3) by the tone of Chinese: ' ˇ ' and Vietnamese ask that sound number 3 replaces;
4) the weight sound of the tone: ‘ ˋ ' of Chinese and Vietnamese number 4 is replaced;
5) by Chinese softly and Vietnamese fall sound number 0 replace.
5. a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method according to claim 1, feature exist In: step 3 specifically: by four tones of Chinese: '-' , ‘ ˊ ', ' ˇ ' , ‘ ˋ ' respectively correspond the profound sound of Vietnamese, sharp sound, ask Sound and weight sound;The sound that falls of Vietnamese is corresponded into phonetic softly;The reason of step 3, is: the negligible amounts of tone, Chinese For phonetic tone plus softly there is 5, Vietnamese tone has 5, and the classification of tone does not have an initial consonant, the classification of simple or compound vowel of a Chinese syllable is more and tone it Between association compared to the association between the initial and the final for, it is more simple and clear.
CN201910382004.3A 2019-05-06 2019-05-06 A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method Pending CN110096715A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910382004.3A CN110096715A (en) 2019-05-06 2019-05-06 A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910382004.3A CN110096715A (en) 2019-05-06 2019-05-06 A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method

Publications (1)

Publication Number Publication Date
CN110096715A true CN110096715A (en) 2019-08-06

Family

ID=67447432

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910382004.3A Pending CN110096715A (en) 2019-05-06 2019-05-06 A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method

Country Status (1)

Country Link
CN (1) CN110096715A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113506559A (en) * 2021-07-21 2021-10-15 成都启英泰伦科技有限公司 Method for generating pronunciation dictionary according to Vietnamese written text
CN113688283A (en) * 2021-08-27 2021-11-23 北京奇艺世纪科技有限公司 Method and device for determining matching degree of video subtitles and electronic equipment
CN113743053A (en) * 2021-08-17 2021-12-03 上海明略人工智能(集团)有限公司 Alphabet vector calculation method, system, storage medium and electronic device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080195372A1 (en) * 2007-02-14 2008-08-14 Jeffrey Chin Machine Translation Feedback
CN104978311A (en) * 2015-07-15 2015-10-14 昆明理工大学 Vietnamese word segmentation method based on conditional random fields
CN105740235A (en) * 2016-01-29 2016-07-06 昆明理工大学 Phrase tree to dependency tree transformation method capable of combining Vietnamese grammatical features
CN106202037A (en) * 2016-06-30 2016-12-07 昆明理工大学 Vietnamese tree of phrases construction method based on chunk
CN106372241A (en) * 2016-09-18 2017-02-01 广西财经学院 Inter-word weighting associating mode-based Vietnamese-to-English cross-language text retrieval method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080195372A1 (en) * 2007-02-14 2008-08-14 Jeffrey Chin Machine Translation Feedback
CN104978311A (en) * 2015-07-15 2015-10-14 昆明理工大学 Vietnamese word segmentation method based on conditional random fields
CN105740235A (en) * 2016-01-29 2016-07-06 昆明理工大学 Phrase tree to dependency tree transformation method capable of combining Vietnamese grammatical features
CN106202037A (en) * 2016-06-30 2016-12-07 昆明理工大学 Vietnamese tree of phrases construction method based on chunk
CN106372241A (en) * 2016-09-18 2017-02-01 广西财经学院 Inter-word weighting associating mode-based Vietnamese-to-English cross-language text retrieval method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HUU ANH TRAN 等: "Integrating pronunciation into Chinese-Vietnamese statistical machine translation", 《TSINGHUA SCIENCE AND TECHNOLOGY》 *
TRAN HUU-ANH 等: "Preordering for Chinese-Vietnamese Statistical Machine Translation", 《IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113506559A (en) * 2021-07-21 2021-10-15 成都启英泰伦科技有限公司 Method for generating pronunciation dictionary according to Vietnamese written text
CN113506559B (en) * 2021-07-21 2023-06-09 成都启英泰伦科技有限公司 Method for generating pronunciation dictionary according to Vietnam written text
CN113743053A (en) * 2021-08-17 2021-12-03 上海明略人工智能(集团)有限公司 Alphabet vector calculation method, system, storage medium and electronic device
CN113743053B (en) * 2021-08-17 2024-03-12 上海明略人工智能(集团)有限公司 Letter vector calculation method, system, storage medium and electronic equipment
CN113688283A (en) * 2021-08-27 2021-11-23 北京奇艺世纪科技有限公司 Method and device for determining matching degree of video subtitles and electronic equipment
CN113688283B (en) * 2021-08-27 2023-09-05 北京奇艺世纪科技有限公司 Method and device for determining video subtitle matching degree and electronic equipment

Similar Documents

Publication Publication Date Title
CN111382580B (en) Encoder-decoder framework pre-training method for neural machine translation
CN101131689B (en) Bidirectional mechanical translation method for sentence pattern conversion between Chinese language and foreign language
CN105957518B (en) A kind of method of Mongol large vocabulary continuous speech recognition
CN110517663B (en) Language identification method and system
CN100536532C (en) Method and system for automatic subtilting
CN109255113A (en) Intelligent critique system
CN105404621B (en) A kind of method and system that Chinese character is read for blind person
CN101788978B (en) Chinese and foreign spoken language automatic translation method combining Chinese pinyin and character
CN103309926A (en) Chinese and English-named entity identification method and system based on conditional random field (CRF)
CN110096715A (en) A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method
Stein et al. Hand in hand: automatic sign language to English translation
CN105895076B (en) A kind of phoneme synthesizing method and system
Tennage et al. Transliteration and byte pair encoding to improve tamil to sinhala neural machine translation
Chenggang et al. Wailaici and English borrowings in Chinese
Lewis et al. Language identification and language specific letter-to-sound rules
CN110569510A (en) method for identifying named entity of user request data
Garside The large-scale production of syntactically analysed corpora
CN106294310A (en) A kind of Tibetan language tone Forecasting Methodology and system
Bansal et al. Development of Text and Speech Corpus for Designing the Multilingual Recognition System
Li et al. The study of comparison and conversion about traditional Mongolian and Cyrillic Mongolian
Abumalloh et al. Building Arabic corpus applied to part-of-speech tagging
Mahmut et al. Exploration of Chinese-Uyghur neural machine translation
KR101604553B1 (en) Apparatus and method for generating pseudomorpheme-based speech recognition units by unsupervised segmentation and merging
Buscaldi et al. How good is NLLB-200 for low-resource languages? A study on Genoese
Kari On the morphology of Degema modifier, demonstrative and interrogative nominals

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190806