CN110096715A - A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method - Google Patents
A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method Download PDFInfo
- Publication number
- CN110096715A CN110096715A CN201910382004.3A CN201910382004A CN110096715A CN 110096715 A CN110096715 A CN 110096715A CN 201910382004 A CN201910382004 A CN 201910382004A CN 110096715 A CN110096715 A CN 110096715A
- Authority
- CN
- China
- Prior art keywords
- chinese
- vietnamese
- tone
- syllable
- vowel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/44—Statistical methods, e.g. probability models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention discloses a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation methods, belong to machine translation and Fusion Features applied technical field.This method passes through Chinese-Vietnamese parallel corpora, utilize the vowel, simple or compound vowel of a Chinese syllable and the correlation between consonant and tone of Chinese Pin Yin pseudonym and Vietnamese that statistics obtains, Chinese data based on pure Chinese character is converted into Chinese character and is aided with phonetic-initial consonant-simple or compound vowel of a Chinese syllable-tone format, the Vietnamese corpus conversion syllabication based on pure tone section is aided with vowel-consonant-tone format;Format corpus is inputted in Machine Translation Model again and is trained, the more bilingual unique language regulation information of the Chinese is made full use of.The method reduces dependence of the scarce resource statistical machine translation to large-scale corpus, solves the disadvantage that the phrase-based statistical machine translation of tradition cannot merge pronunciation character, promotes the machine translation performance between scarcity of resources type language.
Description
Technical field
The present invention relates to a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation methods more particularly to one kind to melt
Close the side Chinese of the pronunciation character based on the factor-Vietnamese statistical machine translation (Factored Translation Model, FTM)
Method belongs to machine translation and Fusion Features applied technical field.
Background technique
In recent years, performance of the machine translation (Machine Translation, MT) in multiple translation evaluation and test tasks took
It obtained and was obviously improved, statistical machine translation is considered as most classic method in machine translation, it is first to entire original language
The translation process of sentence carries out mathematical modeling, forms an original language to the probabilistic model between object language, then passes through search
The path for finding out maximum probability forms optimal translation.However the statistical machine translation between scarcity of resources type language is due to available
The shortage of training corpus, translation quality are very poor.
Chinese-Vietnamese is scarcity of resources type language pair, high quality, large-scale parallel corpora and relevant pretreatment work
Tool extremely lacks, and the quality that this makes the Chinese get over statistical machine translation is bad.Have in Vietnamese and gets over word (Sino- in 65%
Vietnamese) exist, these words originate from Chinese, and similar to Chinese speech pronunciation.Equally possess the language of these features also
There are Japanese, Korean etc..How Chinese this feature similar to Vietnamese pronunciation is utilized, to reduce machine translation to extensive parallel
The dependence of corpus is the problem to merit attention.
The limited method of traditional solution scarce resource translation quality is to introduce pivot, however this method is transported
It uses in Chinese-Vietnamese statistical machine translation, needs to obtain the pivot corpus based on Vietnamese on a large scale, instantly
This requirement is can not be attainable.In statistical machine translation, phrase-based statistical machine translation is considered as statistical machine
State-of-the-art method in device translation, but the defect of this method be cannot be directly by morphology, grammer, the language regulations knowledge such as semanteme is melted
It closes in translation system.In addition, also there is method that macaronic syntactic information or morphologic information are fused to statistical translation mould
In type, to solve scarce resource translation quality limitation problem, however the effect of this method is still bad.
Summary of the invention
The purpose of the present invention is the skill for then leading to translation quality difference is limited for solution Chinese-Vietnamese machine translation resource
Art defect proposes a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method.
Chinese of the present invention-Vietnamese pronunciation correlation and concept are as follows:
1) Vietnamese belongs to tone language without tense and conjugation as Chinese, constitutes the similar Chinese phonetic alphabet, by
Vowel, consonant and tone composition;
2) Vietnamese and Chinese belong to isolated language, do not have gap between word;
3) Chinese phonetic alphabet includes 23 initial consonants, 36 simple or compound vowel of a Chinese syllable and four tones;Vietnamese include 23 vowels, 16 it is auxiliary
Sound and five tones;
4) a corresponding unique word of Vietnamese pronunciation, and the pronunciation of the Chinese phonetic alphabet on the other side, correspondence are multiple
Chinese character;
Related definition of the present invention is as follows:
Define 1: pronunciation correlation, including initial consonant correlation, simple or compound vowel of a Chinese syllable correlation and tone correlation;
Wherein, initial consonant correlation refers to the degree of association between Chinese Pin Yin pseudonym and Vietnamese vowel;Simple or compound vowel of a Chinese syllable correlation is
Refer to the degree of association between Chinese phonetic alphabet simple or compound vowel of a Chinese syllable and Vietnamese consonant;Tone correlation refers to Chinese phonetic alphabet tone and Vietnamese tone
Between the degree of association;
Define 2: the factor refers to calculating source language when the statistical machine translation model based on the factor generates language model
The unit of speech and object language translation probability;
In phrase-based statistical machine translation, the complete sentence of source language and the target language can be separated into first short
Language, then these phrases are based on, the translation probability of calculating original language to object language;
And in the statistical machine translation based on the factor, translation process is no longer based on phrase, but is based on the factor;These because
Son refers to initial consonant, simple or compound vowel of a Chinese syllable and tone in this application;
Wherein, statistical machine translation model, i.e. Factored Translation Model, are abbreviated as FTM;
Define 3: the Chinese gets over bilingual corpora, refers to Chinese-Vietnamese control bilingual documents;For every in Chinese data
One Chinese sentence has a semantic identical Vietnamese sentence to be corresponding to it in Vietnamese corpus;
Define 4: translation process refers to generating Chinese-Vietnamese language model process;
Define 5: generating process refers to completing original language to object language using the language model that translation process generates
Translation, i.e. generation object language;
6:BLEU value is defined, refers to the general translation quality evaluation index in machine translation field, BLEU value is bigger, represents
It is better to translate effect.
Translation process and generating process are two processes that statistical machine translation includes;
A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method, comprising the following steps:
Step 1: getting over bilingual corpora by the Chinese, Chinese-Vietnamese initial consonant correlation is calculated;
The calculating process of initial consonant correlation are as follows: choose and Vietnamese pronunciation and semantic all similar Chinese vocabulary, and by this
The Chinese Pin Yin pseudonym of a little vocabulary extracts, and calculates separately each Chinese Pin Yin pseudonym extracted and accounts for all Chinese phonetic alphabet
The ratio of initial consonant, this ratio is just by the initial consonant correlation as each Chinese Pin Yin pseudonym and Vietnamese vowel;
Initial consonant correlation between Chinese Pin Yin pseudonym and Vietnamese vowel is calculated by formula (1);
Wherein, n is of the different Chinese Pin Yin pseudonyms relevant to a Vietnamese vowel extracted in Chinese data
Number, i is the serial number of these Chinese Pin Yin pseudonyms, and j is the serial number of the different Chinese of the same Chinese Pin Yin pseudonym, miIt is i-th
The number for the Chinese that Chinese Pin Yin pseudonym represents,Indicate i-th of Chinese spelling pronunciation relevant to a Vietnamese vowel
Female number;Indicate Chinese number relevant to a Vietnamese vowel,Represent i-th of Chinese phonetic alphabet
J-th of Chinese of initial consonant;
Step 2: getting over bilingual corpora by the Chinese, Chinese-Vietnamese simple or compound vowel of a Chinese syllable correlation is obtained;
It chooses and Vietnamese pronounces and semantic all similar Chinese, extract Chinese phonetic alphabet simple or compound vowel of a Chinese syllable from these Chinese
Come, calculates separately the ratio that each Chinese phonetic alphabet simple or compound vowel of a Chinese syllable accounts for all Chinese phonetic alphabet simple or compound vowel of a Chinese syllable extracted, this ratio is just by conduct
The simple or compound vowel of a Chinese syllable correlation of Vietnamese consonant and Chinese phonetic alphabet simple or compound vowel of a Chinese syllable;
Wherein, Vietnamese consonant and the simple or compound vowel of a Chinese syllable correlation of Chinese phonetic alphabet simple or compound vowel of a Chinese syllable are calculated by formula (2);
Wherein, n is the number of the Chinese phonetic alphabet simple or compound vowel of a Chinese syllable relevant to a Vietnamese consonant extracted in Chinese data, t
It is the serial number of these Chinese phonetic alphabet simple or compound vowel of a Chinese syllable, k is the serial number of the different Chinese of the same Chinese phonetic alphabet simple or compound vowel of a Chinese syllable, mtIt is t-th of Chinese
The number of the Chinese of phonetic simple or compound vowel of a Chinese syllable,Indicate of t-th of Chinese phonetic alphabet simple or compound vowel of a Chinese syllable relevant to a Vietnamese consonant
Number;Indicate Chinese number relevant to a Vietnamese consonant,Indicate t-th of Chinese phonetic alphabet simple or compound vowel of a Chinese syllable
K-th of Chinese;
Step 3: getting over bilingual corpora by the Chinese, Chinese-Vietnamese tone correlation is directly acquired, specifically: by Chinese
Four tones: '-' , ‘ ˊ ', ' ˇ ' , ‘ ˋ ' respectively correspond the profound sound of Vietnamese, sharp sound, ask sound and weight sound;Vietnamese is fallen into sound
Corresponding phonetic is softly;
The reason of step 3, is: the negligible amounts of tone, and Chinese phonetic alphabet tone, which adds, softly 5, Vietnamese tone
There are 5, the classification of tone does not have initial consonant, and the classification of simple or compound vowel of a Chinese syllable is mostly and the association between tone comes compared to the association between the initial and the final
It says, it is more simple and clear;
Step 4: carrying out digital substitution to the tone of Chinese data and Vietnamese corpus respectively and to pronunciation character point
From, including following sub-step:
Step 4.1 is according to the tone correlation counted in step 3, by the sound in Chinese data and Vietnamese corpus
Continuous number is called to replace, specifically:
1) by the tone of Chinese: the profound sound of '-' and Vietnamese is replaced with number 1;
2) the sharp sound of the tone: ‘ ˊ ' of Chinese and Vietnamese number 2 is replaced;
3) by the tone of Chinese: ' ˇ ' and Vietnamese ask that sound number 3 replaces;
4) the weight sound of the tone: ‘ ˋ ' of Chinese and Vietnamese number 4 is replaced;
5) by Chinese softly and Vietnamese fall sound number 0 replace;
Step 4.2 carries out pronunciation character separation to Chinese data: the Chinese sentence of pure hanzi form is converted into initial consonant, rhythm
Female and tone text, that is, text after converting, for each part word of text after conversion, if word is number, just
Be converted to word | word | word | form is converted into consonant if word is phonetic | vowel | tone form;
Step 4.3 carries out pronunciation character separation to Vietnamese corpus: being converted into vowel, auxiliary to the Vietnamese corpus of pure tone section
The text of sound and tone, that is, text after converting;For each part word of text after conversion, if word is number, just
Be converted to word | word | word | form is converted into consonant if word is syllable | vowel | tone form;
So far, by step 4.1, step 4.2 and step 4.3, the Chinese for obtaining pronunciation character separation gets over bilingual corpora;
Step 5: the Chinese for the pronunciation character separation that extraction step four obtains gets over the factor of bilingual corpora, specifically:
In Chinese data, Chinese is extracted, pronounce PRc, Chinese Pin Yin pseudonym IN, Chinese phonetic alphabet simple or compound vowel of a Chinese syllable FI and Chinese
Phonetic tone Toc is as the CF factor;
In Vietnamese corpus, Vietnamese, pronunciation PRv, Vietnamese vowel CO, Vietnamese consonant VO and Vietnamese are extracted
Tone TOv is as the VF factor;
Step 6: the correspondence and use FTM between the setting CF factor and the VF factor generate Chinese-Vietnamese language model, tool
Steps are as follows for body;
The correspondence between the CF factor and the VF factor is arranged in step 6.1, specifically:
Chinese in Chinese data corresponds to the syllable of Vietnamese corpus, and Chinese Pin Yin pseudonym IN corresponds to Vietnamese vowel CO,
Chinese phonetic alphabet simple or compound vowel of a Chinese syllable FI corresponds to Vietnamese consonant VO, and Chinese phonetic alphabet tone TOv corresponds to Vietnamese tone VF;Specific single Chinese
Phonetic initial consonant IN and single Vietnamese vowel CO, single Chinese phonetic alphabet simple or compound vowel of a Chinese syllable FI and single Vietnamese consonant VO, single Chinese are spelled
Speech tune TOv is corresponding with single Vietnamese tone VF's, by Step 1: step 2 and step 3 calculate the Chinese-obtained more
Initial consonant correlation, simple or compound vowel of a Chinese syllable correlation, the tone correlation of southern language are configured;
The Chinese for the pronunciation character separation that step 4 obtains is got over bilingual corpora and is transported in FTM by step 6.2, and FTM is based on step
The CF factor and the VF factor extracted in rapid five calculate translation probability;
For step 6.3 using Chinese as original language, Vietnamese generates a Chinese-Vietnamese language as object language, FTM
Model;Using Vietnamese as original language, Chinese generates a Vietnamese-Chinese language model as object language, FTM;
So far, translation process is constituted by step 6.1, step 6.2 and step 6.3;
Step 7: translation is completed using the language model that step 6.3 obtains, and during Chinese translates Vietnamese, language
Model generates syllable-vowel-consonant-tone form Vietnamese, and during Vietnamese translates Chinese, language model generates the Chinese
Word-initial consonant-simple or compound vowel of a Chinese syllable-tone form Chinese;
Step 7, that is, generating process;
Step 8: the syllable generated in step 7-vowel-consonant-tone form Vietnamese is converted into pure tone section
The Chinese character of generation-initial consonant-simple or compound vowel of a Chinese syllable-tone form Chinese is converted into the Chinese of pure Chinese character by Vietnamese.
Beneficial effect
A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method of the present invention, compares the prior art, has such as
It is lower the utility model has the advantages that
The method extracts Chinese-Vietnamese pronunciation character, and it is one that this, which gets over South Uietnam statistical machine translation field in Chinese-,
New method, the dependence the method reduce statistical machine translation to extensive parallel corpora improve the translation of Chinese-Vietnamese
Quality.
Detailed description of the invention
Fig. 1 is a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method specific implementation process of the present invention
Schematic diagram.
Specific embodiment
The method of the present invention is described further with reference to the accompanying drawings and embodiments.
Embodiment 1
Fig. 1 is the stream of a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method of the present invention and the present embodiment
Cheng Tu.
From figure 1 it appears that the present invention includes the following steps:
Step A: pronunciation correlation is calculated;
Specially calculate initial consonant correlation, simple or compound vowel of a Chinese syllable correlation and tone correlation;
Specific in the present embodiment, by the bilingual corpus of acquisition, calculate in Vietnamese each Vietnamese vowel with
The initial consonant correlation of Chinese Pin Yin pseudonym is specifically identical as step 1;Calculate each Vietnamese consonant and Chinese phonetic alphabet simple or compound vowel of a Chinese syllable
Simple or compound vowel of a Chinese syllable correlation, it is specifically identical as step 2, the tone correlation of each Vietnamese tone with Chinese phonetic alphabet tone is calculated, specifically
It is identical as step 3;
Step B: pronunciation character separation;
It is identical as step 4.1, step 4.2 and step 4.3 specific in the present embodiment;
Step C: the CF factor, the VF factor are extracted;
Specific in the present embodiment, the Chinese after the pronunciation character separation obtained to step B gets over bilingual corpora, and extraction factor is made
It is specifically identical as step 5 for the unit that the translation probability that FTM model carries out translation process calculates;
Step D: the input Chinese gets over bilingual corpora to FTM;
It is identical as step 6.2 specific in the present embodiment;
Step E: language model is generated;
It is identical as step 6.3 specific in the present embodiment;
Step F: translation is generated;
Specific in the present embodiment, and Step 7: step 8 is identical;
So far, step A to step F completes a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method.
Embodiment 2
The present embodiment will be " full of hope with Vietnamese vowel b, Chinese sentence " study makes one progressive " and Chinese sentence
Trekking more can be to people's enjoyment than arriving at the destination " for unite to a kind of fusion pronunciation character Chinese of the present invention-Vietnamese
The concrete operation step of meter machine translation method is described in detail.
A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method process flow is as shown in Figure 1.From Fig. 1
It can be seen that a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method, comprising the following steps:
Step A1: pronunciation correlation is calculated
It gets over to extract in bilingual corpora from the Chinese to pronounce with Vietnamese vowel b and in 100 Chinese of semantic similarity;
Wherein, Chinese-Vietnam's bilingual corpora is scarce resource type language pair, and bilingual corpora definition is as described in defining 3.
When it is implemented, being obtained by disclosing collection on the net, the corpus of acquisition is related to news, caption, works and expressions for everyday use
Etc. multiple fields, the Chinese-Vietnamese corpus that will be downloaded from each website, arrange is 550,000 more bilingual languages of the Chinese on a large scale
Material, for calculating pronunciation correlation.
The specific calculating process of Vietnamese vowel b and the initial consonant correlation of Chinese Pin Yin pseudonym are illustrated, is described in detail
The calculation method of initial consonant correlation;
It is as shown in table 1 specific to embodiment Vietnamese vowel b and Chinese Pin Yin pseudonym correlation results:
1 step 1 initial consonant correlativity calculation result of table
Four kinds of Chinese Pin Yin pseudonym b have been extracted with the Chinese of Vietnamese vowel b pronunciation and semantic similarity at 100,
F, m, p indicate the Chinese of not initial consonant using 0 here.Initial consonant is that the Chinese of b has 67, the Chinese that initial consonant is f and initial consonant is m
Language has 1, and initial consonant is that the Chinese of p has 30;Chinese Pin Yin pseudonym b, f, m are defined, p is related to the initial consonant of Vietnamese vowel b
Property is p1,p2,p3,p4.P is calculated by formula (1)1=67/100=67%, p2=p3=1/100=1%, p4=30/100
=30%;Probability value is bigger to indicate that the degree of association between the Chinese Pin Yin pseudonym and Vietnamese vowel is bigger.Pass through probability value pi
The conclusion that can obtain of size be that in this example, Chinese Pin Yin pseudonym relevant to Vietnamese initial consonant b is b, p.Use tricks
The method for calculating Vietnamese initial consonant b and the initial consonant correlation of Chinese Pin Yin pseudonym, calculates other Chinese-Vietnamese initial consonant correlation.
Since amount of calculation is very huge, the initial consonant correlation of each Vietnamese vowel with Chinese Pin Yin pseudonym is not just enumerated here
Calculating process;
Chinese phonetic alphabet simple or compound vowel of a Chinese syllable relevant to each Vietnamese consonant is counted, calculation method is the same as calculating Vietnamese sound in step A
Female b is consistent with the method for initial consonant correlation of Chinese Pin Yin pseudonym, no longer illustrates here;
The results are shown in Table 2 for correspondence between tone.
The result of 2 step 3 tone correlation of table
Phonetic tone is four, respectively '-' , ‘ ˊ ', ' ˇ ' , ‘ ˋ ', and one to the four tones of standard Chinese pronunciation corresponds to the profound sound of Vietnamese, and sharp sound is asked
Sound, weight sound, the sound that falls of Vietnamese correspond to phonetic softly;
Step B1, pronunciation character separates, and the pronunciation character separation process specific to embodiment " study makes one progressive " is as follows:
According to the tone correlation enumerated in table 2, the four tones of standard Chinese pronunciation are arrived by the one of Chinese phonetic alphabet tone, with continuous digital 1,2,3,
4 replace, and by the profound sound of Vietnamese tone, sharp sound asks that sound, weight sound are replaced with 1,2,3,4.Softly with fall sound with 0 replace.
To Chinese sentence " study makes one progressive ", it is first converted into PINYIN form " xu é x í sh ǐ r é n j ì n b ù ", is turned
Rear text is changed, be then converted to text after conversion " x | ue | 2x | i | 2sh | i | 3r | en | 2j | in | 4b | u | 4 ".
The corresponding Vietnamese of Chinese " study makes one progressive " is Text after being converted is converted to text after conversion
Step C1, the CF factor, the VF factor are extracted;
The Chinese of the pronunciation character separation obtained according to step 4 gets over bilingual corpora, extracts pronunciation character as the factor, specifically
Factor extraction result to embodiment " full of hope trekking more can be to people's enjoyment than arriving at the destination " is as follows:
Firstly, Chinese sentence " more can than arriving at the destination by full of hope trekking in Chinese-Vietnamese parallel corpora
Give people's enjoyment ", the Chinese sentence after step 4 separates pronunciation character be " fill | chong1 | ch | ong is full | man3 | m | an is uncommon
| xi1 | x | i is hoped | wang4 | w | ang's | de5 | d | e crosses mountains | ba2 | b | a is related to | she4 | sh | e ratio | bi3 | b | i is arrived | dao4 | d | ao
Reach | da2 | d | a mesh | mu4 | m | u's | de5 | d | e | di4 | d | i more | geng4 | g | eng can | neng2 | n | eng is to | gei3 | g
| ei people | ren2 | r | en is happy | le4 | l | e interest | qu4 | q | u ";
The pronunciation (PRc) of the Chinese phonetic alphabet is extracted, initial consonant (IN), simple or compound vowel of a Chinese syllable (FI), tone (TOc) is as the CF factor;
The corresponding Vietnamese sentence of Chinese sentence " full of hope trekking more can be to people's enjoyment than arriving at the destination "It is separated by step 4
Vietnamese sentence after pronunciation character is
The pronunciation (PRv) of Vietnamese is extracted, initial consonant (CO), simple or compound vowel of a Chinese syllable (VO), tone (TOv) is as the VF factor;
Step D1, the input Chinese gets over bilingual corpora to FTM, as follows specific to the process of embodiment " advertisement ":
Each factor in the Chinese phonetic alphabet is corresponding with each factor in Vietnamese.Specifically: " advertisement " in Chinese
Corresponding Vietnamese
Wherein, " wide " word passes through step 4.1, after 4.2 pronunciation character separation, format be " it is wide | guang3 | g | uang ",
VietnameseBy step 4.1, after 4.3 pronunciation character separation, format is
It is wide correspondingGuang3 is correspondingG corresponds to Q, and uang is corresponding
It is determined specific corresponding to rule by the pronunciation correlation calculated in step 1 and step 2.
The Chinese for the pronunciation character separation that step 4 obtains is got over bilingual corpora to be input in FTM, FTM is based on step 6.1
Factor pair is answered, and translation probability is calculated;
Step E1, language model is generated;
Specific to the present embodiment are as follows: FTM generates Chinese-Vietnamese by the bilingual corpora after training separation pronunciation character
Language model;
Step F1, translation is generated;
The language model obtained using step E1, it is as follows specific to the generation translation process of embodiment " advertisement ": to carry out
When the translation of Chinese-Vietnamese, generate shaped likePronunciation it is special
Levy isolated Vietnamese, when carrying out the translation of Vietnamese-Chinese, generate shaped like " extensively | guang3 | g | uang announcement | gao4 | g |
The Chinese of ao " pronunciation character separation;
To generationThe Vietnamese of form is converted into pure
The Vietnamese of syllableBy " it is wide | guang3 | g | uang | gao4 | g | ao " Chinese of form is converted into no spelling
The Chinese " advertisement " of sound;
Embodiment 3
It is effective in order to further verify a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method of the present invention
Property, the present embodiment gets over bilingual corpora using 550,000 Chinese used in embodiment 2, does not merge to fusion pronunciation character and pronunciation spy
The statistical machine translation model of sign is tested, meanwhile, in order between the factor that is arranged in verification step 6.1 corresponded manner it is effective
Property, bilingual corpora equally is got over based on 550,000 Chinese used in embodiment 2, it, will be upper provided with the corresponded manner between other factors
The result for telling that experiment obtains compares and analyzes,
Comparing result is as shown in table 3.
3 contrast and experiment of table
The BLEU of table 3 is determined by defining 6, is tested 1- not fusion factor, is by the Chinese data of pure Chinese and pure tone section
Vietnamese corpus for translation model training, in the experiment 2 of fusion factor, in experiment 3 and experiment 4, experiment 2 is provided with this
The corresponded manner of the factor in invention, i.e. initial consonant-vowel, simple or compound vowel of a Chinese syllable-consonant, tone-tone, experiment 3 are provided with initial consonant-vowel
It is corresponding, it is not provided with simple or compound vowel of a Chinese syllable-consonant and tone-tone correspondence, experiment 4 are provided with simple or compound vowel of a Chinese syllable-consonant correspondence, do not set
Set initial consonant-vowel and tone-tone correspondence.From the results shown in Table 3, it is experiment 2 that BLEU value is highest, based on this
Invent the fusion pronunciation character proposed and setting initial consonant-vowel, simple or compound vowel of a Chinese syllable-consonant, tone-tone factor corresponding method, experiment 1
The experimental result BLEU for not merging pronunciation character is minimum, and experiment 3 and 4 factor corresponded manners of experiment are different from experiment 2, experiment 3
And the BLEU result of experiment 4 is lower than the BLEU value that experiment 2 obtains.
From table 3 it can be seen that a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation of the present invention is compared to more traditional
The method for not merging pronunciation character has promotion in translation quality, and factor corresponding method proposed by the present invention can also be compared
Chinese-Vietnamese translation quality is further promoted with other factor corresponding methods.
The basic principles, main features and advantages of the invention have been shown and described above.The technical staff of the industry should
Understand, the present invention is not limited to the above embodiments, and the above embodiments and description only describe originals of the invention
Reason, without departing from the spirit and scope of the present invention, various changes and improvements may be made to the invention, these changes and improvements
All within the scope of the claimed invention, the claimed scope of the invention is by appended claims and its equivalent circle
It is fixed.
Claims (5)
1. a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method, it is characterised in that: the Chinese-Vietnam being related to
Language pronunciation correlation and concept are as follows:
1) Vietnamese belongs to tone language without tense and conjugation as Chinese, the similar Chinese phonetic alphabet is constituted, by member
Sound, consonant and tone composition;
2) Vietnamese and Chinese belong to isolated language, do not have gap between word;
3) Chinese phonetic alphabet includes 23 initial consonants, 36 simple or compound vowel of a Chinese syllable and four tones;Vietnamese include 23 vowels, 16 consonants with
And five tones;
4) a corresponding unique word of Vietnamese pronunciation, and the pronunciation of the Chinese phonetic alphabet on the other side, corresponding multiple Chinese characters;
Related definition of the present invention is as follows:
Define 1: pronunciation correlation, including initial consonant correlation, simple or compound vowel of a Chinese syllable correlation and tone correlation;
Wherein, initial consonant correlation refers to the degree of association between Chinese Pin Yin pseudonym and Vietnamese vowel;Simple or compound vowel of a Chinese syllable correlation refers to the Chinese
The degree of association between language phonetic simple or compound vowel of a Chinese syllable and Vietnamese consonant;Tone correlation refers between Chinese phonetic alphabet tone and Vietnamese tone
The degree of association;
Define 2: the factor, refer to based on the factor statistical machine translation model generate language model when, calculate original language with
The unit of object language translation probability;
In phrase-based statistical machine translation, the complete sentence of source language and the target language can be separated into phrase first,
It is based on these phrases, the translation probability of calculating original language to object language again;
And in the statistical machine translation based on the factor, translation process is no longer based on phrase, but is based on the factor;
Wherein, statistical machine translation model, i.e. Factored Translation Model, are abbreviated as FTM;
Define 3: the Chinese gets over bilingual corpora, refers to Chinese-Vietnamese control bilingual documents;For each of Chinese data
Chinese sentence has a semantic identical Vietnamese sentence to be corresponding to it in Vietnamese corpus;
Define 4: translation process refers to generating Chinese-Vietnamese language model process;
Define 5: generating process refers to completing original language turning over to object language using the language model that translation process generates
It translates, i.e. generation object language;
6:BLEU value is defined, refers to the general translation quality evaluation index in machine translation field;
Translation process and generating process are two processes that statistical machine translation includes;
The Chinese-Vietnamese statistical machine translation method, comprising the following steps:
Step 1: getting over bilingual corpora by the Chinese, Chinese-Vietnamese initial consonant correlation is calculated;
Initial consonant correlation between Chinese Pin Yin pseudonym and Vietnamese vowel is calculated by formula (1);
Wherein, n is the number of the different Chinese Pin Yin pseudonyms relevant to a Vietnamese vowel extracted in Chinese data, i
It is the serial number of these Chinese Pin Yin pseudonyms, j is the serial number of the different Chinese of the same Chinese Pin Yin pseudonym, miIt is i-th of Chinese
The number for the Chinese that phonetic initial consonant represents,Indicate i-th of Chinese Pin Yin pseudonym relevant to a Vietnamese vowel
Number;Indicate Chinese number relevant to a Vietnamese vowel,Represent i-th of Chinese Pin Yin pseudonym
J-th of Chinese;
Step 2: getting over bilingual corpora by the Chinese, Chinese-Vietnamese simple or compound vowel of a Chinese syllable correlation is obtained;
Wherein, Vietnamese consonant and the simple or compound vowel of a Chinese syllable correlation of Chinese phonetic alphabet simple or compound vowel of a Chinese syllable are calculated by formula (2);
Wherein, n is the number of the Chinese phonetic alphabet simple or compound vowel of a Chinese syllable relevant to a Vietnamese consonant extracted in Chinese data, and t is this
The serial number of a little Chinese phonetic alphabet simple or compound vowel of a Chinese syllable, k is the serial number of the different Chinese of the same Chinese phonetic alphabet simple or compound vowel of a Chinese syllable, mtIt is t-th of Chinese phonetic alphabet
The number of the Chinese of simple or compound vowel of a Chinese syllable,Indicate the number of t-th of Chinese phonetic alphabet simple or compound vowel of a Chinese syllable relevant to a Vietnamese consonant;Indicate Chinese number relevant to a Vietnamese consonant,Indicate the kth of t-th of Chinese phonetic alphabet simple or compound vowel of a Chinese syllable
A Chinese;
Step 3: getting over bilingual corpora by the Chinese, Chinese-Vietnamese tone correlation is directly acquired;
Step 4: carrying out digital substitution to the tone of Chinese data and Vietnamese corpus respectively and being separated to pronunciation character, packet
Include following sub-step:
Step 4.1 uses the tone in Chinese data and Vietnamese corpus according to the tone correlation counted in step 3
Continuous number replaces;
Step 4.2 to Chinese data carry out pronunciation character separation: by the Chinese sentence of pure hanzi form be converted into initial consonant, simple or compound vowel of a Chinese syllable with
And the text of tone, that is, text after converting just convert each part word of text after conversion if word is number
For word | word | word | form is converted into consonant if word is phonetic | vowel | tone form;
Step 4.3 to Vietnamese corpus carry out pronunciation character separation: to the Vietnamese corpus of pure tone section be converted into vowel, consonant with
And the text of tone, that is, text after converting;Each part word of text after conversion is just converted if word is number
For word | word | word | form is converted into consonant if word is syllable | vowel | tone form;
So far, by step 4.1, step 4.2 and step 4.3, the Chinese for obtaining pronunciation character separation gets over bilingual corpora;
Step 5: the Chinese for the pronunciation character separation that extraction step four obtains gets over the factor of bilingual corpora, specifically:
In Chinese data, Chinese is extracted, pronounce PRc, Chinese Pin Yin pseudonym IN, Chinese phonetic alphabet simple or compound vowel of a Chinese syllable FI and the Chinese phonetic alphabet
Tone Toc is as the CF factor;
In Vietnamese corpus, Vietnamese, pronunciation PRv, Vietnamese vowel CO, Vietnamese consonant VO and Vietnamese tone are extracted
TOv is as the VF factor;
Step 6: correspondence between the setting CF factor and the VF factor and Chinese-Vietnamese language model is generated using FTM, it is specific to walk
It is rapid as follows;
The correspondence between the CF factor and the VF factor is arranged in step 6.1, specifically:
Chinese in Chinese data corresponds to the syllable of Vietnamese corpus, and Chinese Pin Yin pseudonym IN corresponds to Vietnamese vowel CO, Chinese
Phonetic simple or compound vowel of a Chinese syllable FI corresponds to Vietnamese consonant VO, and Chinese phonetic alphabet tone TOv corresponds to Vietnamese tone VF;The specific single Chinese phonetic alphabet
Initial consonant IN and single Vietnamese vowel CO, single Chinese phonetic alphabet simple or compound vowel of a Chinese syllable FI and single Vietnamese consonant VO, single Chinese spelling pronunciation
Adjust TOv corresponding with single Vietnamese tone VF, by Step 1: step 2 and step 3 calculate the Chinese-Vietnamese obtained
Initial consonant correlation, simple or compound vowel of a Chinese syllable correlation, tone correlation be configured;
The Chinese for the pronunciation character separation that step 4 obtains is got over bilingual corpora and is transported in FTM by step 6.2, and FTM is based on step 5
The CF factor and the VF factor of middle extraction calculate translation probability;
For step 6.3 using Chinese as original language, Vietnamese generates a Chinese-Vietnamese language mould as object language, FTM
Type;Using Vietnamese as original language, Chinese generates a Vietnamese-Chinese language model as object language, FTM;
So far, translation process is constituted by step 6.1, step 6.2 and step 6.3;
Step 7: translation is completed using the language model that step 6.3 obtains, and during Chinese translates Vietnamese, language model
Syllable-vowel-consonant-tone form Vietnamese is generated, during Vietnamese translates Chinese, language model generates Chinese character-
Initial consonant-simple or compound vowel of a Chinese syllable-tone form Chinese;
Step 7, that is, generating process;
Step 8: the syllable generated in step 7-vowel-consonant-tone form Vietnamese to be converted into Vietnam of pure tone section
The Chinese character of generation-initial consonant-simple or compound vowel of a Chinese syllable-tone form Chinese is converted into the Chinese of pure Chinese character by language.
2. a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method according to claim 1, feature exist
In: the calculating process of initial consonant correlation in step 1 are as follows: choose and Vietnamese pronounces and semantic all similar Chinese vocabulary, and will
The Chinese Pin Yin pseudonym of these vocabulary extracts, and calculates separately each Chinese Pin Yin pseudonym extracted and accounts for all Chinese spellings
The ratio of speech mother, this ratio is just by the initial consonant correlation as each Chinese Pin Yin pseudonym and Vietnamese vowel.
3. a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method according to claim 1, feature exist
In: step 2 specifically: choose and Vietnamese pronounces and semantic all similar Chinese, by Chinese phonetic alphabet simple or compound vowel of a Chinese syllable from these Chinese
It extracts, calculates separately the ratio that each Chinese phonetic alphabet simple or compound vowel of a Chinese syllable accounts for all Chinese phonetic alphabet simple or compound vowel of a Chinese syllable extracted, this ratio is just
By the simple or compound vowel of a Chinese syllable correlation as Vietnamese consonant and Chinese phonetic alphabet simple or compound vowel of a Chinese syllable.
4. a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method according to claim 1, feature exist
In: step 4.1, specifically:
1) by the tone of Chinese: the profound sound of '-' and Vietnamese is replaced with number 1;
2) the sharp sound of the tone: ‘ ˊ ' of Chinese and Vietnamese number 2 is replaced;
3) by the tone of Chinese: ' ˇ ' and Vietnamese ask that sound number 3 replaces;
4) the weight sound of the tone: ‘ ˋ ' of Chinese and Vietnamese number 4 is replaced;
5) by Chinese softly and Vietnamese fall sound number 0 replace.
5. a kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method according to claim 1, feature exist
In: step 3 specifically: by four tones of Chinese: '-' , ‘ ˊ ', ' ˇ ' , ‘ ˋ ' respectively correspond the profound sound of Vietnamese, sharp sound, ask
Sound and weight sound;The sound that falls of Vietnamese is corresponded into phonetic softly;The reason of step 3, is: the negligible amounts of tone, Chinese
For phonetic tone plus softly there is 5, Vietnamese tone has 5, and the classification of tone does not have an initial consonant, the classification of simple or compound vowel of a Chinese syllable is more and tone it
Between association compared to the association between the initial and the final for, it is more simple and clear.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910382004.3A CN110096715A (en) | 2019-05-06 | 2019-05-06 | A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910382004.3A CN110096715A (en) | 2019-05-06 | 2019-05-06 | A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110096715A true CN110096715A (en) | 2019-08-06 |
Family
ID=67447432
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910382004.3A Pending CN110096715A (en) | 2019-05-06 | 2019-05-06 | A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110096715A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113506559A (en) * | 2021-07-21 | 2021-10-15 | 成都启英泰伦科技有限公司 | Method for generating pronunciation dictionary according to Vietnamese written text |
CN113688283A (en) * | 2021-08-27 | 2021-11-23 | 北京奇艺世纪科技有限公司 | Method and device for determining matching degree of video subtitles and electronic equipment |
CN113743053A (en) * | 2021-08-17 | 2021-12-03 | 上海明略人工智能(集团)有限公司 | Alphabet vector calculation method, system, storage medium and electronic device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080195372A1 (en) * | 2007-02-14 | 2008-08-14 | Jeffrey Chin | Machine Translation Feedback |
CN104978311A (en) * | 2015-07-15 | 2015-10-14 | 昆明理工大学 | Vietnamese word segmentation method based on conditional random fields |
CN105740235A (en) * | 2016-01-29 | 2016-07-06 | 昆明理工大学 | Phrase tree to dependency tree transformation method capable of combining Vietnamese grammatical features |
CN106202037A (en) * | 2016-06-30 | 2016-12-07 | 昆明理工大学 | Vietnamese tree of phrases construction method based on chunk |
CN106372241A (en) * | 2016-09-18 | 2017-02-01 | 广西财经学院 | Inter-word weighting associating mode-based Vietnamese-to-English cross-language text retrieval method and system |
-
2019
- 2019-05-06 CN CN201910382004.3A patent/CN110096715A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080195372A1 (en) * | 2007-02-14 | 2008-08-14 | Jeffrey Chin | Machine Translation Feedback |
CN104978311A (en) * | 2015-07-15 | 2015-10-14 | 昆明理工大学 | Vietnamese word segmentation method based on conditional random fields |
CN105740235A (en) * | 2016-01-29 | 2016-07-06 | 昆明理工大学 | Phrase tree to dependency tree transformation method capable of combining Vietnamese grammatical features |
CN106202037A (en) * | 2016-06-30 | 2016-12-07 | 昆明理工大学 | Vietnamese tree of phrases construction method based on chunk |
CN106372241A (en) * | 2016-09-18 | 2017-02-01 | 广西财经学院 | Inter-word weighting associating mode-based Vietnamese-to-English cross-language text retrieval method and system |
Non-Patent Citations (2)
Title |
---|
HUU ANH TRAN 等: "Integrating pronunciation into Chinese-Vietnamese statistical machine translation", 《TSINGHUA SCIENCE AND TECHNOLOGY》 * |
TRAN HUU-ANH 等: "Preordering for Chinese-Vietnamese Statistical Machine Translation", 《IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113506559A (en) * | 2021-07-21 | 2021-10-15 | 成都启英泰伦科技有限公司 | Method for generating pronunciation dictionary according to Vietnamese written text |
CN113506559B (en) * | 2021-07-21 | 2023-06-09 | 成都启英泰伦科技有限公司 | Method for generating pronunciation dictionary according to Vietnam written text |
CN113743053A (en) * | 2021-08-17 | 2021-12-03 | 上海明略人工智能(集团)有限公司 | Alphabet vector calculation method, system, storage medium and electronic device |
CN113743053B (en) * | 2021-08-17 | 2024-03-12 | 上海明略人工智能(集团)有限公司 | Letter vector calculation method, system, storage medium and electronic equipment |
CN113688283A (en) * | 2021-08-27 | 2021-11-23 | 北京奇艺世纪科技有限公司 | Method and device for determining matching degree of video subtitles and electronic equipment |
CN113688283B (en) * | 2021-08-27 | 2023-09-05 | 北京奇艺世纪科技有限公司 | Method and device for determining video subtitle matching degree and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111382580B (en) | Encoder-decoder framework pre-training method for neural machine translation | |
CN101131689B (en) | Bidirectional mechanical translation method for sentence pattern conversion between Chinese language and foreign language | |
CN105957518B (en) | A kind of method of Mongol large vocabulary continuous speech recognition | |
CN110517663B (en) | Language identification method and system | |
CN100536532C (en) | Method and system for automatic subtilting | |
CN109255113A (en) | Intelligent critique system | |
CN105404621B (en) | A kind of method and system that Chinese character is read for blind person | |
CN101788978B (en) | Chinese and foreign spoken language automatic translation method combining Chinese pinyin and character | |
CN103309926A (en) | Chinese and English-named entity identification method and system based on conditional random field (CRF) | |
CN110096715A (en) | A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method | |
Stein et al. | Hand in hand: automatic sign language to English translation | |
CN105895076B (en) | A kind of phoneme synthesizing method and system | |
Tennage et al. | Transliteration and byte pair encoding to improve tamil to sinhala neural machine translation | |
Chenggang et al. | Wailaici and English borrowings in Chinese | |
Lewis et al. | Language identification and language specific letter-to-sound rules | |
CN110569510A (en) | method for identifying named entity of user request data | |
Garside | The large-scale production of syntactically analysed corpora | |
CN106294310A (en) | A kind of Tibetan language tone Forecasting Methodology and system | |
Bansal et al. | Development of Text and Speech Corpus for Designing the Multilingual Recognition System | |
Li et al. | The study of comparison and conversion about traditional Mongolian and Cyrillic Mongolian | |
Abumalloh et al. | Building Arabic corpus applied to part-of-speech tagging | |
Mahmut et al. | Exploration of Chinese-Uyghur neural machine translation | |
KR101604553B1 (en) | Apparatus and method for generating pseudomorpheme-based speech recognition units by unsupervised segmentation and merging | |
Buscaldi et al. | How good is NLLB-200 for low-resource languages? A study on Genoese | |
Kari | On the morphology of Degema modifier, demonstrative and interrogative nominals |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190806 |