CN102750267A

CN102750267A - Chinese Pinyin and character conversion method and system as well as distinguishing dictionary building method

Info

Publication number: CN102750267A
Application number: CN2012102024711A
Authority: CN
Inventors: 张劲松; 李伟; 解焱陆; 曹文
Original assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Current assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority date: 2012-06-15
Filing date: 2012-06-15
Publication date: 2012-10-24
Anticipated expiration: 2032-06-15
Also published as: CN102750267B

Abstract

The embodiment of the invention provides a Chinese Pinyin and character conversion method and system as well as a distinguishing dictionary building method. The Chinese Pinyin and character conversion method comprises the following steps of: according to an input Pinyin string and a prebuilt distinguishing dictionary, generating a word grid corresponding to the Pinyin string, wherein the distinguishing dictionary is built based on the mutual information of the text and the Pinyin; decoding the word grid according to a statistical language model to obtain a conversion path with the highest probability so as to realize the conversion of the Chinese Pinyin and character. By virtue of he embodiment of the invention, the accuracy of the Pinyin and character conversion can be improved further.

Description

The construction method of Chinese tone-character conversion method and system, differentiation property dictionary

Technical field

The present invention relates to sound word switch technology field, particularly a kind of Chinese tone-character conversion method and system, construction method of differentiation property dictionary based on the property distinguished dictionary.

Background technology

Phonetic is the phone string of Chinese character.In a lot of systems, phonetic all is the key component of its composition to the conversion of Chinese character, like the keyboard input of Chinese, the speech recognition system of Chinese etc.Because in Chinese, probably have only 410 and be not with accent phonetic, the Chinese character corresponding with it then has 6700, so how from same phonetic, to select its correct corresponding Chinese character, just become a present research topic.

At present, one of best bet that addresses this problem is to utilize statistical language model to eliminate the ambiguity that the unisonance words is brought.The structure of statistical language model needs to solve two important problem: 1. the selection of dictionary; 2. the optimization of model parameter.With the most frequently used ternary statistical model is example, and can be divided into the selection of dictionary has supervision and two types of non-supervision.The structure that dictionary in the method for supervision is arranged is mainly through manual establishment.Yet Chinese is the dictionary of unified standard not, and perhaps the linguist can reach an agreement to up to ten thousand entries, but remaining words can cause very big dispute.For this reason, a large amount of non-supervision dictionary construction methods are suggested, comprising the structure of maximum likelihood method dictionary, based on structure of mutual information dictionary etc.Compare with manual dictionary, these method proofs make up method in concrete the application based on the dictionary of data-driven, have same feasibility, and more practice thrift cost.

For the parameter optimization problem of language model, the scholar is decades in the past, and the optimization criterion of main foundation is maximum likelihood or minimum puzzlement degree.In recent years, in order to improve the accuracy of Chinese speech identification, some scholars propose the method for the property distinguished training and optimize language model.The core concept of this method is that the relative probability of prepare word has more the effect to the disambiguation of unisonance words than absolute probability score in the conversion of sound word.In the training process of language model, come the parameter of continuous adjustment model according to the result of sound word conversion.

But in realizing process of the present invention; The inventor finds that the defective of prior art is: in above-mentioned traditional method; The structure of dictionary mainly is through manual establishment; Perhaps from text, directly obtain, the structure of dictionary is not considered the information of pinyin string, can not further improve the accuracy of sound word conversion.

Listed below for understanding the useful document of the present invention and routine techniques, by reference they have been incorporated herein, the same as having illustrated fully in this article.

[list of references 1] Jianfeng Gao, Hai-Feng Wang, Mingjing Li; And Kai-Fu Lee, " A Unified Approach to Statistical Language Modeling for Chinese ", IEEE ICASSP2000; Istanbul, Turkey.June 5-9,2000.

[list of references 2] Lingyun Pan and Changsheng Yang, " An Auto-system For Converting HANYUPINYIN to Chinese Characters ", Journal of Computer, 13 (4): 271-275.

[list of references 3] Ruiqiang Zhang; Zuoying Wang and Jianping Zhang; " Chinese Pinyin-to-Text Translation Technique with Error Correction Used for Continuous Speech Recognition "; Journal ofTsinghua University (Sci&Tech), 37 (10): 9-11,1997.

[list of references 4] Ando, R.and Lee, " Mostly-unsupervised Statistical Segmentation of Japanese:Application to Kanji ", ANLP-NAACL.2000.

[list of references 5] Fuchun Peng; Dale Schuurmans; " Self-Supervised Chinese Word Segmentation ", Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis, p.238-247; September 13-15,2001.

[list of references 6] Zheng Chen, Kai-Fu Lee, Ming-jing Li, " Discriminative training on language model ", and In Proc.ISCSLP 2000, Beijing, China, Oct 2000.

[list of references 7] Hong-Kwang Jeff Kuo, et al " Discriminative Training of Language Models for Speech Recognition ", IEEE, ICASSP 2002, Orlando, Florida.

[list of references 8] Jinsong Zhang, Wei Li, Yuxia Hou; Wen Cao; Ziyu Xiong, " A Study On Functional Loads of Phonetic Contrasts Under Context Based On Mutual Information of Chinese Text And Phonemes ", The 7th International Symposium on Chinese Spoken Language Processing (ISCSLP); Tainan, Nov.2010.

[list of references 9] http: ∥ www.speech.sri.com/projects/srilm/

Summary of the invention

The embodiment of the invention provides a kind of conversion method and system of Chinese sound word, the construction method of differentiation property dictionary, and purpose is further to improve the accuracy of sound word conversion.

An aspect according to the embodiment of the invention provides a kind of Chinese tone-character conversion method, based on the property distinguished dictionary; Said Chinese tone-character conversion method comprises:

According to the pinyin string of input and the differentiation property dictionary that makes up in advance, generate the words grid corresponding with said pinyin string; Wherein said differentiation property dictionary is based on the mutual information of text and phonetic and make up;

According to statistical language model said words grid is decoded, obtain the maximum transduction pathway of probability to realize the conversion of Chinese sound word.

According to another aspect of the embodiment of the invention, a kind of construction method of the property distinguished dictionary is provided, said construction method comprises:

Make up the words grid according to training pinyin string and initial dictionary, and said words grid is decoded to obtain different phonetic switching modes with statistical language model;

From said different phonetic switching mode, confirm the phonetic switching mode that mutual information is maximum;

According to the maximum phonetic switching mode cutting text corresponding with the pinyin string of said training of said mutual information, and the text after the statistics cutting is to obtain new dictionary.

According to another aspect of the embodiment of the invention, a kind of Chinese sound word converting system is provided, based on the property distinguished dictionary; Said Chinese sound word converting system comprises:

First generation unit according to the pinyin string of input and the differentiation property dictionary that makes up in advance, generates the words grid corresponding with said pinyin string; Wherein said differentiation property dictionary is based on the mutual information of text and phonetic and make up;

The path obtains the unit, to said words trellis decode, obtains the maximum transduction pathway of probability to realize the conversion of Chinese sound word according to statistical language model.

The beneficial effect of the embodiment of the invention is, through the differentiation property dictionary that the mutual information based on text and phonetic makes up, realizes the conversion of Chinese sound word, can further improve the accuracy of sound word conversion.

Description of drawings

Accompanying drawing described herein is used to provide further understanding of the present invention, constitutes the application's a part, does not constitute qualification of the present invention.In the accompanying drawings:

Fig. 1 is the process flow diagram of the conversion method of the embodiment of the invention;

Fig. 2 is the synoptic diagram of the text-phonetic-text mode of the embodiment of the invention;

Fig. 3 is an exemplary plot of the decode procedure of the embodiment of the invention;

Fig. 4 is a schematic flow sheet of the structure property the distinguished dictionary of the embodiment of the invention;

Fig. 5 is another schematic flow sheet of the structure property the distinguished dictionary of the embodiment of the invention;

Fig. 6 is a synoptic diagram of the experimental result of the embodiment of the invention;

Fig. 7 is another synoptic diagram of the experimental result of the embodiment of the invention;

Fig. 8 is the formation synoptic diagram of the Chinese sound word converting system of the embodiment of the invention;

Fig. 9 is the formation synoptic diagram of the dictionary construction unit of the embodiment of the invention.

Embodiment

For making the object of the invention, technical scheme and advantage clearer, the embodiment of the invention is done further explain below in conjunction with accompanying drawing.At this, illustrative examples of the present invention and explanation thereof are used to explain the present invention, but not as to qualification of the present invention.

Embodiment 1

The embodiment of the invention provides a kind of Chinese tone-character conversion method based on the property distinguished dictionary.Fig. 1 is the process flow diagram of the conversion method of the embodiment of the invention, and is as shown in Figure 1, and this Chinese tone-character conversion method comprises:

Step 101 according to the pinyin string of input and the differentiation property dictionary that makes up in advance, generates the words grid corresponding with pinyin string; Wherein the property distinguished dictionary is based on the mutual information of text and phonetic and make up;

Step 102 is decoded to the words grid according to statistical language model, obtains the maximum transduction pathway of probability to realize the conversion of Chinese sound word.

In the present embodiment, can come to obtain to have automatically the dictionary of the property distinguished more by force through the mutual information of continuous maximization pinyin string and the text corresponding with it.In the process of sound word conversion, at first generate the words grid corresponding, and then,, obtain the maximum transduction pathway of probability its dynamic decoder according to statistical language model with it according to given pinyin string and dictionary.

In the present embodiment,, take into account the information between phonetic and text, will help the raising of sound word conversion ratio if in the building process of dictionary.For example; Given pinyin string " xiang wo men zhe yang de nian qing ren "; Using the bigram statistics language model, transformation result is " to we such young man " in the system of traditional dictionary, and correct result is " young men as us ".The present invention can avoid this type of mistake through in traditional dictionary, adding entry " as us ", and the process that adds entry is to accomplish automatically based on data-driven fully in the present invention, and its adding criterion is the mutual information between the corresponding text with it of phonetic.

Fig. 2 is the synoptic diagram of the text-phonetic-text mode of the embodiment of the invention, can sound word transfer process form be turned to the process of information decoding.As shown in Figure 2; W has represented a kind of language, indicates with textual form, and the F band has been shown the pinyin string corresponding with it; Conversion described from F to W the conversion from pinyin sequence to the Chinese character sequence, the high-rise knowledge that this transfer process need be used comprises dictionary and statistical language model.

A given pinyin string adopts different dictionary and statistical language model that it is decoded, and perhaps will obtain different Chinese character string W1,2.Optimum dictionary will make W=Wi, and the selection of dictionary has determined the cutting of optimum pinyin string, can describe optimum pinyin string with following formula:

\underset{i}{\arg \max} I (W, F_{i}) wherei = 1,2 . . . - - - (1)

Mutual information between W and Fi is defined as I (W, F _i):

I(W,F _i)＝H(W)-H(W|F _i) (2)

H (W) is the information entropy of text W, and W is by words sequence { w ₁, w ₂, w ₃... W _nRepresented.H (W) is obtained by the average information entropy that calculates each speech:

H (W) = \lim_{n &RightArrow; \infty} - \frac{1}{n} \log p (w_{1}, w_{2}, {\cdot \cdot \cdot, w}_{n}) - - - (3)

Wherein

p (W) = p (w_{1}, w_{2}, \cdot \cdot \cdot, w_{n})

= Π_{i = 1}^{n} p (w_{i} | w_{1}, \cdot \cdot \cdot, w_{i - 1}) - - - (4)

(W Fi) has weighed statement W and the possible slit mode Fi of its pinyin string, given different pinyin string slit mode Fi to variable I; In all the other conditions, under the situation about remaining unchanged like dictionary and language model, variable I (W; Fi) big more, slit mode F between explanatory text W and this kind pinyin string _iRelation tight more, then can release from F _iConversion ambiguity degree to W is more little, and this has guaranteed the accuracy of sound word conversion.

Calculating and abbreviation through to formula (2) can obtain following formula:

I (W, F_{i}) = - \log \underset{all W_{j} with F_{i}}{Σ} P (W_{j}) - - - (5)

Variable Wj has represented all candidate character strings, and these word strings are enjoyed pinyin string slit mode F jointly _iOther particular contents about mutual information can be with reference to above-mentioned list of references.

In the present embodiment, can describe the decode procedure from the pinyin string to the text with the words grid, it obtains different slit modes by same pinyin string according to dictionary.

Fig. 3 is an exemplary plot of the decode procedure of the embodiment of the invention, and is as shown in Figure 3, shown the part words grid of pinyin string " zhong guo ren min sheng huo ".Wherein, node<s>With</s>Represented the beginning and end of all words sequences, remaining node is the pairing all possible prepare word of pinyin string.

For example, the candidate word of syllable " zhong " has kind, in, heavily etc., because the restriction of feature size, Fig. 3 has only listed the words grid of part.According to formula (5), can through the algorithm of dynamic programming find out the maximum path of probability from beginning of the sentence to the sentence tail.This path the is corresponding optimum slit mode of pinyin string.

Below to how making up the property distinguished dictionary be elaborated.The present invention can make up the property distinguished dictionary through all possible words border in the mutual information adjustment statement of text and phonetic.

Fig. 4 is a schematic flow sheet of the structure property the distinguished dictionary of the embodiment of the invention, and is as shown in Figure 4, and making up the property distinguished dictionary can comprise:

Step 401 makes up the words grid according to training pinyin string and initial dictionary, and with statistical language model the words grid is decoded to obtain different phonetic switching modes;

Step 402 is confirmed the phonetic switching mode that mutual information is maximum from different phonetic switching modes;

Step 403, according to mutual information maximum cutting of phonetic switching mode and the corresponding text of training pinyin string, and the text after the statistics cutting is to obtain new dictionary.

Fig. 4 is illustrated an iterative process that makes up the property distinguished dictionary, can carry out repeatedly iteration in the specific implementation.Can repeatedly stop iteration according to certain threshold value after the iteration through optimizing dictionary, language model, the next mutual information entropy that constantly increases text W and phonetic F of pinyin string slit mode simultaneously.

Fig. 5 is another schematic flow sheet of the structure property the distinguished dictionary of the embodiment of the invention, and is as shown in Figure 5, and making up the property distinguished dictionary can comprise:

Step 501, full cutting news language material training original language model; Wherein major term length can be 4.

Step 502 makes up the words grid according to training pinyin string and dictionary, and with statistical language model the words grid is decoded to obtain different phonetic switching modes.

Step 503 is confirmed the phonetic switching mode that mutual information is maximum from different phonetic switching modes.

Step 504, according to mutual information maximum cutting of phonetic switching mode and the corresponding text of training pinyin string, and the text after the statistics cutting is to obtain new dictionary.

Step 505 is assessed the mutual information between training pinyin string and text; If the variation of the mutual information of assessing out surpasses predetermined threshold value, then execution in step 502; If the variation of the mutual information of assessing out does not surpass predetermined threshold value, then execution in step 506.

In the present embodiment, if the mutual information between phonetic and text does not significantly rise, then can select new training pinyin string, the execution in step that iterates 502 is carried out the iteration training to step 504 to new dictionary.

Step 506 obtains final dictionary and language model.

In the present embodiment, done preliminary experiment, with three kinds of different dictionaries and language model pinyin string decoded respectively, and compared of the influence of three kinds of different dictionary construction methods sound word conversion ratio with People's Daily's language material:

Baseline I: based on the statistical language model of bi-gram, dictionary be chosen as manual dictionary, comprised 46,856 entries of everyday words, according to word frequency statistics, ordering obtains this dictionary by Peking University.

Baseline II: based on the statistical language model of bi-gram, dictionary be according to the minimum optimization criterion of language model puzzlement degree from the acquistion of language material middle school to.

Optimized SLM:, come Automatic Optimal dictionary and language model according to the mutual information entropy between text and phonetic based on the statistical language model of bi-gram.

Wherein, Optimized SLM has adopted method of the present invention.Language material is randomly drawed from nearest 5 years People's Daily, the sentence total bit 1,030,000 of corpus, and testing material is 50,000, data are as shown in table 1:

Table 1

Fig. 6 is a synoptic diagram of the experimental result of the embodiment of the invention, has described the iterations of EM algorithm and the relation of corpus complexity.Fig. 7 is another synoptic diagram of the experimental result of the embodiment of the invention, has represented the increase along with iterations, the variation of mutual information between text and phonetic.

Table 2 and 3 is the comparing result of sound word conversion ratio, promptly respectively Baseline I and Baseline II and Optimized SLM method is contrasted.

Table 2

Table 3

As shown in Figure 6, the complexity of Baseline II and Optimized bilingual model increases and reduces along with iteration, through all having reached local optimum after six iteration.Baseline II has lower language model complexity than method of the present invention.As shown in Figure 7, the mutual information entropy between text and phonetic is along with number of iterations increases gradually, convergence after iteration eight times.

Method of the present invention is compared with Baseline I, on training set test set, has showed its superiority respectively, and the error rate of sound word conversion has reduced by 87.04% and 19.72% respectively relatively.Compare with Baseline II, system of the present invention is resulting error rate on training set and test set, has reduced by 82.8% and 10.3% respectively.

Experimental result shows that the method that the present invention proposes particularly on training set, has shown very high accuracy having obtained optimum result aspect the accuracy of sound word conversion.Compare with traditional complexity Baseline II to optimize language model, method of the present invention has bigger complexity, yet higher accuracy is but arranged.The complexity of this descriptive language model can not be portrayed the performance of system well.

After about 8 iteration of process, final dictionary size has comprised 147,784 entries; Wherein probably have 36; 000 entry is consistent with traditional dictionary, and remaining entry then is based on data-driven, is obtained automatically through mutual information entropy between optimization text and phonetic.

The composition of new entry can be divided into two types: 1. have very high co-occurrence rate between adjacent words, such as " he ", " will come ", entries such as " as us ".According to the word-building of Chinese, it is illegal that these entries are considered to usually, so can not be admitted in the standard dictionary.2. the discovery of neologisms and term, such as " domain name ", " Quanjude ", " Beijing Capital Iron and Steel " etc.The adding of these speech will reduce the uncertainty that phonetic is carried to the Chinese character conversion, thereby improves the accuracy of sound word conversion.

Can know by the foregoing description,, realize the conversion of Chinese sound word, can further improve the accuracy of sound word conversion through the differentiation property dictionary that the mutual information based on text and phonetic makes up.

Embodiment 2

The embodiment of the invention provides a kind of Chinese sound word converting system based on the property distinguished dictionary, corresponding to the Chinese tone-character conversion method among the embodiment 1, repeats no more with embodiment 1 identical content.

Fig. 8 is the formation synoptic diagram of the Chinese sound word converting system of the embodiment of the invention; As shown in Figure 8; This Chinese sound word converting system 800 comprises: first generation unit 801 obtains unit 802 with the path, and other parts of Chinese sound word converting system 800 can be with reference to prior art.

Wherein, first generation unit 801 generates the words grid corresponding with pinyin string according to the pinyin string of input and the differentiation property dictionary that makes up in advance; Wherein the property distinguished dictionary is based on the mutual information of text and phonetic and make up; Acquisition unit 802, path to the words trellis decode, obtains the maximum transduction pathway of probability to realize the conversion of Chinese sound word according to statistical language model.

As shown in Figure 8, Chinese sound word converting system 800 can also comprise: dictionary construction unit 803, dictionary construction unit 803 makes up the property distinguished dictionary through all possible words border in the mutual information adjustment statement of text and phonetic.

Fig. 9 is the formation synoptic diagram of the dictionary construction unit of the embodiment of the invention, and is as shown in Figure 9, and this dictionary construction unit 803 can comprise that second generation unit 901, mode confirm unit 902 and text cutter unit 903;

Wherein, second generation unit 901 makes up the words grid according to training pinyin string and initial dictionary, and with statistical language model the words grid is decoded to obtain different phonetic switching modes; Mode is confirmed unit 902 maximum phonetic switching mode of definite mutual information from different phonetic switching modes; Text cutter unit 903 is according to mutual information maximum cutting of phonetic switching mode and the corresponding text of training pinyin string, and the text after the statistics cutting is to obtain new dictionary.

As shown in Figure 9, dictionary construction unit 803 specifically can also comprise: information evaluation unit 904 and iteration judging unit 905; Wherein the mutual information between information evaluation unit 904 pairs of training pinyin string and text is assessed; When iteration judging unit 905 surpasses predetermined threshold value in the variation of the mutual information of assessing out, select new training pinyin string that new dictionary is carried out the iteration training.

The professional can also further recognize; The unit and the algorithm steps of each example of describing in conjunction with embodiment disclosed herein; Can realize with electronic hardware, computer software or the combination of the two; For the interchangeability of hardware and software clearly is described, the composition and the step of each example described prevailingly according to function in above-mentioned explanation.These functions still are that software mode is carried out with hardware actually, depend on the application-specific and the design constraint of technical scheme.The professional and technical personnel can use distinct methods to realize described function to each certain applications, but this realization should not thought and exceeds scope of the present invention.

The software module that the method for describing in conjunction with embodiment disclosed herein or the step of algorithm can use hardware, processor to carry out, perhaps the combination of the two is implemented.Software module can place the storage medium of any other form known in random access memory (RAM), internal memory, ROM (read-only memory) (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or the technical field.

Above-described embodiment; The object of the invention, technical scheme and beneficial effect have been carried out further explain, and institute it should be understood that the above is merely embodiment of the present invention; And be not used in qualification protection scope of the present invention; All within spirit of the present invention and principle, any modification of being made, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. Chinese tone-character conversion method is based on the property distinguished dictionary; It is characterized in that said Chinese tone-character conversion method comprises:

2. Chinese tone-character conversion method according to claim 1, wherein, said Chinese tone-character conversion method also comprises:

Through all possible words border in the mutual information adjustment statement of text and phonetic, make up said differentiation property dictionary.

3. Chinese tone-character conversion method according to claim 2 wherein, through all possible words border in the mutual information adjustment statement of text and phonetic, makes up said differentiation property dictionary and specifically comprises:

According to said mutual information maximum cutting of phonetic switching mode and the corresponding text of said training pinyin string, and the text after the statistics cutting is to obtain new dictionary.

4. Chinese tone-character conversion method according to claim 3 wherein, through all possible words border in the mutual information adjustment statement of text and phonetic, makes up said differentiation property dictionary and specifically also comprises:

Mutual information between said training pinyin string and text is assessed;

If the variation of the mutual information of assessing out surpasses predetermined threshold value, then select new training pinyin string that said new dictionary is carried out the iteration training.

5. the construction method of the property a distinguished dictionary is characterized in that, said construction method comprises:

6. construction method according to claim 5, wherein, said construction method also comprises:

Mutual information between said training pinyin string and text is assessed;

7. Chinese sound word converting system is based on the property distinguished dictionary; It is characterized in that said Chinese sound word converting system comprises:

8. Chinese sound word converting system according to claim 7, wherein, said Chinese sound word converting system also comprises:

The dictionary construction unit through all possible words border in the mutual information adjustment statement of text and phonetic, makes up said differentiation property dictionary.

9. Chinese sound word converting system according to claim 7, wherein, said dictionary construction unit specifically comprises:

Second generation unit makes up the words grid according to training pinyin string and initial dictionary, and with statistical language model said words grid is decoded to obtain different phonetic switching modes;

Mode is confirmed the unit, from said different phonetic switching mode, confirms the phonetic switching mode that mutual information is maximum;

The text cutter unit, according to said mutual information maximum cutting of phonetic switching mode and the corresponding text of said training pinyin string, and the text after the statistics cutting is to obtain new dictionary.

10. Chinese sound word converting system according to claim 7, wherein, said dictionary construction unit specifically also comprises:

The information evaluation unit is assessed the mutual information between said training pinyin string and text;

The iteration judging unit if the variation of the mutual information of assessing out surpasses predetermined threshold value, then selects new training pinyin string that said new dictionary is carried out the iteration training.