CN102750267B

CN102750267B - Chinese Pinyin and character conversion method and system as well as distinguishing dictionary building method

Info

Publication number: CN102750267B
Application number: CN201210202471.1A
Authority: CN
Inventors: 张劲松; 李伟; 解焱陆; 曹文
Original assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Current assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority date: 2012-06-15
Filing date: 2012-06-15
Publication date: 2015-02-25
Anticipated expiration: 2032-06-15
Also published as: CN102750267A

Abstract

The embodiment of the invention provides a Chinese Pinyin and character conversion method and system as well as a distinguishing dictionary building method. The Chinese Pinyin and character conversion method comprises the following steps of: according to an input Pinyin string and a prebuilt distinguishing dictionary, generating a word grid corresponding to the Pinyin string, wherein the distinguishing dictionary is built based on the mutual information of the text and the Pinyin; decoding the word grid according to a statistical language model to obtain a conversion path with the highest probability so as to realize the conversion of the Chinese Pinyin and character. By virtue of he embodiment of the invention, the accuracy of the Pinyin and character conversion can be improved further.

Description

The construction method of Chinese phonetic-to-word conversion method and system, distinction dictionary

Technical field

The present invention relates to Syllable text conversion technical field, particularly a kind of based on the Chinese phonetic-to-word conversion method and system of distinction dictionary, the construction method of distinction dictionary.

Background technology

Phonetic is the phone string of Chinese character.In many systems, phonetic is all the key component that it forms to the conversion of Chinese character, as the input through keyboard of Chinese, and the speech recognition system etc. of Chinese.Owing to probably only having 410 not to be with tune phonetic in Chinese, the Chinese character corresponding with it then has 6700, so how to select its correct corresponding Chinese character from same phonetic, has just become a current research topic.

At present, one of best bet solving this problem is the ambiguity utilizing statistical language model to bring to eliminate homophone word.The structure of statistical language model, needs the problem that solution two is important: 1. the selection of dictionary; 2. the optimization of model parameter.For the most frequently used ternary statistical model, the selection of dictionary can be divided into and have supervision and non-supervisory two classes.The structure having a dictionary in the method for supervision is mainly by hand weaving.But the dictionary that Chinese is not sought unity of standard, perhaps linguist can reach an agreement to up to ten thousand entries, but remaining words then can cause very large dispute.For this reason, a large amount of non-supervisory dictionary creation methods is suggested, and which includes the structure of maximum likelihood method dictionary, based on the structure etc. of mutual information dictionary.Compared with manual dictionary, these methods prove based on data-driven dictionary creation method in a particular application, there is same feasibility, and more cost-saving.

For the Parametric optimization problem of language model, scholar is decades in the past, and the Optimality Criteria of Main Basis is maximum likelihood or minimum puzzled degree.In recent years, in order to improve the accuracy of Chinese speech recognition, some scholars propose the method for distinction training to optimize language model.The core concept of the method is that the relative probability of prepare word has more the effect to homophone word disambiguation than absolute probability score in Syllable text conversion.In the training process of language model, carry out the parameter of continuous adjustment model according to the result of Syllable text conversion.

But realizing in process of the present invention, inventor finds that the defect of prior art is: in above-mentioned traditional method, the structure of dictionary mainly passes through hand weaving, or directly obtain from text, the information of pinyin string do not considered by the structure of dictionary, can not improve the accuracy of Syllable text conversion further.

Following is a list for understanding the present invention and the useful document of routine techniques, incorporating them into by reference herein, as illustrated completely in this article.

[list of references 1] Jianfeng Gao, Hai-Feng Wang, Mingjing Li, and Kai-Fu Lee, " A UnifiedApproach to Statistical Language Modeling for Chinese ", IEEE ICASSP2000, Istanbul, Turkey.June 5-9,2000.

[list of references 2] Lingyun Pan and Changsheng Yang, " An Auto-system For ConvertingHANYUPINYIN to Chinese Characters ", Journal of Computer, 13 (4): 271-275.

[list of references 3] Ruiqiang Zhang, Zuoying Wang and Jianping Zhang, " ChinesePinyin-to-Text Translation Technique with Error Correction Used for Continuous SpeechRecognition ", Journal ofTsinghua University (Sci & Tech), 37 (10): 9-11,1997.

[list of references 4] Ando, R.and Lee, " Mostly-unsupervised Statistical Segmentation ofJapanese:Application to Kanji ", ANLP-NAACL.2000.

[list of references 5] Fuchun Peng, Dale Schuurmans, " Self-Supervised Chinese WordSegmentation ", Proceedings of the 4th International Conference on Advances in Intelligent DataAnalysis, p.238-247, September 13-15,2001.

[list of references 6] Zheng Chen, Kai-Fu Lee, Ming-jing Li, " Discriminative training onlanguage model ", In Proc.ISCSLP 2000, Beijing, China, Oct 2000.

[list of references 7] Hong-Kwang Jeff Kuo, et al " Discriminative Training of Language Modelsfor Speech Recognition ", IEEE, ICASSP 2002, Orlando, Florida.

[list of references 8] Jinsong Zhang, Wei Li, Yuxia Hou, Wen Cao, Ziyu Xiong, " A Study OnFunctional Loads of Phonetic Contrasts Under Context Based On Mutual Information of ChineseText And Phonemes ", The 7th International Symposium on Chinese Spoken LanguageProcessing (ISCSLP), Tainan, Nov.2010.

[list of references 9] http: ∥ www.speech.sri.com/projects/srilm/

Summary of the invention

The embodiment of the present invention provides the construction method of a kind of conversion method of Chinese phonetic word and system, distinction dictionary, and object is the accuracy improving Syllable text conversion further.

According to an aspect of the embodiment of the present invention, provide a kind of Chinese phonetic-to-word conversion method, based on distinction dictionary; Described Chinese phonetic-to-word conversion method comprises:

According to the pinyin string inputted and the distinction dictionary built in advance, generate the word lattice corresponding with described pinyin string; Wherein said distinction dictionary based on text and phonetic mutual information and build;

According to statistical language model, described word lattice is decoded, obtain the transduction pathway of maximum probability to realize the conversion of Chinese phonetic word.

According to another aspect of the embodiment of the present invention, provide a kind of construction method of distinction dictionary, described construction method comprises:

According to training pinyin string and initial dictionary creation word lattice, and decode to obtain different phonetic switching modes to described word lattice with statistical language model;

The phonetic switching mode that mutual information is maximum is determined from described different phonetic switching mode;

The text corresponding with the pinyin string of described training according to the phonetic switching mode cutting that described mutual information is maximum, and the text after statistics cutting is to obtain new dictionary.

According to another aspect of the embodiment of the present invention, provide a kind of Chinese phonetic-to-word conversion system, based on distinction dictionary; Described Chinese phonetic-to-word conversion system comprises:

First generation unit, according to the pinyin string inputted and the distinction dictionary built in advance, generates the word lattice corresponding with described pinyin string; Wherein said distinction dictionary based on text and phonetic mutual information and build;

Path obtains unit, decodes, obtain the transduction pathway of maximum probability to realize the conversion of Chinese phonetic word according to statistical language model to described word lattice.

The beneficial effect of the embodiment of the present invention is, the distinction dictionary built by the mutual information based on text and phonetic, is realized the conversion of Chinese phonetic word, can improve the accuracy of Syllable text conversion further.

Accompanying drawing explanation

Accompanying drawing described herein is used to provide a further understanding of the present invention, forms a application's part, does not form limitation of the invention.In the accompanying drawings:

Fig. 1 is the process flow diagram of the conversion method of the embodiment of the present invention;

Fig. 2 is the schematic diagram of the text-phonetic-File Transfer model of the embodiment of the present invention;

Fig. 3 is an exemplary plot of the decode procedure of the embodiment of the present invention;

Fig. 4 is a schematic flow sheet of the structure distinction dictionary of the embodiment of the present invention;

Fig. 5 is another schematic flow sheet of the structure distinction dictionary of the embodiment of the present invention;

Fig. 6 is a schematic diagram of the experimental result of the embodiment of the present invention;

Fig. 7 is another schematic diagram of the experimental result of the embodiment of the present invention;

Fig. 8 is the formation schematic diagram of the Chinese phonetic-to-word conversion system of the embodiment of the present invention;

Fig. 9 is the formation schematic diagram of the dictionary creation unit of the embodiment of the present invention.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly understand, below in conjunction with accompanying drawing, the embodiment of the present invention is described in further detail.At this, schematic description and description of the present invention is for explaining the present invention, but not as a limitation of the invention.

Embodiment 1

The embodiment of the present invention provides a kind of Chinese phonetic-to-word conversion method based on distinction dictionary.Fig. 1 is the process flow diagram of the conversion method of the embodiment of the present invention, and as shown in Figure 1, this Chinese phonetic-to-word conversion method comprises:

Step 101, according to the pinyin string inputted and the distinction dictionary built in advance, generates the word lattice corresponding with pinyin string; Wherein distinction dictionary based on text and phonetic mutual information and build;

Step 102, decodes to word lattice according to statistical language model, obtains the transduction pathway of maximum probability to realize the conversion of Chinese phonetic word.

In the present embodiment, the dictionary with stronger distinction can automatically be obtained by the mutual information constantly maximizing pinyin string and the text corresponding with it.In the process of Syllable text conversion, first generate the word lattice corresponding with it according to given pinyin string and dictionary, and then according to statistical language model, to its dynamic decoder, obtain the transduction pathway of maximum probability.

In the present embodiment, if in the building process of dictionary, the information between phonetic and text is taken into account, the raising of Syllable text conversion rate will be contributed to.Such as, given pinyin string " xiang wo men zhe yang de nian qing ren ", at use bigram statistics language model, in the system of conventional dictionaries, transformation result is " to the young man that we are such ", and correct result is " young men as us ".The present invention can avoid this type of mistake by adding entry " as us " in conventional dictionaries, and the process adding entry completes based on data-driven in the present invention completely automatically, and its criterion that adds is mutual information between phonetic text corresponding to it.

Fig. 2 is the schematic diagram of the text-phonetic-File Transfer model of the embodiment of the present invention, Syllable text conversion process form can be turned to the process of information decoding.As shown in Figure 2, W represents a kind of language, indicates in the form of text, and the pinyin string corresponding with it shown by F band, conversion described from F to W is from pinyin sequence to the conversion of Chinese character sequence, and this transfer process needs the high-rise knowledge used to comprise dictionary and statistical language model.

A given pinyin string, adopts different dictionaries and statistical language model to decode to it, perhaps will obtain different Chinese character string W1,2.Optimum dictionary will make W=Wi, and the selection of dictionary determines the cutting of optimum pinyin string, can describe optimum pinyin string with following formula:

\underset{i}{\arg \max} I (W, F_{i}) wherei = 1,2 . . . - - - (1)

Mutual information between W and Fi is defined as I (W, F _i):

I(W,F _i)＝H(W)-H(W|F _i) (2)

H (W) is the information entropy of text W, and W is by words sequence { w ₁, w ₂, w ₃... W _nrepresented by.H (W) is obtained by the average information entropy calculating each word:

H (W) = \lim_{n &RightArrow; \infty} - \frac{1}{n} \log p (w_{1}, w_{2}, {\cdot \cdot \cdot, w}_{n}) - - - (3)

Wherein

p (W) = p (w_{1}, w_{2}, \cdot \cdot \cdot, w_{n})

= Π_{i = 1}^{n} p (w_{i} | w_{1}, \cdot \cdot \cdot, w_{i - 1}) - - - (4)

Variable I (W, Fi) the slit mode Fi that statement W is possible with its pinyin string has been weighed, given different pinyin string slit mode Fi, in all the other conditions, as dictionary and language model remain unchanged when, variable I (W, Fi) is larger, slit mode F between explanatory text W and this kind of pinyin string _irelation tightr, then can release from F _iconversion ambiguity degree to W is less, which ensure that the accuracy of Syllable text conversion.

By to the calculating of formula (2) and abbreviation, following formula can be obtained:

I (W, F_{i}) = - \log \underset{all W_{j} with F_{i}}{Σ} P (W_{j}) - - - (5)

Variable Wj represents all candidate character strings, and these word strings enjoy pinyin string slit mode F jointly _i.Other particular contents about mutual information can with reference to above-mentioned list of references.

In the present embodiment, can describe decode procedure from pinyin string to text with word lattice, it obtains different slit modes by same pinyin string according to dictionary.

Fig. 3 is an exemplary plot of the decode procedure of the embodiment of the present invention, as shown in Figure 3, illustrates the part word lattice of pinyin string " zhongguo ren min sheng huo ".Wherein, node <s> and </s> represents the beginning and end of all words sequences, all possible prepare word of remaining node corresponding to pinyin string.

Such as, the candidate word of syllable " zhong " has kind, in, heavily etc., due to the restriction of feature size, Fig. 3 only lists the word lattice of part.According to formula (5), the path of maximum probability can be found out from beginning of the sentence to sentence tail by the algorithm of dynamic programming.This path the is corresponding optimum slit mode of pinyin string.

To how building distinction dictionary be described in detail below.The present invention can adjust all possible words border in statement by the mutual information of text and phonetic, builds distinction dictionary.

Fig. 4 is a schematic flow sheet of the structure distinction dictionary of the embodiment of the present invention, and as shown in Figure 4, building distinction dictionary can comprise:

Step 401, according to training pinyin string and initial dictionary creation word lattice, and decodes to obtain different phonetic switching modes to word lattice with statistical language model;

Step 402, determines the phonetic switching mode that mutual information is maximum from different phonetic switching modes;

Step 403, the text corresponding with training pinyin string according to the phonetic switching mode cutting that mutual information is maximum, and the text after statistics cutting is to obtain new dictionary.

Fig. 4 is illustrated the iterative process building distinction dictionary, can carry out successive ignition in the specific implementation.By the Mutual information entropy optimizing dictionary, language model, pinyin string slit mode constantly increase text W and phonetic F simultaneously, after successive ignition, iteration can be stopped according to certain threshold value.

Fig. 5 is another schematic flow sheet of the structure distinction dictionary of the embodiment of the present invention, and as shown in Figure 5, building distinction dictionary can comprise:

Step 501, complete syncopate news corpus training original language model; Wherein most major term length can be 4.

Step 502, according to training pinyin string and dictionary creation word lattice, and decodes to obtain different phonetic switching modes to word lattice with statistical language model.

Step 503, determines the phonetic switching mode that mutual information is maximum from different phonetic switching modes.

Step 504, the text corresponding with training pinyin string according to the phonetic switching mode cutting that mutual information is maximum, and the text after statistics cutting is to obtain new dictionary.

Step 505, assesses the mutual information between training pinyin string and text; If the change of the mutual information evaluated exceedes predetermined threshold value, then perform step 502; If the change of the mutual information evaluated does not exceed predetermined threshold value, then perform step 506.

In the present embodiment, if the mutual information between phonetic and text does not significantly rise, then can select new training pinyin string, iterate and perform step 502 to step 504, repetitive exercise is carried out to new dictionary.

Step 506, obtains final dictionary and language model.

In the present embodiment, do preliminary experiment with People's Daily's language material, with three kinds of different dictionaries and language model, pinyin string decoded respectively, and compare three kinds of different dictionary creation methods to the impact of Syllable text conversion rate:

Baseline I: based on the statistical language model of bi-gram, dictionary be chosen as manual dictionary, include everyday words 46,856 entries, this dictionary is by Peking University according to word frequency statistics, and sequence obtains.

Baseline II: based on the statistical language model of bi-gram, dictionary obtains from language material learning according to the Optimality Criteria that language model puzzlement degree is minimum.

Optimized SLM: based on the statistical language model of bi-gram, comes Automatic Optimal dictionary and language model according to the Mutual information entropy between text and phonetic.

Wherein, Optimized SLM have employed method of the present invention.Language material is randomly drawed from the People's Daily of nearest 5 years, the sentence total bit 1,030,000 of corpus, and testing material is 50,000, and data are as shown in table 1:

Table 1

Fig. 6 is a schematic diagram of the experimental result of the embodiment of the present invention, describes the iterations of EM algorithm and the relation of corpus complexity.Fig. 7 is another schematic diagram of the experimental result of the embodiment of the present invention, illustrates the increase along with iterations, the change of mutual information between text and phonetic.

Table 2 and 3 is the comparing result of sound word conversion ratio, namely Baseline I and Baseline II and Optimized SLM method is contrasted respectively.

Table 2

Table 3

As shown in Figure 6, the complexity of Baseline II and Optimized bilingual model increases along with iteration and reduces, after six iteration, all reach local optimum.Baseline II has lower language model complexity than method of the present invention.As shown in Figure 7, the Mutual information entropy between text and phonetic increases gradually along with the number of times of iteration, restrains after iteration eight times.

Method of the present invention is compared with Baseline I, and on training set test set, showed its superiority respectively, the error rate of Syllable text conversion reduces 87.04% and 19.72% respectively relatively.Compared with Baseline II, the error rate that system of the present invention obtains on training set and test set, reduces 82.8% and 10.3% respectively.

Experimental result shows, the method that the present invention proposes achieves optimum result in the accuracy of Syllable text conversion, particularly on training set, shows very high accuracy.With traditional with compared with the complexity Baseline II optimizing language model, method of the present invention has larger complexity, but but has higher accuracy.The complexity of this descriptive language model can not the performance of describing system well.

After about 8 iteration, final dictionary size contains 147,784 entries, wherein probably have 36,000 entry is consistent with conventional dictionaries, and remaining entry is then based on data-driven, by optimizing Mutual information entropy institute automatic acquisition between text and phonetic.

The composition of new entry can be divided into two classes: 1. have very high co-occurrence rate between adjacent words, such as " he ", " will come ", entries such as " as us ".According to the word-building of Chinese, these entries are considered to illegal usually, so can not be admitted in standard dictionary.2. the discovery of neologisms and term, such as " domain name ", " Quanjude ", " Beijing Capital Iron and Steel " etc.These words add the uncertainty that will reduce phonetic and carry to Chinese character conversion, thus improve the accuracy of Syllable text conversion.

From above-described embodiment, the distinction dictionary built by the mutual information based on text and phonetic, is realized the conversion of Chinese phonetic word, can improve the accuracy of Syllable text conversion further.

Embodiment 2

The embodiment of the present invention provides a kind of Chinese phonetic-to-word conversion system based on distinction dictionary, and corresponding to the Chinese phonetic-to-word conversion method in embodiment 1, the content identical with embodiment 1 repeats no more.

Fig. 8 is the formation schematic diagram of the Chinese phonetic-to-word conversion system of the embodiment of the present invention, as shown in Figure 8, this Chinese phonetic-to-word conversion system 800 comprises: the first generation unit 801 and path obtain unit 802, and other parts of Chinese phonetic-to-word conversion system 800 can with reference to prior art.

Wherein, the first generation unit 801, according to the pinyin string inputted and the distinction dictionary built in advance, generates the word lattice corresponding with pinyin string; Wherein distinction dictionary based on text and phonetic mutual information and build; Path obtains unit 802 and decodes to word lattice according to statistical language model, obtains the transduction pathway of maximum probability to realize the conversion of Chinese phonetic word.

As shown in Figure 8, Chinese phonetic-to-word conversion system 800 can also comprise: dictionary creation unit 803, and dictionary creation unit 803 adjusts all possible words border in statement by the mutual information of text and phonetic, builds distinction dictionary.

Fig. 9 is the formation schematic diagram of the dictionary creation unit of the embodiment of the present invention, and as shown in Figure 9, this dictionary creation unit 803 can comprise the second generation unit 901, mode determining unit 902 and text cutter unit 903;

Wherein, the second generation unit 901 according to training pinyin string and initial dictionary creation word lattice, and decodes to obtain different phonetic switching modes to word lattice with statistical language model; Mode determining unit 902 determines the phonetic switching mode that mutual information is maximum from different phonetic switching modes; The text that text cutter unit 903 is corresponding with training pinyin string according to the phonetic switching mode cutting that mutual information is maximum, and the text after statistics cutting is to obtain new dictionary.

As shown in Figure 9, dictionary creation unit 803 specifically can also comprise: information evaluation unit 904 and iteration judging unit 905; Wherein information evaluation unit 904 is assessed the mutual information between training pinyin string and text; Iteration judging unit 905, when the change of the mutual information evaluated exceedes predetermined threshold value, selects new training pinyin string to carry out repetitive exercise to new dictionary.

Professional can also recognize further, in conjunction with unit and the algorithm steps of each example of embodiment disclosed herein description, can realize with electronic hardware, computer software or the combination of the two, in order to the interchangeability of hardware and software is clearly described, generally describe composition and the step of each example in the above description according to function.These functions perform with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Professional and technical personnel can use distinct methods to realize described function to each specifically should being used for, but this realization should not thought and exceeds scope of the present invention.

The software module that the method described in conjunction with embodiment disclosed herein or the step of algorithm can use hardware, processor to perform, or the combination of the two is implemented.Software module can be placed in the storage medium of other form any known in random access memory (RAM), internal memory, ROM (read-only memory) (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field.

Above-described embodiment; object of the present invention, technical scheme and beneficial effect are further described; be understood that; the foregoing is only the specific embodiment of the present invention; the protection domain be not intended to limit the present invention; within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a Chinese phonetic-to-word conversion method, based on distinction dictionary; It is characterized in that, described Chinese phonetic-to-word conversion method comprises:

According to statistical language model, described word lattice is decoded, obtain the transduction pathway of maximum probability to realize the conversion of Chinese phonetic word;

Wherein, described Chinese phonetic-to-word conversion method also comprises:

Adjust all possible words border in statement by the mutual information of text and phonetic, build described distinction dictionary; Specifically comprise:

The text corresponding with described training pinyin string according to the phonetic switching mode cutting that described mutual information is maximum, and the text after statistics cutting is to obtain new dictionary.

2. Chinese phonetic-to-word conversion method according to claim 1, wherein, adjusts all possible words border in statement by the mutual information of text and phonetic, builds described distinction dictionary and specifically also comprise:

Mutual information between described training pinyin string and text is assessed;

If the change of the mutual information evaluated exceedes predetermined threshold value, then new training pinyin string is selected to carry out repetitive exercise to described new dictionary.

3. a construction method for distinction dictionary, is characterized in that, described construction method comprises:

4. construction method according to claim 3, wherein, described construction method also comprises:

5. a Chinese phonetic-to-word conversion system, based on distinction dictionary; It is characterized in that, described Chinese phonetic-to-word conversion system comprises:

Path obtains unit, decodes, obtain the transduction pathway of maximum probability to realize the conversion of Chinese phonetic word according to statistical language model to described word lattice;

Dictionary creation unit, adjusts all possible words border in statement by the mutual information of text and phonetic, builds described distinction dictionary;

Described dictionary creation unit specifically comprises:

Second generation unit, according to training pinyin string and initial dictionary creation word lattice, and decodes to obtain different phonetic switching modes to described word lattice with statistical language model;

Mode determining unit, determines the phonetic switching mode that mutual information is maximum from described different phonetic switching mode;

Text cutter unit, the text corresponding with described training pinyin string according to the phonetic switching mode cutting that described mutual information is maximum, and the text after statistics cutting is to obtain new dictionary.

6. Chinese phonetic-to-word conversion system according to claim 5, wherein, described dictionary creation unit specifically also comprises:

Information evaluation unit, assesses the mutual information between described training pinyin string and text;

Iteration judging unit, if the change of the mutual information evaluated exceedes predetermined threshold value, then selects new training pinyin string to carry out repetitive exercise to described new dictionary.