Background technology
Along with the develop rapidly of science and technology, especially computing machine steps into each corner of society gradually, and uses a computer widely and become the inexorable trend that modern Du can be developed.Yet, because the invention of computing machine all is in the west with main application, therefore, promotes the use of computing machine in China and will inevitably produce some obstacles, and wherein topmost be the obstacle of spoken and written languages.Because computing machine generally all shows with English alphabet and operates, so concerning most of Chinese, be very difficult with English skilled operational computations machine.Therefore, computing machine using and the universal restriction that has received this bottle neck of Chinese characters in China.
In order to get rid of this obstacle, since the seventies, China has designed many kinds of input schemes.Existing seven according to report on the magazine, 800 kind.Encoding schemes such as font code, sound sign indicating number, shape sound sign indicating number, numerical code are wherein arranged; Like the Five-stroke Method method (State Patent Office's patent No. is CN85100837A); These coded input methods have two outstanding shortcomings: first; What import is coding, rather than " word ", between coding and the word conversion process will be arranged.Operating personnel are the study coding earlier, could operate, and is not easy to popularize.The second, be single Chinese character by what encode input, single Chinese character majority is the speech with meaning, is a kind of rudimentary input mode.
In order to address the above problem, " Scheme for the Chinese Phonetic Alphabet " input method has been carried out by country, like double spelling (State Patent Office's patent No. is CN87100313A).Because its input is letter, rather than coding, there is not the conversion process between coding and the word.Although its input speed possibly be not so good as some encoding scheme, but with regard to input mode, it is than encoding scheme science.
But " Scheme for the Chinese Phonetic Alphabet " input method exists some shortcomings, though through experiment and the popularization in 10 years, compiled and edit an orthography, imperfection very, the repetition rate of coding is too high during the input computer, and vocabulary is difficult to typing.In order to address this problem, the spelling error correcting technique has been proposed.
The spelling error correction is to handle an indispensable critical function in the application software of written historical materials in the general computer.These application software of handling written historical materialss also comprise database (database) and trial balance (sreadsheet) or the like except that Word (word processor), use manuscript that minimizing writes or the input error in the database Chinese words data.
The spelling error correction has suitable application in search engine; Thereby being mainly used in correction input error guiding user correctly inquires about; The major technique that realizes at present all is based on the phonetic error correction; On baidu (Baidu), input " table tennis wrap up in ", the baidu query page can point out " you inquire for whether: apple ".
The Another application of spelling error correction when the user imports a non-existent phonetic, can be recommended some possible speech in the input method of phonetic.
But; Above-mentioned spelling error correcting technique can only be recommended the speech of same pronunciation; And cannot recommend the speech of easy confusion tone, as can only realizing according to " table tennis is wrapped up in (pingguo) " recommendation " apple (pingguo) ", and can't be according to " article Guo (pinguo) " recommendation " apple (pingguo) ".Because have the dialect in a large amount of areas to exist, it is so inaccurate to pronounce, and therefore can cause the existence of a large amount of easy confusion tones, as in Zhejiang area, often is hard to tell cacuminal/flat tongue consonant, pre-nasal sound/back nasal sound.In this case, still can cause input error, can not play more intelligentized effect, not have hommization.
Summary of the invention
The object of the present invention is to provide a kind of phonetic check system and method that has easy confusion tone identification; Can not utilize the similarity of pronunciation to correct the mistake that possibly occur in user's input in Chinese to solve prior art; Solve obscuring of each department dialect and mandarin, and then cause importing the technical matters of easy error.
A kind of phonetic check system that has easy confusion tone identification; Comprise file storage and phonetic inspection processing unit; And said file storage comprises that dictionary storage unit, phonetic transcriptions of Chinese characters storage unit and Chinese character obscure the sound storage unit, and said phonetic inspection handles that unit pack is drawn together phonetic transcriptions of Chinese characters index process subelement, the dictionary pinyin indexes is handled subelement and Chinese character easy confusion tone index process subelement.
Wherein, said system also comprises the index stores space, and said index stores space comprises:
Phonetic transcriptions of Chinese characters index file: in order to preserve the index structure that from the phonetic transcriptions of Chinese characters storage unit, obtains pronouncing according to Chinese character;
Chinese character easy confusion tone index file: obscure the index structure that is prone to obscure phonetic that finds its correspondence on the sound index process list unit at said Chinese character in order to preserve according to phonetic;
Dictionary pinyin indexes file: in order to preserve the index structure that on the dictionary storage unit, finds all equivalents according to phonetic.
Especially, said dictionary storage unit is according to the Hash operation value that speech pronounces rank order to be arranged from small to large or from big to small;
Said dictionary pinyin indexes file further comprises: phonetic cryptographic hash index son file, list address index son file, wherein,
Phonetic cryptographic hash index son file: being used for cryptographic hash according to phonetic has from small to large or from big to small and sequentially preserves each cryptographic hash in the corresponding list address of list address index son file;
List address index son file: be used for preserving the corresponding speech number of each list address and those speech in the corresponding storage address information of dictionary storage unit with identical phonetic.
The dictionary pinyin indexes is handled subelement and is further comprised:
Hash calculation subelement: the cryptographic hash that is used to calculate speech phonetic;
Cryptographic hash index process subelement: be used for the cryptographic hash of calculating is found corresponding list address in said phonetic cryptographic hash index son file;
List address is handled subelement: be used for said list address is found corresponding speech number and each speech storage address information in the dictionary storage unit in list address index son file;
Subelement handled in dictionary: be used for list address is handled the storage address information of those speech that subelement finds, in the dictionary storage unit, find corresponding speech.
Based on this system, a kind of phonetic inspection method that has easy confusion tone identification is proposed, comprise step,
(1) the dictionary storage unit of stored word, the phonetic transcriptions of Chinese characters storage unit of storage phonetic transcriptions of Chinese characters and the Chinese character that storage is prone to obscure phonetic are set and obscure storage unit;
(2) receive the key word that the user imports, in said phonetic transcriptions of Chinese characters storage unit, search corresponding phonetic;
(3) receive the phonetic that said phonetic transcriptions of Chinese characters storage unit is sent, obscure searching out the corresponding phonetic of obscuring in the sound storage unit at said Chinese character;
(4) distinguish the phonetic that receiving step (2) and step (3) provide, search obtains corresponding speech in said dictionary storage unit.
Wherein, in the step (1) the dictionary storage unit being set further is: have sequentially from small to large or from big to small according to the Hash operation value of speech pronunciation and in the dictionary storage unit, sort.
Step (1) also comprises:
Phonetic cryptographic hash index son file is set: the cryptographic hash according to phonetic has the list address of sequentially preserving each cryptographic hash correspondence in list address index son file from small to large or from big to small;
List address index son file is set: the storage address information of preserving the corresponding speech number of each list address and those speech correspondence in the dictionary storage unit with identical phonetic
In said dictionary storage unit, searching for the speech that obtains correspondence in the step (4) further comprises:
Calculate the cryptographic hash of each speech phonetic;
The cryptographic hash of calculating is found corresponding list address in said phonetic cryptographic hash index son file
Said list address is found corresponding speech number and each speech storage address information in the dictionary storage unit in list address index son file;
List address is handled the storage address information of those speech that subelement finds, in the dictionary storage unit, find corresponding speech.
Preferably, step (1) is provided with the phonetic transcriptions of Chinese characters storage unit and further comprises:
With the key of Chinese character, the value value of phonetic, if polyphone then increases corresponding bar record on binary tree as binary tree as binary tree;
Step (1) is provided with Chinese character and obscures storage unit and further comprise:
With the key of each phonetic as binary tree, being prone to of this phonetic obscured phonetic as the value value, if a plurality ofly be prone to obscure phonetic, then on binary tree, increases corresponding bar record.
The invention has the beneficial effects as follows through introducing the identification of easy confusion tone; Solve the problem of obscuring of each department dialect and mandarin; Utilize the similarity of pronunciation, similar like pronunciations such as cacuminal/flat tongue consonant, pre-nasal sound/back nasal sound, the mistake that occurs when correcting user's input in Chinese; Make more intellectuality and hommization of spelling error correction, improved the accuracy of Chinese character input.
Embodiment
Below in conjunction with accompanying drawing, specify the present invention.
See also Fig. 1, it has the structural representation of the phonetic check system of easy confusion tone identification for first kind of the present invention.It comprises file storage 100 and phonetic inspection processing unit 200.File storage 100 is mainly used in the key word and the corresponding phonetic and the easy confusion tone of each Chinese character of storage input.The key word that said phonetic inspection processing unit 200 is mainly used in input carries out pinyin marking and searches its easy confusion tone, draws corresponding speech.
Wherein, said file storage 100 is generally a storer, or the storage unit of opening up in the storer.Divide from function, comprise that mainly phonetic transcriptions of Chinese characters storage unit 110, Chinese character obscure sound storage unit 120 and dictionary storage unit 130.
Phonetic transcriptions of Chinese characters storage unit 110 is used to store the corresponding standard phonetic of Chinese character.Press certain format Chinese character and corresponding standard phonetic thereof in the said phonetic transcriptions of Chinese characters storage unit 110.General storage format be " Chinese character: phonetic " wherein, if when a Chinese character is polyphone, between two phonetic, add ", ".Like " apple: ping ", " containing: sheng, cheng ".Only store a Chinese character and corresponding phonetic thereof in said each storage unit.
Phonetic transcriptions of Chinese characters storage unit 110 can store successively the corresponding phonetic of each Chinese character with the order of dictionary, when this word is polyphone, can adopts and deposit a storage unit more, and this storage unit is another corresponding phonetic of this Chinese character.Because this mode stores Chinese character, carrying out phonetic transcriptions of Chinese characters when searching, speed is slow.During embodiments of the invention, phonetic transcriptions of Chinese characters storage unit 110 is to store Chinese character and corresponding phonetic with the mode of binary tree access.That is, Chinese character is as the key of binary tree, and the phonetic of Chinese character is as the value value of binary tree.If polyphone, then each pronunciation is inserted once, when preserving, just has two records.When preserving in this way, can be faster when getting the phonetic of Chinese character correspondence.
Sound storage unit 120 obscured in Chinese character, be used for storing each phonetic of said phonetic transcriptions of Chinese characters storage unit corresponding be prone to obscure phonetic.The phonetic of obscuring easily by certain format in the sound storage unit 120 obscured in said Chinese character.General storage format is " phonetic: easy confusion tone ", and is wherein a plurality of when obscuring sound if phonetic exists, and obscures at two to add ", " between the sound.It is general because the existence of the dialect of various places is distinguished following two types and the pronunciation that dialect causes is general: cacuminal/flat tongue consonant, pre-nasal sound/back nasal sound.Therefore, the major part of obscuring in the sound storage unit 120 storage at Chinese character is cacuminal/flat tongue consonant, pre-nasal sound/back nasal sound obscure sound, like " ping:pin ", " sheng:shen, seng, shen ".
Chinese character obscure sound storage unit 120 can with a definite sequence come to store successively each phonetic corresponding obscure phonetic, when this phonetic is a plurality of when obscuring sound, can adopt and deposit a storage unit more, this storage unit is prone to obscure phonetic for another of this phonetic correspondence.Because this mode is stored and is prone to obscure phonetic, when obscuring phonetic and search, speed is slow.During embodiments of the invention, sound storage unit 120 obscured in Chinese character is to store each phonetic and corresponding be prone to obscure phonetic with the mode of binary tree access.That is, phonetic is as the key of binary tree, this phonetic be prone to obscure the value value of phonetic as binary tree.If a plurality ofly be prone to obscure phonetic, then each is prone to obscure phonetic and inserts once, when preserving, just has two records.When preserving in this way, get this phonetic corresponding be prone to obscure phonetic the time can be faster.The easy confusion tone that the sound storage unit obscured in Chinese character can be carried out freely disposing by the user according to actual needs.
Dictionary storage unit 130 is used to store the speech as candidate target, and it mainly is all set as the speech of candidate target.Dictionary storage unit 130 is stored those speech in certain sequence, can preserve according to the mode of dictionary, also can preserve according to alternate manner.Search for ease, can write down each speech address stored information in advance, like absolute memory address information.The present invention also can be the offset address information of preserving between the first address of this speech address stored and dictionary storage unit 130, like this, when obtaining the storage address information of a speech, can find corresponding speech fast, improves its reading speed.
Said phonetic inspection processing unit 200 masters are used for the key word of input is carried out the spell check operation.It normally processor realize the work of spell check through programming.Divide from logic, said phonetic inspection processing unit 200 can further be divided into phonetic transcriptions of Chinese characters index process subelement 210, sound index process subelement 220 obscured in Chinese character and the dictionary pinyin indexes is handled subelement 230.
Said phonetic transcriptions of Chinese characters index process subelement 210 is used to receive the key word that the user imports, and in said phonetic transcriptions of Chinese characters storage unit 110, searches corresponding phonetic.Phonetic transcriptions of Chinese characters index process subelement 210 can find corresponding phonetic in a sequential manner successively in phonetic transcriptions of Chinese characters storage unit 110.But consider that the search efficiency all too is slow; When phonetic transcriptions of Chinese characters storage unit 110 is when preserving the corresponding relation of said Chinese character and phonetic with the mode of binary tree access, phonetic transcriptions of Chinese characters index process subelement 210 can adopt multimap (being the binary tree mode) to search.Multimap is the container of std, adopts the balanced binary tree structure organization, and it according to the balanced binary tree structure organization, therefore can obtain corresponding value value through key to key fast.Wherein allow to contain equal values between the key.
Specifically, adopt the key of Chinese character as multimap during preservation, the pairing phonetic of Chinese character is as the value of multimap.If polyphone, then each pronunciation is inserted once.Like " apple ", a record is arranged, i.e. < apple, ping>in multimap.And " Sheng " is polyphone, at multimap two records just arranged, and is respectively < containing sheng >, < containing cheng >.
When these phonetic transcriptions of Chinese characters index process subelement 210 work; At first obtain the key word of input; This key word is transformed into the key of multimap; In phonetic transcriptions of Chinese characters storage unit 110, search the value value of balanced binary tree subsequently through the multimap of this module, obtain the pairing phonetic of those key words.Its entire work process is called the pinyin marking process.
Sound index process subelement 220 obscured in said Chinese character, and the phonetic that is used for providing according to phonetic transcriptions of Chinese characters index process subelement 210 is obscured sound storage unit 120 at Chinese character and searched out its easy confusion tone.Wherein, said easy confusion tone comprises cacuminal/flat tongue consonant, pre-nasal sound/back nasal sound.It is similar that principle and phonetic transcriptions of Chinese characters index process subelement 210 that sound index process subelement 220 handles obscured in Chinese character, repeated no more at this.
With reference to Fig. 2, it handles the structural representation of subelement for dictionary pinyin indexes of the present invention.
Said dictionary pinyin indexes is handled subelement 230, is used to receive said phonetic transcriptions of Chinese characters index process subelement 210 and obscures the phonetic that sound index process subelement 220 provides with said Chinese character, and search obtains corresponding speech in said dictionary storage unit.Back extended meeting highlights, and omits earlier at this.
Said system of the present invention also comprises index stores space 300, is used to preserve index information.Said index stores space 300 comprises:
Phonetic transcriptions of Chinese characters index file 310: in order to preserve the index information that from phonetic transcriptions of Chinese characters storage unit 110, obtains pronouncing according to Chinese character.Usually, preserve the index rule of pronunciation how to find phonetic transcriptions of Chinese characters storage unit 110 in the phonetic transcriptions of Chinese characters index file 310, the storage address information of phonetic transcriptions of Chinese characters storage unit 110 etc.The index rule typically refers to by what and goes in proper order to search.Phonetic transcriptions of Chinese characters index file 310 can be opened up a storage space and preserve in storer, also can be arranged on the phonetic transcriptions of Chinese characters index process unit 210, and in other words, phonetic transcriptions of Chinese characters index file 310 is from can omit in logic.
Chinese character easy confusion tone index file 320: obscure the index information that finds corresponding easy confusion tone in the sound index storage unit 120 at said Chinese character in order to preserve according to phonetic.Said index information comprises that index rule and Chinese the Liao Dynasty obscure the address information of sound index storage unit 120.Equally, Chinese character easy confusion tone index file 320 can be opened up a storage space and preserve in storer, also can be arranged on Chinese character and obscure on the sound index process subelement 220.
Dictionary pinyin indexes file 330: in order to preserve the index information that on the dictionary storage unit, finds all equivalents according to phonetic.Below just introduce dictionary pinyin indexes file 330 of the present invention emphatically, it only is a preferred forms of the present invention, but does not limit to of the present invention.
Said dictionary storage unit 130 can have rank order from small to large or from big to small according to the Hash operation value of speech pronunciation.
Said dictionary pinyin indexes file 330 further comprises: phonetic cryptographic hash index son file 410, list address index son file 420, wherein,
Phonetic cryptographic hash index son file 410: being used for cryptographic hash according to phonetic has from small to large or from big to small and sequentially preserves each cryptographic hash in the corresponding list address information of list address index son file 420;
List address index son file 420: be used for preserving the corresponding speech number of each list address and those speech in the corresponding storage address information of dictionary storage unit 130 with identical phonetic.
Below just lift an application examples dictionary pinyin indexes file 330 is described.
See also Fig. 4, it is an application examples synoptic diagram of dictionary pinyin indexes file 330.What phonetic cryptographic hash introduction file 410 was preserved is the corresponding relation of cryptographic hash and list address.When the cryptographic hash of calculating when speech was identical, corresponding list address was identical.That is to say, promptly possibly find list address through cryptographic hash.List address information can be the specific address information of the memory address of list address, also offset address or other address.
The number that the list address index file is preserved with the identical speech of this cryptographic hash, and this speech is deposited storage address information corresponding in the unit, mansion 130 at dictionary.
To above-mentioned dictionary pinyin indexes file 330, then dictionary pinyin indexes processing subelement further comprises hash calculation subelement 231, cryptographic hash index process subelement 232, and list address is handled subelement 233 and handled subelement 234 wherein with dictionary,
Hash calculation subelement 231: the cryptographic hash that is used to calculate speech phonetic.The cryptographic hash of said each speech phonetic is formed the essential information of each speech phonetic.Said hash calculation subelement 231 adopts hash algorithm to obtain the cryptographic hash of each speech phonetic.
Cryptographic hash index process subelement 232: be used for the cryptographic hash of calculating is found corresponding list address.
List address is handled subelement 233: be used for said list address is found corresponding speech number and each speech storage address information in dictionary storage unit 130 in list address index son file 420;
Subelement 234 handled in dictionary: be used for list address is handled the storage address information of those speech that subelement finds, in dictionary storage unit 130, find corresponding speech.
Based on the above-mentioned system that has the phonetic inspection method of easy confusion tone identification, the present invention proposes to have the phonetic inspection method of easy confusion tone identification.With reference to Fig. 5, it comprises:
S1: the dictionary storage unit of stored word, the phonetic transcriptions of Chinese characters storage unit of storage phonetic transcriptions of Chinese characters and the Chinese character that storage is prone to obscure phonetic are set obscure storage unit.
The dictionary storage unit is set among the step S1 further is: have sequentially from small to large or from big to small according to the Hash operation value of speech pronunciation and in the dictionary storage unit, sort.
The phonetic transcriptions of Chinese characters storage unit is set further to be comprised:
With the key of Chinese character, the value value of phonetic, if polyphone then increases corresponding bar record on binary tree as binary tree as binary tree;
Step S1 is provided with Chinese character to be obscured storage unit and further comprises:
With the key of each phonetic as binary tree, being prone to of this phonetic obscured phonetic as the value value, if a plurality ofly be prone to obscure phonetic, then on binary tree, increases corresponding bar record.
Step S1 also comprises:
Phonetic cryptographic hash index son file is set: the cryptographic hash according to phonetic has the list address of sequentially preserving each cryptographic hash correspondence in list address index son file from small to large or from big to small;
List address index son file is set: the storage address information of preserving the corresponding speech number of each list address and those speech correspondence in the dictionary storage unit with identical phonetic.
S2: receive the key word of user's input, in said phonetic transcriptions of Chinese characters storage unit, search corresponding phonetic.Adopt multimap, the key word of input is transformed into the key of multimap, the value value of in the phonetic transcriptions of Chinese characters storage unit, searching balanced binary tree subsequently through the multimap of this module obtains the pairing phonetic of those key words.If there are a plurality of phonetics, then between a plurality of phonetics, cut apart with the space.
S3: receive the phonetic that said phonetic transcriptions of Chinese characters storage unit is sent, obscure searching out the corresponding phonetic of obscuring in the sound storage unit at said Chinese character.Wherein, said easy confusion tone comprises cacuminal/flat tongue consonant, pre-nasal sound/back nasal sound.Adopt multimap, each phonetic that phonetic transcriptions of Chinese characters index process subelement is provided is obscured the value value of searching balanced binary tree in the sound storage unit as the key of multimap at Chinese character, obtains that those phonetics are pairing obscures sound.
S4: the phonetic that provides of receiving step S2 and step S3 respectively, search obtains corresponding speech in said dictionary storage unit.
In said dictionary storage unit, searching for the speech that obtains correspondence among the step S4 further comprises:
Calculate the cryptographic hash of each speech phonetic;
The cryptographic hash of calculating is found corresponding list address in said phonetic cryptographic hash index son file
Said list address is found corresponding speech number and each speech storage address information in the dictionary storage unit in list address index son file;
List address is handled the storage address information of those speech that subelement finds, in the dictionary storage unit, find corresponding speech.Said storage address information is the side-play amount that the address is directed against first address.
Below just with the bright above-mentioned flow process of a concrete as an exampleBSEMGVR takeN-PSVSEMOBJ.
See also Fig. 6, the dictionary pinyin indexes was handled son file structure applications synoptic diagram when it had the phonetic inspection method of easy confusion tone identification for the present invention adopts.
Suppose; Dictionary storage unit 130 is stored " apple ", " article Guo ", " rubber ", " banana ", " Zhejiang " respectively; Its corresponding storage address information is an offset address information; Such as, " apple ", " article Guo ", " rubber ", " banana ", " Zhejiang " each self-corresponding offset address to dictionary storage unit 130 first address PBase are respectively " 20 ", " 25 ", " 30 ", " 35 ", " 40 ".
Store the address information in the corresponding dictionary pinyin indexes file 420 of hash (ping guo), hash (pin guo), hash (xiang jiao), hash (zhe jiang) in the phonetic cryptographic hash index file 410 respectively; Said address information is the offset address to list address index son file 420 first addresss, and then the offset address of dictionary pinyin indexes file 420 first addresss of hash (pingguo), hash (pin guo), hash (xiang jiao), hash (zhe jiang) correspondence is respectively " 10 ", " 12 ", " 14 ", " 17 ".
In the list address index son file 420; Offset address is that the speech number of its ping guo phonetic is 1 for what store in the storage unit of " 10 "; This speech corresponding storage address information (being that offset address is 20), offset address in dictionary storage unit 130 are that the speech number of its " pin guo " phonetic is 1 for what store in the storage unit of " 12 "; This speech corresponding storage address information (being that offset address is 25), offset address in dictionary storage unit 130 are that the speech number of its xiang jiao phonetic is 2 for what store in the storage unit of " 14 "; Is that the speech number of its zhejiang phonetic is 1 with each speech corresponding storage address information (being that offset address is 30,40), offset address in dictionary storage unit 130 for what store in the storage unit of " 17 ", this speech corresponding storage address information (being that offset address is 40) in dictionary storage unit 130.
When Chinese character easy confusion tone storage unit was set, correspondence was provided with " ping " in the phonetic that " pin " obscured easily.
Suppose that the user thinks input " apple "; But because pronunciation is inaccurate; But import " assembly Guo " time, at first Chinese character retrieval phonetic storage unit finds corresponding phonetic " pin " " guo " respectively; When searching Chinese character easy confusion tone storage unit, can find " pin " corresponding obscure sound " ping ".Calculate the hash value of " pin guo " and " ping guo " subsequently; Search the address in the phonetic cryptographic hash index son file 410 through the hash value; Obtain corresponding address information (offset address is 10,12) respectively; Search list address index son file 420 subsequently and can obtain dictionary storage unit 130 corresponding address information (offset address is 20,30); Corresponding speech is found in the back from dictionary storage unit 130 " apple ", " article Guo ", whether the prompting user is one of them in those speech, and then the reduction misspelling.
More than disclosedly be merely several specific embodiment of the present invention, but the present invention is not limited thereto, any those skilled in the art can think variation, all should drop in protection scope of the present invention.