CN101388012B - Phonetic check system and method with easy confusion tone recognition - Google Patents

Phonetic check system and method with easy confusion tone recognition Download PDF

Info

Publication number
CN101388012B
CN101388012B CN2007101494831A CN200710149483A CN101388012B CN 101388012 B CN101388012 B CN 101388012B CN 2007101494831 A CN2007101494831 A CN 2007101494831A CN 200710149483 A CN200710149483 A CN 200710149483A CN 101388012 B CN101388012 B CN 101388012B
Authority
CN
China
Prior art keywords
phonetic
storage unit
index
speech
chinese character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2007101494831A
Other languages
Chinese (zh)
Other versions
CN101388012A (en
Inventor
孙海涛
施行向
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Network Technology Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN2007101494831A priority Critical patent/CN101388012B/en
Publication of CN101388012A publication Critical patent/CN101388012A/en
Priority to HK09108175.1A priority patent/HK1128541A1/en
Application granted granted Critical
Publication of CN101388012B publication Critical patent/CN101388012B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a Pinyin check system with identification of confusable spellings, which comprises a file storage space and an index storage space, wherein the file storage space comprises a word stock storage unit, a Chinese character Pinyin storage unit and a Chinese character confusable spelling storage unit, and a Pinyin check processing unit comprises a Chinese character Pinyin index processing sub-unit, a word stock Pinyin index processing sub-unit and a Chinese character confusable spelling index processing sub-unit. Additionally, the invention further provides a method for checking Pinyin, which comprises steps of (1) setting the word stock storage unit, the Chinese character Pinyin storage unit and the Chinese character confusable spelling storage unit, (2) receiving keywords input by users, searching corresponding spelling in the Chinese character Pinyin storage unit, (3) receiving the spelling sent by the Chinese character Pinyin storage unit, and searching corresponding confusable spelling in the Chinese character confusable spelling storage unit, (4) respectively receiving spelling provided by the step (2) and the step (3), and searching corresponding words in the word stock storage unit. The invention increases accuracy of inputting Chinese character.

Description

The phonetic check system and the method that have easy confusion tone identification
Technical field
The present invention relates to a kind of phonetic inspection technology, particularly relate to a kind of phonetic inspection technology that has easy confusion tone identification.
Background technology
Along with the develop rapidly of science and technology, especially computing machine steps into each corner of society gradually, and uses a computer widely and become the inexorable trend that modern Du can be developed.Yet, because the invention of computing machine all is in the west with main application, therefore, promotes the use of computing machine in China and will inevitably produce some obstacles, and wherein topmost be the obstacle of spoken and written languages.Because computing machine generally all shows with English alphabet and operates, so concerning most of Chinese, be very difficult with English skilled operational computations machine.Therefore, computing machine using and the universal restriction that has received this bottle neck of Chinese characters in China.
In order to get rid of this obstacle, since the seventies, China has designed many kinds of input schemes.Existing seven according to report on the magazine, 800 kind.Encoding schemes such as font code, sound sign indicating number, shape sound sign indicating number, numerical code are wherein arranged; Like the Five-stroke Method method (State Patent Office's patent No. is CN85100837A); These coded input methods have two outstanding shortcomings: first; What import is coding, rather than " word ", between coding and the word conversion process will be arranged.Operating personnel are the study coding earlier, could operate, and is not easy to popularize.The second, be single Chinese character by what encode input, single Chinese character majority is the speech with meaning, is a kind of rudimentary input mode.
In order to address the above problem, " Scheme for the Chinese Phonetic Alphabet " input method has been carried out by country, like double spelling (State Patent Office's patent No. is CN87100313A).Because its input is letter, rather than coding, there is not the conversion process between coding and the word.Although its input speed possibly be not so good as some encoding scheme, but with regard to input mode, it is than encoding scheme science.
But " Scheme for the Chinese Phonetic Alphabet " input method exists some shortcomings, though through experiment and the popularization in 10 years, compiled and edit an orthography, imperfection very, the repetition rate of coding is too high during the input computer, and vocabulary is difficult to typing.In order to address this problem, the spelling error correcting technique has been proposed.
The spelling error correction is to handle an indispensable critical function in the application software of written historical materials in the general computer.These application software of handling written historical materialss also comprise database (database) and trial balance (sreadsheet) or the like except that Word (word processor), use manuscript that minimizing writes or the input error in the database Chinese words data.
The spelling error correction has suitable application in search engine; Thereby being mainly used in correction input error guiding user correctly inquires about; The major technique that realizes at present all is based on the phonetic error correction; On baidu (Baidu), input " table tennis wrap up in ", the baidu query page can point out " you inquire for whether: apple ".
The Another application of spelling error correction when the user imports a non-existent phonetic, can be recommended some possible speech in the input method of phonetic.
But; Above-mentioned spelling error correcting technique can only be recommended the speech of same pronunciation; And cannot recommend the speech of easy confusion tone, as can only realizing according to " table tennis is wrapped up in (pingguo) " recommendation " apple (pingguo) ", and can't be according to " article Guo (pinguo) " recommendation " apple (pingguo) ".Because have the dialect in a large amount of areas to exist, it is so inaccurate to pronounce, and therefore can cause the existence of a large amount of easy confusion tones, as in Zhejiang area, often is hard to tell cacuminal/flat tongue consonant, pre-nasal sound/back nasal sound.In this case, still can cause input error, can not play more intelligentized effect, not have hommization.
Summary of the invention
The object of the present invention is to provide a kind of phonetic check system and method that has easy confusion tone identification; Can not utilize the similarity of pronunciation to correct the mistake that possibly occur in user's input in Chinese to solve prior art; Solve obscuring of each department dialect and mandarin, and then cause importing the technical matters of easy error.
A kind of phonetic check system that has easy confusion tone identification; Comprise file storage and phonetic inspection processing unit; And said file storage comprises that dictionary storage unit, phonetic transcriptions of Chinese characters storage unit and Chinese character obscure the sound storage unit, and said phonetic inspection handles that unit pack is drawn together phonetic transcriptions of Chinese characters index process subelement, the dictionary pinyin indexes is handled subelement and Chinese character easy confusion tone index process subelement.
Wherein, said system also comprises the index stores space, and said index stores space comprises:
Phonetic transcriptions of Chinese characters index file: in order to preserve the index structure that from the phonetic transcriptions of Chinese characters storage unit, obtains pronouncing according to Chinese character;
Chinese character easy confusion tone index file: obscure the index structure that is prone to obscure phonetic that finds its correspondence on the sound index process list unit at said Chinese character in order to preserve according to phonetic;
Dictionary pinyin indexes file: in order to preserve the index structure that on the dictionary storage unit, finds all equivalents according to phonetic.
Especially, said dictionary storage unit is according to the Hash operation value that speech pronounces rank order to be arranged from small to large or from big to small;
Said dictionary pinyin indexes file further comprises: phonetic cryptographic hash index son file, list address index son file, wherein,
Phonetic cryptographic hash index son file: being used for cryptographic hash according to phonetic has from small to large or from big to small and sequentially preserves each cryptographic hash in the corresponding list address of list address index son file;
List address index son file: be used for preserving the corresponding speech number of each list address and those speech in the corresponding storage address information of dictionary storage unit with identical phonetic.
The dictionary pinyin indexes is handled subelement and is further comprised:
Hash calculation subelement: the cryptographic hash that is used to calculate speech phonetic;
Cryptographic hash index process subelement: be used for the cryptographic hash of calculating is found corresponding list address in said phonetic cryptographic hash index son file;
List address is handled subelement: be used for said list address is found corresponding speech number and each speech storage address information in the dictionary storage unit in list address index son file;
Subelement handled in dictionary: be used for list address is handled the storage address information of those speech that subelement finds, in the dictionary storage unit, find corresponding speech.
Based on this system, a kind of phonetic inspection method that has easy confusion tone identification is proposed, comprise step,
(1) the dictionary storage unit of stored word, the phonetic transcriptions of Chinese characters storage unit of storage phonetic transcriptions of Chinese characters and the Chinese character that storage is prone to obscure phonetic are set and obscure storage unit;
(2) receive the key word that the user imports, in said phonetic transcriptions of Chinese characters storage unit, search corresponding phonetic;
(3) receive the phonetic that said phonetic transcriptions of Chinese characters storage unit is sent, obscure searching out the corresponding phonetic of obscuring in the sound storage unit at said Chinese character;
(4) distinguish the phonetic that receiving step (2) and step (3) provide, search obtains corresponding speech in said dictionary storage unit.
Wherein, in the step (1) the dictionary storage unit being set further is: have sequentially from small to large or from big to small according to the Hash operation value of speech pronunciation and in the dictionary storage unit, sort.
Step (1) also comprises:
Phonetic cryptographic hash index son file is set: the cryptographic hash according to phonetic has the list address of sequentially preserving each cryptographic hash correspondence in list address index son file from small to large or from big to small;
List address index son file is set: the storage address information of preserving the corresponding speech number of each list address and those speech correspondence in the dictionary storage unit with identical phonetic
In said dictionary storage unit, searching for the speech that obtains correspondence in the step (4) further comprises:
Calculate the cryptographic hash of each speech phonetic;
The cryptographic hash of calculating is found corresponding list address in said phonetic cryptographic hash index son file
Said list address is found corresponding speech number and each speech storage address information in the dictionary storage unit in list address index son file;
List address is handled the storage address information of those speech that subelement finds, in the dictionary storage unit, find corresponding speech.
Preferably, step (1) is provided with the phonetic transcriptions of Chinese characters storage unit and further comprises:
With the key of Chinese character, the value value of phonetic, if polyphone then increases corresponding bar record on binary tree as binary tree as binary tree;
Step (1) is provided with Chinese character and obscures storage unit and further comprise:
With the key of each phonetic as binary tree, being prone to of this phonetic obscured phonetic as the value value, if a plurality ofly be prone to obscure phonetic, then on binary tree, increases corresponding bar record.
The invention has the beneficial effects as follows through introducing the identification of easy confusion tone; Solve the problem of obscuring of each department dialect and mandarin; Utilize the similarity of pronunciation, similar like pronunciations such as cacuminal/flat tongue consonant, pre-nasal sound/back nasal sound, the mistake that occurs when correcting user's input in Chinese; Make more intellectuality and hommization of spelling error correction, improved the accuracy of Chinese character input.
Description of drawings
Fig. 1 has the structural representation of the phonetic check system of easy confusion tone identification for first kind of the present invention;
Fig. 2 handles the structural representation of subelement for dictionary pinyin indexes of the present invention;
Fig. 3 has the structural representation of the phonetic check system of easy confusion tone identification for second kind of the present invention;
The dictionary pinyin indexes was handled the son file structural representation when Fig. 4 had the phonetic inspection method of easy confusion tone identification for the present invention adopts;
Fig. 5 has the process flow diagram of the phonetic inspection method of easy confusion tone identification for the present invention is a kind of;
The dictionary pinyin indexes was handled son file structure applications synoptic diagram when Fig. 6 had the phonetic inspection method of easy confusion tone identification for the present invention adopts.
Embodiment
Below in conjunction with accompanying drawing, specify the present invention.
See also Fig. 1, it has the structural representation of the phonetic check system of easy confusion tone identification for first kind of the present invention.It comprises file storage 100 and phonetic inspection processing unit 200.File storage 100 is mainly used in the key word and the corresponding phonetic and the easy confusion tone of each Chinese character of storage input.The key word that said phonetic inspection processing unit 200 is mainly used in input carries out pinyin marking and searches its easy confusion tone, draws corresponding speech.
Wherein, said file storage 100 is generally a storer, or the storage unit of opening up in the storer.Divide from function, comprise that mainly phonetic transcriptions of Chinese characters storage unit 110, Chinese character obscure sound storage unit 120 and dictionary storage unit 130.
Phonetic transcriptions of Chinese characters storage unit 110 is used to store the corresponding standard phonetic of Chinese character.Press certain format Chinese character and corresponding standard phonetic thereof in the said phonetic transcriptions of Chinese characters storage unit 110.General storage format be " Chinese character: phonetic " wherein, if when a Chinese character is polyphone, between two phonetic, add ", ".Like " apple: ping ", " containing: sheng, cheng ".Only store a Chinese character and corresponding phonetic thereof in said each storage unit.
Phonetic transcriptions of Chinese characters storage unit 110 can store successively the corresponding phonetic of each Chinese character with the order of dictionary, when this word is polyphone, can adopts and deposit a storage unit more, and this storage unit is another corresponding phonetic of this Chinese character.Because this mode stores Chinese character, carrying out phonetic transcriptions of Chinese characters when searching, speed is slow.During embodiments of the invention, phonetic transcriptions of Chinese characters storage unit 110 is to store Chinese character and corresponding phonetic with the mode of binary tree access.That is, Chinese character is as the key of binary tree, and the phonetic of Chinese character is as the value value of binary tree.If polyphone, then each pronunciation is inserted once, when preserving, just has two records.When preserving in this way, can be faster when getting the phonetic of Chinese character correspondence.
Sound storage unit 120 obscured in Chinese character, be used for storing each phonetic of said phonetic transcriptions of Chinese characters storage unit corresponding be prone to obscure phonetic.The phonetic of obscuring easily by certain format in the sound storage unit 120 obscured in said Chinese character.General storage format is " phonetic: easy confusion tone ", and is wherein a plurality of when obscuring sound if phonetic exists, and obscures at two to add ", " between the sound.It is general because the existence of the dialect of various places is distinguished following two types and the pronunciation that dialect causes is general: cacuminal/flat tongue consonant, pre-nasal sound/back nasal sound.Therefore, the major part of obscuring in the sound storage unit 120 storage at Chinese character is cacuminal/flat tongue consonant, pre-nasal sound/back nasal sound obscure sound, like " ping:pin ", " sheng:shen, seng, shen ".
Chinese character obscure sound storage unit 120 can with a definite sequence come to store successively each phonetic corresponding obscure phonetic, when this phonetic is a plurality of when obscuring sound, can adopt and deposit a storage unit more, this storage unit is prone to obscure phonetic for another of this phonetic correspondence.Because this mode is stored and is prone to obscure phonetic, when obscuring phonetic and search, speed is slow.During embodiments of the invention, sound storage unit 120 obscured in Chinese character is to store each phonetic and corresponding be prone to obscure phonetic with the mode of binary tree access.That is, phonetic is as the key of binary tree, this phonetic be prone to obscure the value value of phonetic as binary tree.If a plurality ofly be prone to obscure phonetic, then each is prone to obscure phonetic and inserts once, when preserving, just has two records.When preserving in this way, get this phonetic corresponding be prone to obscure phonetic the time can be faster.The easy confusion tone that the sound storage unit obscured in Chinese character can be carried out freely disposing by the user according to actual needs.
Dictionary storage unit 130 is used to store the speech as candidate target, and it mainly is all set as the speech of candidate target.Dictionary storage unit 130 is stored those speech in certain sequence, can preserve according to the mode of dictionary, also can preserve according to alternate manner.Search for ease, can write down each speech address stored information in advance, like absolute memory address information.The present invention also can be the offset address information of preserving between the first address of this speech address stored and dictionary storage unit 130, like this, when obtaining the storage address information of a speech, can find corresponding speech fast, improves its reading speed.
Said phonetic inspection processing unit 200 masters are used for the key word of input is carried out the spell check operation.It normally processor realize the work of spell check through programming.Divide from logic, said phonetic inspection processing unit 200 can further be divided into phonetic transcriptions of Chinese characters index process subelement 210, sound index process subelement 220 obscured in Chinese character and the dictionary pinyin indexes is handled subelement 230.
Said phonetic transcriptions of Chinese characters index process subelement 210 is used to receive the key word that the user imports, and in said phonetic transcriptions of Chinese characters storage unit 110, searches corresponding phonetic.Phonetic transcriptions of Chinese characters index process subelement 210 can find corresponding phonetic in a sequential manner successively in phonetic transcriptions of Chinese characters storage unit 110.But consider that the search efficiency all too is slow; When phonetic transcriptions of Chinese characters storage unit 110 is when preserving the corresponding relation of said Chinese character and phonetic with the mode of binary tree access, phonetic transcriptions of Chinese characters index process subelement 210 can adopt multimap (being the binary tree mode) to search.Multimap is the container of std, adopts the balanced binary tree structure organization, and it according to the balanced binary tree structure organization, therefore can obtain corresponding value value through key to key fast.Wherein allow to contain equal values between the key.
Specifically, adopt the key of Chinese character as multimap during preservation, the pairing phonetic of Chinese character is as the value of multimap.If polyphone, then each pronunciation is inserted once.Like " apple ", a record is arranged, i.e. < apple, ping>in multimap.And " Sheng " is polyphone, at multimap two records just arranged, and is respectively < containing sheng >, < containing cheng >.
When these phonetic transcriptions of Chinese characters index process subelement 210 work; At first obtain the key word of input; This key word is transformed into the key of multimap; In phonetic transcriptions of Chinese characters storage unit 110, search the value value of balanced binary tree subsequently through the multimap of this module, obtain the pairing phonetic of those key words.Its entire work process is called the pinyin marking process.
Sound index process subelement 220 obscured in said Chinese character, and the phonetic that is used for providing according to phonetic transcriptions of Chinese characters index process subelement 210 is obscured sound storage unit 120 at Chinese character and searched out its easy confusion tone.Wherein, said easy confusion tone comprises cacuminal/flat tongue consonant, pre-nasal sound/back nasal sound.It is similar that principle and phonetic transcriptions of Chinese characters index process subelement 210 that sound index process subelement 220 handles obscured in Chinese character, repeated no more at this.
With reference to Fig. 2, it handles the structural representation of subelement for dictionary pinyin indexes of the present invention.
Said dictionary pinyin indexes is handled subelement 230, is used to receive said phonetic transcriptions of Chinese characters index process subelement 210 and obscures the phonetic that sound index process subelement 220 provides with said Chinese character, and search obtains corresponding speech in said dictionary storage unit.Back extended meeting highlights, and omits earlier at this.
Said system of the present invention also comprises index stores space 300, is used to preserve index information.Said index stores space 300 comprises:
Phonetic transcriptions of Chinese characters index file 310: in order to preserve the index information that from phonetic transcriptions of Chinese characters storage unit 110, obtains pronouncing according to Chinese character.Usually, preserve the index rule of pronunciation how to find phonetic transcriptions of Chinese characters storage unit 110 in the phonetic transcriptions of Chinese characters index file 310, the storage address information of phonetic transcriptions of Chinese characters storage unit 110 etc.The index rule typically refers to by what and goes in proper order to search.Phonetic transcriptions of Chinese characters index file 310 can be opened up a storage space and preserve in storer, also can be arranged on the phonetic transcriptions of Chinese characters index process unit 210, and in other words, phonetic transcriptions of Chinese characters index file 310 is from can omit in logic.
Chinese character easy confusion tone index file 320: obscure the index information that finds corresponding easy confusion tone in the sound index storage unit 120 at said Chinese character in order to preserve according to phonetic.Said index information comprises that index rule and Chinese the Liao Dynasty obscure the address information of sound index storage unit 120.Equally, Chinese character easy confusion tone index file 320 can be opened up a storage space and preserve in storer, also can be arranged on Chinese character and obscure on the sound index process subelement 220.
Dictionary pinyin indexes file 330: in order to preserve the index information that on the dictionary storage unit, finds all equivalents according to phonetic.Below just introduce dictionary pinyin indexes file 330 of the present invention emphatically, it only is a preferred forms of the present invention, but does not limit to of the present invention.
Said dictionary storage unit 130 can have rank order from small to large or from big to small according to the Hash operation value of speech pronunciation.
Said dictionary pinyin indexes file 330 further comprises: phonetic cryptographic hash index son file 410, list address index son file 420, wherein,
Phonetic cryptographic hash index son file 410: being used for cryptographic hash according to phonetic has from small to large or from big to small and sequentially preserves each cryptographic hash in the corresponding list address information of list address index son file 420;
List address index son file 420: be used for preserving the corresponding speech number of each list address and those speech in the corresponding storage address information of dictionary storage unit 130 with identical phonetic.
Below just lift an application examples dictionary pinyin indexes file 330 is described.
See also Fig. 4, it is an application examples synoptic diagram of dictionary pinyin indexes file 330.What phonetic cryptographic hash introduction file 410 was preserved is the corresponding relation of cryptographic hash and list address.When the cryptographic hash of calculating when speech was identical, corresponding list address was identical.That is to say, promptly possibly find list address through cryptographic hash.List address information can be the specific address information of the memory address of list address, also offset address or other address.
The number that the list address index file is preserved with the identical speech of this cryptographic hash, and this speech is deposited storage address information corresponding in the unit, mansion 130 at dictionary.
To above-mentioned dictionary pinyin indexes file 330, then dictionary pinyin indexes processing subelement further comprises hash calculation subelement 231, cryptographic hash index process subelement 232, and list address is handled subelement 233 and handled subelement 234 wherein with dictionary,
Hash calculation subelement 231: the cryptographic hash that is used to calculate speech phonetic.The cryptographic hash of said each speech phonetic is formed the essential information of each speech phonetic.Said hash calculation subelement 231 adopts hash algorithm to obtain the cryptographic hash of each speech phonetic.
Cryptographic hash index process subelement 232: be used for the cryptographic hash of calculating is found corresponding list address.
List address is handled subelement 233: be used for said list address is found corresponding speech number and each speech storage address information in dictionary storage unit 130 in list address index son file 420;
Subelement 234 handled in dictionary: be used for list address is handled the storage address information of those speech that subelement finds, in dictionary storage unit 130, find corresponding speech.
Based on the above-mentioned system that has the phonetic inspection method of easy confusion tone identification, the present invention proposes to have the phonetic inspection method of easy confusion tone identification.With reference to Fig. 5, it comprises:
S1: the dictionary storage unit of stored word, the phonetic transcriptions of Chinese characters storage unit of storage phonetic transcriptions of Chinese characters and the Chinese character that storage is prone to obscure phonetic are set obscure storage unit.
The dictionary storage unit is set among the step S1 further is: have sequentially from small to large or from big to small according to the Hash operation value of speech pronunciation and in the dictionary storage unit, sort.
The phonetic transcriptions of Chinese characters storage unit is set further to be comprised:
With the key of Chinese character, the value value of phonetic, if polyphone then increases corresponding bar record on binary tree as binary tree as binary tree;
Step S1 is provided with Chinese character to be obscured storage unit and further comprises:
With the key of each phonetic as binary tree, being prone to of this phonetic obscured phonetic as the value value, if a plurality ofly be prone to obscure phonetic, then on binary tree, increases corresponding bar record.
Step S1 also comprises:
Phonetic cryptographic hash index son file is set: the cryptographic hash according to phonetic has the list address of sequentially preserving each cryptographic hash correspondence in list address index son file from small to large or from big to small;
List address index son file is set: the storage address information of preserving the corresponding speech number of each list address and those speech correspondence in the dictionary storage unit with identical phonetic.
S2: receive the key word of user's input, in said phonetic transcriptions of Chinese characters storage unit, search corresponding phonetic.Adopt multimap, the key word of input is transformed into the key of multimap, the value value of in the phonetic transcriptions of Chinese characters storage unit, searching balanced binary tree subsequently through the multimap of this module obtains the pairing phonetic of those key words.If there are a plurality of phonetics, then between a plurality of phonetics, cut apart with the space.
S3: receive the phonetic that said phonetic transcriptions of Chinese characters storage unit is sent, obscure searching out the corresponding phonetic of obscuring in the sound storage unit at said Chinese character.Wherein, said easy confusion tone comprises cacuminal/flat tongue consonant, pre-nasal sound/back nasal sound.Adopt multimap, each phonetic that phonetic transcriptions of Chinese characters index process subelement is provided is obscured the value value of searching balanced binary tree in the sound storage unit as the key of multimap at Chinese character, obtains that those phonetics are pairing obscures sound.
S4: the phonetic that provides of receiving step S2 and step S3 respectively, search obtains corresponding speech in said dictionary storage unit.
In said dictionary storage unit, searching for the speech that obtains correspondence among the step S4 further comprises:
Calculate the cryptographic hash of each speech phonetic;
The cryptographic hash of calculating is found corresponding list address in said phonetic cryptographic hash index son file
Said list address is found corresponding speech number and each speech storage address information in the dictionary storage unit in list address index son file;
List address is handled the storage address information of those speech that subelement finds, in the dictionary storage unit, find corresponding speech.Said storage address information is the side-play amount that the address is directed against first address.
Below just with the bright above-mentioned flow process of a concrete as an exampleBSEMGVR takeN-PSVSEMOBJ.
See also Fig. 6, the dictionary pinyin indexes was handled son file structure applications synoptic diagram when it had the phonetic inspection method of easy confusion tone identification for the present invention adopts.
Suppose; Dictionary storage unit 130 is stored " apple ", " article Guo ", " rubber ", " banana ", " Zhejiang " respectively; Its corresponding storage address information is an offset address information; Such as, " apple ", " article Guo ", " rubber ", " banana ", " Zhejiang " each self-corresponding offset address to dictionary storage unit 130 first address PBase are respectively " 20 ", " 25 ", " 30 ", " 35 ", " 40 ".
Store the address information in the corresponding dictionary pinyin indexes file 420 of hash (ping guo), hash (pin guo), hash (xiang jiao), hash (zhe jiang) in the phonetic cryptographic hash index file 410 respectively; Said address information is the offset address to list address index son file 420 first addresss, and then the offset address of dictionary pinyin indexes file 420 first addresss of hash (pingguo), hash (pin guo), hash (xiang jiao), hash (zhe jiang) correspondence is respectively " 10 ", " 12 ", " 14 ", " 17 ".
In the list address index son file 420; Offset address is that the speech number of its ping guo phonetic is 1 for what store in the storage unit of " 10 "; This speech corresponding storage address information (being that offset address is 20), offset address in dictionary storage unit 130 are that the speech number of its " pin guo " phonetic is 1 for what store in the storage unit of " 12 "; This speech corresponding storage address information (being that offset address is 25), offset address in dictionary storage unit 130 are that the speech number of its xiang jiao phonetic is 2 for what store in the storage unit of " 14 "; Is that the speech number of its zhejiang phonetic is 1 with each speech corresponding storage address information (being that offset address is 30,40), offset address in dictionary storage unit 130 for what store in the storage unit of " 17 ", this speech corresponding storage address information (being that offset address is 40) in dictionary storage unit 130.
When Chinese character easy confusion tone storage unit was set, correspondence was provided with " ping " in the phonetic that " pin " obscured easily.
Suppose that the user thinks input " apple "; But because pronunciation is inaccurate; But import " assembly Guo " time, at first Chinese character retrieval phonetic storage unit finds corresponding phonetic " pin " " guo " respectively; When searching Chinese character easy confusion tone storage unit, can find " pin " corresponding obscure sound " ping ".Calculate the hash value of " pin guo " and " ping guo " subsequently; Search the address in the phonetic cryptographic hash index son file 410 through the hash value; Obtain corresponding address information (offset address is 10,12) respectively; Search list address index son file 420 subsequently and can obtain dictionary storage unit 130 corresponding address information (offset address is 20,30); Corresponding speech is found in the back from dictionary storage unit 130 " apple ", " article Guo ", whether the prompting user is one of them in those speech, and then the reduction misspelling.
More than disclosedly be merely several specific embodiment of the present invention, but the present invention is not limited thereto, any those skilled in the art can think variation, all should drop in protection scope of the present invention.

Claims (6)

1. one kind has the phonetic check system that easy confusion tone is discerned, and is used for drawing according to the key word of input the entry of its corresponding candidate target, it is characterized in that, comprises file storage, phonetic inspection processing unit, wherein,
Said file storage comprises:
The dictionary storage unit is used to store the speech as candidate target,
The phonetic transcriptions of Chinese characters storage unit is used to store the corresponding standard phonetic of Chinese character, and
The sound storage unit obscured in Chinese character, be used for storing each phonetic of said phonetic transcriptions of Chinese characters storage unit corresponding be prone to obscure phonetic;
Said phonetic inspection is handled unit pack and is drawn together:
Phonetic transcriptions of Chinese characters index process subelement is used to receive the key word that the user imports, and in said phonetic transcriptions of Chinese characters storage unit, searches corresponding phonetic;
Sound index process subelement obscured in Chinese character, is used to receive the phonetic that said phonetic transcriptions of Chinese characters storage unit is sent, and obscures searching out the corresponding phonetic of obscuring in the sound storage unit at said Chinese character;
The dictionary pinyin indexes is handled subelement; Be used to receive said phonetic transcriptions of Chinese characters index process subelement and the phonetic that sound index process subelement provides obscured in said Chinese character; Search obtains corresponding speech in said dictionary storage unit; Said system also comprises the index stores space, and said index stores space comprises:
Dictionary pinyin indexes file: in order to preserving the index information that on the dictionary storage unit, finds all equivalents according to phonetic,
Said dictionary storage unit is according to the Hash operation value that speech pronounces rank order to be arranged from small to large or from big to small;
Said dictionary pinyin indexes file further comprises: phonetic cryptographic hash index son file, list address index son file, wherein,
Phonetic cryptographic hash index son file: being used for cryptographic hash according to phonetic has from small to large or from big to small and sequentially preserves each cryptographic hash in the corresponding list address of list address index son file;
List address index son file: be used for preserving the corresponding speech number of each list address and those speech in the corresponding storage address information of dictionary storage unit with identical phonetic.
2. the system of claim 1, dictionary pinyin indexes are handled subelement and are further comprised:
Hash calculation subelement: the cryptographic hash that is used to calculate speech phonetic;
Cryptographic hash index process subelement: be used for the cryptographic hash of calculating is found corresponding list address in said phonetic cryptographic hash index son file;
List address is handled subelement: be used for said list address is found corresponding speech number and each speech storage address information in the dictionary storage unit in list address index son file;
Subelement handled in dictionary: be used for list address is handled the storage address information of those speech that subelement finds, in the dictionary storage unit, find corresponding speech.
3. the system of claim 1 is characterized in that, also comprises:
Phonetic transcriptions of Chinese characters index file: in order to preserve the index information that from the phonetic transcriptions of Chinese characters storage unit, obtains pronouncing according to Chinese character;
Chinese character easy confusion tone index file: obscure the index information that is prone to obscure phonetic that finds its correspondence on the sound storage unit at said Chinese character in order to preserve according to phonetic.
4. a phonetic inspection method that has easy confusion tone identification is characterized in that, comprises step,
(1) the dictionary storage unit of stored word, the phonetic transcriptions of Chinese characters storage unit of storage phonetic transcriptions of Chinese characters are set; The sound storage unit obscured in the Chinese character that is prone to obscure phonetic with storage; In the step (1) the dictionary storage unit being set further is: have sequentially from small to large or from big to small according to the Hash operation value of speech pronunciation and in the dictionary storage unit, sort; And step (1) also comprises:
Phonetic cryptographic hash index son file is set: the cryptographic hash according to phonetic has the list address of sequentially preserving each cryptographic hash correspondence in list address index son file from small to large or from big to small;
List address index son file is set: the storage address information of preserving the corresponding speech number of each list address and those speech correspondence in the dictionary storage unit with identical phonetic;
(2) receive the key word that the user imports, in said phonetic transcriptions of Chinese characters storage unit, search corresponding phonetic;
(3) receive the phonetic that said phonetic transcriptions of Chinese characters storage unit is sent, obscure searching out the corresponding phonetic of obscuring in the sound storage unit at said Chinese character;
(4) phonetic that provides of receiving step (2) and step (3) respectively, search obtains corresponding speech in said dictionary storage unit, and the speech that search obtains correspondence in said dictionary storage unit in the step (4) further comprises:
Calculate the cryptographic hash of each speech phonetic;
The cryptographic hash of calculating is found corresponding list address in said phonetic cryptographic hash index son file;
Said list address is found corresponding speech number and each speech storage address information in the dictionary storage unit in list address index son file;
Storage address information according to those speech that find finds corresponding speech in the dictionary storage unit.
5. method as claimed in claim 4 is characterized in that, said storage address information is the side-play amount that the address is directed against first address.
6. method as claimed in claim 4 is characterized in that,
Step (1) is provided with the phonetic transcriptions of Chinese characters storage unit and further comprises:
With the key of Chinese character, the value value of phonetic, if polyphone then increases corresponding bar record on binary tree as binary tree as binary tree;
Step (1) is provided with Chinese character and obscures the sound storage unit and further comprise:
With the key of each phonetic as binary tree, being prone to of this phonetic obscured phonetic as the value value, if a plurality ofly be prone to obscure phonetic, then on binary tree, increases corresponding bar record.
CN2007101494831A 2007-09-13 2007-09-13 Phonetic check system and method with easy confusion tone recognition Active CN101388012B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN2007101494831A CN101388012B (en) 2007-09-13 2007-09-13 Phonetic check system and method with easy confusion tone recognition
HK09108175.1A HK1128541A1 (en) 2007-09-13 2009-09-07 System and method for pinyin checking with confusable pinyin recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2007101494831A CN101388012B (en) 2007-09-13 2007-09-13 Phonetic check system and method with easy confusion tone recognition

Publications (2)

Publication Number Publication Date
CN101388012A CN101388012A (en) 2009-03-18
CN101388012B true CN101388012B (en) 2012-05-30

Family

ID=40477438

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2007101494831A Active CN101388012B (en) 2007-09-13 2007-09-13 Phonetic check system and method with easy confusion tone recognition

Country Status (2)

Country Link
CN (1) CN101388012B (en)
HK (1) HK1128541A1 (en)

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102236706B (en) * 2011-06-17 2012-12-05 浙江大学 Fast fuzzy pinyin inquiry method of mass Chinese file names
US10417037B2 (en) 2012-05-15 2019-09-17 Apple Inc. Systems and methods for integrating third party services with a digital assistant
CN104199954B (en) * 2012-06-26 2018-09-14 北京奇虎科技有限公司 A kind of commending system and method for searching for input
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
EP3149728B1 (en) 2014-05-30 2019-01-16 Apple Inc. Multi-command single utterance input method
US9711141B2 (en) * 2014-12-09 2017-07-18 Apple Inc. Disambiguating heteronyms in speech synthesis
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
US10460227B2 (en) 2015-05-15 2019-10-29 Apple Inc. Virtual assistant in a communication session
CN105159894B (en) * 2015-09-01 2018-04-10 广东欧珀移动通信有限公司 Text information generation method and system
US11204787B2 (en) 2017-01-09 2021-12-21 Apple Inc. Application integration with a digital assistant
DK201770383A1 (en) 2017-05-09 2018-12-14 Apple Inc. User interface for correcting recognition errors
DK179822B1 (en) 2018-06-01 2019-07-12 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
CN109326284B (en) * 2018-08-22 2022-06-10 深圳创维数字技术有限公司 Voice search method, apparatus and storage medium
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
CN109657044A (en) * 2018-12-14 2019-04-19 北京向上心科技有限公司 Data retrieval method, data reordering method, device, terminal and storage medium
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
DK201970509A1 (en) 2019-05-06 2021-01-15 Apple Inc Spoken notifications
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
DK180129B1 (en) 2019-05-31 2020-06-02 Apple Inc. User activity shortcut suggestions
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
US11488406B2 (en) 2019-09-25 2022-11-01 Apple Inc. Text detection using global geometry estimators
CN113704594A (en) * 2021-08-31 2021-11-26 芸豆数字科技有限公司 Retrieval method and device of traditional Chinese medicine medicinal materials, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1339754A (en) * 2000-08-22 2002-03-13 英业达集团(上海)电子技术有限公司 Chinese character identifying method and system with correcting function
CN1556458A (en) * 2004-01-05 2004-12-22 郑 方 Chinese whole sentence input method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1339754A (en) * 2000-08-22 2002-03-13 英业达集团(上海)电子技术有限公司 Chinese character identifying method and system with correcting function
CN1556458A (en) * 2004-01-05 2004-12-22 郑 方 Chinese whole sentence input method

Also Published As

Publication number Publication date
HK1128541A1 (en) 2009-10-30
CN101388012A (en) 2009-03-18

Similar Documents

Publication Publication Date Title
CN101388012B (en) Phonetic check system and method with easy confusion tone recognition
CN110046350B (en) Grammar error recognition method, device, computer equipment and storage medium
KR101425182B1 (en) Typing candidate generating method for enhancing typing efficiency
CN102725790B (en) Recognition dictionary creation device and speech recognition device
CN102479191A (en) Method and device for providing multi-granularity word segmentation result
EP0834139A1 (en) Machine assisted translation tools
CN100429648C (en) Automatic segmentation of texts comprising chunsk without separators
AU2018102145A4 (en) Method of establishing English geographical name index and querying method and apparatus thereof
Li et al. Improving text normalization using character-blocks based models and system combination
CN111597800A (en) Method, device, equipment and storage medium for obtaining synonyms
KR20230009564A (en) Learning data correction method and apparatus thereof using ensemble score
CN114036950A (en) Medical text named entity recognition method and system
CN109885641B (en) Method and system for searching Chinese full text in database
CN113190692B (en) Self-adaptive retrieval method, system and device for knowledge graph
CN101739142B (en) Five-stroke input system and method
CN101539433A (en) Searching method with first letter of pinyin and intonation in navigation system and device thereof
CN101667099A (en) Method for inputting stroke connection keyboard characters and device therefor
CN1102779C (en) Simplified Chinese character-the original complex form changingover apparatus
Bagul et al. Rule based POS tagger for Marathi text
Wang et al. Conditional Random Field-based Parser and Language Model for Tradi-tional Chinese Spelling Checker
CN113553853B (en) Named entity recognition method and device, computer equipment and storage medium
CN114548049A (en) Digital regularization method, device, equipment and storage medium
CN114154503A (en) Sensitive data type identification method
CN107092669A (en) A kind of method for setting up intelligent robot interaction
CN109727591B (en) Voice search method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1128541

Country of ref document: HK

C14 Grant of patent or utility model
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: GR

Ref document number: 1128541

Country of ref document: HK

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20211122

Address after: No. 699, Wangshang Road, Binjiang District, Hangzhou, Zhejiang

Patentee after: Alibaba (China) Network Technology Co.,Ltd.

Address before: Box four, 847, capital building, Grand Cayman Island capital, Cayman Islands, UK

Patentee before: ALIBABA GROUP HOLDING Ltd.