CN101770478B - Data retrieval method, data retrieval engine and embedded terminal - Google Patents

Data retrieval method, data retrieval engine and embedded terminal Download PDF

Info

Publication number
CN101770478B
CN101770478B CN 200810240889 CN200810240889A CN101770478B CN 101770478 B CN101770478 B CN 101770478B CN 200810240889 CN200810240889 CN 200810240889 CN 200810240889 A CN200810240889 A CN 200810240889A CN 101770478 B CN101770478 B CN 101770478B
Authority
CN
China
Prior art keywords
chinese character
data
participle
target data
phonetic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN 200810240889
Other languages
Chinese (zh)
Other versions
CN101770478A (en
Inventor
吴跃进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Autonavi Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Autonavi Information Technology Co Ltd filed Critical Autonavi Information Technology Co Ltd
Priority to CN 200810240889 priority Critical patent/CN101770478B/en
Publication of CN101770478A publication Critical patent/CN101770478A/en
Application granted granted Critical
Publication of CN101770478B publication Critical patent/CN101770478B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention provides a data retrieval method, which comprises the following steps: 1, receiving a Chinese character input by a user, and if the Chinese character is not the first Chinese character currently input by the user but is the Nth Chinese character, entering a second step, wherein N is greater than or equal to 2; 2, reading a participle subset and target data where the participle subset belongs from a pre-acquired (N-1)th retrieval result set; 3, judging whether the Nth Chinese character is the same as a first character of participles in the participle subset or not, and if so, entering a fourth step; and 4, correspondingly storing the target data and the participle subset read in the second step into an Nth retrieval result set. Correspondingly, the invention also provides a data retrieval device for implementing the method and an embedded terminal provided with the data retrieval device. Through the data retrieval technology provided by the invention, the target data can be quickly and accurately retrieved from a mass target data set.

Description

Data retrieval method, data searcher and built-in terminal
Technical field
The present invention relates to the data retrieval technology field, relate in particular to a kind of data retrieval method, data searcher and built-in terminal.
Background technology
After the middle and later periods eighties 20th century, along with computer process ability greatly improve and use progressively popularize, the research of data retrieval technology has entered the golden period of a fast development, various data retrieval technology and practical system continue to bring out.Such as, text retrieval technique, this technological improvement and used three kinds of popular data retrieval models: Boolean Model, probability model and vector space model.
The inventor is in to existing text retrieval technique research and practice process, find when the neighbouring relations of Chinese character in the Chinese character sequence of user's input do not conform to target data, because existing text retrieval technique adopts the mode that the content of the content of Chinese character sequence and target data is mated, the searched targets data, and from the semanteme of Chinese character sequence the target data that retrieves is not analyzed, thereby can't guarantee in the magnanimity target data set, to retrieve quickly and accurately target data.
Summary of the invention
The technical matters that the embodiment of the invention will solve provides a kind of data retrieval method, data searcher and built-in terminal, can realize retrieving exactly target data in the magnanimity target data set.
For solving the problems of the technologies described above, the objective of the invention is to be achieved through the following technical solutions:
The embodiment of the invention provides a kind of data retrieval method, and the method comprises:
A Chinese character of step 1, reception user input, if described Chinese character is not first Chinese character of the current input of user, but N Chinese character, N 〉=2 then enter step 2;
Step 2, concentrated from the N-1 result for retrieval that gets access in advance reads participle subclass and affiliated target data thereof;
Step 3, judge whether the lead-in of participle in described N Chinese character and the described participle subclass is identical, if identical, then enters step 4;
Step 4, the target data that described step 2 is read out and participle subclass correspondence are kept at the N result for retrieval and concentrate.
Accordingly, the embodiment of the invention also provides a kind of data searcher, and this data searcher comprises:
The Chinese character receiving element is used for receiving the Chinese character that the user inputs;
Chinese character order judging unit, if judge that the Chinese character that obtains described Chinese character reception is not first Chinese character of the current input of user, but N Chinese character, described result for retrieval reading unit is then triggered in N 〉=2;
Described result for retrieval reading unit is used for the N-1 retrieval set from getting access in advance, reads participle subclass and affiliated target data thereof;
The first Chinese character matching unit is used for judging whether the lead-in of the participle subclass participle that described N Chinese character and described result for retrieval reading unit read out is identical, if identical, then triggers N result for retrieval storage unit;
Described N result for retrieval storage unit is used for target data that described result for retrieval reading unit is read out and participle subclass correspondence and is kept at the N result for retrieval and concentrates.
Accordingly, the embodiment of the invention also provides a kind of built-in terminal, comprising: the arbitrary data searcher that preamble has been stated.
Technique scheme has following beneficial effect:
The embodiment of the invention provides a kind of data retrieval technology, this technology is behind the Chinese character that receives user's input, judge that first this Chinese character is first Chinese character of the current input of user, if not first Chinese character, the N-1 retrieval set from getting access in advance then, read participle subclass and affiliated target data thereof, then, whether the lead-in of judging participle in described N Chinese character and the described participle subclass is identical, if identical, the target data that then described step 2 is read out and participle subclass correspondence are kept at the N result for retrieval and concentrate.Because the mode that this technology adopts the lead-in with participle in the Chinese character of user's input and the participle subclass to mate, obtain more accurately target data from the N-1 result for retrieval that gets access in advance is concentrated, thereby realized in the magnanimity target data set, retrieving quickly and accurately target data.
Description of drawings
The data retrieval method process flow diagram that Fig. 1 provides for first embodiment of the invention;
The data retrieval method process flow diagram that Fig. 2 provides for second embodiment of the invention;
The multilevel retrieval index structuring method process flow diagram that Fig. 3 provides for third embodiment of the invention;
The one-level search index aggregate of data collection synoptic diagram that Fig. 4 provides for the embodiment of the invention;
The 2-level search directoried data set synoptic diagram that Fig. 5 provides for the embodiment of the invention;
Fig. 6 forms synoptic diagram for fourth embodiment of the invention provides data searcher.
Embodiment
For the purpose that makes the embodiment of the invention, technical scheme, and advantage clearer, referring to accompanying drawing the technical scheme that the embodiment of the invention provides is elaborated.
First embodiment of the invention provides a kind of data retrieval method, sees also Fig. 1, and the method comprises the steps:
Step 101: a Chinese character Ch who receives user's input WordIf, described Chinese character Ch WordNot first Chinese character of the current input of user, but N Chinese character, N 〉=2 then enter step 102;
Step 102: concentrate from the N-1 result for retrieval that gets access in advance, read participle subclass and affiliated target data thereof;
Step 103: whether the lead-in of judging participle in described N Chinese character and the described participle subclass is identical, if identical, then enters step 105, if different, then enters step 104;
Step 104: judge whether described N Chinese character be identical with N Chinese character in described minute set of words, if identical, then enters step 105, if different, then returns step 102;
Step 105: the target data that described step 2 is read out and participle subclass correspondence are kept at the N result for retrieval and concentrate.
More than the data retrieval method that provides for the embodiment of the invention, because the mode that the method adopts the lead-in with participle in the Chinese character of user's input and the participle subclass to mate, obtain more accurately target data from the N-1 result for retrieval that gets access in advance is concentrated, thereby realized in the magnanimity target data set, retrieving exactly target data.And the data retrieval method that first embodiment of the invention provides can be called the word and search method of striding, it is that the neighbouring relations of each Chinese character in the Chinese character sequence of user input are not when conforming to target data that what is called is striden word and search, the searched targets data, in other words conj.or perhaps, discontinuous and when crossing over the word (or phrase) that several have self semanteme in target data in the Chinese character sequence of user input, the searched targets data.
Analyze data retrieval method that first embodiment of the invention provides as can be known, if the user has inputted 2 Chinese characters continuously, then said method is to adopt method shown in Figure 1 on the basis of the first retrieval set, obtains the second retrieval set.This shows, want to obtain the second retrieval set and need to obtain in advance the first retrieval set, therefore, second embodiment of the invention also provides a kind of data retrieval method, the method is used for obtaining described the first retrieval set, but the method for obtaining the first retrieval set is not limited to the method that second embodiment of the invention provides.
See also Fig. 2, the data retrieval method process flow diagram for second embodiment of the invention provides comprises:
Step 201: the Chinese character Ch that receives user's input Word
Step 202: judge Chinese character Ch WordWhether be first Chinese character of user's current (in frame retrieval) input, if so, then enter step 203, if not, then enter described step 102;
In actual applications, after one time data retrieval is finished, can empty all data relevant with retrieving, therefore, judge Chinese character Ch WordWhether be that current first Chinese character in retrieval of user can adopt following method:
Judge whether the data relevant with retrieving are arranged in the buffer memory, if do not have, illustrate that then the current Chinese character in the frame retrieval input of user is current first Chinese character inputted of user in frame retrieval, if have, illustrate that then the current Chinese character in the frame retrieval input of user is not current first Chinese character in the frame retrieval input of user.
Step 203: resolve Chinese character Ch WordChinese phonetic alphabet Spell Ch(phonetic Spell Ch);
Step 204: from phonetic Spell ChMiddle extraction initial C EnWith simple or compound vowel of a Chinese syllable R En
Step 205: from the 2-level search directoried data set that presets or
Figure GDA00002826520200041
In the mapping table, obtain initial C EnIn the concentrated reference position side-play amount of one-level search index aggregate of data
Figure GDA00002826520200042
Step 206: concentrate at the one-level search index aggregate of data that presets, from the reference position side-play amount
Figure GDA00002826520200043
Corresponding position begins, and searches initial C EnWith simple or compound vowel of a Chinese syllable R EnOne-level search index aggregate of data corresponding to combination examined index data bunch concentrated reference position side-play amount in one-level
Figure GDA00002826520200044
With the rest position side-play amount
Figure GDA00002826520200045
Wherein, initial C EnWith simple or compound vowel of a Chinese syllable R EnOne-level search index aggregate of data corresponding to combination examined index data bunch concentrated rest position side-play amount in one-level
Figure GDA00002826520200046
To closely follow at C EnAnd R EnThe reference position side-play amount that another initial after the combination and one-level search index aggregate of data corresponding to simple or compound vowel of a Chinese syllable combination are concentrated at one-level search index aggregate of data, or this reference position offset minus 1.
Step 207: from the reference position side-play amount
Figure GDA00002826520200051
Corresponding position begins, and reads one by one the one-level search index in the one-level search index aggregate of data collection that presets, and comprises in this one-level search index: the reference position side-play amount that participle subclass and affiliated target data thereof are concentrated in the target data that presets;
Step 208: whether first Chinese character that determining step 207 reads out the participle subclass is identical with described first Chinese character, if identical, then enters step 209, if not, then enters step 211;
Step 209: read out the reference position side-play amount that target data is concentrated in the target data that presets in the one-level search index that reads out from step 207;
Step 210: concentrate in target data, read objective data from position corresponding to reference position side-play amount that step 209 reads out, participle subclass correspondence in the one-level search index that this target data and step 207 are read out is kept at the first result for retrieval and concentrates, and enters step 211;
Step 211: the one-level search index that reads out in step 207 adds 1 in the concentrated reference position side-play amount of one-level search index aggregate of data, judges that the reference position side-play amount that adds after 1 is less than the rest position side-play amount
Figure GDA00002826520200052
Or more than or equal to the rest position side-play amount
Figure GDA00002826520200053
If less than, then return step 207, if more than or equal to, then enter step 212:
Step 212: export the target data among the first retrieval set data.
More than the data retrieval method that provides for second embodiment of the invention, the method can according to first Chinese character of the current input of user, obtain the first result for retrieval data set.After obtaining the first result for retrieval data set, if the user no longer inputs Chinese character, then the target data concentrated of the first result for retrieval will be presented in face of the user as final result for retrieval, if the user continues to have inputted second Chinese character, the 3rd Chinese character, N Chinese character, then the backstage can start the data retrieval method that first embodiment of the invention provides, will be according to second Chinese character, the 3rd Chinese character, N Chinese character, the target data that retrieval obtains constantly is presented in face of the user, but the result for retrieval that finally is presented in face of the user is last Chinese character of inputting according to the user, the target data that retrieval obtains.
Need explanation herein, in actual applications, can adopt variety of way to trigger the data retrieval method that second embodiment of the invention provides.First kind of way is: after the user inputted first Chinese character, the backstage search program namely began retrieve data, that is to say, in a single day background program receives first Chinese character of the current input of user, namely begins retrieve data.The second way is: the user inputs first complete Chinese character sequence, and then by user's manual triggers backstage search program retrieve data, at this moment, the method that second embodiment of the invention provides also is that the one by one Chinese character in the Chinese character sequence is processed.
The data retrieval method that the analysis second embodiment of the invention provides when obtaining the first retrieval set, makes up the multilevel retrieval index as can be known in advance.Below with reference to accompanying drawing the multilevel retrieval index structuring method that the embodiment of the invention provides is described in detail.
See also Fig. 3, be the multilevel retrieval index structuring method that third embodiment of the invention provides, the method comprises the steps:
Step 301: every objective data in the target data set that presets are carried out word segmentation processing, obtain minute set of words of every objective data;
In actual applications, can adopt every objective data that forward/reverse maximum coupling minute morphology is concentrated target data to carry out word segmentation processing, obtain not having minute set of words of redundancy, described do not have minute set of words of redundancy to refer to that each participle in minute set of words is separate, the end to end rear target data that has formed before the participle of participle, and do not have a unnecessary Chinese character.Such as, " Haidian District, Beijing City people's court " carried out the Forward Maximum Method word segmentation processing, obtain not having comprising four independently participles " Beijing " " Haidian District " " people " " law court " in minute set of words of redundancy.
Step 302: the phonetic of resolving each Chinese character of every objective data in the target data set;
Wherein, the process of resolving the phonetic of each Chinese character of every objective data in the target data set specifically comprises: read one by one target data in target data set, obtain the kanji code of each Chinese character of the current target data that reads out, according to the kanji code of record and the mapping relations between the phonetic in simplified Chinese character collection (GB2312) or the GB (GBK), obtain the phonetic of Chinese character again.
Such as, the kanji code in " north " is " B1B1 ", the phonetic that " B1B1 " is corresponding in GB2312 is " bei "; The kanji code in " capital " is " BEA9 ", and in GB2312, the phonetic that " BEA9 " is corresponding is " jing ".Therefore, according to the mapping relations of the kanji code that records in the GB2312 and phonetic, the phonetic of resolving " north " " capital " that obtains is respectively " bei " " jing ".
Step 303: the first letter (initial) and the simple or compound vowel of a Chinese syllable that from the phonetic that step 302 obtains, extract this phonetic;
In phonetic, phonetic has following three kinds of forms of the composition:
1, phonetic is made of jointly initial consonant and simple or compound vowel of a Chinese syllable, and initial consonant only has a letter;
2, phonetic is made of jointly initial consonant and simple or compound vowel of a Chinese syllable, and initial consonant has two letters;
3, phonetic only has simple or compound vowel of a Chinese syllable, and does not have initial consonant.
In embodiments of the present invention, the initial that extracts from phonetic is the initial of the initial consonant of this phonetic, concrete: when phonetic is made of jointly initial consonant and simple or compound vowel of a Chinese syllable, and initial consonant only has one when alphabetical, and the initial that extracts from this phonetic is exactly the initial consonant of this phonetic; When phonetic is comprised of initial consonant and simple or compound vowel of a Chinese syllable, and initial consonant has two when alphabetical, and the initial that extracts from this phonetic is exactly the initial of the initial consonant of this phonetic; When this phonetic only has simple or compound vowel of a Chinese syllable, and when not having initial consonant, from this phonetic, can only extract simple or compound vowel of a Chinese syllable and do not have initial.
Also have important characteristics to be in the phonetic aufbauprinciple: initial consonant is never identical with the initial of simple or compound vowel of a Chinese syllable, and have 23 initial consonants (as, b, m, d, n, g, h, q, zh, sh, z, s, y, p, f, t, l, k, j, x, ch, r, c, w), 34 simple or compound vowel of a Chinese syllable (as, a, o, e ... uang, iong, uan).According to the phonetic aufbauprinciple, the process of extracting first letter of pinyin and simple or compound vowel of a Chinese syllable from phonetic that the embodiment of the invention provides specifically comprises:
The first step, the initial of judging phonetic whether be 23 initial consonants initial (be b, m, d ..., r, c, w) one of, if, then enter second step, if not, illustrate that then this phonetic only has simple or compound vowel of a Chinese syllable, and whole phonetic is exactly simple or compound vowel of a Chinese syllable, therefore, can only from this phonetic, extract simple or compound vowel of a Chinese syllable, and not have initial;
Second step judges whether the second letter of phonetic is character ' h ',
If, the initial consonant that this phonetic is described belongs to complex consonant (being arbitrary among ch, sh, the zh), the first two letter of this phonetic is initial consonant, all letters behind the first two letter have consisted of the simple or compound vowel of a Chinese syllable of this phonetic, then from this phonetic, extract first letter as the initial of this phonetic, and extract all letters behind this phonetic second letter as the simple or compound vowel of a Chinese syllable of this phonetic;
If not, the initial that this phonetic is described is initial consonant, and all letters after the initial have consisted of the simple or compound vowel of a Chinese syllable of this phonetic, then extracts initial from this phonetic, and extracts all letters after the initial as the simple or compound vowel of a Chinese syllable of this phonetic.
Such as, the phonetic in " north " is " bei ", and the initial of this phonetic is " b ", and simple or compound vowel of a Chinese syllable is " ei "; The phonetic in " city " is " shi ", and the initial of this phonetic is " s ", and simple or compound vowel of a Chinese syllable is " i ".
Step 304: according to minute set of words of every objective data, obtain the participle subclass of this target data, the number of participle subclass equals the number of participle in minute set of words;
Minute set of words of supposing objective data is<w i, w I+1, w N-1, w n| 1≤i≤n 〉, participle subclass C then WiFor:
C Wi={<w i, w I+1..., w N-1, w n| 1≤i≤n}, n is participle number in minute set of words.
Step 305: in above-mentioned each participle subclass, add the affiliated reference position side-play amount of target data in target data set, obtain the one-level search index of this target data;
Step 306: the one-level search index that step 305 is obtained is saved in the initial of the phonetic of first Chinese character in this one-level search index one-level search index aggregate of data corresponding with the simple or compound vowel of a Chinese syllable combination;
Take minute set of words corresponding to preamble described " Haidian District, Beijing City people's court " as example description of step 304 to step 306.
Minute set of words of " Haidian District, Beijing City people's court " comprises four independently participles " Beijing ", " Haidian District ", " people ", " law court " can obtain four participle subclass:<Beijing thus; The Haidian District; The people; Law court 〉,<Haidian District; The people; Law court 〉,<people; Law court 〉,<law court 〉, wherein "; " for separating each participle; The reference position side-play amount of target data " Haidian District, Beijing City people's court " in target data set under above-mentioned four participle subclass is 1348, in above-mentioned four the participle subclass of 1348 addings, obtain four one-level search indexs of " Haidian District, Beijing City people's court ":<Beijing; The Haidian District; The people; Law court, 1348 〉,<Haidian District; The people; Law court, 1348 〉,<people; Law court, 1348 〉,<law court, 1348 〉; General<Beijing; The Haidian District; The people; Law court, 1348〉initial " b " that is kept at the phonetic in " north " makes up in the corresponding one-level search index aggregate of data with simple or compound vowel of a Chinese syllable " ei ", other three one-level search indexs then are saved in respectively " sea ", " people ", the initial of the phonetic of " method " and the combination of simple or compound vowel of a Chinese syllable " h+ai ", " r+en " is in the one-level search index aggregate of data of " f+a " correspondence.
Step 307: obtain 26 English alphabets (a, b, c ..., x, y, z) and in the concentrated reference position side-play amount of one-level search index aggregate of data
Figure GDA00002826520200081
Described one-level search index aggregate of data collection refers to the set that the described one-level search index of preamble aggregate of data consists of;
See also Fig. 4, be one-level search index aggregate of data collection synoptic diagram.
Step 308: with 26 English alphabets and in the corresponding preservation of the reference position side-play amount of one-level search index aggregate of data collection, obtain the 2-level search directoried data set.
Further, in order to improve the speed of obtaining the reference position side-play amount of 26 English alphabets in one-level search index aggregate of data collection, can be in the rear further execution following steps of step 308:
Read all data contents in the 2-level search directoried data set, make up
Figure GDA00002826520200091
Mapping table, wherein, character C represents 26 English alphabets, variable Represent the reference position side-play amount of 26 English alphabets in one-level search index aggregate of data collection.
See also Fig. 5, be 2-level search directoried data set synoptic diagram.Above synoptic diagram only is clearer explanation method provided by the invention, should not be considered as limitation of the present invention.
More than having introduced the embodiment of the invention provides the implementation of the whole bag of tricks, below in conjunction with accompanying drawing embodiment of the invention data searcher is described in detail.
See also Fig. 6, the data searcher that provides for fourth embodiment of the invention forms synoptic diagram, and this described data searcher comprises:
Chinese character receiving element 601 is used for receiving the Chinese character that the user inputs;
Chinese character order judging unit 602, if judge that the Chinese character that obtains described Chinese character reception is not first Chinese character of the current input of user, but N Chinese character, result for retrieval reading unit 603 is then triggered in N 〉=2;
Result for retrieval reading unit 603 is used for the N-1 retrieval set from getting access in advance, reads participle subclass and affiliated target data thereof;
The first Chinese character matching unit 604, be used for judging whether the lead-in of the participle subclass participle that described N Chinese character and described result for retrieval reading unit 603 read out is identical, if it is not identical, then trigger the second Chinese character matching unit 605, if identical, then trigger N result for retrieval storage unit 606
The second Chinese character matching unit 605 judges whether described N Chinese character be identical with N Chinese character in the described participle subclass, if identical, then triggers N result for retrieval storage unit 606, if different, then triggers described result for retrieval reading unit 603;
N result for retrieval storage unit 606 is used for target data that described result for retrieval reading unit is read out and participle subclass correspondence and is kept at the N result for retrieval and concentrates.
It more than is the data searcher that fourth embodiment of the invention provides, analyze said apparatus and when adopting said apparatus, need to get access in advance the first retrieval set as can be known, for this reason, said apparatus may further include: the first spelling analyzing unit, and the first alphabetical extraction unit, first start bit is put the side-play amount acquiring unit, one-level search index aggregate of data is searched the unit, one-level search index reading unit, the 3rd Chinese character matching unit, the first result for retrieval acquiring unit;
When Chinese character order judging unit 602 judges that the Chinese character that obtains described Chinese character reception is first Chinese character of the current input of user, trigger the first spelling analyzing unit;
The first spelling analyzing unit is for the phonetic of resolving described first Chinese character;
The first alphabetical extraction unit, the phonetic that is used for obtaining from described the first spelling analyzing unit resolves extracts initial and simple or compound vowel of a Chinese syllable;
First start bit is put the side-play amount acquiring unit, is used for concentrating from presetting the 2-level search index data, obtains described initial and puts side-play amount in the concentrated first start bit of the one-level search index aggregate of data that presets;
One-level search index aggregate of data is searched the unit, is used for concentrating at the one-level search index aggregate of data that presets, and from described first start bit is put position corresponding to side-play amount, searches one-level search index aggregate of data corresponding to described initial and simple or compound vowel of a Chinese syllable combination;
One-level search index reading unit, be used for searching the one-level search index aggregate of data that the unit finds from described one-level search index aggregate of data, read the one-level search index, comprise in the described one-level search index: participle subclass and affiliated target data thereof the second reference position side-play amount in the target data set of presetting;
The 3rd Chinese character matching unit is used for judging whether described first Chinese character is identical with first Chinese character of described participle subclass, if identical, then triggers the first result for retrieval acquiring unit, if different, then triggers one-level search index reading unit;
The first result for retrieval acquiring unit, the one-level search index that is used for reading out from described one-level search index reading unit obtains described the second reference position side-play amount and described participle subclass, concentrate in described target data, read objective data from position corresponding to described the second reference position side-play amount, and described target data and participle subclass correspondence are kept at the first result for retrieval concentrate.
By the above-mentioned process of obtaining the first retrieval set as can be known, need to make up in advance the multilevel retrieval index, above-mentioned data searcher may further include such as lower unit for this reason:
Divide the set of words acquiring unit, be used for the target data that described target data is concentrated is carried out word segmentation processing, obtain minute set of words of target data;
Target data spelling analyzing unit is used for the phonetic that parsing consists of each Chinese character of described target data;
In actual applications, described target data spelling analyzing unit can be with following several Implement of Function Modules:
The kanji code acquiring unit is for the kanji code that obtains each Chinese character that consists of described target data;
The spelling analyzing unit is used for according to the kanji code that records in simplified Chinese character collection or the GB and the mapping relations between the phonetic, obtains the phonetic of described Chinese character.
Target data letter extraction unit, the phonetic that is used for going out from described target data spelling analyzing unit resolves extracts initial and simple or compound vowel of a Chinese syllable;
In actual applications, described target data letter extraction unit can adopt following Implement of Function Module:
Whether initial type judging unit, the initial that is used for judging described phonetic are of initial of 23 initial consonants, if not, then trigger the first alphabetical extraction unit, if so, then trigger character ' h ' judging unit;
The described first alphabetical extraction unit is used for extracting simple or compound vowel of a Chinese syllable from described phonetic, and initial is empty;
Described character ' h ' judging unit is used for judging whether the second letter of described phonetic is character ' h ', if so, then triggers the second alphabetical extraction unit, if not, then triggers the trigram extraction unit;
The described second alphabetical extraction unit, the first letter that is used for extracting described phonetic are as its initial, and all letters behind the extraction second letter are as its simple or compound vowel of a Chinese syllable;
Described trigram extraction unit, the first letter that is used for extracting described phonetic be as its initial, and extract all letters after the described initial as its simple or compound vowel of a Chinese syllable.
Participle subclass acquiring unit is used for described minute set of words according to each target data, obtains the participle subclass of described target data, and the number of described participle subclass equals the number of participle in described minute set of words;
One-level search index aggregate of data acquiring unit, be used for obtaining the second reference position side-play amount of described target data in described target data set, described the second reference position side-play amount is added in the participle subclass that described participle subclass acquiring unit gets access to, obtain the one-level search index of described target data, and in the initial that described one-level search index is saved in the phonetic of first Chinese character in the described participle subclass one-level search index aggregate of data corresponding with the simple or compound vowel of a Chinese syllable combination;
2-level search directoried data set acquiring unit, be used for obtaining the first start bit that 26 English alphabets concentrate at one-level search index aggregate of data and put side-play amount, and will described 26 English alphabets and first start bit put the side-play amount correspondence to be kept at the 2-level search index data concentrated.
The embodiment of the invention also provides a kind of built-in terminal, and this terminal comprises the data searcher that preamble has been stated.Described built-in terminal can be the terminals such as PDA, navigating instrument, mobile phone.
More than a kind of data retrieval method provided by the present invention, data searcher and built-in terminal are described in detail, for one of ordinary skill in the art, thought according to the embodiment of the invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (11)

1. a data retrieval method is characterized in that, described method comprises:
A Chinese character of step 1, reception user input;
If the Chinese character of user's input is first Chinese character of the current input of user, then resolve the phonetic of described first Chinese character; From described phonetic, extract initial and simple or compound vowel of a Chinese syllable; Concentrate from presetting the 2-level search index data, obtain described initial and put side-play amount in the concentrated first start bit of the one-level search index aggregate of data that presets; Concentrate at the one-level search index aggregate of data that presets, from described first start bit is put position corresponding to side-play amount, search one-level search index aggregate of data corresponding to described initial and simple or compound vowel of a Chinese syllable combination; From described one-level search index aggregate of data, read the one-level search index, comprise in the described one-level search index: the second reference position side-play amount that participle subclass and affiliated target data thereof are concentrated in the target data that presets; Judge that whether described first Chinese character is identical with first Chinese character in the participle subclass, if identical, then obtains described the second reference position side-play amount and described participle subclass from described one-level search index; Concentrate in described target data, read objective data from position corresponding to described the second reference position side-play amount, and described target data and participle subclass correspondence are kept at the first result for retrieval concentrate;
If described Chinese character is not first Chinese character of the current input of user, but N Chinese character, N 〉=2 then enter step 2; Step 2, concentrated from the N-1 result for retrieval that gets access in advance reads participle subclass and affiliated target data thereof;
Step 3, judge whether the lead-in of participle in described N Chinese character and the described participle subclass is identical, if identical, then enters step 4;
Step 4, the target data that described step 2 is read out and participle subclass correspondence are kept at the N result for retrieval and concentrate.
2. the method for claim 1 is characterized in that, if the lead-in of participle is not identical in described N Chinese character and the described participle subclass, then described method further comprises:
Judge that whether described N Chinese character be identical with N Chinese character in the described participle subclass, if identical, then enters described step 4.
3. method as claimed in claim 2 is characterized in that, described method further comprises:
The target data that described target data is concentrated is carried out word segmentation processing, obtain minute set of words of target data;
Resolve the phonetic of each Chinese character that consists of described target data, and from described phonetic, extract initial and simple or compound vowel of a Chinese syllable;
According to described minute set of words of each target data, obtain the participle subclass of described target data, the number of described participle subclass equals the number of participle in described minute set of words;
Obtain the second reference position side-play amount of described target data in described target data set;
Described the second reference position side-play amount is added described participle subclass, obtain the one-level search index of described target data, and in the initial that described one-level search index is saved in the phonetic of first Chinese character in the described participle subclass one-level search index aggregate of data corresponding with the simple or compound vowel of a Chinese syllable combination;
Obtain 26 English alphabets and put side-play amount in the concentrated first start bit of one-level search index aggregate of data;
26 English alphabets and first start bit thereof are put the side-play amount correspondence to be kept at the 2-level search index data and to concentrate.
4. method as claimed in claim 3 is characterized in that, the phonetic that described parsing consists of each Chinese character of described target data specifically comprises;
Obtain the kanji code of each Chinese character that consists of described target data;
According to the kanji code that records in simplified Chinese character collection or the GB and the mapping relations between the phonetic, obtain the phonetic of described Chinese character.
5. method as claimed in claim 3 is characterized in that, describedly extracts initial and simple or compound vowel of a Chinese syllable specifically comprises from described phonetic:
Whether the initial of judging described phonetic is in the initial of 23 initial consonants,
If not, illustrate that described phonetic only has simple or compound vowel of a Chinese syllable, then from described phonetic, extract simple or compound vowel of a Chinese syllable, initial is empty;
If so, judge then whether the second letter of described phonetic is character ' h ', if so, then extract the first letter of described phonetic as its initial, and all letters behind the extraction second letter are as its simple or compound vowel of a Chinese syllable; If not, then extract the first letter of described phonetic as its initial, and extract all letters after the described initial as its simple or compound vowel of a Chinese syllable.
6. method as claimed in claim 3 is characterized in that,
If minute set of words of target data is<w i, w I+1..., W N-1, W n>| 1≤i≤n 〉, the participle subset is combined into C Wi, then described participle subclass is specially: C Wi={<w i, w I+1..., W N-1, W n>| 1≤i≤n}.
7. a data searcher is characterized in that, described data searcher comprises:
The Chinese character receiving element is used for receiving the Chinese character that the user inputs;
Chinese character order judging unit, if judge that the Chinese character that obtains described Chinese character reception is not first Chinese character of the current input of user, but N Chinese character, N 〉=2, then trigger the result for retrieval reading unit, if judge that the Chinese character that obtains described Chinese character reception is first Chinese character of the current input of user, then trigger the first spelling analyzing unit;
The first spelling analyzing unit is for the phonetic of resolving described first Chinese character;
The first alphabetical extraction unit, the phonetic that is used for obtaining from described the first spelling analyzing unit resolves extracts initial and simple or compound vowel of a Chinese syllable;
First start bit is put the side-play amount acquiring unit, is used for concentrating from presetting the 2-level search index data, obtains described initial and puts side-play amount in the concentrated first start bit of the one-level search index aggregate of data that presets;
One-level search index aggregate of data is searched the unit, is used for concentrating at the one-level search index aggregate of data that presets, and from described first start bit is put position corresponding to side-play amount, searches one-level search index aggregate of data corresponding to described initial and simple or compound vowel of a Chinese syllable combination;
One-level search index reading unit, be used for searching the one-level search index aggregate of data that the unit finds from described one-level search index aggregate of data, read the one-level search index, comprise in the described one-level search index: the second reference position side-play amount that participle subclass and affiliated target data thereof are concentrated in the target data that presets;
The 3rd Chinese character matching unit is used for judging whether described first Chinese character is identical with first Chinese character of described participle subclass, if identical, then triggers the first result for retrieval acquiring unit;
The first result for retrieval acquiring unit, the one-level search index that is used for reading out from described one-level search index reading unit obtains described the second reference position side-play amount and described participle subclass; Concentrate in described target data, read objective data from position corresponding to described the second reference position side-play amount, and described target data and participle subclass correspondence are kept at the first result for retrieval concentrate;
Described result for retrieval reading unit is used for the N-1 retrieval set from getting access in advance, reads participle subclass and affiliated target data thereof;
The first Chinese character matching unit is used for judging whether the lead-in of the participle subclass participle that described N Chinese character and described result for retrieval reading unit read out is identical, if identical, then triggers N result for retrieval storage unit;
Described N result for retrieval storage unit is used for target data that described result for retrieval reading unit is read out and participle subclass correspondence and is kept at the N result for retrieval and concentrates.
8. data searcher as claimed in claim 7 is characterized in that, described data searcher further comprises: the second Chinese character matching unit;
Described the first Chinese character matching unit if judge that the lead-in obtain participle in described N Chinese character and the participle subclass that described result for retrieval reading unit reads out is not identical, then triggers described the second Chinese character matching unit;
Described the second Chinese character matching unit is used for judging whether described N Chinese character be identical with N Chinese character of described minute set of words, if identical, then triggers described N result for retrieval storage unit.
9. data searcher as claimed in claim 8 is characterized in that, described data searcher further comprises:
Divide the set of words acquiring unit, be used for the target data that described target data is concentrated is carried out word segmentation processing, obtain minute set of words of target data;
Target data spelling analyzing unit is used for the phonetic that parsing consists of each Chinese character of described target data;
Target data letter extraction unit, the phonetic that is used for going out from described target data spelling analyzing unit resolves extracts initial and simple or compound vowel of a Chinese syllable;
Participle subclass acquiring unit is used for described minute set of words according to each target data, obtains the participle subclass of described target data, and the number of described participle subclass equals the number of participle in described minute set of words;
One-level search index aggregate of data acquiring unit, be used for obtaining the second reference position side-play amount of described target data in described target data set, described the second reference position side-play amount is added in the participle subclass that described participle subclass acquiring unit gets access to, obtain the one-level search index of described target data, and in the initial that described one-level search index is saved in the phonetic of first Chinese character in the described participle subclass one-level search index aggregate of data corresponding with the simple or compound vowel of a Chinese syllable combination;
2-level search directoried data set acquiring unit, be used for obtaining the first start bit that 26 English alphabets concentrate at one-level search index aggregate of data and put side-play amount, and will described 26 English alphabets and first start bit put the side-play amount correspondence to be kept at the 2-level search index data concentrated.
10. data searcher as claimed in claim 9 is characterized in that, described target data spelling analyzing unit specifically comprises;
The kanji code acquiring unit is for the kanji code that obtains each Chinese character that consists of described target data;
The spelling analyzing unit is used for according to the kanji code that records in simplified Chinese character collection or the GB and the mapping relations between the phonetic, obtains the phonetic of described Chinese character.
11. data searcher as claimed in claim 9 is characterized in that, described target data letter extraction unit specifically comprises:
Whether initial type judging unit, the initial that is used for judging described phonetic are of initial of 23 initial consonants, if not, then trigger the first alphabetical extraction unit, if so, then trigger character ' h ' judging unit;
The described first alphabetical extraction unit is used for extracting simple or compound vowel of a Chinese syllable from described phonetic, and initial is empty;
Described character ' h ' judging unit is used for judging whether the second letter of described phonetic is character ' h ', if so, then triggers the second alphabetical extraction unit, if not, then triggers the trigram extraction unit;
The described second alphabetical extraction unit, the first letter that is used for extracting described phonetic are as its initial, and all letters behind the extraction second letter are as its simple or compound vowel of a Chinese syllable;
Described trigram extraction unit, the first letter that is used for extracting described phonetic be as its initial, and extract all letters after the described initial as its simple or compound vowel of a Chinese syllable.
CN 200810240889 2008-12-26 2008-12-26 Data retrieval method, data retrieval engine and embedded terminal Active CN101770478B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200810240889 CN101770478B (en) 2008-12-26 2008-12-26 Data retrieval method, data retrieval engine and embedded terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200810240889 CN101770478B (en) 2008-12-26 2008-12-26 Data retrieval method, data retrieval engine and embedded terminal

Publications (2)

Publication Number Publication Date
CN101770478A CN101770478A (en) 2010-07-07
CN101770478B true CN101770478B (en) 2013-04-24

Family

ID=42503343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200810240889 Active CN101770478B (en) 2008-12-26 2008-12-26 Data retrieval method, data retrieval engine and embedded terminal

Country Status (1)

Country Link
CN (1) CN101770478B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105528420A (en) * 2015-12-07 2016-04-27 北京金山安全软件有限公司 Character encoding and decoding method and device and electronic equipment
CN111859091B (en) * 2020-07-21 2021-06-04 山东省科院易达科技咨询有限公司 Search result aggregation method and device based on artificial intelligence
CN112817966B (en) * 2020-07-24 2023-10-13 腾讯科技(深圳)有限公司 Data retrieval method, device, electronic equipment and storage medium
CN117875267B (en) * 2024-03-11 2024-05-24 江西曼荼罗软件有限公司 Method and system for converting Chinese characters into pinyin

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101071420A (en) * 2007-06-22 2007-11-14 腾讯科技(深圳)有限公司 Method and system for cutting index participle
CN101246478A (en) * 2007-02-14 2008-08-20 高德软件有限公司 Information storage and retrieval method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE10048478C2 (en) * 2000-09-29 2003-05-28 Siemens Ag Method of accessing a storage unit when searching for substrings

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246478A (en) * 2007-02-14 2008-08-20 高德软件有限公司 Information storage and retrieval method
CN101071420A (en) * 2007-06-22 2007-11-14 腾讯科技(深圳)有限公司 Method and system for cutting index participle

Also Published As

Publication number Publication date
CN101770478A (en) 2010-07-07

Similar Documents

Publication Publication Date Title
CN107463666B (en) sensitive word filtering method based on text content
CN107526967B (en) Risk address identification method and device and electronic equipment
CN104142915B (en) A kind of method and system adding punctuate
CN1801139B (en) Sentence displaying method, information processing system
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN104252484B (en) A kind of phonetic error correction method and system
CN111160031A (en) Social media named entity identification method based on affix perception
CN110879834B (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
Sooraj et al. Deep learning based spell checker for Malayalam language
CN104808806A (en) Chinese character input method and device in accordance with uncertain information
CN101770478B (en) Data retrieval method, data retrieval engine and embedded terminal
CN102033891B (en) Retrieval method and device for Chinese information
Wijeratne et al. Sinhala language corpora and stopwords from a decade of sri lankan facebook
CN101539433A (en) Searching method with first letter of pinyin and intonation in navigation system and device thereof
CN101599075B (en) Chinese abbreviation processing method and device therefor
CN111274428B (en) Keyword extraction method and device, electronic equipment and storage medium
CN110222340B (en) Training method of book figure name recognition model, electronic device and storage medium
CN102737017B (en) Method and apparatus for extracting page theme
KR102109858B1 (en) System and Method for Korean POS Tagging Using the Concatenation of Jamo and Syllable Embedding
CN101436205A (en) Method and apparatus for enquiring unique word by explanation
CN108564086A (en) A kind of the identification method of calibration and device of character string
CN109727591B (en) Voice search method and device
CN108595584B (en) Chinese character output method and system based on digital marks
Sagar et al. Complete Kannada Optical Character Recognition with syntactical analysis of the script
CN102004598A (en) Media player and character input method thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200511

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 100080 Beijing city Haidian District No. three Suzhou Street Daheng Technology Building South 16 floor room 2

Patentee before: AUTONAVI INFORMATION TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right