CN1645374A

CN1645374A - Digit marking character string searching technology

Info

Publication number: CN1645374A
Application number: CNA2005100233835A
Authority: CN
Inventors: 徐文新
Original assignee: Individual
Current assignee: Individual
Priority date: 2005-01-17
Filing date: 2005-01-17
Publication date: 2005-07-27
Also published as: CN101488127A; CN101488127B; WO2006074586A1

Abstract

An indexing method of bit labeled character string includes dividing basic character of character string to be m group, labeling these basic character information by bit 'or' operation, recording character string information named as 'bit value', using 'bit value' operation to select out preliminay result set R1 from databank record then using character normal bit to bit comparison mode for the secondary indexing to obtain final indexing result set R2.

Description

Bit mark character string retrieval technique

Technical field

The present invention is a kind of character string fuzzy search technology, and purpose is to improve the speed of database character string fuzzy search.Method is that the base character that will form character string is divided into the m group, and uses by the data W of m bit and come mark to form the base character information of character string.If the base character C1 of character string S belongs to the n group, then data W is labeled as 1 from n bit of right-to-left (also can be from left to right), similarly, according to other base character C2, C3, C4 ... under group data W is carried out mark, mark can with " or " (or) computing carry out.Finish the data W behind whole base character marks, record the information of character string S, be called " place value " of character string S.To " place value " Wt of " place value " Wn of character string Sn and character string T to be retrieved carry out the position " with " (and) computing, its result is called Wg.If Wg equals Wt, then " place value " Wn equals or comprises Wt.Because different character strings has identical place value, obtain PRELIMINARY RESULTS collection R1 utilizing " place value " computing that data-base recording is screened, again with common character by turn manner of comparison make quadratic search, draw final retrieval set R2.

Background technology

Database character string fuzzy search at present adopts by turn manner of comparison to carry out, as judges whether comprise character f among the character string bdopfqew, computing machine from first to last compares character string bdopfqew by turn with f, efficient is not high.

On October 19th, 2004, I have applied for " prime number replacing character string search technology " patent, application number 200410067258.X, this method has improved the speed of character string fuzzy search effectively, but implement " prime number replacing string search " for long character string, need write down the prime number product with the integer of a plurality of fields, more to the memory space demand.In order to improve the speed of character string fuzzy search, and minimizing is to the demand of memory space, the present invention proposes with several position (bit) composition information of coming the tab character string, after database finished mark, utilize the position " with " computing makes preliminary screening to record, utilizing by turn in PRELIMINARY RESULTS again, manner of comparison retrieves net result.

Summary of the invention

The present invention is a kind of character string fuzzy search technology, and method is that the base character that will form character string is divided into the m group, and uses by the data W of m bit and come mark to form the base character information of character string.Its process has two, one to utilize the inclusive-OR operation of position that character string is carried out mark; Two, utilize the position " with " computing retrieves, and the following describes the realization principle:

21000 Chinese characters and other symbol of income GBK scope all have ISN, according to ISN whole Chinese characters and other symbol are divided into 31 groups, for n group value of investing 2 ^N-1From scale-of-two, every group is being 1 on n bit of right-to-left, and all the other bit are 0, is referred to as " basic place value ".

Group	Numerical value	Basic place value
Group	Numerical value	Basic place value	????1	????1	????00000000000000000000000000000001
????2	????2	????00000000000000000000000000000010	????1	????1	????00000000000000000000000000000001
????2	????2	????00000000000000000000000000000010	????3	????4	????00000000000000000000000000000100
????4	????8	????00000000000000000000000000001000	????3	????4	????00000000000000000000000000000100
????4	????8	????00000000000000000000000000001000	????5	????16	????00000000000000000000000000010000
????6	????32	????00000000000000000000000000100000	????5	????16	????00000000000000000000000000010000
????6	????32	????00000000000000000000000000100000	????7	????64	????00000000000000000000000001000000
????8	????128	????00000000000000000000000010000000	????7	????64	????00000000000000000000000001000000
????8	????128	????00000000000000000000000010000000	????9	????256	????00000000000000000000000100000000
????10	????512	????00000000000000000000001000000000	????9	????256	????00000000000000000000000100000000
????10	????512	????00000000000000000000001000000000	????11	????1024	????00000000000000000000010000000000
????12	????2048	????00000000000000000000100000000000	????11	????1024	????00000000000000000000010000000000
????12	????2048	????00000000000000000000100000000000	????13	????4096	????00000000000000000001000000000000
????14	????8192	????00000000000000000010000000000000	????13	????4096	????00000000000000000001000000000000
????14	????8192	????00000000000000000010000000000000	????15	????16384	????00000000000000000100000000000000
????16	????32768	????00000000000000001000000000000000	????15	????16384	????00000000000000000100000000000000
????16	????32768	????00000000000000001000000000000000	????17	????65536	????00000000000000010000000000000000
????18	????131072	????00000000000000100000000000000000	????17	????65536	????00000000000000010000000000000000
????18	????131072	????00000000000000100000000000000000	????19	????262144	????00000000000001000000000000000000
????20	????524288	????00000000000010000000000000000000	????19	????262144	????00000000000001000000000000000000
????20	????524288	????00000000000010000000000000000000	????21	????1048576	????00000000000100000000000000000000
????22	????2097152	????00000000001000000000000000000000	????21	????1048576	????00000000000100000000000000000000
????22	????2097152	????00000000001000000000000000000000	????23	????4194304	????00000000010000000000000000000000
????24	????8388608	????00000000100000000000000000000000	????23	????4194304	????00000000010000000000000000000000

????25	????16777216	????00000001000000000000000000000000
????25	????16777216	????00000001000000000000000000000000	????26	????33554432	????00000010000000000000000000000000
????27	????67108864	????00000100000000000000000000000000	????26	????33554432	????00000010000000000000000000000000
????27	????67108864	????00000100000000000000000000000000	????28	????134217728	????00001000000000000000000000000000
????29	????268435456	????00010000000000000000000000000000	????28	????134217728	????00001000000000000000000000000000
????29	????268435456	????00010000000000000000000000000000	????30	????536870912	????00100000000000000000000000000000
????31	????1073741824	????01000000000000000000000000000000	????30	????536870912	????00100000000000000000000000000000
????31	????1073741824	????01000000000000000000000000000000

Be provided with character string " the straight long river of lonely cigarette, desert setting sun circle ", then:

Chinese character	ISN	Group	Numerical value	Basic place value
Chinese character	ISN	Group	Numerical value	Basic place value	Greatly	????22823	????8	????128	????00000000000000000000000010000000
Unconcerned	????28448	????22	????2097152	????00000000001000000000000000000000	Greatly	????22823	????8	????128	????00000000000000000000000010000000
Unconcerned	????28448	????22	????2097152	????00000000001000000000000000000000	Lonely	????23396	????23	????4194304	????00000000010000000000000000000000
Cigarette	????28895	????4	????8	????00000000000000000000000000001000	Lonely	????23396	????23	????4194304	????00000000010000000000000000000000
Cigarette	????28895	????4	????8	????00000000000000000000000000001000	Directly	????30452	????11	????1024	????00000000000000000000010000000000
Long	????27265	????17	????65536	????00000000000000010000000000000000	Directly	????30452	????11	????1024	????00000000000000000000010000000000
Long	????27265	????17	????65536	????00000000000000010000000000000000	The river	????27827	????21	????1048576	????00000000000100000000000000000000
Fall	????31683	????2	????2	????00000000000000000000000000000010	The river	????27827	????21	????1048576	????00000000000100000000000000000000
Fall	????31683	????2	????2	????00000000000000000000000000000010	Day	????26085	????15	????16384	????00000000000000000100000000000000
Circle	????22278	????21	????1048576	????00000000000100000000000000000000	Day	????26085	????15	????16384	????00000000000000000100000000000000
Circle	????22278	????21	????1048576	????00000000000100000000000000000000				????8471690
The place value of whole character string			????7423114	????00000000011100010100010010001010				????8471690

Place value to " big, desert, orphan, cigarette, straight, length, river, fall, day, justify " ten characters do " or " (or) computing, can obtain " place value " of whole character string: 00000000011100010100010010001010.

Another aspect, the total value of character string " the straight long river of lonely cigarette, desert setting sun circle " is 8471690, removes a repetition values 1048576 in " river " and " circle ", net value is 7423114.It is corresponding with 00000000011100010100010010001010.

Can obtain " place value " of any character string with this kind method, " place value " of " white clouds thousand years empty long " is: 00100010000001001010000000010000.

And " place value " of " the long river setting sun " is 00000000000100010100000000000010.

Judge whether " place value " Wn of character string Sn comprises or equal " place value " Wt of T, as long as " place value " Wt to " place value " Wn of character string and T do " with " (and) computing, if Wg equals Wt as a result, then Wn comprises or equals Wt, furtherly, character string Sn may comprise or equal T.That is:

Wg＝Wn?and?Wt

As Wg=Wt

Then Wn comprises or equals Wt,

And Sn may comprise or equal T.

The straight long river of lonely cigarette, desert setting sun circle S1	Thousand years empty long S2 of white clouds
	Thousand years empty long S2 of white clouds		??00000000011100010100010010001010W1	??00100010000001001010000000010000W2
??00000000000100010100000000000010Wt	??00000000000100010100000000000010Wt	Long river setting sun T	??00000000011100010100010010001010W1	??00100010000001001010000000010000W2
??00000000000100010100000000000010Wt	??00000000000100010100000000000010Wt	Long river setting sun T	??00000000000100010100000000000010Wg1	??00000000000000000000000000000000Wg2	" and " value

As seen from above-mentioned, " the basic place value " of " river " and " circle " is identical, and kinds of characters string " place value " identical existence.The purpose of bit mark character string retrieval is to utilize bit arithmetic that the character string in the database is done preliminary search to obtain R1, carries out quadratic search with common relative method by turn in the result, obtains net result R2.The position " with " computing by turn than comparatively fast, in the enforcement, in order to improve retrieval rate, should reduce R1 more than character as far as possible, makes it near R2, reduces the used time of quadratic search.

Some explanation:

1. establishing the character string average length is L, and data-base recording bar number is R, and string length to be retrieved is l, and the used figure place of mark is m, and then the bar number of preliminary search result set R1 can be estimated roughly with following formula:

R 1 = \frac{(L * R)}{m! / (1! * (m - 1)!)}

This formula is not considered the probability distribution problem of string token place value, thus inaccurate, but general description the influence of Several Parameters to R1.

Be provided with the title database of 3,000,000 records, the character string average length is 16, and with 31 bit marks, the used search key length of user is 4, then

R 1 = \frac{(16 * 3,000,000)}{31! / (4! * (31 - 4)!)} = 1526

As seen for general Chinese character words and phrases, title, place name, unit name, can carry out mark to character string effectively with 31 bit outside the sign bit among 32 bit of a lint-long integer.

Longer for the character string average length, record strip is counted the more data storehouse, in sql SEVER 2000, can adopt 63 bit of data type bigint to carry out mark, correspondingly, base character is divided into 63 groups, certainly for 32 bit processors, with bigint inevitable Wn and Wt carry out the position " with " whether computing and comparison Wg equate to use the more time with Wt, whether adopt should do to contrast and test.In fact, any data type of being convenient to carry out " or " and " with " computing of position in any database all can be used for the tab character string, and is better naturally if independent programming constructs does not have the exclusive data type of sign symbol position.

3. be the Chinese words and phrases database of double word symbol for the overwhelming majority, can consider with two bit to be that 1 basic place value is carried out mark, the branch Chinese character be 31! / (2! * (31-2)! ) group, promptly 465 groups.But thus, " place value " common 4 bit of double word symbol words and phrases are 1, can resolve to 4! / (2! * (4-2)! ), i.e. 6 Chinese characters.If the user is with a Chinese character index, the database of 100,000 words and phrases, then

R 1 = \frac{(2 * 100,000) * 6}{465} = 2580

R 1 = \frac{(2 * 100,000)}{31} = 6452

As seen adopting two bit is that 1 basic place value is carried out labeling properties and slightly improved, but this kind method restricted application.That is to say that when string length to be retrieved was 1 character, bit mark character string search method performance was inferior to the prime number replacing character string retrieving method.

4. base character is for Chinese Chinese character normally, certainly during Chinese character retrieval, can be basic compile other.For the Chinese phonetic alphabet, can be letter, initial consonant, simple or compound vowel of a Chinese syllable, syllable.For other Languages, can be letter, syllable, word etc.

5. the used data type of mark still should be considered the figure place of cpu except that considering software factors such as programming language, database.For 64 cpu, should pay the utmost attention to and adopt 64 bit to come the tab character string, to make full use of the performance of cpu, improve the dispersion of " place value ".

4. the grouping of base character if can realize the word frequency equalization, and then performance is optimum naturally, carries out modular arithmetic with grouping with Hanzi internal code, and relatively easy the realization is not optimum grouping.

Embodiment

The present invention has obtained good realization in database character string fuzzy searches such as Chinese vocabulary, phrase, phrase, title, make up database with sql SERVER2000 below, with vb6.0 is programming language, specify, the character string fuzzy search of other programming language and other database can be with reference to enforcement.

1. set up database

If database shuku has table biao, field shuming is wherein arranged, data type is nvarchar, length is 40.Other sets up field wei, and data type is " long ", and just 4 bytes have 32 bit, and wherein one is positive and negative numerical symbol, and all the other 31 bit can utilize.

2 utilize the inclusive-OR operation of position that the database character string is made " mark "

The long array of 31 elements of dim shuzu (30) As Long ' definition.Shuzu (0)=1 For x=1 To 30 shuzu (x)=2*shuzu (x-1) Next ' is to 31 element assignment of long array, from 1,2,4,8,16 to 1073741824, from scale-of-two, it is 1 that a bit is arranged, and all the other bit are 0, just " basic place value ".Dim biaostr As String when the basic place value Dim x As Integer biaors.MoveFirst of a character of place value Dim weizhilin As Long storage of the character string Dim weizhi As Long of pre-treatment storage character string＜!--SIPO＜DP n=" 6 "〉--〉＜dp n=" d6 "/' first record Do weizhilin=0 weizhi=0 With biaors biaostr=.Fields (" shuming ") the End With ' that moves to database record set biaors reads in the character string of a record, invest string variable biaostr For x=1 To Len (biaostr) index=Abs (AscW (Mid (biaostr, x, 1)) Mod 31) ' from string variable biaostr, get a character, and with this character ISN, with 31 is that mould is done computing, take absolute value again, and invest index, just base character is divided into groups.Weizhilin=shuzu (index) ' invests weizhilin with array shuzu (index) value, is one of 1,2,4,8,16 to 1073741824.Weizhi=weizhi Or weizhilin ' is with " basic place value " the weizhilin value of a character and the inclusive-OR operation of weizhi work position.Next ' loop ends, " place value " weizhi With biaors .Fields (" wei ")=weizhi End With biaors.Update ' that obtains current string handles next record Loop While Not biaors.EOF with the field wei biaors.MoveNext ' that " place value " weizhi stores into current record

3. utilize the position " with " computing carries out the fuzzy search of database character string

Dim shuzu (30) As Long shuzu (0)=1 For x=1 To 30 shuzu (x)=2*shuzu (x-1) Next＜!--SIPO＜DP n=" 7 "〉--〉＜dp n=" d7 "/' the long array of 31 elements of definition, assignment is finished consistent from 1,2,4,8,16 to 1073741824 with the array of " mark ".It is character string For x=1 To Len to be retrieved (biaostr) index=Abs (AscW (Mid (textstr that the place value Dim textstr As String ' of a character of place value Dim weizhilin As Long ' storage of a character string of Dim weizhi As Long ' storage stores current searching character string Dim xAs Integer weizhilin=0 weizhi=0 textstr=Text1.Text ' Text1.Text, x, 1)) Mod 31) weizhilin=shuzu (index) weizhi=weizhi Or weiztilin Next ' obtains " place value " weizhi of character string to be retrieved, and method is consistent with database character string " mark " method.StrQuery=" select*from (SELECT*FROM biao WHERE (wei﹠amp; Amp; " ﹠amp; Amp; Weizhi﹠amp; Amp; ")=" ﹠amp; Amp; Weizhi﹠amp; Amp; ") DERIVEDTBL WHERE (shuming like ' % " ﹠amp; Amp; Textstr﹠amp; Amp; " % ') " ' with " place value " work of each record of " place value " of character string to be retrieved and database " with " (and) computing, make preliminary search, make quadratic search in common character string fuzzy search mode again, obtain net result.This is the query statement of sql SERVER2000, and other database may be slightly different.Adodc 1.RecordSource=strQuery Adodc 1.Refresh ' execution retrieval DataList1.ListField=" shuming "＜!--SIPO＜DP n=" 8 "〉--〉＜dp n=" d8 "/DataList1.ReFill ' shows current result for retrieval in list box.

Claims

1. character string fuzzy search technology is characterized in that: the base character that will form character string is divided into the m group, and uses by the data W of m bit and come mark to form the base character information of character string.If the base character C1 of character string S belongs to the n group, then data W is labeled as 1 from n bit of right-to-left (or from left to right), similarly, according to other base character C2, C3, C4 ... affiliated group is carried out mark to data W, finish the data W behind whole base character marks, record the information of character string S, be called " place value " of character string S." place value " Wn of character string Sn and " place value " Wt of character string T to be retrieved are compared, if Wn equals or comprises Wt, then character string Sn may equal or comprise character string T, thereby realizes the fuzzy search of character string.

2. in accordance with the method for claim 1, it is characterized in that: it is 1 and all the other bit are 0 " place value substantially " that mark can be earlier invests corresponding n bit to each group base character, to " the basic place value " of whole base characters of a character string carry out the position " or " (or) computing, obtain " place value " of a character string.

3. in accordance with the method for claim 1, it is characterized in that: relatively whether two " place values " have relation of inclusion, available position " with " (and) computing carry out.To " place value " Wn of character string Sn and character string T " place value,, Wt carry out the position " with " (and) computing, the result is called Wg, if Wg equals Wt, then Wn equals or comprises Wt.

4. in accordance with the method for claim 1, it is characterized in that: because different character strings has identical " place value ", obtain PRELIMINARY RESULTS R1 so utilize " place value " computing that data-base recording is screened, again with common character by turn manner of comparison make quadratic search, draw final result for retrieval R2.

5. in accordance with the method for claim 1, it is characterized in that:, can carry out mark to character string effectively with 31 bit outside the sign bit among 32 bit of a lint-long integer for general words and phrases, phrase, proper noun database.For the bigger database of character string average length, in sql SEVER 2000, can carry out mark with 63 bit of data type bigint, correspondingly, base character then should be divided into 63 groups.Any data type that can carry out " or " and " with " computing of position in any database all can be used for the tab character string, and the exclusive data type of not having the sign symbol position as independent programming constructs is then better.

6. according to claim 1 and 5 described methods, it is characterized in that: the figure place of the used bit of mark, except that considering the character string average length, should consider the bar number of current database record simultaneously, mark is carried out in record strip number database application how more multidigit bit, correspondingly, base character then should be divided into more groups.

7. according to claim 1,5 and 6 described methods, it is characterized in that: the used data type of mark still should be considered the figure place of cpu except that considering software factors such as programming language, database.For 64 cpu, should pay the utmost attention to and adopt 64 bit to come the tab character string, to make full use of the performance of cpu, improve the dispersion of " place value ".

8. in accordance with the method for claim 1, it is characterized in that: base character is for Chinese Chinese character normally, during Chinese character retrieval, can be basic compile other; For the Chinese phonetic alphabet, can be letter, initial consonant, simple or compound vowel of a Chinese syllable, syllable; For other Languages, can be letter, syllable, word etc.

9. in accordance with the method for claim 1, it is characterized in that: base character is divided into groups, the base character number of each group needn't equate, should make every effort in this language or current database, respectively organize base character word frequency sum and be tending towards balanced, especially the high frequency base character answers equilibrium to be allocated in each group, so that best performance.

10. in accordance with the method for claim 1, it is characterized in that: for the Chinese words and phrases database of the overwhelming majority for double word symbol, available two bit are that 1 basic place value is carried out mark, divide Chinese character be 31～/ (2! * (31-2)! ) group, promptly 465 groups, to improve performance.