WO2006074586A1

WO2006074586A1 - Retrieval technology of character string marked with bit

Info

Publication number: WO2006074586A1
Application number: PCT/CN2005/001642
Authority: WO
Inventors: Wenxin Xu
Original assignee: Wenxin Xu
Priority date: 2005-01-17
Filing date: 2005-10-08
Publication date: 2006-07-20
Also published as: CN1645374A; CN101488127B; CN101488127A

Abstract

The present invention relates to a character string retrieval technology, where one bit corresponds to several character cells, and n bit correspond to all character cells, that is, dividing all character cells into n groups, and marking n bits of data in which each bit equal to 0 with W, so as to mark character cell information making up of character string If one character cell P1 of character string S belongs to n-th group, n-th bit corresponding to W is marked with 1,similarly, W is marked with other character cells P2 ,P3 ,P4… of S, so as to W provided with information of S is referenced to as 'bit value' after mark is completed as for all character cells, this way is referenced to as 1 mark. According to rule of logic algebra, n bits of data in which each bit equal to 0 mark character cell information making up of character cells, where the n bits of data in which each bit equal to 1 is marked with W. If one character cell P of S belongs to n-th group, n-th bit corresponding to data W is marked with 0,this way is referenced to as 0 mark. It is possible to determine that Sb does not contain all character cells of retrieval key word Sb, that Sb contains all character cells of retrieval key word Sb, that Sb maybe contain all character cells of retrieval keyword Sb, through comparing 'bit value' Wa, Wa of Sa with 'bit value' Wb, Wb of Sb Wb.

Description

Bit tag string retrieval technology

The invention is a string retrieval technique, which aims to improve the speed of string fuzzy retrieval. A bit corresponds to a number of character elements, and n bits correspond to all character elements, that is, all character elements are n groups, and n bits of a data are all 0 bits, which are denoted as W _F , to mark The character meta information that makes up the string. If one character element of a plurality of character strings S belongs to the nth group, the nth bit of W is marked as 1 correspondingly, similarly, the other character elements P ₂ , P ₃ , P ₄ ... to which S belongs The group marks W, completes the W after all the character meta tags, and records the information of S, which is called the "bit value" of S. This method is called 1 mark. According to the principle of logical algebra, it is also possible to use one of the n bits of a data, denoted as ^^, to mark the character meta information constituting the string. If one character P of S belongs to the nth group, the nth bit of the data is marked as 0 accordingly, which is called a 0 mark. By comparing the S "bit value" W _a , ^ with the "bit value" of the 8 ₁₅ W _b , ^;, it can be determined that S _b "does not contain", "contains" or "may contain" all characters of S _h yuan. For example, a bit implication operation is performed on \\^ and W _b . If all bits have an implication relationship, S _b contains or may contain S "all character elements". If necessary, use the usual character bitwise comparison method to determine whether S _b contains "S _a ". Tests have shown that bit mark retrieval can significantly improve the speed of fuzzy retrieval of strings. In addition to speed advantage, another feature of bit tag retrieval is that multiple keyword queries are as convenient as a single keyword query. The bit mark can be used for the search of the usual meaning, that is, whether the database string contains keywords or "reverse search", and whether the keyword contains a database string, which can be used for voice input, machine translation, pinyin input and Chinese word segmentation. Medium, matching basic sentence patterns or words. - If the number of bits n that can be used for the tag is more than twice the average length m of the string, a combination of a number of bits can be used to mark a group of character elements to improve the filtering efficiency.

Bit tag string retrieval technology is a string algorithm that can be used for string lookups of various data structures. Background technique The usual string fuzzy search is performed by bitwise comparison, such as judging strings.

S = "bdopfqew,, whether the character f is included, the computer compares the first character b of the main string S with f, compares the second character d of S with f, and so on, until the S 5 characters and the same F, the matching is successful, this is the simplest case. If a substring of length 2 T or more characters, simple pattern matching algorithm, i.e. BF algorithm, _but-1 is compared with _81, if different from the 1 to ₈₂ and Comparative ^, and so on, until one character is ₈₁ S and T, the same, then after T _{1+ 1} are compared, if the same, then the comparison continues down, when S is a When a character S _{i+n is} different from Τ _{1 + η} , it returns, and then a new round is compared with 8; ₊₁ and ^, and the above process is repeated until the characters in the 全部 are all finished, the matching is successful, otherwise the match is made. Failure. As the length of the search keyword 增加 increases, the complexity of character matching increases accordingly. The improved pattern matching algorithm, ie, the ΚΜΡ algorithm, avoids backtracking for the pinyin text of the small character set, but the character set is large. For a Chinese character string with a low single character frequency, Meaning less quality. In short, BF algorithm and ΚΜΡ algorithms is the main character string and substring-by-bit comparison.

On October 19, 2004, I applied for the patent for "Quality Substitution String Search Technology", application number 200410067258.X. This method effectively improves the speed of string fuzzy search, but implements "mass substitution" for long strings. String retrieval technology", to achieve better results, requires more space to store the prime product value. In order to improve the speed of string fuzzy retrieval and reduce the need for memory space, the present invention proposes to use n bits of data to mark the composition information of a string, and the data after the label is called the "bit" of the string. The value " , compares the bit values of the two strings, and combines the usual character bitwise comparison method to achieve fuzzy search of the string. Tests have shown that the speed is a multiple of the general character-by-bit comparison fuzzy search or even more than ten times. Summary of the invention

basic method

In the implementation of bit mark string retrieval, "bit" or "combination of bits" may be used to correspond to characters of usual meaning, such as: a, b, A, B, ∑ #, ∞, t, :, 中中,国; Need to use the corresponding Chinese character radicals, such as: 4, ^ra , and even strokes, such as ",,, etc.; For the pinyin text,: 3⁄4 mouth English string "day and night", you can use the bit corresponding to the "", such as Day, and, night, or a combination of digits, such as ay, ai, can be substituted in Chinese pinyin input or voice input The initials, finals or syllables of the Chinese Pinyin can represent the phonetic symbols for other languages, such as 35, θ, sir. For convenience of explanation, the character unit corresponding to the combination of the position or the bit is the "character element", which is denoted as P.

The method of bit mark string retrieval is easy to implement in pinyin character string retrieval. Set a 32-bit data with 0000, 0000, 0000, 0000, 0000, 0000, 0000, 0000 for each bit, denoted as W _F , with W _F from left to right (also from right to left) The 1st to 26th bits correspond to 26 English letters, and the other 6 bits may not be considered, and may also correspond to punctuation. The word big contains b, g, i, so the corresponding 2, 7, 9 and other 3 bits are marked as "1", then the data becomes 0100, 0010, 1000, 0000, 0000, 0000, 0000, 0000, called big The value of the bit is recorded as w _a , and the standard "i" mode is "1".

Similarly, bigger contains b, e, g, i, r, so 5 bits of 2, 5, 7, 9, 18, etc. of W _F are marked as "1", and the data becomes 0100, 1010, 1000, 0000, 0100. , 0000,0000,0000 , a bit value called bigger, denoted as W _b .

It can be seen that all big bits of 1 and bigger must be 1; but bigger is 1 bit, big may be 1, or 0, such as 5th, 18th digit; bigger is 0, big must Is 0, such as the first 1, 3, 4, 6, 8, 10 and so on 21 bits. The \ W _b ₃ with a "bit contains" operation, the result is that all bits are "1": 1111,1111,1111,1111,1111,1111,1111,1111, denoted W _T , if it is an unsigned long integer, it is 4294967295. It can be expressed as: W _a → W _h = W _r , the string bigger contains big.

If the bit value of biggest is written as W _c , there is also: P _{a →} . = ₇ ., the string biggest contains bigo

If the bit value of BIG is recorded as W _A , the bit value of big is ^^ with the "bit implication" operation, the same is: iV _a → W _A = W _r , the string BIG is equal to big. The case is not considered here, only the bit values of two identical strings are described as "bit-in", all bits are 1.

If the bit value of digger is written as W _d , if the bit value \\^ of big is subjected to a "bit implication" operation, the result is 1011, 1111, 1111, 1 111, 1111, 1111, 1111, 1111, not equal to W. _T , ie:

W _a → W _d ≠ W _T , the string digger does not contain big.

That is to say, by performing a "bit implication" operation on the bit value W of two strings, if the result is not equal to W _T , the string corresponding to the latter item "does not contain"("not more than" and "not equal"", the same below) The string corresponding to the previous item.

The above scheme does not consider the case of letters. If you want to distinguish between upper and lower case, you need 52 bits, which can also be achieved. But each character element corresponds to a bit, which is not always possible. For example, GBK has 21000 Chinese characters, and one bit corresponds to one Chinese character. The data of the stored bit value will occupy a lot of space, and there are quite a lot of meaningless blanks, and the reading data and bit comparison operation will be meaningless. The ground is increased. In the implementation, a "bit" can be used to correspond to multiple Chinese characters as needed. An easy solution is to divide 21000 Chinese characters and other symbols in the GBK range into 32 groups according to the encoding, and store the bit value of the string with a long integer data, and set the string "the desert is solitary and straight." Japanese yen", then:

Chinese character big 3⁄4 solitary smoke straight long river falling circle code 46323 50350 47554 53708 54961 45988 47827 49892 51413 54450 group 20 15 3 13 18 5 20 5 22 19 Chinese character cold long river lonely world code 52138 49380 45988 47827 49892 51413 52460 53700 group 1 1 5 5 20 5 22 13 5

The code in the table is obtained by the function code in excel, where "big""river" is in the same group as 20, and "long" and "falling" are in the same group, so only 8 corresponding positions are 1, then "the desert is alone." The position value W _h of the smoke straight long river is:

0010, 1000,0000, 1010,01 1 1 ,0100,0000,0000

Similarly, the Wi "bit value" of "Long River Sunset" can be obtained as: 0000, 1000, 0000, 0000, 0001, 0100, 0000, 0000, with Wi and w _h as the "bit implication" operation:

, the two strings have an inclusion relationship.

In the "Cold Cold River", "Frost" is the 11th group, and "Cold" and "Long" are the 5th group, then the "Bit Value" Wj is: 0000, 1000,0010,0000, 0001, 0000, 0000, 0000, with Wj and w _h as a bit-contained operation: w^w^ w. , and there is no inclusion relationship between the two strings. In other words, use a "bit" to correspond to multiple Chinese characters, get the bit value, and perform a "bit implication" operation on the bit values of the two strings. If the result is not equal to W _T , the string corresponding to the latter item is not The "all character" containing the string corresponding to the previous item must not contain the "string". For pinyin text, the same reason can be used to correspond to multiple words in one bit, and judge a character element that does not contain a string or "may contain" another string.

However, a bitwise operation is performed on the bit values of two strings. If the result is equal to W _T , the string corresponding to the latter term is only the string that may contain the preceding term.

If the bit value of gibber is recorded as W _e , the bit value of big is \¥ ₃ and the "bit implication" operation is performed.

w _u → w _e = w _r

But the string gibber does not contain big.

The "bit value" of the "Sunset of the World" W _k is: 0000, 1000, 0000, 1000, 0000, 0100, 0000, 0000, and the "bit implication" operation with w _h : But "the desert is so long and the river is straight" does not include "the sunset".

That is to say, the bit mark string search is different from the general string bitwise comparison method. Even if there is a bit implication relationship between the bit values of the two strings, the two strings do not necessarily have an inclusion relationship. If a "bit" corresponds to "one character" and the bit values of the two strings have a bit implication relationship, the string corresponding to the latter item "contains" all the character elements of the string corresponding to the previous item. However, since the arrangement of character elements is not considered, it is not certain whether the two strings have an inclusion relationship. If the mark is a "bit" corresponding to "a set of character elements", and a set of character elements has more than one character element, the string corresponding to the latter item "possibly" contains all the character elements of the string corresponding to the previous item. , and "possibly" does not include all the character elements of the string corresponding to the previous item, and it is not sure whether the two strings have an inclusion relationship. However, as needed, the character-by-bit comparison method can be used to determine whether there is an inclusion relationship between the two strings.

The bit tag string search, the bit value of the preceding and following items of the bit operation, may be the tag information of the character element of a string, or may be the tag information of all the character elements of the multiple string, collectively referred to as "certain The string value of the string "S. "Several strings" is the concept of the parity of S, that is, S refers to one or more strings.

It is assumed that the database is subjected to bit mark string retrieval, and the bit values of the string records s, s ₂ , s ₃ , ... are w, w ₂ , w ₃ , ..., respectively. The bit value of the search key s _t is recorded as w _t , and the bit implication operation is performed by w _t and \\^, and the result is W _T , then the character string 8 „ contains or may contain all the character elements of S _t . The record set of S is denoted as Ri. The time used for the search is called Ti. It is usually necessary to obtain the final result set R _z by string-by-bit comparison in R, and the time used for the search is called the bit mark string search. The total time of the method is Ti + T _Z . If you want to make Ti low, for 32-bit processors, the best data for storing bit values is 32 bits. 'For a string fuzzy search, R _z It is certain that the method of lowering D ₂ is to minimize the preliminary result set Ri as much as possible. The fewer the number of bits n of the storage bit value, the more character elements corresponding to one bit, the worse the screening effect, the larger Ri will be; The longer the average length of the string, the larger Ri will be. There is another factor affecting the Ri size: The longer the search keyword, the smaller the 1 is. Let the number of bits used for the tag be n, and the number of "1" of the string value is m, and the value of the keyword is retrieved.

If " 1 " is k, the screening probability of bit comparison can be calculated by the following formula. The smaller the value, the better:

Ml

k (n - k) \ n is the 32, m and k part of the screening probability is calculated as follows:

Among the three factors affecting the screening probability, the length of the keyword used in the user search determines the size of k, and the length of the record string determines the size of m, but the length of the record string and the length of the keyword used in the search are uncontrollable and controllable. The factor is the number n of bits used for the mark, and the number of "1" after the string mark with a certain length is affected by n, so it is important to maintain a sufficient ratio of n and m.

Suggestions:

1. 32-bit cpu, marked with a long integer, is most conducive to bit value comparison, if it is an unsigned integer, W is 32 bits, if it is a signed bit integer, only 31 bits are easy to mark. If the average length of the string is greater than 16 characters, the database with a large number of records, in SQL SEVER 2000, can be marked with 63 bits of the data type bigint, and accordingly, the character elements are divided into 63 groups. Of course, for 32-bit processors, using bigint to store bit values, it takes more time to read and compare bits. In fact, any data type in any database that facilitates bit operations can be used to mark strings, if the data type with unsigned bits is naturally better. For 64-bit cpu, Of course, 64 bits should be used to mark the string to take full advantage of the performance of the CPU and improve the dispersion of the "bit value".

2. If the cpu is a 32-bit database with a large difference in length between strings, the record can be divided into two tables, the short string table is marked with 32 bits, the long string table is marked with 64 bits, and the query is used with union. The command combines the results of the two table queries.

3. The grouping of character elements is best with frequency equalization, and the speed of marking bit values is also taken into account. The modulo operation is performed according to the Chinese character inner code, and the remainder grouping is easy to implement, but it is not the optimal grouping, and the modulo operation division operation is slow. It can be considered to take a certain 5 digits of the Chinese character code to divide the Chinese characters into 32 groups, the speed will be faster, and the effect may be better.

In addition to retrieving the usual sense, the method may be performed by database alignment mark strings "reverse retrieval", i.e. string S _n to the database of the W _n value in the preceding paragraph, the value of W _t S _t search keyword entry The bit implication operation is performed, and according to whether all the bits of the result are "1", that is, ^ _{i →} ^t = , it is possible to filter out those constituent characters in the database which may be S _n included in the search keyword S.

Naturally, the calculation of the probability of selection for the inverse search is reversed. Let the database and keywords be marked with "1", the number of digits used for the label is n, the _number of "1" of the bit value W _n of the bit 8 ₁₁ is m, and the number of "1" after the keyword search is k. The probability that the search key S _t contains or may contain all of its character elements can be calculated by:

k\

p CTM m (k - m) k\ n - m)\

" " " ~ n a n\{k-m)\

m\(n - m)l

test analysis

Using the above method to test a database with Chinese characters and partial English and numbers: The number of records is more than 267, 000, the number of characters is 3, 473, 000, the average length of the string is 12.989, and the value is marked with a long integer. The programming language is VB, the operating system is Window xp, CPU Celeron 800Hz, memory 256M, Gigabyte 810 motherboard, hard disk 40G. When searching for the whole use, it is recorded as T. Obviously, any bit mark search must read the bit value of all records, and compare the bit value of the keyword with it. The time used can be regarded as a constant, called Τ. , but not easy to measure directly. The record set that meets the criteria after the bit value is compared is recorded as correct! ^ Perform a bit by bit comparison search When using time T, it is not easy to measure directly, but it is related to the size and can be obtained. Another factor affecting the retrieval time is the size of the final result set R ₂ . According to the T, R, and R ₂ obtained from 120 tests, the regression analysis yields the following equation:

T = 0.268 + 0.000,008,625/?, + 0.000,0270,2 ? ₂

Adjusted decision coefficient = 0.989

Significance test of regression model F = 5265.814

more than the

The significance of the constant test t. = 86.610

R, the significance test ^ = 74.263

Significance test for R ₂ t ₂ = 25.489

Both are greater than (1 17) = 2.6185

It can be seen that the regression equation is highly reliable. The constant, that is, the bit value for reading all the records, is the time taken to compare the bit value of the keyword with To, which is 0.268 seconds. At the same time, the 120-bit string-by-bit comparison method took an average of 2.1739 seconds. The time T used to compare the bit values is visible. It is only one-eighth of the time that is usually used for the bit-by-bit comparison method. However, the overall time of the bit mark string search is also related to R! and R _{2. In} this test, n=31, the average length of the string is 12.989, and there is overlap after the mark. The average number m of "1" is about 1 1 To 12, here is calculated according to 12, if the character element distribution of the database string above is normal, the marked "1" distribution is balanced, and the R result can be calculated according to the probability. The final result set R ₂ has nothing to do with the filtering effect, here is specified as 240. In the ideal state, the search time after the keyword is marked with k from 1 to 10 can be calculated as follows according to the regression equation:

It can be seen that bit mark string retrieval has a great performance advantage compared to the usual string retrieval method with an average time of 2.1739 seconds. However, different hardware and data types used for marking, especially the size of n, the distribution of database character elements, and the grouping status of character elements when marking, all affect the time of bit mark string retrieval, so this regression equation has only reference meaning. .

There are only 26 letters in English. If the string is long and each record contains every letter, the bit mark search will lose its filtering effect. However, testing an English database data with a record number of 242,000 and a character count of 7,493,000 indicates that the bit mark string search is usually feasible for pinyin text. The database string data type is varchar, the field length is 56, the string average length is 30.846, and 26 letters are marked with 26 bits of a long integer, which is case-insensitive, ignoring spaces and other characters. Due to the uneven use frequency of letters and letters, there are only 3213052 "1" after the mark, and the average number of "1" for each record is 13.2266. On the one hand, there is a large overlap of marks, on the other hand, not every one. The record contains each letter.

The database was searched by character search for 120 times, and the average time was 4.573 seconds. At the same time, the bit mark search was performed. According to the 120 times of T and RK 2, the following equation was obtained by regression analysis:

Τ = 0.265 + 0.000,019,367?, + 0.000,0362,3^ ₂ Adjusted coefficient = 0.999

Significance test of regression model F = 55405.88

Greater than F. _Q1 (2,117) = 4.791

The significance of the constant test /. =36.400

R, the significance test t = 121.733

Significance test for R ₂ ₂ = 77.409

Both are greater than ^. , (117) = 2.6185

The following is the letter distribution statistics table for the database: The word contains the number of records in the letter containing the number of letters. Percentage of letters. Number of letters t 227167 0.935136 h 1 14263 0.470365 r 222338 0.915257 u 109376 0.450248 e 221713 0.912685 g 97339 0.400697 a 213249 0.877842 f 83452 0.343531 c 2101 16 0.864945 w 71228 0.29321 1 n 204225 0.840695 b 67065 0.276074

0 201082 0.827757 y 60698 0.249864 i 195734 0.805742 V 59351 0.2443 19 s 195639 0.805351 k 47379 0.195036

1 169753 0.698791 X 151 50 0.062365 d 159202 0.655357 q 7258 0.029878 m 125449 0.516413 j 6351 0.026144

P 122168 0.502906 z 6307 0.025963 It can be seen from the above table that the three letters of t, r and e have the highest frequency in the database, and the tree is used as the search key. The actual retrieval of 7 records containing trees is 188491, which takes 3.655 seconds. The simultaneous bit-by-bit comparison search took 4.465 seconds. For example, according to the frequency of the three letters t, r and e in the above table, the calculation can be obtained:

Rl=242,924*0.935136*0.915257*0.912685=242,924*0.781 158=189762 Then calculate the retrieval time according to the regression equation:

T = 0.265 + 0.000,019,36 * 189762 + 0.000,0362,3 * 7

=3.939

It can be seen that even for words such as trees, which are composed entirely of high-frequency letters, bit mark retrieval still has performance advantages. In fact, most words in English are more than 4 letters. If there is a low-frequency letter, the screening effect of the bit mark is very good, which is equivalent to the "short board effect". List thirty test data for reference:

"'"'"··' - - "

Search off t art: 嶋 1 :, W|i wrapped

Inks and coatings 4.417812 .3595625 8.14 5078 1 10

Custom Manufactured 4.535031 .2309375 5.09 8 1 17 Packaging Equipment

Reduce damage to 4.518875 .2613437 5.78 921 1 14 equipment

Utilize Daylight 4.426813 .2507812 ■ 5.67 280 128 11

Production Area 4.595 .8316875 18.1 24143 83 11 recovery equipment 4.498 .2399375 5.33 85 1 13

RELIEVED FROM 4.477719 .2898125 6.47 2748 1 9

Aircraft Parts 4.494594 .6908125 15.37 19556 1 Overlapping probability of 8-bit markers

Pinyin text has a small number of letters, so the frequency of repeated letters is high. For example, bigger, biggest, etc. have two _g , but only the 7th "bit" is marked as 1. For a Chinese character string, the probability of repeated occurrence of the same Chinese character in a string is low, but for Chinese characters, several Chinese characters in a string may belong to the same group, as in the above group, "big""river" , in the same group of 20, "long" and "falling" are in the same group, and the label is overlapped with a "bit". Even if the grouping achieves character frequency equalization, there is a problem that the string marks overlap. For n, the number of character elements is m. The probability that no overlap occurs at all is:

A:;' n n - 1) ··■(«- m + 1)

r = :

n'" n ^m obviously, when n is fixed, for every 1 increase in m, the new addition of the element is decremented by 1, and the new term of the denominator is unchanged, and the probability of no overlap at all is getting lower and lower. Set to 32 bits. Mark, the length of the string m is 7, the probability of no overlap at all:

p ₌ A] _{2 =} 32(32 -1)···(32 -7 + 1) = 32*31*30*29*28*27*26― 16,963,914,240 ( ⁷⁾ — 32 ⁷ 32 ⁷ " 32 * 32 * 32 * 32 * 32 * 32 * 32 ~ 34,359,738,368 ~ ' The following table is marked with 32 bits. When the string length m is 1 to 24, no weight occurs at all.

Overlap means loss of information. A certain degree of overlap is unavoidable and acceptable, but the proportion of overlap is too high, which will affect the performance of bit mark retrieval, so the probability of overlap is also worthy of attention. Marked by n bits, the length of the string is m characters, and the overlap is k. The formula for calculating the probability is:

C, ·Α

P =

Where A

The summation here is to take all positive integer solutions satisfying m, + m-, +··· + ^m k ^=m , the number of groups is c: set with 32 bit marks, and the string length is 7 characters. The overlap is 3 bits, ie m, + m ₂ + m ₃ = 7 There are 15 groups, and the probability is calculated as follows:

Marked by 32 bits, the length of the string is 7 characters, and the rate of k after the mark is 1 -7. The following rates are as follows: k Number of permutations Percentage of arrangement Percentage of cumulative k* Arrangement

7 16963914240 0.493714884 0.493714884 1.18747E+1 1

6 13701623040 0.398769714 0.892484598 82209738240

5 3383 1 16800 0.098461658 0.990946256 16915584000

4 302064000 0.008791219 0.999737475 1208256000

3 8957760 0.000260705 0.99999818 26873280

The weighted average number of Ts after marking is 6.376881392. Bit mark related logic algebra principle and 0 mark

A bit of the bit value W of the string S is "1", which means that the proposition "the string S has a character element or one of a group of character elements" is true. For the two strings s _a and s _b ,

A certain k bits of W _a is 1, which means that the proposition "string S _a has a k character element or one of a k group of character elements" is true; if W _b all corresponding k bits are 1 That means that the proposition "the string S _b has a k character or one of the k groups of characters" is also true. Naturally, W _b may have some other mk bits of 1, and W _a certain mk bits are 0. This relationship is expressed as a logical algebraic formula: w. →W _b =W is intuitive and easy to understand. But not all programming languages and database systems have "bit-input" operators, so applications need to be transformed according to the principle of logical algebra.

From the logical algebraic operation theorem, it can be seen that if →6 = 1, then |6 = 1. Similarly, for the bit values of n bits: If W _u → W _b = W. _r , then

W _a \W _h = W _r .

Also get:

w _a \w _h =w _h from the logical operation theorem w _h <^→

w _a →w _h = w. _r , then w _h →w _a = w _r .

Therefore, the bit value can be marked with a data W _T whose each "bit" is 1. If the character string contains a character element, the corresponding bit is marked as 0, which is called a 0 mark. Such as:

Obviously i _h → = w _T .

Also get: w„ = w,

It is more intuitive to use the truth table:

The above explanation is how to use the most common bit operators for bit value comparison. The principle of probability selection is the same. Of course, when 0 is marked, m and k all refer to the number of 0. However, the number of bit operators provided by each programming language and database is different. The specific operations of other bit operators can be introduced according to the principle of logical algebra. Using the principle of logical algebra, the equivalent transformation, the formula should of course be used for the purpose of simplicity, rather than becoming more and more complicated.

Another point is that if the programming language or database system provides a "bit" set or set to 0, the bit "tag" can be used directly. If the database system programming language does not provide a location 1, the "or" (or) operation of the bit can be used to mark 1.

First, the character elements are grouped. If 32 bits are used, the value of the nth group is 2 ³² Λ. From the binary point of view, the value is 1 from left to right, and the remaining bits are 0. Is the "basic bit value". The "base value" of all the character elements of a string is ORed to get the bit value of the string. Of course, it is also possible to assign a value of 2 ¹¹ · ¹ to the nth group "base bit value". From the binary point of view, the nth bit from right to left is 1, and the remaining bits are 0. In fact, in a specific database, the record string and the search keyword are marked, and the character element has a fixed correspondence with the bit, whether from right to left, from left to right, or other order.

If the 0 mark is performed, the nth character element is first given, the "next bit value" of the nth bit is 0 and the remaining bits are 1, and then the "basic bit value" of all the character elements of the specific character string is performed. The "and" (and) operation gives the bit value of the string.

As for the marker overlap probability calculation method, the 0 mark and the 1 mark are in communication. Multi-bit tagging and group tagging

Multi-digit mark

Based on the single "bit" mark, the following describes the "bit" mark. If the number of bits n of the stored bit value is large, and the length of the string is small, and the combination of j bits can be marked by a group of character elements, there is a combination of "-~bits, and the corresponding character element is ~ - ~

;'!("-7')! 7'!("-;')! Group, if one character P of S belongs to the rth group, the j bits constituting the combination of the rth bit in W are marked as 1. When j is 1, it is the above-mentioned single "bit" mark.

,

Let n be 32 and j be 2, then divide the character element into ~~~ group, which is 496 group, use "1"

2!(32-2)! Mark, if the character element belongs to the first group, mark the two "bits" of W and 1, 2 as "1"; if the character element belongs to the second group, mark the two "bits" of W and 1, 3 as 1. If the character element belongs to the third group, mark the two "bits" of W and 1, 4 as ....

Assume that the bit value of the Chinese character "before" is 2, 5 and 2, and the bit value of the "jing" word is 23, 29 is 1, then the word "foreground, there are 2, 5, 23, 29 four bits are 1, and the "far" bit value is 2, 23, and the two bits are 1, and the "big" bit value is 5, 29, and the two bits are 1, and the word "far" Large " is also 2, 5, 23, 29 four bits are 1, using the "foreground" bit value search, the result will appear "far,,, and vice versa, that is, the search results will appear with "unrelated group characters" The string of the meta", but since the Chinese characters contained in each group are marked less than the single "bit", the screening effect can be improved. The purpose of using multi-bit markers is to improve the screening effect, and the screening probability calculation is the same as the unit labeling principle. The following calculation string length is 4 characters, the search keyword length is 2 characters, j is 1, 2, 3, 4 when the screening probability, the lower the value, the better:

j is 2, regardless of the overlap, m is 4*2=8 of 1 after the character string is marked, and k is 2*2=4 of 1 after the keyword is searched. j is 3, that is, the character elements are divided into ³ groups, that is, 4960 groups, regardless of the overlap, the string

3!(32-3)! After marking, m is 4*3=12, and after searching for the keyword tag, k is 2*3=6.

32!

j is 4, that is, the character elements are divided into groups, that is, 35960 groups, regardless of the overlap, the characters

4!(32-4)! After the string is marked, m is 4*4=16, after the keyword is searched, k is 2*4=8 1

m\

p _ j _ kl(m - k)\ ^ m\{n - k)l ^ 4!(32 - 2)! ₌ 4*3 : Please

C, n\(m-k)\ 32! (4 -2)! 32*31

Kl(n-k)l

8!

P = 0.00194661

4!(32- 4)!

6!(32- 6)!

16! 16!

8!(16- 8)! 16*15*14*13*12*11*10*9

P, = _8L = 0.00122358

32! 32! 32 * 31 * 30 * 29 * 28 * 27 * 26 * 25

8!(32- ■8)! Multi-bit tags are especially useful if the search keyword length is 1, 2 characters. However, when the storage bit value n is constant, the j value is not as large as possible. As the above calculation shows, when n is 32 and m is 4, j is the best effect of 3, continues to increase, can not improve the screening effect, and the mark Overlap The proportion will increase and the marking operation will be more complicated. If you mark a long string with multiple bits, n must be large enough. Group tag:

For example, the 32 bits of an unsigned long integer are divided into two groups of 17 and 15 and marked separately. The length m of the string is 4, regardless of the problem of overlapping marks, 4 after the mark, and the length of the keyword is 2, After 2 1s, the screening probability is:

P _ 4!(17-2)!;;4!(15-2)!_ *3 _y 4*3 _ 12 _y 12 _ _{Q QQ252l}

17!(4-2)! 15!(4-2)! 17*16 15*14 272 210 Obviously, the screening effect can also be improved. Of course, to implement this optimization, the same bit n of the mark is several times larger than 11. This optimization method can be used for long strings, that is, two different character grouping methods are used, and two sets of data are respectively marked to obtain two sets of W and W. When searching, first obtain the W _{t of the} keyword by the marking method of one of the groups, compare it with the W _n value of the database, and filter out R, then. In 1^, the W, _{t of the} keyword is obtained by the second group of marking methods, and compared with the W and _n values of the database to obtain R ₂ , and then the final result set R is obtained by the usual bitwise comparison method in R ₂ . _z . Compared with the previous optimization method, it is also a storage space for increasing the bit value. This optimization method does not have to read all W, _n and compare with W, _t .

The 0 mark and the 1 mark can perform multi-bit mark and group mark, and the group mark can also be combined with the multi-position mark, and different groups can also be 0 mark and 1 mark respectively. Of course, it is better to apply. 0 markers, multi-bit markers, and group markers can be used for reverse retrieval. When there are many classes, it is not easy to implement with one bit corresponding to one character element; if multi-bit tag is used, the string containing the "destination group character element" and only the "unrelated group character element" will be mixed in the search result. Use a shield number to replace a character element to avoid this kind of confounding in the search results. In the application, you can use the bit mark string search for initial screening, and then use the prime number substitution character search for secondary search. , then use the usual string bitwise comparison method to get the final result. The application of bit mark retrieval technology in speech input, machine translation, etc.: There are many homophones and homophones in Chinese, so it is necessary to press pinyin in Chinese speech input and pinyin input. To filter the appropriate words, the bit-marking method "reverse search" can be used for this aspect.

Regardless of the tone, Chinese has more than 400 syllables. It is not convenient to use one bit to correspond to one syllable. 64 bits of 8 bytes can be used to correspond to the initials and finals of Chinese Pinyin. There is a database of Chinese word combinations and basic sentence patterns. Among them, there are basic sentence patterns "He graduated" and Pinyin is "tabiyele". After tagging, the bit values are:

1 000,0101 ,0000,0000,0000,0001 , 101 1 ,0000,0000,0000,0000,0000,0000,0000,0000,0000 If the voice conversion or pinyin is input "tazaojiubiyele", the post-mark value W _t is : 1000,01 01 ,0001 ,0010,0000,0001 , 101 1 ,0000, 1000,0000, 1000,0000,0000,0000,0000,0000 Then "He graduated, is the basic sentence that can be referenced by "tazaojiubiyele" Type, after processing, you can get "he graduated from zaojiu", in which "graduation" is a predicate, which is a verb, and the word "zaojiu" in the lexicon is "early" and "made""jujube", if grammar, Semantics, word frequency and other aspects can play a supporting role, then it can be further converted into "he has graduated,". Due to the diversity of the phonetic rhyme, the initial results of the bit value comparison, the pinyin such as "tebayile" will appear. On the basis of the bit mark string, a prime number corresponding to a syllable can be used for secondary screening to obtain a closer result.

If there is a combination of "loud" and the quantifier "dong" in the word collocation and basic sentence pattern, and the "high building" with the adjective "high" which often modifies it, you can press their initials. The finals are marked. Set the voice conversion or pinyin input "zhedongxiezilouhengao", also mark the initials of the initials, compare all the words in the database with the position value of the basic sentence type W _n , and filter out the "Building House", "High Building" and so on. Reference, after processing, you can get "zhe dong xiezi floor hen high", where xiezi or xiezilou can get "writing" or "office" from the thesaurus, if the grammar, semantics, word frequency and other aspects can play a supporting role, will

"zhedongxiezilouhengao" is converted into "this office building is very high., similarly, you can get other Chinese matching and corresponding position value. Administrative divisions such as "provincial city""provincialcounty"; figures such as "seven eight""Thousands"; quantifier "Jian two""Yuanjiao"; even the combination of surnames "Zhang Li""ZhangLiu". If you enter "hubeishengxianningshi", according to the above method, you can get "hubei province xianning city". : δ 口果输入"Zhanglonglihu", you can get J "Zhang long Li hu". Enter "qiwansanqian" for the word "qiwansanqian", you can get "70,000 san thousand". Similarly, you can show yourself in your position. On the basis of a string, a prime number corresponding to a prime syllable is used for secondary screening' to get a closer result.

In addition, since the frequency of use of Chinese initials and finals is not balanced, when the unit is marked,

The 64 "bits" of the 8-byte data correspond to the initials and finals of the Hanyu Pinyin. When searching with certain syllables such as li, since both 1 and i are high-frequency initials and finals, the screening effect may be poor. Double-digit mark, that is, n is 32, j is 2, that is, 496 group, corresponding to 400 syllables of Chinese, and the database is measured to make it as balanced as possible. Of course, with ⁴ 00 syllables marked with double digits, the "non-purpose" sentence pattern and collocation will definitely appear in the preliminary results. On the basis of the position marker search and screening, a prime number substitution method corresponding to one syllable can be used. Sub-screening, getting closer results.

This technique can also be used in Chinese word segmentation. The words, reference collocations and sentence patterns in the lexicon are marked by pinyin. The keywords (actually sentences) are also digitized by the pinyin value, and Ri is obtained. , can effectively narrow the search range; of course, you can also use the words in the lexicon, reference collocations and sentence patterns, mark the bit value according to the inner code group, and the keyword also performs the bit implication operation according to the same group mark bit value. It can effectively narrow the search range, and then use other methods to complete the decomposition of keywords (that is, sentences).

In English, because of continuous reading and speech rhythm, it is not as clear as Chinese syllables. It can also be used to select reference vocabulary and phrases. Let the database have the phrase in the morning, and mark it according to the vowel, consonant i, η, δ, 3, m, o:, n, i, η. After the speech input is converted, the corresponding one of the corresponding ones is ambiguous and can be eliminated. The other relatively loud i, n, m, 0:, n, i, g are marked to obtain the bit value, and the database reference vocabulary, The phrase values of the phrase are compared to obtain the phrase in the morning, and other pronunciations include i, n, m, o:, n, i, . Phrase or word. On this basis, the method of substituting prime numbers for syllables further narrows the scope, and then according to the context, with the help of grammar, semantics, and word frequency, select in the morning.

If the basic sentence patterns of the two languages are associated, the bit marks and prime numbers can be used for machine translation. There is a basic Chinese sentence pattern library with bit mark and prime number substitution, among which there is "I study", corresponding to the basic sentence form "I learn" of English, if there is a sentence "I am learning French", after marking and prime substitution Using reverse search, you can quickly find the basic sentence pattern that can be referenced as "I learn Xi, "The basic sentence pattern corresponding to English is "I learn", "Learning" is a predicate verb, followed by "French" is the object, the corresponding word is "French", then refer to the grammar of the two languages, you can put the sentence Translated into "I am learning French".

The above describes the method of bit mark string retrieval and related probability calculation formulas, logical algebra principles, and focuses on the screening of string records in the database. From a programming perspective, the data structure of the database is diverse, and various data structures are not only used for databases. However, bit-marker string retrieval technology is a string algorithm that can be used for string lookups of various data structures, such as string fuzzy lookups in large arrays. As for the string fuzzy search of small arrays, it is not necessary, because the mark of the string also takes time and space. detailed description

The invention has been well implemented in the Chinese character string database and the English string database fuzzy retrieval. The following is the Chinese character string database in SQL SERVER2000, the vb6.0 code marked with "1", other programming languages, databases The bit mark string fuzzy search can be referred to the implementation.

1. Establish a database

Let the database shuku have a table biao, which has a field shuming, the data type is nvarchar, and the length is 40. Another field wei is created, the data type is "long integer", that is, 4 bytes, there are 32 bits, one of which is a sign bit, and the remaining 31 bits can be used for marking.

2. There is no direct command to set "bit" to "1" or "0" in vb6.0, so use the "or" operation of the bit to "tag" the database string dim shuzu(30) As Long

'Define a long integer array of 31 elements.

Shuzu(O) = 1

For = 1 To 30

Shuzu(x) = 2 * shuzu(x - 1)

Next 'Assigning 31 elements of a long integer array, from 1, 2, 4, 8, 16 to 1073741824, from the binary point of view, one bit is 1 and the remaining bits are 0, which is the "basic bit value".

Dim biaostr As String

'Store the currently processed string

Dim weizhi As Long

'Stored string bit value

Dim weizhilin As Long

'Store the basic bit value of a character

Dim x As Integer biaors.MoveFirst

'Move to the database record set biaors the first record

Do

Weizhilin = 0

Weizhi = 0

With biaors

Biaostr = .Fields("shuming")

End With

'Read a string of records, assign a string variable biaostr

For x = 1 To Len(biaostr)

Index = Abs(AscW(Mid(biaostr, x, 1)) Mod 31) 'From the string variable biaostr, take a character, and the character inner code, with 31 as the module, then take the absolute value, and assign Index, which is to group character elements.

Weizhilin = shuzu(index)

' Assign the array shuzu(index) value to weizhilin, which is one of 1, 2, 4, 8, 16 to 1073741824.

Weizhi = weizhi Or weizhilin 'Or the "or" operation of the "basic bit value" wdzhilin value of a character and weizhi. Next

'Cycle knot, get the "bit value" of the current string weizhi

With biaors

.Fields("wei") = weizhi

End With

biaors.Update

'Load the bit value' weizhi into the current record field wei

biaors.MoveNext

'Processing the next record

Loop While Not biaors.EOF uses the "and" operation of bits to perform database string fuzzy retrieval

Dim shuzu(30) As Long

Shuzu(O) = 1

For x = 1 To 30

Shuzu(x) = 2 * shuzu(x - 1)

Dim weizhi As Long

'Store a bit value of a string

Dim weizhilin As Long

'Store a bit value of a character

Dim textstr As String

'Storage search string,

Dim x As Integer

Weizhilin = 0 Weizhi = 0

Textstr = Text 1. Text

'Textl .Text is the search keyword in the text box

For x = 1 To Len(biaostr)

Index = Abs(AscW(Mid(textstr, x, 1)) Mod 31)

Weizhilin = shuzu(index)

Weizhi = weizhi Or weizhilin

'Marking the search keyword to get the "bit value" weizhi, the method is consistent with the database string method. strQuery = "select * from (SELECT * FROM biao WHERE (wei & " & weizhi & " ) = " & weizhi &";) DERIVEDTBL WHERE (shuming like '%" & textstr & "%')"

'SQL SERVER 2000 does not have a bit implied operator, here will search for the "bit value of the keyword, weizhi and the "bit value" of the database records as "and" (and) operation, filter out "and" (and ) The result of the operation is equal to the record of the "bit value" of the keyword weizhi (you can also use (wei | "& weizhi &" ) = wei ) to obtain the preliminary result set Ri, and then perform the second search by the usual string fuzzy search method. The final result is R _z . This is the SQL SERVER 2000 query, other databases may be slightly different.

Adodcl .RecordSource = strQuery

Adodcl . Refresh

'Execute search

DataListl丄 istField = "shuming"

DataListl .ReFill

'Show the current search results in the list box.

Claims

Rights request

A string paste retrieval technique, characterized in that:

W refers to the bit of each piece of data that is 0. The W is used to mark the character meta-information that constitutes the string. The j bits (j=l, 2, 3, -) are combined. A combination of one ~ - ~ ones, correspondingly all character elements are ~ - ~ groups, with a combination of one bit corresponding to a group of character elements,

;'!(" - 7')·'

Mark it. If one character element of a plurality of character strings S belongs to the r-th group, the j bits constituting the combination of the r-th bit of W are marked as 1, similarly, other character elements P ₂ , P ₃ , P according to S ₄ ... the group to which the group belongs is marked W, the W after all the character element marks are completed, and the character element information of S is recorded, which is called the "bit value" of the thousands of character strings S, and this method is called 1 mark.

W refers to a bit of n of one data, and uses W to mark the character meta information constituting the string, in which j bits (j=l, 2, 3, ...) are a combination. There is a combination of ~ - ~ ones, correspondingly all character elements are ~ - ~ groups, with a combination of one bit corresponding to a group of character elements, marked. If one character element of a plurality of character strings S belongs to the r-th group, the j bits constituting the combination of the r-th bit of W are marked as 0, similarly, other character elements P ₂ , P ₃ , The group to which P ₄ ... belongs marks W, and the W after all the character meta tags are completed, and the character meta information of S is recorded, which is called the "bit value" of a plurality of character strings S. This method is called 0 mark.

Denoted S _a 1 flag "bit value" is W _a, denoted S _a zero flag "bit value" is ^, denoted S _b 1 flag "bit value" is W _b, denoted S _b is 0 flag "bit value " ^, the following methods can be used for string comparison:

Alternatively, I .S _a and S _b are marked by 1, and w _b ^ \\ to compare, if all the bits w _a is 1, the corresponding bit is also 1 W _b, comprising the s _b (containing equal, The same below) or may contain all the character elements of s _a .

Alternatively, Il.S p S _b is marked with 0, and is compared with ^ _; if all bits of 1 are 1 and the corresponding bit is also 1, then S _b contains or may contain all character elements of S. Alternatively, m .s _a is marked with l and s _b is marked with o, and _{3 is} compared with ^. If ^ and w _{a are} all corresponding bits, and are not 1, then s _b contains or may contain s 々 all character elements .

Alternatively, IV.S _a mark with 0 and S _B with a tag to _^; comparison with \ _[), and if all corresponding bits ^ w _b, are not simultaneously 0, S, or may contain _B contains S 々 All character elements.

2. The method according to claim 1, wherein there are a plurality of specific operation methods for comparing W _a , ^ and W _b with a bit operator. For example, if all bits are 1 and W is W _T , then Method I is available. → = ₇ ., w _a \ w _h =w _b , implementation, as for the method: H吏 is implemented by other bit operators, and the specific operation methods implemented by methods II, III and IV can be derived according to the principle of logical algebra.

3. The method according to claim 1, wherein: by comparing W _a , ^ with W _b and W _b , if not meeting the criterion, S _b does not include all character elements of SJ, that is, does not include s _a . If the unit is marked and there is only one character in each group, then s _b contains "all character elements of s _a ", but since the arrangement of character elements is not considered, it is not certain whether s _b contains "s _a "; In accordance with the judgment standard, the unit mark and each group has multiple character elements, or multiple bit marks, then s _b only "may" contain "SJ all character elements", but it is not sure whether s _b contains s _a . However, as needed, the character-by-bit comparison method can be used to determine whether s _b contains "s _a ".

4. The method according to claim 1, wherein: the character element can be a character of a usual meaning, and can be a Chinese character radical, a stroke, a Chinese alphabetical alphabet, an initial, a final, a syllable; Syllables, words; Chinese pinyin letters, initials, finals, syllables; phonetic symbols of other languages; or a combination of them.

5. The method according to claim 1, wherein: "if a thousand strings" is a parity concept of S, and emphasis S refers to one or more character strings. That is to say, the bit mark string search, the W or ^ value to be compared, can be the tag information of the character element of a string, or Tag information for all character elements of multiple strings.

6. The method according to claim 1, wherein: the search keyword S _t has a value W _t , and the character strings S, S ₂ , S ₃ , S ₄ ... have respective bit values W, W ₂ , W ₃ , W ₄ — or , w ₂ , w, , w, .... By comparing w _t , ^ with w _n , „: It is possible to filter out s _{n of} “all character elements containing or possibly containing the search key s _t ”, and if necessary, judge the s _n by character-by-bit comparison method. Whether it contains "s _t ", that is, the search of the usual meaning; on the contrary, it can also filter out the s _n of "search keyword s _t contains or may contain all its character elements", if necessary, can be judged by bit by bit comparison method Whether s _t contains "s _n " is called inverse retrieval.