WO2006074586A1 - Retrieval technology of character string marked with bit - Google Patents

Retrieval technology of character string marked with bit Download PDF

Info

Publication number
WO2006074586A1
WO2006074586A1 PCT/CN2005/001642 CN2005001642W WO2006074586A1 WO 2006074586 A1 WO2006074586 A1 WO 2006074586A1 CN 2005001642 W CN2005001642 W CN 2005001642W WO 2006074586 A1 WO2006074586 A1 WO 2006074586A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
bit
string
mark
bits
Prior art date
Application number
PCT/CN2005/001642
Other languages
French (fr)
Chinese (zh)
Inventor
Wenxin Xu
Original Assignee
Wenxin Xu
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wenxin Xu filed Critical Wenxin Xu
Publication of WO2006074586A1 publication Critical patent/WO2006074586A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures

Definitions

  • bit mark retrieval can significantly improve the speed of fuzzy retrieval of strings.
  • Another feature of bit tag retrieval is that multiple keyword queries are as convenient as a single keyword query.
  • the bit mark can be used for the search of the usual meaning, that is, whether the database string contains keywords or "reverse search", and whether the keyword contains a database string, which can be used for voice input, machine translation, pinyin input and Chinese word segmentation. Medium, matching basic sentence patterns or words. - If the number of bits n that can be used for the tag is more than twice the average length m of the string, a combination of a number of bits can be used to mark a group of character elements to improve the filtering efficiency.
  • the improved pattern matching algorithm ie, the ⁇ algorithm, avoids backtracking for the pinyin text of the small character set, but the character set is large.
  • BF algorithm and ⁇ algorithms is the main character string and substring-by-bit comparison.
  • bit mark string retrieval "bit” or “combination of bits” may be used to correspond to characters of usual meaning, such as: a, b, A, B, ⁇ #, ⁇ , t, :, ⁇ , ⁇ ; Need to use the corresponding Chinese character radicals, such as: 4, ra , and even strokes, such as ",,, etc.;
  • the pinyin text : 3 ⁇ 4 mouth English string “day and night”, you can use the bit corresponding to the "", such as Day, and, night, or a combination of digits, such as ay, ai, can be substituted in Chinese pinyin input or voice input
  • the initials, finals or syllables of the Chinese Pinyin can represent the phonetic symbols for other languages, such as 35, ⁇ , sir.
  • the character unit corresponding to the combination of the position or the bit is the "character element", which is denoted as P.
  • bit mark string retrieval is easy to implement in pinyin character string retrieval.
  • W F denoted as W F
  • the 1st to 26th bits correspond to 26 English letters, and the other 6 bits may not be considered, and may also correspond to punctuation.
  • the word big contains b, g, i, so the corresponding 2, 7, 9 and other 3 bits are marked as "1", then the data becomes 0100, 0010, 1000, 0000, 0000, 0000, 0000, called big
  • the value of the bit is recorded as w a , and the standard "i" mode is "1".
  • bigger contains b, e, g, i, r, so 5 bits of 2, 5, 7, 9, 18, etc. of W F are marked as "1", and the data becomes 0100, 1010, 1000, 0000, 0100. , 0000,0000,0000 , a bit value called bigger, denoted as W b .
  • each character element corresponds to a bit, which is not always possible.
  • GBK has 21000 Chinese characters, and one bit corresponds to one Chinese character.
  • the data of the stored bit value will occupy a lot of space, and there are quite a lot of meaningless blanks, and the reading data and bit comparison operation will be meaningless.
  • the ground is increased.
  • a "bit" can be used to correspond to multiple Chinese characters as needed.
  • the code in the table is obtained by the function code in excel, where "big””river” is in the same group as 20, and “long” and “falling” are in the same group, so only 8 corresponding positions are 1, then "the desert is alone.”
  • the position value W h of the smoke straight long river is:
  • bitwise operation is performed on the bit values of two strings. If the result is equal to W T , the string corresponding to the latter term is only the string that may contain the preceding term.
  • the "bit value" of the "Sunset of the World” W k is: 0000, 1000, 0000, 1000, 0000, 0100, 0000, 0000, and the "bit implication" operation with w h : But "the desert is so long and the river is straight” does not include "the sunset”.
  • bit mark string search is different from the general string bitwise comparison method. Even if there is a bit implication relationship between the bit values of the two strings, the two strings do not necessarily have an inclusion relationship. If a "bit” corresponds to "one character” and the bit values of the two strings have a bit implication relationship, the string corresponding to the latter item "contains" all the character elements of the string corresponding to the previous item. However, since the arrangement of character elements is not considered, it is not certain whether the two strings have an inclusion relationship.
  • the string corresponding to the latter item "possibly" contains all the character elements of the string corresponding to the previous item. , and “possibly” does not include all the character elements of the string corresponding to the previous item, and it is not sure whether the two strings have an inclusion relationship.
  • the character-by-bit comparison method can be used to determine whether there is an inclusion relationship between the two strings.
  • bit tag string search the bit value of the preceding and following items of the bit operation, may be the tag information of the character element of a string, or may be the tag information of all the character elements of the multiple string, collectively referred to as "certain The string value of the string "S. "Several strings” is the concept of the parity of S, that is, S refers to one or more strings.
  • the screening probability of bit comparison can be calculated by the following formula. The smaller the value, the better:
  • the length of the keyword used in the user search determines the size of k
  • the length of the record string determines the size of m
  • the length of the record string and the length of the keyword used in the search are uncontrollable and controllable.
  • the factor is the number n of bits used for the mark, and the number of "1" after the string mark with a certain length is affected by n, so it is important to maintain a sufficient ratio of n and m.
  • 32-bit cpu marked with a long integer, is most conducive to bit value comparison, if it is an unsigned integer, W is 32 bits, if it is a signed bit integer, only 31 bits are easy to mark. If the average length of the string is greater than 16 characters, the database with a large number of records, in SQL SEVER 2000, can be marked with 63 bits of the data type bigint, and accordingly, the character elements are divided into 63 groups. Of course, for 32-bit processors, using bigint to store bit values, it takes more time to read and compare bits. In fact, any data type in any database that facilitates bit operations can be used to mark strings, if the data type with unsigned bits is naturally better. For 64-bit cpu, Of course, 64 bits should be used to mark the string to take full advantage of the performance of the CPU and improve the dispersion of the "bit value”.
  • the record can be divided into two tables, the short string table is marked with 32 bits, the long string table is marked with 64 bits, and the query is used with union.
  • the command combines the results of the two table queries.
  • the grouping of character elements is best with frequency equalization, and the speed of marking bit values is also taken into account.
  • the modulo operation is performed according to the Chinese character inner code, and the remainder grouping is easy to implement, but it is not the optimal grouping, and the modulo operation division operation is slow. It can be considered to take a certain 5 digits of the Chinese character code to divide the Chinese characters into 32 groups, the speed will be faster, and the effect may be better.
  • the method may be performed by database alignment mark strings "reverse retrieval", i.e. string S n to the database of the W n value in the preceding paragraph, the value of W t S t search keyword entry
  • the probability that the search key S t contains or may contain all of its character elements can be calculated by:
  • the constant that is, the bit value for reading all the records, is the time taken to compare the bit value of the keyword with To, which is 0.268 seconds.
  • To the time taken to compare the bit value of the keyword with To.
  • To the time taken to compare the bit value of the keyword with To.
  • To the time taken to compare the bit value of the keyword with To.
  • To the time taken to compare the bit value of the keyword with To.
  • To the time taken to compare the bit value of the keyword with To, which is 0.268 seconds.
  • To the 120-bit string-by-bit comparison method took an average of 2.1739 seconds.
  • the time T used to compare the bit values is visible. It is only one-eighth of the time that is usually used for the bit-by-bit comparison method.
  • the average number m of "1" is about 1 1 To 12, here is calculated according to 12, if the character element distribution of the database string above is normal, the marked "1" distribution is balanced, and the R result can be calculated according to the probability.
  • the final result set R 2 has nothing to do with the filtering effect, here is specified as 240.
  • the search time after the keyword is marked with k from 1 to 10 can be calculated as follows according to the regression equation:
  • the database was searched by character search for 120 times, and the average time was 4.573 seconds. At the same time, the bit mark search was performed. According to the 120 times of T and RK 2, the following equation was obtained by regression analysis:
  • the word contains the number of records in the letter containing the number of letters. Percentage of letters. Number of letters t 227167 0.935136 h 1 14263 0.470365 r 222338 0.915257 u 109376 0.450248 e 221713 0.912685 g 97339 0.400697 a 213249 0.877842 f 83452 0.343531 c 2101 16 0.864945 w 71228 0.29321 1 n 204225 0.840695 b 67065 0.276074
  • bit mark retrieval still has performance advantages. In fact, most words in English are more than 4 letters. If there is a low-frequency letter, the screening effect of the bit mark is very good, which is equivalent to the "short board effect".
  • Pinyin text has a small number of letters, so the frequency of repeated letters is high. For example, bigger, biggest, etc. have two g , but only the 7th "bit" is marked as 1.
  • the probability of repeated occurrence of the same Chinese character in a string is low, but for Chinese characters, several Chinese characters in a string may belong to the same group, as in the above group, "big””river” , in the same group of 20, "long” and “falling” are in the same group, and the label is overlapped with a "bit". Even if the grouping achieves character frequency equalization, there is a problem that the string marks overlap. For n, the number of character elements is m. The probability that no overlap occurs at all is:
  • n'" n m obviously, when n is fixed, for every 1 increase in m, the new addition of the element is decremented by 1, and the new term of the denominator is unchanged, and the probability of no overlap at all is getting lower and lower.
  • Overlap means loss of information. A certain degree of overlap is unavoidable and acceptable, but the proportion of overlap is too high, which will affect the performance of bit mark retrieval, so the probability of overlap is also worthy of attention. Marked by n bits, the length of the string is m characters, and the overlap is k. The formula for calculating the probability is:
  • the weighted average number of Ts after marking is 6.376881392.
  • a bit of the bit value W of the string S is "1", which means that the proposition "the string S has a character element or one of a group of character elements" is true.
  • a certain k bits of W a is 1, which means that the proposition "string S a has a k character element or one of a k group of character elements" is true; if W b all corresponding k bits are 1 That means that the proposition "the string S b has a k character or one of the k groups of characters" is also true.
  • W b may have some other mk bits of 1, and W a certain mk bits are 0.
  • bit value can be marked with a data W T whose each "bit" is 1. If the character string contains a character element, the corresponding bit is marked as 0, which is called a 0 mark.
  • a data W T whose each "bit" is 1.
  • bit "tag” can be used directly. If the database system programming language does not provide a location 1, the "or" (or) operation of the bit can be used to mark 1.
  • the character elements are grouped. If 32 bits are used, the value of the nth group is 2 32 ⁇ . From the binary point of view, the value is 1 from left to right, and the remaining bits are 0. Is the "basic bit value”. The "base value" of all the character elements of a string is ORed to get the bit value of the string. Of course, it is also possible to assign a value of 2 11 ⁇ 1 to the nth group "base bit value". From the binary point of view, the nth bit from right to left is 1, and the remaining bits are 0. In fact, in a specific database, the record string and the search keyword are marked, and the character element has a fixed correspondence with the bit, whether from right to left, from left to right, or other order.
  • the nth character element is first given, the "next bit value" of the nth bit is 0 and the remaining bits are 1, and then the "basic bit value" of all the character elements of the specific character string is performed.
  • the “and” (and) operation gives the bit value of the string.
  • the 0 mark and the 1 mark are in communication.
  • bit the number of bits n of the stored bit value is large, and the length of the string is small, and the combination of j bits can be marked by a group of character elements, there is a combination of "- ⁇ bits, and the corresponding character element is ⁇ - ⁇
  • the purpose of using multi-bit markers is to improve the screening effect, and the screening probability calculation is the same as the unit labeling principle.
  • the following calculation string length is 4 characters
  • the search keyword length is 2 characters
  • j is 1, 2, 3, 4 when the screening probability, the lower the value, the better:
  • j is 3, that is, the character elements are divided into 3 groups, that is, 4960 groups, regardless of the overlap, the string
  • Multi-bit tags are especially useful if the search keyword length is 1, 2 characters.
  • n is constant
  • j is not as large as possible.
  • n is 32 and m is 4
  • j is the best effect of 3
  • continues to increase can not improve the screening effect, and the mark Overlap
  • the proportion will increase and the marking operation will be more complicated. If you mark a long string with multiple bits, n must be large enough.
  • the 32 bits of an unsigned long integer are divided into two groups of 17 and 15 and marked separately.
  • the length m of the string is 4, regardless of the problem of overlapping marks, 4 after the mark, and the length of the keyword is 2, After 2 1s, the screening probability is:
  • the screening effect can also be improved.
  • the same bit n of the mark is several times larger than 11.
  • This optimization method can be used for long strings, that is, two different character grouping methods are used, and two sets of data are respectively marked to obtain two sets of W and W.
  • searching first obtain the W t of the keyword by the marking method of one of the groups, compare it with the W n value of the database, and filter out R, then.
  • the W, t of the keyword is obtained by the second group of marking methods, and compared with the W and n values of the database to obtain R 2 , and then the final result set R is obtained by the usual bitwise comparison method in R 2 . z .
  • This optimization method does not have to read all W, n and compare with W, t .
  • the 0 mark and the 1 mark can perform multi-bit mark and group mark, and the group mark can also be combined with the multi-position mark, and different groups can also be 0 mark and 1 mark respectively.
  • 0 markers, multi-bit markers, and group markers can be used for reverse retrieval.
  • bit-mark retrieval technology in speech input, machine translation, etc.: There are many homophones and homophones in Chinese, so it is necessary to press pinyin in Chinese speech input and pinyin input. To filter the appropriate words, the bit-marking method "reverse search" can be used for this aspect.
  • Chinese has more than 400 syllables. It is not convenient to use one bit to correspond to one syllable. 64 bits of 8 bytes can be used to correspond to the initials and finals of Chinese Pinyin.
  • the pinyin such as "tebayile" will appear.
  • a prime number corresponding to a syllable can be used for secondary screening to obtain a closer result.
  • the 64 "bits" of the 8-byte data correspond to the initials and finals of the Hanyu Pinyin.
  • the screening effect may be poor.
  • Double-digit mark that is, n is 32, j is 2, that is, 496 group, corresponding to 400 syllables of Chinese, and the database is measured to make it as balanced as possible.
  • the "non-purpose" sentence pattern and collocation will definitely appear in the preliminary results.
  • a prime number substitution method corresponding to one syllable can be used. Sub-screening, getting closer results.
  • This technique can also be used in Chinese word segmentation.
  • the words, reference collocations and sentence patterns in the lexicon are marked by pinyin.
  • the keywords (actually sentences) are also digitized by the pinyin value, and Ri is obtained.
  • Ri is obtained.
  • the other relatively loud i, n, m, 0:, n, i, g are marked to obtain the bit value, and the database reference vocabulary,
  • the phrase values of the phrase are compared to obtain the phrase in the morning, and other pronunciations include i, n, m, o:, n, i, . Phrase or word.
  • the method of substituting prime numbers for syllables further narrows the scope, and then according to the context, with the help of grammar, semantics, and word frequency, select in the morning.
  • bit marks and prime numbers can be used for machine translation.
  • bit-marker string retrieval technology is a string algorithm that can be used for string lookups of various data structures, such as string fuzzy lookups in large arrays.
  • string fuzzy search of small arrays it is not necessary, because the mark of the string also takes time and space. detailed description
  • the invention has been well implemented in the Chinese character string database and the English string database fuzzy retrieval.
  • the following is the Chinese character string database in SQL SERVER2000, the vb6.0 code marked with "1", other programming languages, databases
  • the bit mark string fuzzy search can be referred to the implementation.
  • the database shuku have a table biao, which has a field shuming, the data type is nvarchar, and the length is 40. Another field wei is created, the data type is "long integer", that is, 4 bytes, there are 32 bits, one of which is a sign bit, and the remaining 31 bits can be used for marking.
  • Index Abs(AscW(Mid(biaostr, x, 1)) Mod 31) 'From the string variable biaostr, take a character, and the character inner code, with 31 as the module, then take the absolute value, and assign Index, which is to group character elements.
  • Weizhi weizhi Or weizhilin 'Or the "or” operation of the "basic bit value” wdzhilin value of a character and weizhi.
  • Adodcl .RecordSource strQuery

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a character string retrieval technology, where one bit corresponds to several character cells, and n bit correspond to all character cells, that is, dividing all character cells into n groups, and marking n bits of data in which each bit equal to 0 with W, so as to mark character cell information making up of character string If one character cell P1 of character string S belongs to n-th group, n-th bit corresponding to W is marked with 1,similarly, W is marked with other character cells P2 ,P3 ,P4… of S, so as to W provided with information of S is referenced to as 'bit value' after mark is completed as for all character cells, this way is referenced to as 1 mark. According to rule of logic algebra, n bits of data in which each bit equal to 0 mark character cell information making up of character cells, where the n bits of data in which each bit equal to 1 is marked with W. If one character cell P of S belongs to n-th group, n-th bit corresponding to data W is marked with 0,this way is referenced to as 0 mark. It is possible to determine that Sb does not contain all character cells of retrieval key word Sb, that Sb contains all character cells of retrieval key word Sb, that Sb maybe contain all character cells of retrieval keyword Sb, through comparing 'bit value' Wa, Wa of Sa with 'bit value' Wb, Wb of Sb Wb.

Description

位标记字符串检索技术 技术领域  Bit tag string retrieval technology
本发明是一种字符串检索技术, 目的是提高字符串模糊检索的速度。 以一个位 (bit)对应若干个字符元, 以 n 个位对应全部字符元, 也就是分全 部字符元为 n组, 并用一个数据的 n个均为 0的位, 记为 WF, 来标记组成 字符串的字符元信息。 如果若干个字符串 S的一个字符元 属于第 n组, 则相应地将 W的第 n个位标记为 1, 类似地, 居 S其它字符元 P2、 P3 、 P4...所属的组对 W进行标记, 完成全部字符元标记后的 W, 记录有 S的信 息, 称为 S的 "位值" , 这种方式称为 1标记。 根据逻辑代数的原理, 也 可用一个数据的 n个均为 1的位, 记为^^, 来标记组成字符串的字符元信 息。 如果 S的一个字符元 P属于第 n组, 则相应地将数据 的第 n个位标 记为 0, 这种方式称为 0标记。 通过对 S "位值" Wa、 ^与 815的 "位 值" Wb、 ^;,进行比较, 可以判断 Sb "不包含" 、 "包含" 或 "可能包含" Sh的所有字符元。 例如, 对 \\^与 Wb进行位蕴含运算, 如果所有的位都 有蕴含关系, 则 Sb包含或可能包含 S "所有字符元" 。 如果需要, 再用 通常的字符逐位比较方法判断 Sb是否包含 "Sa" 。 测试表明, 位标记检索 可以显著提高字符串的模糊检索速度。 速度优势之外, 位标记检索的另一 特点是多个关键词查询同单个关键词查询一样方便。 位标记既可用于通常 意义的检索, 即判断数据库字符串是否包含关键词, 也可以作 "逆检索" , 判断关键词是否包含数据库字符串, 可用于语音输入、 机器翻译、 拼音输 入及汉语分词中, 匹配基本句型或词语。 - 如果可用于标记的位数 n, 是字符串的平均长度 m的 2倍以上, 可以 用数个位 (bit)的组合对应一组字符元进行标记, 以提高筛选效率。 The invention is a string retrieval technique, which aims to improve the speed of string fuzzy retrieval. A bit corresponds to a number of character elements, and n bits correspond to all character elements, that is, all character elements are n groups, and n bits of a data are all 0 bits, which are denoted as W F , to mark The character meta information that makes up the string. If one character element of a plurality of character strings S belongs to the nth group, the nth bit of W is marked as 1 correspondingly, similarly, the other character elements P 2 , P 3 , P 4 ... to which S belongs The group marks W, completes the W after all the character meta tags, and records the information of S, which is called the "bit value" of S. This method is called 1 mark. According to the principle of logical algebra, it is also possible to use one of the n bits of a data, denoted as ^^, to mark the character meta information constituting the string. If one character P of S belongs to the nth group, the nth bit of the data is marked as 0 accordingly, which is called a 0 mark. By comparing the S "bit value" W a , ^ with the "bit value" of the 8 15 W b , ^;, it can be determined that S b "does not contain", "contains" or "may contain" all characters of S h yuan. For example, a bit implication operation is performed on \\^ and W b . If all bits have an implication relationship, S b contains or may contain S "all character elements". If necessary, use the usual character bitwise comparison method to determine whether S b contains "S a ". Tests have shown that bit mark retrieval can significantly improve the speed of fuzzy retrieval of strings. In addition to speed advantage, another feature of bit tag retrieval is that multiple keyword queries are as convenient as a single keyword query. The bit mark can be used for the search of the usual meaning, that is, whether the database string contains keywords or "reverse search", and whether the keyword contains a database string, which can be used for voice input, machine translation, pinyin input and Chinese word segmentation. Medium, matching basic sentence patterns or words. - If the number of bits n that can be used for the tag is more than twice the average length m of the string, a combination of a number of bits can be used to mark a group of character elements to improve the filtering efficiency.
位标记字符串检索技术作为一种字符串算法, 可用于各种数据结构的 字符串查找。 背景技术 通常的字符串模糊检索采用逐位比较方式进行, 如判断字符串Bit tag string retrieval technology is a string algorithm that can be used for string lookups of various data structures. Background technique The usual string fuzzy search is performed by bitwise comparison, such as judging strings.
S="bdopfqew,,中是否包含字符 f,计算机以主串 S的第一个字符 b与 f进行 比较, 再以 S的第二个字符 d与 f进行比较, 以此类推, 直到 S的第 5个 字符与 f相同, 匹配成功, 这是最简单的情形。 如果子串 T长度为 2个字 符以上, 简单的模式匹配算法, 即 BF算法, 是以 81与丁1比较, 若不同则 以 82与 1^比较, 依次类推, 直到 S的某一个字符 81与 T,相同, 再将它们 之后的 T1+ 1进行比较, 若也相同, 则继续往下比较, 当 S的某一个 字符 Si+n与 Τ1 +η不同时, 则返回, 再以 8;+1与 ^作新一轮比较, 重复以上 过程, 直到 Τ中的字符全部比完, 则匹配成功, 否则匹配失败。 随着检索 关键词 Τ的长度增加, 字符匹配的复杂程度相应增加。 改进后的模式匹配 算法, 即 ΚΜΡ算法, 对小字符集的拼音文字来说, 避免了回溯, 但对字符 集大、 单字符频度低的汉字字符串而言, 实质意义不大。 简而言之, BF算 法与 ΚΜΡ算法均是对主串和子串的字符进行逐位比较。 S = "bdopfqew,, whether the character f is included, the computer compares the first character b of the main string S with f, compares the second character d of S with f, and so on, until the S 5 characters and the same F, the matching is successful, this is the simplest case. If a substring of length 2 T or more characters, simple pattern matching algorithm, i.e. BF algorithm, but-1 is compared with 81, if different from the 1 to 82 and Comparative ^, and so on, until one character is 81 S and T, the same, then after T 1+ 1 are compared, if the same, then the comparison continues down, when S is a When a character S i+n is different from Τ 1 + η , it returns, and then a new round is compared with 8; +1 and ^, and the above process is repeated until the characters in the 全部 are all finished, the matching is successful, otherwise the match is made. Failure. As the length of the search keyword 增加 increases, the complexity of character matching increases accordingly. The improved pattern matching algorithm, ie, the ΚΜΡ algorithm, avoids backtracking for the pinyin text of the small character set, but the character set is large. For a Chinese character string with a low single character frequency, Meaning less quality. In short, BF algorithm and ΚΜΡ algorithms is the main character string and substring-by-bit comparison.
2004年 10月 19 日, 本人申请了 "质数代换字符串检索技术" 专利, 申请号 200410067258.X, 该方法有效地提高了字符串模糊检索的速度, 但 对于长字符串实施 "质数代换字符串检索技术" , 要达到较好的效果, 需 要较多的空间存贮质数乘积值。 为了提高字符串模糊检索的速度, 并减少 对存^空间的需求, 本发明提出用数据的 η个位 (bit ) 来标记字符串的组 成信息, 标记后之后的数据称该字符串的 "位值" , 对两个字符串的位值 进行比较, 并结合通常的字符逐位比较法, 实现字符串的模糊检索。 测试 表明, 速度是一般的字符逐位比较模糊检索的数倍乃至十几倍以上。 发明内容  On October 19, 2004, I applied for the patent for "Quality Substitution String Search Technology", application number 200410067258.X. This method effectively improves the speed of string fuzzy search, but implements "mass substitution" for long strings. String retrieval technology", to achieve better results, requires more space to store the prime product value. In order to improve the speed of string fuzzy retrieval and reduce the need for memory space, the present invention proposes to use n bits of data to mark the composition information of a string, and the data after the label is called the "bit" of the string. The value " , compares the bit values of the two strings, and combines the usual character bitwise comparison method to achieve fuzzy search of the string. Tests have shown that the speed is a multiple of the general character-by-bit comparison fuzzy search or even more than ten times. Summary of the invention
基本方法  basic method
位标记字符串检索实施中, 可以用 "位" 或 "位的组合" 对应通常意 义的字符, 如: a、 b、 A、 B、 ∑ #、 ∞、 、 t 、 :、 中、 国; 根据需要也 可以用位对应汉字偏旁, 如: 4 、 ra, 以至笔画, 如」 、 、等; 对于拼音 文字,:¾口英语字符串 "day and night",可以用位对应单" ί司,如 day、 and、 night, 或者用位对应字母组合, 如 ay、 ai, 在汉语拼音输入或语音输入中可以代 表汉语拼音的声母、 韵母或音节, 对于其它语言可以代表音标, 如 35、 θ、 sir). 为了便于说明, 称位或位的组合所对应的字符单位为"字符元", 记为 P。 In the implementation of bit mark string retrieval, "bit" or "combination of bits" may be used to correspond to characters of usual meaning, such as: a, b, A, B, ∑ #, ∞, t, :, 中中,国; Need to use the corresponding Chinese character radicals, such as: 4, ra , and even strokes, such as ",,, etc.; For the pinyin text,: 3⁄4 mouth English string "day and night", you can use the bit corresponding to the "", such as Day, and, night, or a combination of digits, such as ay, ai, can be substituted in Chinese pinyin input or voice input The initials, finals or syllables of the Chinese Pinyin can represent the phonetic symbols for other languages, such as 35, θ, sir. For convenience of explanation, the character unit corresponding to the combination of the position or the bit is the "character element", which is denoted as P.
位标记字符串检索的方法, 容易在拼音文字字符串检索中实现。 设以 一个 32位,各个位均为 0的数据 0000,0000,0000,0000,0000,0000,0000,0000, 记为 WF, 以 WF自左至右 (也可以是自右至左) 的第 1 至第 26个位对应 26个英文字母, 其它 6位可能不考虑, 也可以对应标点符号。 单词 big包 含 b、 g、 i, 所以将相应的 2、 7、 9 等 3 个位标记为 "1" , 则数据成为 0100,0010,1000,0000,0000,0000,0000,0000, 称为 big的位值, 记为 wa, 这 种标" i己方式为 "1" 才示 i己。 The method of bit mark string retrieval is easy to implement in pinyin character string retrieval. Set a 32-bit data with 0000, 0000, 0000, 0000, 0000, 0000, 0000, 0000 for each bit, denoted as W F , with W F from left to right (also from right to left) The 1st to 26th bits correspond to 26 English letters, and the other 6 bits may not be considered, and may also correspond to punctuation. The word big contains b, g, i, so the corresponding 2, 7, 9 and other 3 bits are marked as "1", then the data becomes 0100, 0010, 1000, 0000, 0000, 0000, 0000, 0000, called big The value of the bit is recorded as w a , and the standard "i" mode is "1".
同样, bigger包含 b、 e、 g、 i、 r, 故将 WF的 2、 5、 7、 9、 18等 5个 位标记为 "1" , 则数据成为 0100, 1010, 1000,0000,0100,0000,0000,0000 , 称 为 bigger的位值, 记为 WbSimilarly, bigger contains b, e, g, i, r, so 5 bits of 2, 5, 7, 9, 18, etc. of W F are marked as "1", and the data becomes 0100, 1010, 1000, 0000, 0100. , 0000,0000,0000 , a bit value called bigger, denoted as W b .
Figure imgf000004_0001
Figure imgf000004_0001
可以看出, 所有 big为 1的位, bigger也必为 1; 但 bigger为 1的位, big可能是 1, 也可能是 0, 如第 5、 18个位; bigger为 0的位, big必为 0, 如第 1、 3、 4、 6、 8、 10等 21个位。 将 \ 3与 Wb进行 "位蕴含" 运算, 结果是所有的位均为 " 1 ": 1111,1111,1111,1111,1111,1111,1111,1111, 记为 WT , 如果是无符号长整数, 就是 4294967295。 可以用公式表示为: Wa→ Wh = Wr , 字符串 bigger包含 big。 It can be seen that all big bits of 1 and bigger must be 1; but bigger is 1 bit, big may be 1, or 0, such as 5th, 18th digit; bigger is 0, big must Is 0, such as the first 1, 3, 4, 6, 8, 10 and so on 21 bits. The \ W b 3 with a "bit contains" operation, the result is that all bits are "1": 1111,1111,1111,1111,1111,1111,1111,1111, denoted W T , if it is an unsigned long integer, it is 4294967295. It can be expressed as: W a → W h = W r , the string bigger contains big.
如果 biggest 的位值记为 Wc, 同样有: P a→ . = 7., 字符串 biggest 包含 bigo If the bit value of biggest is written as W c , there is also: P a → . = 7 ., the string biggest contains bigo
如果将 BIG的位值记为 WA, 将 big的位值 \^与之进行 "位蕴含" 运 算, 同样的有: iVa→WA = Wr , 字符串 BIG等于 big。 这里未考虑字母大小 写, 仅说明两个相同的字符串的位值进行 "位蕴含" 运算, 所有的位均为 1。 If the bit value of BIG is recorded as W A , the bit value of big is ^^ with the "bit implication" operation, the same is: iV a → W A = W r , the string BIG is equal to big. The case is not considered here, only the bit values of two identical strings are described as "bit-in", all bits are 1.
如果 digger的位值记为 Wd, 如果将 big的位值 \\^与之进行 "位蕴含" 运算, 结果为 1011,1111,1111 , 1 111,1111 ,1111 ,1111,1111 , 不等于 WT, 即:If the bit value of digger is written as W d , if the bit value \\^ of big is subjected to a "bit implication" operation, the result is 1011, 1111, 1111, 1 111, 1111, 1111, 1111, 1111, not equal to W. T , ie:
Wa→ Wd≠ WT, 字符串 digger不包含 big。 W a → W d ≠ W T , the string digger does not contain big.
也就是说, 通过对两个字符串的位值 W进行 "位蕴含" 运算, 如果结 果不等于 WT, 则后项所对应的字符串 "不包含" ( "不多于"及 "不等于", 下同)前项所对应的字符串。 That is to say, by performing a "bit implication" operation on the bit value W of two strings, if the result is not equal to W T , the string corresponding to the latter item "does not contain"("not more than" and "not equal"", the same below) The string corresponding to the previous item.
上面的方案未考虑字母的大小写,如果要区分大小写,则需要 52个位, 这也可以实现。 但是每个字符元对应一个位, 并不总是可行的。 如, GBK 范围有 21000个汉字, 用一个位对应一个汉字, 存贮位值的数据会占用很 多空间, 其中有相当多的是无意义的空白, 而读取数据及位比较运算用时 会无意义地增加, 实施中, 根据需要可用一个 "位" 对应多个汉字。 一个 易行的方案是将 GBK范围的 21000个汉字及其它符号, 根据编码分为 32 组, 以一个长整型数据来存贮字符串的位值, 设有字符串 "大漠孤烟直长 河落日圓" , 则: The above scheme does not consider the case of letters. If you want to distinguish between upper and lower case, you need 52 bits, which can also be achieved. But each character element corresponds to a bit, which is not always possible. For example, GBK has 21000 Chinese characters, and one bit corresponds to one Chinese character. The data of the stored bit value will occupy a lot of space, and there are quite a lot of meaningless blanks, and the reading data and bit comparison operation will be meaningless. The ground is increased. In the implementation, a "bit" can be used to correspond to multiple Chinese characters as needed. An easy solution is to divide 21000 Chinese characters and other symbols in the GBK range into 32 groups according to the encoding, and store the bit value of the string with a long integer data, and set the string "the desert is solitary and straight." Japanese yen", then:
汉字 大 ¾ 孤 烟 直 长 河 落 曰 圆 编码 46323 50350 47554 53708 54961 45988 47827 49892 51413 54450 组 20 15 3 13 18 5 20 5 22 19 汉字 冷 长 河 落 曰 天 涯 编码 52138 49380 45988 47827 49892 51413 52460 53700 组 1 1 5 5 20 5 22 13 5 Chinese character big 3⁄4 solitary smoke straight long river falling circle code 46323 50350 47554 53708 54961 45988 47827 49892 51413 54450 group 20 15 3 13 18 5 20 5 22 19 Chinese character cold long river lonely world code 52138 49380 45988 47827 49892 51413 52460 53700 group 1 1 5 5 20 5 22 13 5
表中的编码是用 excel中函数 code获取, 其中 "大" "河" 同在 20组, "长" "落" 同在 5组, 所以只将 8个相应的位置为 1 , 则 "大漠孤烟直长 河落日圓" 的位值 Wh为: The code in the table is obtained by the function code in excel, where "big""river" is in the same group as 20, and "long" and "falling" are in the same group, so only 8 corresponding positions are 1, then "the desert is alone." The position value W h of the smoke straight long river is:
0010, 1000,0000, 1010,01 1 1 ,0100,0000,0000  0010, 1000,0000, 1010,01 1 1 ,0100,0000,0000
同 样 可 以 得 到 " 长 河 落 日 " 的 Wi " 位 值 " 为 : 0000, 1000,0000,0000,0001 ,0100,0000,0000, 以 Wi与 wh作位 "位蕴含" 运 算:
Figure imgf000006_0001
, 两个字符串存在包含关系。
Similarly, the Wi "bit value" of "Long River Sunset" can be obtained as: 0000, 1000, 0000, 0000, 0001, 0100, 0000, 0000, with Wi and w h as the "bit implication" operation:
Figure imgf000006_0001
, the two strings have an inclusion relationship.
"霜冷长河"中 "霜" 为第 11组, 而 "冷" 与 "长" 均为第 5组, 则 "位 值" Wj为: 0000, 1000,0010,0000,0001 ,0000,0000,0000, 以 Wj与 wh作位 "位 蕴含" 运算: w^w^ w. , 而两个字符串也不存在包含关系。 也就是说, 用一个 "位" 对应多个汉字, 得到位值, 对两个字符串的位值进行 "位蕴 含" 运算, 如果结果不等于 WT, 则后项所对应的字符串 "不包含" 前项所 对应的字符串的 "全部字符元", 也就必然不包含该 "字符串"。 对于拼音 文字, 同理可用一个位对应多个单词, 判断一个字符串 "不包含"、 "可能 包含" 另一个字符串的字符元。 In the "Cold Cold River", "Frost" is the 11th group, and "Cold" and "Long" are the 5th group, then the "Bit Value" Wj is: 0000, 1000,0010,0000, 0001, 0000, 0000, 0000, with Wj and w h as a bit-contained operation: w^w^ w. , and there is no inclusion relationship between the two strings. In other words, use a "bit" to correspond to multiple Chinese characters, get the bit value, and perform a "bit implication" operation on the bit values of the two strings. If the result is not equal to W T , the string corresponding to the latter item is not The "all character" containing the string corresponding to the previous item must not contain the "string". For pinyin text, the same reason can be used to correspond to multiple words in one bit, and judge a character element that does not contain a string or "may contain" another string.
但是, 对两个字符串的位值进行 "位蕴含" 运算, 如果结果等于 WT, 后项所对应的字符串也只是 "可能包含" 前项所对应的字符串。 However, a bitwise operation is performed on the bit values of two strings. If the result is equal to W T , the string corresponding to the latter term is only the string that may contain the preceding term.
如果 gibber的位值记为 We, 将 big的位值 \¥3与之进行 "位蕴含" 运 算, 同样有: If the bit value of gibber is recorded as W e , the bit value of big is \¥ 3 and the "bit implication" operation is performed.
wu→ we = wr w u → w e = w r
但字符串 gibber不包含 big。  But the string gibber does not contain big.
" 落 日 天 涯 " 的 " 位 值 " Wk 为 : 0000, 1000,0000, 1000,0000,0100,0000,0000 , 以之与 wh作位 "位蕴含"运算: 但 "大漠孤烟直长河落日圓" 不包含"落日天涯"。 The "bit value" of the "Sunset of the World" W k is: 0000, 1000, 0000, 1000, 0000, 0100, 0000, 0000, and the "bit implication" operation with w h : But "the desert is so long and the river is straight" does not include "the sunset".
也就是说, 位标记字符串检索与一般字符串逐位比较法是有区别的。 即 使两个字符串的位值存在位蕴含关系, 两个字符串也不一定存在包含关系。 如果以一个 "位" 对应 "一个字符元", 两个字符串的位值存在位蕴含关系, 则后项对应的字符串 "包含" 前项对应的字符串的全部字符元。 但由于未考 虑字符元的排列, 不能肯定两个字符串是否存在包含关系。 如果标记是以一 个 "位" 对应 "一组字符元", 一组字符元之中有多个字符元, 则后项对应 的字符串 "有可能" 包含前项对应的字符串的全部字符元, 也 "有可能" 不 包含前项对应的字符串的全部字符元,更不能肯定两个字符串是否存在包含 关系。 但根据需要, 可再用字符逐位比较方法判断两个字符串是否存在包含 关系。  That is to say, the bit mark string search is different from the general string bitwise comparison method. Even if there is a bit implication relationship between the bit values of the two strings, the two strings do not necessarily have an inclusion relationship. If a "bit" corresponds to "one character" and the bit values of the two strings have a bit implication relationship, the string corresponding to the latter item "contains" all the character elements of the string corresponding to the previous item. However, since the arrangement of character elements is not considered, it is not certain whether the two strings have an inclusion relationship. If the mark is a "bit" corresponding to "a set of character elements", and a set of character elements has more than one character element, the string corresponding to the latter item "possibly" contains all the character elements of the string corresponding to the previous item. , and "possibly" does not include all the character elements of the string corresponding to the previous item, and it is not sure whether the two strings have an inclusion relationship. However, as needed, the character-by-bit comparison method can be used to determine whether there is an inclusion relationship between the two strings.
位标记字符串检索, 进行位运算的前项及后项的位值, 均可以是一个字 符串的字符元的标记信息, 也可以是多个字符串所有字符元的标记信息, 统 称为 "若干个字符串" S的位值。 "若干个字符串" 是 S的同位概念, 就是 说, S指一个或多个字符串。  The bit tag string search, the bit value of the preceding and following items of the bit operation, may be the tag information of the character element of a string, or may be the tag information of all the character elements of the multiple string, collectively referred to as "certain The string value of the string "S. "Several strings" is the concept of the parity of S, that is, S refers to one or more strings.
设对数据库进行位标记字符串检索, 字符串记录 s,、 s2、 s3、 …的位值 分别为 w,、 w2、 w3、 ...。 检索关键词 st的位值记为 wt, 以 wt与 \\^进 行位蕴含运算, 结果为 WT, 则字符串 8„包含或可能包含 St的所有字符元。 由符合条件的 S 成的记录集, 记为 Ri。 检索所用时间称之为 Ti。 通常需 要在 R, 中用字符串逐位比较法得到最终结果集 Rz, 检索所用时间称之为 因此位标记字符串检索方法总的用时是 Ti +TZ, 如果希望使 Ti低, 对 32位的处理器来说, 最好存贮位值的数据就是 32位。'对于某次字符串模 糊检索来说, Rz是一定的, 降低丁2的方法是尽可能使初步结果集 Ri最小。 存贮位值的位数 n越少, 一个位对应的字符元越多, 筛选效果越差, Ri也 会越大; 字符串平均长度越长, Ri也会越大。 影响 Ri大小还有另一因素: 检索关键词越长, 1 ,也越小。 设标记所用位数为 n, 字符串位值的 " 1 " 为 m个, 检索关键词位值的It is assumed that the database is subjected to bit mark string retrieval, and the bit values of the string records s, s 2 , s 3 , ... are w, w 2 , w 3 , ..., respectively. The bit value of the search key s t is recorded as w t , and the bit implication operation is performed by w t and \\^, and the result is W T , then the character string 8 „ contains or may contain all the character elements of S t . The record set of S is denoted as Ri. The time used for the search is called Ti. It is usually necessary to obtain the final result set R z by string-by-bit comparison in R, and the time used for the search is called the bit mark string search. The total time of the method is Ti + T Z . If you want to make Ti low, for 32-bit processors, the best data for storing bit values is 32 bits. 'For a string fuzzy search, R z It is certain that the method of lowering D 2 is to minimize the preliminary result set Ri as much as possible. The fewer the number of bits n of the storage bit value, the more character elements corresponding to one bit, the worse the screening effect, the larger Ri will be; The longer the average length of the string, the larger Ri will be. There is another factor affecting the Ri size: The longer the search keyword, the smaller the 1 is. Let the number of bits used for the tag be n, and the number of "1" of the string value is m, and the value of the keyword is retrieved.
" 1 " 为 k个, 则位比较的筛选概率可以用下式计算, 其值越小越好: If " 1 " is k, the screening probability of bit comparison can be calculated by the following formula. The smaller the value, the better:
ml
Figure imgf000008_0001
Ml
Figure imgf000008_0001
k (n - k) \ n为 32, m和 k部分取值的筛选概率计算如下:  k (n - k) \ n is the 32, m and k part of the screening probability is calculated as follows:
Figure imgf000008_0002
Figure imgf000008_0002
影响筛选概率的三个因素中, 用户检索时所用关键词长度决定 k的大 小, 记录字符串长度决定 m的大小, 但记录字符串长度、 检索时所用关键 词长度是不可控制的, 可控制的因素是标记所用的位数 n, 而且长度一定 的字符串标记之后 " 1 " 的个数 m、 k的大小, 也受 n影响, 所以 n与 m保 持足够的比例很重要。  Among the three factors affecting the screening probability, the length of the keyword used in the user search determines the size of k, and the length of the record string determines the size of m, but the length of the record string and the length of the keyword used in the search are uncontrollable and controllable. The factor is the number n of bits used for the mark, and the number of "1" after the string mark with a certain length is affected by n, so it is important to maintain a sufficient ratio of n and m.
几点建议:  Suggestions:
1 . 32位的 cpu, 用长整数进行标记, 最利于位值比较, 如果是无符号整 数, 则 W为 32个位, 如果是有符号位整数, 则只有 31个位便于标记。 如 果字符串平均长度大于 16个字符元, 记录条数多的数据库, 在 SQL SEVER 2000中, 可用数据类型 bigint的 63个位进行标记, 相应地, 将字符元分为 63组。 当然, 对于 32位处理器, 用 bigint存贮位值, 读取及进行位比较必 然用更多时间。 实际上, 任何数据库中便于进行位运算的任何数据类型均可 以用来标记字符串, 如果用无符号位的数据类型自然更佳。 对于 64位 cpu, 当然应采用 64个位来标记字符串, 以充分利用 cpu的性能, 提高 "位值" 的离散度。 1. 32-bit cpu, marked with a long integer, is most conducive to bit value comparison, if it is an unsigned integer, W is 32 bits, if it is a signed bit integer, only 31 bits are easy to mark. If the average length of the string is greater than 16 characters, the database with a large number of records, in SQL SEVER 2000, can be marked with 63 bits of the data type bigint, and accordingly, the character elements are divided into 63 groups. Of course, for 32-bit processors, using bigint to store bit values, it takes more time to read and compare bits. In fact, any data type in any database that facilitates bit operations can be used to mark strings, if the data type with unsigned bits is naturally better. For 64-bit cpu, Of course, 64 bits should be used to mark the string to take full advantage of the performance of the CPU and improve the dispersion of the "bit value".
2. 如果 cpu是 32位, 字符串之间长度相差大的数据库, 可将记录分为 两个表, 短字符串表用 32个位标记, 长字符串表用 64位标记, 查询中用 union命令将两表查询结果联合起来。  2. If the cpu is a 32-bit database with a large difference in length between strings, the record can be divided into two tables, the short string table is marked with 32 bits, the long string table is marked with 64 bits, and the query is used with union. The command combines the results of the two table queries.
3.字符元的分组, 以频率均衡为最佳, 还要兼顾标记位值的速度。 按汉 字内码进行模运算, 取余数分组, 容易实现, 但不是最优分组, 而且模运算 除法运算速度慢。 可以考虑取汉字内码的某 5个位将汉字分为 32组, 速度 会更快, 效果可能更好。  3. The grouping of character elements is best with frequency equalization, and the speed of marking bit values is also taken into account. The modulo operation is performed according to the Chinese character inner code, and the remainder grouping is easy to implement, but it is not the optimal grouping, and the modulo operation division operation is slow. It can be considered to take a certain 5 digits of the Chinese character code to divide the Chinese characters into 32 groups, the speed will be faster, and the effect may be better.
通常意义的检索之外, 用位标记方法可以对数据库字符串进行 "逆检 索", 即以数据库字符串 Sn的 Wn值为前项, 以检索关键词 St的 Wt值为后 项进行位蕴含运算, 根据结果是否所有的位为 "1", 即^ i →^t = , 可筛 选出数据库中那些组成字符元可能为检索关键词 S 包含的 SnIn addition to retrieving the usual sense, the method may be performed by database alignment mark strings "reverse retrieval", i.e. string S n to the database of the W n value in the preceding paragraph, the value of W t S t search keyword entry The bit implication operation is performed, and according to whether all the bits of the result are "1", that is, ^ i → ^t = , it is possible to filter out those constituent characters in the database which may be S n included in the search keyword S.
自然,逆检索的 选概率的计算要反过来。设数据库及关键词均用 "1" 标记, 标记所用位数为 n, 位标记后 811的位值 Wn的 "1" 为 m个, 检索关 键词标记后的 "1" 为 k个, 则检索关键词 St包含或可能包含其所有字符 元的概率可以用下式计算: Naturally, the calculation of the probability of selection for the inverse search is reversed. Let the database and keywords be marked with "1", the number of digits used for the label is n, the number of "1" of the bit value W n of the bit 8 11 is m, and the number of "1" after the keyword search is k. The probability that the search key S t contains or may contain all of its character elements can be calculated by:
k\  k\
p C™ m (k - m) k\ n - m)\  p CTM m (k - m) k\ n - m)\
" " " ~ n 一 n\{k-m)\  " " " ~ n a n\{k-m)\
m\(n - m)l  m\(n - m)l
测试分析 test analysis
用上述方法对一个以汉字为主,有部分英文及数字的数据库进行测试: 记录数 267, 000余条, 字符量 3, 473, 000余, 字符串平均长度 12.989, 以长整数标记位值, 编程语言为 VB, 操作系统为 Window xp , CPU赛扬 800Hz, 内存 256M, 技嘉 810主板, 硬盘 40G。 检索整体用时, 记为 T。 显然, 任何一次位标记检索, 都必须读取全部记录的位值, 以关键词的位 值与之进行比较, 所用的时间可以认为是一个常量, 称为 Τ。, 但不易直接 测得。 位值比较后符合条件的记录集, 记为 对!^进行逐位比较检索 用时 T,也不易直接测得, 但与 大小有关, 可以得到。 影响检索用时 的另一个因素是最终结果集 R2的大小。 根据 120次测试所得的 T、 R,、 R2, 回归分析得到如下方程: Using the above method to test a database with Chinese characters and partial English and numbers: The number of records is more than 267, 000, the number of characters is 3, 473, 000, the average length of the string is 12.989, and the value is marked with a long integer. The programming language is VB, the operating system is Window xp, CPU Celeron 800Hz, memory 256M, Gigabyte 810 motherboard, hard disk 40G. When searching for the whole use, it is recorded as T. Obviously, any bit mark search must read the bit value of all records, and compare the bit value of the keyword with it. The time used can be regarded as a constant, called Τ. , but not easy to measure directly. The record set that meets the criteria after the bit value is compared is recorded as correct! ^ Perform a bit by bit comparison search When using time T, it is not easy to measure directly, but it is related to the size and can be obtained. Another factor affecting the retrieval time is the size of the final result set R 2 . According to the T, R, and R 2 obtained from 120 tests, the regression analysis yields the following equation:
T = 0.268 + 0.000,008,625/?, + 0.000,0270,2 ?2 T = 0.268 + 0.000,008,625/?, + 0.000,0270,2 ? 2
调整后的判定系数 = 0.989  Adjusted decision coefficient = 0.989
回归模型的显著性检验 F = 5265.814  Significance test of regression model F = 5265.814
大于
Figure imgf000010_0001
more than the
Figure imgf000010_0001
常量的显著性检验 t。 = 86.610  The significance of the constant test t. = 86.610
R,的显著性检验 ^ = 74.263  R, the significance test ^ = 74.263
R2的显著性检验 t2 = 25.489 Significance test for R 2 t 2 = 25.489
均大于 (1 17) = 2.6185  Both are greater than (1 17) = 2.6185
可见, 回归方程可信度很高。 其中的常量, 也就是读取全部记录的位 值, 以关键词的位值与之进行比较所用的时间, 即 To, 为 0.268秒。 同时 进行的 120次字符串逐位比较方法平均用时 2.1739秒。可见位值比较所用 的时间 T。仅为通常逐位比较方法用时的八分之一。 但是, 位标记字符串检 索的整体用时还与 R!、 R2相关, 该测试中 n=31, 字符串平均长度 12.989, 标记后有重叠现象, " 1 " 的平均个数 m约在 1 1至 12 , 这里按 12计算, 假如上面数据库字符串的字符元分布正常, 标记后的 " 1 " 分布均衡, 按概 率可以计算得到 R 最终结果集 R2与筛选效果无关, 这里指定为 240, 则 理想状态下关键词标记后 k为 1至 10的检索用时,可按回归方程计算如下: It can be seen that the regression equation is highly reliable. The constant, that is, the bit value for reading all the records, is the time taken to compare the bit value of the keyword with To, which is 0.268 seconds. At the same time, the 120-bit string-by-bit comparison method took an average of 2.1739 seconds. The time T used to compare the bit values is visible. It is only one-eighth of the time that is usually used for the bit-by-bit comparison method. However, the overall time of the bit mark string search is also related to R! and R 2. In this test, n=31, the average length of the string is 12.989, and there is overlap after the mark. The average number m of "1" is about 1 1 To 12, here is calculated according to 12, if the character element distribution of the database string above is normal, the marked "1" distribution is balanced, and the R result can be calculated according to the probability. The final result set R 2 has nothing to do with the filtering effect, here is specified as 240. In the ideal state, the search time after the keyword is marked with k from 1 to 10 can be calculated as follows according to the regression equation:
Figure imgf000010_0002
Figure imgf000011_0001
可见, 与通常字符串检索方法平均用时 2.1739秒相比, 位标记字符串 检索有很大的性能优势。 但不同的硬件、 以及标记时所用的数据类型尤其 是 n的大小、 数据库字符元的分布状况、 标记时字符元的分组状况, 均影 响位标记字符串检索的用时, 所以此回归方程只有参考意义。
Figure imgf000010_0002
Figure imgf000011_0001
It can be seen that bit mark string retrieval has a great performance advantage compared to the usual string retrieval method with an average time of 2.1739 seconds. However, different hardware and data types used for marking, especially the size of n, the distribution of database character elements, and the grouping status of character elements when marking, all affect the time of bit mark string retrieval, so this regression equation has only reference meaning. .
英语中仅有 26个字母, 如果字符串很长, 每条记录含每一个字母, 则 位标记检索会失去筛选作用。 但对一个记录数 242,000, 字符量 7,493,000 的英文数据库数据进行测试, 表明位标记字符串检索对拼音文字通常也是 可行的。 该数据库字符串数据类型为 varchar, 字段长 56, 字符串平均长 30.846, 用 1个长整数的 26个 bit标记 26个字母, 不分大小写, 忽略空格 及其它字符。 由于字母英文字母的使用频率不均衡, 标记后统计仅有 3213052个 "1" , 每个记录 "1" 的平均数为 13.2266, 一方面说明标记有 大量的重叠发生, 另一方面说明并非每条记录包含每个字母。  There are only 26 letters in English. If the string is long and each record contains every letter, the bit mark search will lose its filtering effect. However, testing an English database data with a record number of 242,000 and a character count of 7,493,000 indicates that the bit mark string search is usually feasible for pinyin text. The database string data type is varchar, the field length is 56, the string average length is 30.846, and 26 letters are marked with 26 bits of a long integer, which is case-insensitive, ignoring spaces and other characters. Due to the uneven use frequency of letters and letters, there are only 3213052 "1" after the mark, and the average number of "1" for each record is 13.2266. On the one hand, there is a large overlap of marks, on the other hand, not every one. The record contains each letter.
对该数据库进行 120次字符逐位比较检索, 平均用时为 4.573秒, 同 时进行位标记检索, 根据 120次的 T、 RK 2, 回归分析得到如下方程:  The database was searched by character search for 120 times, and the average time was 4.573 seconds. At the same time, the bit mark search was performed. According to the 120 times of T and RK 2, the following equation was obtained by regression analysis:
Τ = 0.265 + 0.000,019,367?, + 0.000,0362,3^2 调整后的判定系数 = 0.999 Τ = 0.265 + 0.000,019,367?, + 0.000,0362,3^ 2 Adjusted coefficient = 0.999
回归模型的显著性检验 F = 55405.88  Significance test of regression model F = 55405.88
大于 F。Q1(2,117) = 4.791 Greater than F. Q1 (2,117) = 4.791
常量的显著性检验 /。 =36.400  The significance of the constant test /. =36.400
R,的显著性检验 t =121.733  R, the significance test t = 121.733
R2的显著性检验 2 = 77.409 Significance test for R 2 2 = 77.409
均大于 ^。,(117) = 2.6185  Both are greater than ^. , (117) = 2.6185
下面为该数据库的字母分布统计表: 字 含该字母的记 含该字母的记录 母 录数 百分比 字母 数 百分比 t 227167 0.935136 h 1 14263 0.470365 r 222338 0.915257 u 109376 0.450248 e 221713 0.912685 g 97339 0.400697 a 213249 0.877842 f 83452 0.343531 c 2101 16 0.864945 w 71228 0.29321 1 n 204225 0.840695 b 67065 0.276074The following is the letter distribution statistics table for the database: The word contains the number of records in the letter containing the number of letters. Percentage of letters. Number of letters t 227167 0.935136 h 1 14263 0.470365 r 222338 0.915257 u 109376 0.450248 e 221713 0.912685 g 97339 0.400697 a 213249 0.877842 f 83452 0.343531 c 2101 16 0.864945 w 71228 0.29321 1 n 204225 0.840695 b 67065 0.276074
0 201082 0.827757 y 60698 0.249864 i 195734 0.805742 V 59351 0.2443 19 s 195639 0.805351 k 47379 0.1950360 201082 0.827757 y 60698 0.249864 i 195734 0.805742 V 59351 0.2443 19 s 195639 0.805351 k 47379 0.195036
1 169753 0.698791 X 151 50 0.062365 d 159202 0.655357 q 7258 0.029878 m 125449 0.516413 j 6351 0.0261441 169753 0.698791 X 151 50 0.062365 d 159202 0.655357 q 7258 0.029878 m 125449 0.516413 j 6351 0.026144
P 122168 0.502906 z 6307 0.025963 从上表可见, 该数据库中 t、 r、 e三个字母频率最高, 以 tree为检索关 键词, 实际检索得到含 tree的记录 7条, 而 为 188491 , 用时 3.655秒, 同时进行的逐位比较检索用时 4.465秒。 如按上表 t、 r、 e三个字母的频率, 计算可得: P 122168 0.502906 z 6307 0.025963 It can be seen from the above table that the three letters of t, r and e have the highest frequency in the database, and the tree is used as the search key. The actual retrieval of 7 records containing trees is 188491, which takes 3.655 seconds. The simultaneous bit-by-bit comparison search took 4.465 seconds. For example, according to the frequency of the three letters t, r and e in the above table, the calculation can be obtained:
Rl=242,924*0.935136*0.915257*0.912685=242,924*0.781 158=189762 再按回归方程计算检索用时为:  Rl=242,924*0.935136*0.915257*0.912685=242,924*0.781 158=189762 Then calculate the retrieval time according to the regression equation:
T = 0.265 + 0.000,019,36 * 189762 + 0.000,0362,3 * 7  T = 0.265 + 0.000,019,36 * 189762 + 0.000,0362,3 * 7
=3.939 =3.939
可见, 即使是 tree这样完全由高频字母构成的单词, 位标记检索仍有 性能优势。 实际上, 英文中大多数单词超过 4个字母, 如果其中有一个低 频字母, 位标记的筛选作用就很好, 相当于 "短板效应" 。 列举三十次测 试数据以供参考: It can be seen that even for words such as trees, which are composed entirely of high-frequency letters, bit mark retrieval still has performance advantages. In fact, most words in English are more than 4 letters. If there is a low-frequency letter, the screening effect of the bit mark is very good, which is equivalent to the "short board effect". List thirty test data for reference:
Figure imgf000013_0001
" ' "'" ··' 了 -「
Figure imgf000013_0001
"'"'"··' - - "
检索关 t藝:嶋 1 :, W|i纏  Search off t art: 嶋 1 :, W|i wrapped
inks and coatings 4.417812 .3595625 8.14 5078 1 10 Inks and coatings 4.417812 .3595625 8.14 5078 1 10
Custom Manufactured 4.535031 .2309375 5.09 8 1 17 Packaging Equipment Custom Manufactured 4.535031 .2309375 5.09 8 1 17 Packaging Equipment
Reduce damage to 4.518875 .2613437 5.78 921 1 14 equipment  Reduce damage to 4.518875 .2613437 5.78 921 1 14 equipment
Utilize Daylight 4.426813 .2507812■ 5.67 280 128 11 Utilize Daylight 4.426813 .2507812 ■ 5.67 280 128 11
Production Area 4.595 .8316875 18.1 24143 83 11 recovery equipment 4.498 .2399375 5.33 85 1 13Production Area 4.595 .8316875 18.1 24143 83 11 recovery equipment 4.498 .2399375 5.33 85 1 13
RELIEVED FROM 4.477719 .2898125 6.47 2748 1 9RELIEVED FROM 4.477719 .2898125 6.47 2748 1 9
Aircraft Parts 4.494594 .6908125 15.37 19556 1 8 位标记的重叠概率 Aircraft Parts 4.494594 .6908125 15.37 19556 1 Overlapping probability of 8-bit markers
拼音文字中字母数量小,所以字母重复出现的频率高,如 bigger , biggest 等单词有两个 g, 但仅将第 7个 "位" 标记为 1。 对于汉字字符串来说, 一 个字符串中同一个汉字重复出现的机率低, 但对汉字分组, 一个字符串中 的数个汉字可能会属于同一组, 如上面的分组中, "大" "河,, 同在 20组, "长" "落" 同在 5组, 标记后重叠于一个 "位" 。 即使分组实现字符元 频率均衡, 也有字符串标记重叠的问题。 设用来标记的位数为 n, 字符元 的个数为 m。 其中完全不发生重叠的概率是: Pinyin text has a small number of letters, so the frequency of repeated letters is high. For example, bigger, biggest, etc. have two g , but only the 7th "bit" is marked as 1. For a Chinese character string, the probability of repeated occurrence of the same Chinese character in a string is low, but for Chinese characters, several Chinese characters in a string may belong to the same group, as in the above group, "big""river" , in the same group of 20, "long" and "falling" are in the same group, and the label is overlapped with a "bit". Even if the grouping achieves character frequency equalization, there is a problem that the string marks overlap. For n, the number of character elements is m. The probability that no overlap occurs at all is:
A:;' n n - 1) ··■(«- m + 1)  A:;' n n - 1) ··■(«- m + 1)
r = :  r = :
n'" nm 显然, n—定的时候, m每增加 1, 分子新增项就递减 1, 而分母新增 项不变, 完全不发生重叠的概率越来越低。 设以 32个位进行标记, 字符串 长 m为 7, 完全不发生重叠的概率: n'" n m obviously, when n is fixed, for every 1 increase in m, the new addition of the element is decremented by 1, and the new term of the denominator is unchanged, and the probability of no overlap at all is getting lower and lower. Set to 32 bits. Mark, the length of the string m is 7, the probability of no overlap at all:
p = A]2 = 32(32 -1)···(32 -7 + 1) = 32*31*30*29*28*27*26― 16,963,914,240 (7)— 327 327 " 32 * 32 * 32 * 32 * 32 * 32 * 32 ~ 34,359,738,368 ~ ' 下表是以 32个位进行标记, 字符串长 m为 1至 24时, 完全不发生重 p = A] 2 = 32(32 -1)···(32 -7 + 1) = 32*31*30*29*28*27*26― 16,963,914,240 ( 7) — 32 7 32 7 " 32 * 32 * 32 * 32 * 32 * 32 * 32 ~ 34,359,738,368 ~ ' The following table is marked with 32 bits. When the string length m is 1 to 24, no weight occurs at all.
Figure imgf000015_0001
Figure imgf000015_0001
重叠就意味着信息的损失, 一定程度的重叠是不可避免的, 也是可以 接受的, 但重叠的比例太高, 会影响位标记检索的性能, 所以发生重叠的 概率也值得关注。 以 n个位标记, 字符串长为 m个字符元, 重叠为 k个 " Γ 的概率计算公式是:  Overlap means loss of information. A certain degree of overlap is unavoidable and acceptable, but the proportion of overlap is too high, which will affect the performance of bit mark retrieval, so the probability of overlap is also worthy of attention. Marked by n bits, the length of the string is m characters, and the overlap is k. The formula for calculating the probability is:
C, ·Α  C, ·Α
P =  P =
其中 AWhere A
Figure imgf000015_0002
这里的求和是取所有满足 m, + m-, +··· + mk=m的正整数解, 其组数为 c: 设以 32个位标记, 字符串长为 7个字符元, 重叠为 3个位, 即 m, + m2 + m3 = 7 有 15组, 概率计算如下:
Figure imgf000015_0002
The summation here is to take all positive integer solutions satisfying m, + m-, +··· + m k =m , the number of groups is c: set with 32 bit marks, and the string length is 7 characters. The overlap is 3 bits, ie m, + m 2 + m 3 = 7 There are 15 groups, and the probability is calculated as follows:
Figure imgf000016_0001
Figure imgf000016_0001
以 32个位标记, 字符串长为 7个字符元, 标记后 k为 1 -7的 4既率如 下表: k 排列数 排列百分比 百分比累计 k*排列数  Marked by 32 bits, the length of the string is 7 characters, and the rate of k after the mark is 1 -7. The following rates are as follows: k Number of permutations Percentage of arrangement Percentage of cumulative k* Arrangement
7 16963914240 0.493714884 0.493714884 1.18747E+1 1 7 16963914240 0.493714884 0.493714884 1.18747E+1 1
6 13701623040 0.398769714 0.892484598 822097382406 13701623040 0.398769714 0.892484598 82209738240
5 3383 1 16800 0.098461658 0.990946256 169155840005 3383 1 16800 0.098461658 0.990946256 16915584000
4 302064000 0.008791219 0.999737475 12082560004 302064000 0.008791219 0.999737475 1208256000
3 8957760 0.000260705 0.99999818 26873280
Figure imgf000017_0001
3 8957760 0.000260705 0.99999818 26873280
Figure imgf000017_0001
标记后 T 的加权平均个数为 6.376881392。 位标记相关的逻辑代数原理及 0标记  The weighted average number of Ts after marking is 6.376881392. Bit mark related logic algebra principle and 0 mark
字符串 S的位值 W的某一个位为 "1", 就是意味着命题 "字符串 S 有某个字符元或某组字符元中的一个" 为真。 对于两个字符串 sa和 sb,A bit of the bit value W of the string S is "1", which means that the proposition "the string S has a character element or one of a group of character elements" is true. For the two strings s a and s b ,
Wa的某 k个位为 1, 就是意味着命题 "字符串 Sa有某 k个字符元或某 k 组字符元中的一个" 为真; 如果 Wb所有相应的某 k个位为 1, 就是意味着 命题 "字符串 Sb有某 k个字符元或某 k组字符元中的一个" 也为真。 自 然, Wb可能有其它某 m-k个位为 1, 而 Wa这些某 m-k个位为 0。 这种关 系用逻辑代数公式就表示为: w。→Wb =W 直观易解。 但是并非所有的编 程语言和数据库系统都有 "位蕴含" 运算符, 所以应用中需要按逻辑代数 原理进行变换。 A certain k bits of W a is 1, which means that the proposition "string S a has a k character element or one of a k group of character elements" is true; if W b all corresponding k bits are 1 That means that the proposition "the string S b has a k character or one of the k groups of characters" is also true. Naturally, W b may have some other mk bits of 1, and W a certain mk bits are 0. This relationship is expressed as a logical algebraic formula: w. →W b =W is intuitive and easy to understand. But not all programming languages and database systems have "bit-input" operators, so applications need to be transformed according to the principle of logical algebra.
从逻辑代数运算定理" 可知, 若《→6 = 1, 则 |6 = 1。 同样的, 对于 η个位的位值也有: 若 Wu→Wb =W.r , 则 From the logical algebraic operation theorem, it can be seen that if →6 = 1, then |6 = 1. Similarly, for the bit values of n bits: If W u → W b = W. r , then
Wa \Wh = WrW a \W h = W r .
也可以得到:  Also get:
wa\wh=wh 从逻辑运算定理 wh<^→ 可知 , w a \w h =w h from the logical operation theorem w h <^→
wa→wh= w.r, 则 wh→wa= wrw a →w h = w. r , then w h →w a = w r .
所以可用一个各 "位" 均为 1的数据 WT来标记位值, 如果字符串 包含某一字符元, 则将相应的位标记为 0, 这种方式称为 0标记。 如: Therefore, the bit value can be marked with a data W T whose each "bit" is 1. If the character string contains a character element, the corresponding bit is marked as 0, which is called a 0 mark. Such as:
显然 i h→ = wTObviously i h → = w T .
也可以得到: w„ = w, Also get: w„ = w,
用真值表比较直观:  It is more intuitive to use the truth table:
Figure imgf000018_0001
Figure imgf000018_0002
Figure imgf000018_0003
以上说明的是, 如何运用最常见的位运算符进行位值比较, 选概率 的原理是相通的。 当然, 0标记时, m、 k均指 0的个数。 但各编程语言和 数据库提供的位运算符数量不一, 其它的位运算符的具体操作, 可按逻辑 代数原理推出。 运用逻辑代数原理, 进行等价变换, 得出公式当然应以简 约适用为目的, 而不是越变越复杂。
Figure imgf000018_0001
Figure imgf000018_0002
Figure imgf000018_0003
The above explanation is how to use the most common bit operators for bit value comparison. The principle of probability selection is the same. Of course, when 0 is marked, m and k all refer to the number of 0. However, the number of bit operators provided by each programming language and database is different. The specific operations of other bit operators can be introduced according to the principle of logical algebra. Using the principle of logical algebra, the equivalent transformation, the formula should of course be used for the purpose of simplicity, rather than becoming more and more complicated.
另外一点是, 如果编程语言或数据库系统提供了 "位" 的置 1 或置 0 的功能, 则位 "标记" 可直接运用。 如果数据库系统编程语言未提供位置 1, 则可用位的 "或" (or) 运算, 进行 1标记。 Another point is that if the programming language or database system provides a "bit" set or set to 0, the bit "tag" can be used directly. If the database system programming language does not provide a location 1, the "or" (or) operation of the bit can be used to mark 1.
首先对字符元进行分组, 如果用 32个位标记, 则对第 n组赋值 232Λ 从二进制来看, 则该值自左至右第 η个位为 1, 而其余位为 0, 称之为 "基 本位值"。 将一个字符串所有字符元的 "基本位值" 进行 "或" (or)运算, 即可得到该字符串的位值。 当然, 也可以对第 n组 "基本位值" 赋值 211·1, 从二进制来看, 则自右至左的第 n个位为 1, 而其余位为 0。 实际上, 在特 定数据库中, 对记录字符串及检索关键词进行标记, 字符元与位有固定的 对应即可, 而不论是自右至左, 还是自左至右, 或者其它秩序。 First, the character elements are grouped. If 32 bits are used, the value of the nth group is 2 32 Λ. From the binary point of view, the value is 1 from left to right, and the remaining bits are 0. Is the "basic bit value". The "base value" of all the character elements of a string is ORed to get the bit value of the string. Of course, it is also possible to assign a value of 2 11 · 1 to the nth group "base bit value". From the binary point of view, the nth bit from right to left is 1, and the remaining bits are 0. In fact, in a specific database, the record string and the search keyword are marked, and the character element has a fixed correspondence with the bit, whether from right to left, from left to right, or other order.
如果是进行 0标记, 则先给第 n组字符元, 赋以第 n个位为 0而其余 位为 1的 "基本位值",再将特定字符串所有字符元的 "基本位值"进行"和" (and) 运算, 即可得到该字符串的位值 。  If the 0 mark is performed, the nth character element is first given, the "next bit value" of the nth bit is 0 and the remaining bits are 1, and then the "basic bit value" of all the character elements of the specific character string is performed. The "and" (and) operation gives the bit value of the string.
至于标记重叠概率计算方法, 0标记与 1标记是相通的。 多位标记与分组标记  As for the marker overlap probability calculation method, the 0 mark and the 1 mark are in communication. Multi-bit tagging and group tagging
多位标记  Multi-digit mark
在单 "位" (bit) 标记的基础上, 下面说明多 "位" (bit)标记。 如果 存贮位值的位数 n较多, 而字符串的长度少, 可用 j个位的组合对应一组 字符元进行标记, 则共有一" - ~个位的组合, 相应地分字符元为 ~ - ~ Based on the single "bit" mark, the following describes the "bit" mark. If the number of bits n of the stored bit value is large, and the length of the string is small, and the combination of j bits can be marked by a group of character elements, there is a combination of "-~bits, and the corresponding character element is ~ - ~
;'!("-7')! 7'!("-;')! 组, 如果 S的一个字符元 P属于第 r组, 则将 W中构成第 r个位的组合的 j个位标记为 1。 j为 1时即上述的单 "位" (bit) 标记。 ;'!("-7')! 7'!("-;')! Group, if one character P of S belongs to the rth group, the j bits constituting the combination of the rth bit in W are marked as 1. When j is 1, it is the above-mentioned single "bit" mark.
 ,
设 n为 32, j为 2, 则将字符元分为 ~― ~一组, 即为 496组, 用 "1"  Let n be 32 and j be 2, then divide the character element into ~~~ group, which is 496 group, use "1"
2!(32-2)! 标记, 若字符元属于第 1组, 则将 W的 1、 2两个 "位" 标记为 "1" ; 若 字符元属于第 2组, 则将 W的 1、 3两个 "位" 标记为 1, 若字符元属于 第 3组, 则将 W的 1、 4两个 "位" 标记为 1, …。  2!(32-2)! Mark, if the character element belongs to the first group, mark the two "bits" of W and 1, 2 as "1"; if the character element belongs to the second group, mark the two "bits" of W and 1, 3 as 1. If the character element belongs to the third group, mark the two "bits" of W and 1, 4 as ....
假设汉字 "前" 字的位值第 2、 5两个位为 1, 而 "景" 字的位值第 23、 29两个位为 1, 则词语 "前景,, 有 2、 5、 23、 29四个位为 1, 又设 "远" 的位值第 2、 23两个位为 1, "大"的位值第 5、 29两个位为 1, 则词语 "远 大" 也是 2、 5、 23、 29四个位为 1, 用 "前景" 的位值检索, 结果中会出 现 "远大,, , 反之亦然, 即检索结果中会出现含 "不相关组字符元" 的字 符串, 但由于每一组所含的汉字较单 "位" 标记为少, 所以可以提高筛选 效果的。 采用多位标记的目的是提高筛选效果, 其筛选概率计算与单位标 记原理是相同的。 下面计算字符串长为 4个字符元, 检索关键词长为 2个 字符元, j为 1、 2、 3、 4时的筛选概率, 其值越低越好: Assume that the bit value of the Chinese character "before" is 2, 5 and 2, and the bit value of the "jing" word is 23, 29 is 1, then the word "foreground, there are 2, 5, 23, 29 four bits are 1, and the "far" bit value is 2, 23, and the two bits are 1, and the "big" bit value is 5, 29, and the two bits are 1, and the word "far" Large " is also 2, 5, 23, 29 four bits are 1, using the "foreground" bit value search, the result will appear "far,,, and vice versa, that is, the search results will appear with "unrelated group characters" The string of the meta", but since the Chinese characters contained in each group are marked less than the single "bit", the screening effect can be improved. The purpose of using multi-bit markers is to improve the screening effect, and the screening probability calculation is the same as the unit labeling principle. The following calculation string length is 4 characters, the search keyword length is 2 characters, j is 1, 2, 3, 4 when the screening probability, the lower the value, the better:
j为 2, 不考虑重叠, 则字符串标记后 m为 4*2=8个 1, 检索关键词标 记后 k为 2*2=4个 1。 j为 3, 即字符元分为 3 组, 即 4960组, 不考虑重叠, 则字符串 j is 2, regardless of the overlap, m is 4*2=8 of 1 after the character string is marked, and k is 2*2=4 of 1 after the keyword is searched. j is 3, that is, the character elements are divided into 3 groups, that is, 4960 groups, regardless of the overlap, the string
3!(32-3)! 标记后 m为 4*3=12个 1, 检索关键词标记后 k为 2*3=6个 1。  3!(32-3)! After marking, m is 4*3=12, and after searching for the keyword tag, k is 2*3=6.
32!  32!
j为 4, 即字符元分为 组, 即 35960组, 不考虑重叠, 则字符  j is 4, that is, the character elements are divided into groups, that is, 35960 groups, regardless of the overlap, the characters
4!(32-4)! 串标记后 m为 4*4=16个 1, 检索关键词标记后 k为 2*4=8个 1  4!(32-4)! After the string is marked, m is 4*4=16, after the keyword is searched, k is 2*4=8 1
m\  m\
p _ j _ kl(m - k)\ ^ m\{n - k)l ^ 4!(32 - 2)! = 4*3 :請 p _ j _ kl(m - k)\ ^ m\{n - k)l ^ 4!(32 - 2)! = 4*3 : Please
C, n\(m-k)\ 32! (4 -2)! 32*31  C, n\(m-k)\ 32! (4 -2)! 32*31
kl(n-k)l  Kl(n-k)l
8!  8!
P = 0.00194661 P = 0.00194661
Figure imgf000020_0001
Figure imgf000020_0001
4!(32- 4)!
Figure imgf000020_0002
4!(32- 4)!
Figure imgf000020_0002
6!(32- 6)!  6!(32- 6)!
16! 16!  16! 16!
8!(16- 8)! 16*15*14*13*12*11*10*9  8!(16- 8)! 16*15*14*13*12*11*10*9
P, = _8L = 0.00122358  P, = _8L = 0.00122358
32! 32! 32 * 31 * 30 * 29 * 28 * 27 * 26 * 25  32! 32! 32 * 31 * 30 * 29 * 28 * 27 * 26 * 25
8!(32- ■8)! 如果检索关键词长度为 1、 2个字符元时, 多位标记尤其有意义。 但在 存贮位值 n为一定时, j值不是越大越好, 如上面计算显示, 在 n为 32, m 为 4时, j为 3效果最佳, 继续增加, 不能提高筛选效果, 而且标记时重叠 的比例会增加, 标记操作也更复杂。 如果对长字符串进行多位标记, n必 须足够大。 分组标记: 8!(32- ■8)! Multi-bit tags are especially useful if the search keyword length is 1, 2 characters. However, when the storage bit value n is constant, the j value is not as large as possible. As the above calculation shows, when n is 32 and m is 4, j is the best effect of 3, continues to increase, can not improve the screening effect, and the mark Overlap The proportion will increase and the marking operation will be more complicated. If you mark a long string with multiple bits, n must be large enough. Group tag:
如将一个无符号长整数的 32位分 17和 15两组, 分别进行标记, 字符 串长 m为 4, 不考虑标记重叠的问题, 标记后为 4个 1, 设关键词长为 2, 标记后为 2个 1 , 则筛选概率为:  For example, the 32 bits of an unsigned long integer are divided into two groups of 17 and 15 and marked separately. The length m of the string is 4, regardless of the problem of overlapping marks, 4 after the mark, and the length of the keyword is 2, After 2 1s, the screening probability is:
P _ 4!(17-2)!;;4!(15-2)!_ *3 y 4*3 _ 12 y 12 _Q QQ252l P _ 4!(17-2)!;;4!(15-2)!_ *3 y 4*3 _ 12 y 12 _ Q QQ252l
17!(4-2)! 15!(4-2)! 17*16 15*14 272 210 显然, 也可提高筛选效果。 当然, 实行这种优化, 同样是标记的位 n 要数倍于 11。 这种优化方式, 用于长字符串时可作变通, 即用两种不同的 字符元分组方式, 用两个数据分别标记得到两组 W、 W,。 检索时, 先以其 中一组的标记方法得到关键词的 Wt, 与数据库的 Wn值进行比较, 筛选出 R,后。 在1^中, 以第二组的标记方法得到关键词的 W,t, 与数据库的 W,n 值进行比较得到 R2, 再在 R2中用通常的逐位比较法得到最终结果集 Rz。 与上一种优化方法相比, 同样是增加位值的存贮空间, 这种优化方式不必 读取全部的 W,n, 与 W,t进行比较。 17!(4-2)! 15!(4-2)! 17*16 15*14 272 210 Obviously, the screening effect can also be improved. Of course, to implement this optimization, the same bit n of the mark is several times larger than 11. This optimization method can be used for long strings, that is, two different character grouping methods are used, and two sets of data are respectively marked to obtain two sets of W and W. When searching, first obtain the W t of the keyword by the marking method of one of the groups, compare it with the W n value of the database, and filter out R, then. In 1^, the W, t of the keyword is obtained by the second group of marking methods, and compared with the W and n values of the database to obtain R 2 , and then the final result set R is obtained by the usual bitwise comparison method in R 2 . z . Compared with the previous optimization method, it is also a storage space for increasing the bit value. This optimization method does not have to read all W, n and compare with W, t .
0标记、 1标记都可以进行多位标记、 分组标记, 分组标记也可以同多 位标记结合, 不同的组也可以分别是 0标记、 1标记。 当然, 适用为佳。 0 标记、 多位标记与分组标记均可用于逆检索。 类多时, 用一个位对应一个字符元, 不易实施; 如用多位标记, 检索结果 中, 不包含 "目的组字符元" 而仅包含 "不相关组字符元" 的字符串又会 混杂进来。 用一个盾数代换一个字符元, 可避免检索结果中出现这种混杂 现象, 应用中, 可先用位标记字符串检索作初步筛选, 再用质数代换字符 检索作二次检索, 如需要, 再用通常的字符串逐位比较方法得到最终结果。 位标记检索技术在语音输入、 机 翻译等方面的应用: 汉语的同音字、 同音词多, 故汉语语音输入及拼音输入中需要按拼音 筛选适当的字词, 位标记方法 "逆检索" 可用于这个方面。 The 0 mark and the 1 mark can perform multi-bit mark and group mark, and the group mark can also be combined with the multi-position mark, and different groups can also be 0 mark and 1 mark respectively. Of course, it is better to apply. 0 markers, multi-bit markers, and group markers can be used for reverse retrieval. When there are many classes, it is not easy to implement with one bit corresponding to one character element; if multi-bit tag is used, the string containing the "destination group character element" and only the "unrelated group character element" will be mixed in the search result. Use a shield number to replace a character element to avoid this kind of confounding in the search results. In the application, you can use the bit mark string search for initial screening, and then use the prime number substitution character search for secondary search. , then use the usual string bitwise comparison method to get the final result. The application of bit mark retrieval technology in speech input, machine translation, etc.: There are many homophones and homophones in Chinese, so it is necessary to press pinyin in Chinese speech input and pinyin input. To filter the appropriate words, the bit-marking method "reverse search" can be used for this aspect.
不考虑声调, 汉语有 400多个音节, 不便于用一个位对应一个音节, 可用 8个字节的 64个位对应汉语拼音的声母及韵母。设有汉语字词搭配及 基本句型数据库, 其中有基本句型"他毕业了" , 拼音为 "tabiyele", 标记后 位值 是:  Regardless of the tone, Chinese has more than 400 syllables. It is not convenient to use one bit to correspond to one syllable. 64 bits of 8 bytes can be used to correspond to the initials and finals of Chinese Pinyin. There is a database of Chinese word combinations and basic sentence patterns. Among them, there are basic sentence patterns "He graduated" and Pinyin is "tabiyele". After tagging, the bit values are:
1 000,0101 ,0000,0000,0000,0001 , 101 1 ,0000,0000,0000,0000,0000,0000,0000,0000,0000 如果语音转换或者拼音输入 "tazaojiubiyele", 标记后位值 Wt是: 1000,01 01 ,0001 ,0010,0000,0001 , 101 1 ,0000, 1000,0000, 1000,0000,0000,0000,0000,0000 则"他毕业了,, 是" tazaojiubiyele"可参考的基本句型, 处理后可以得到 "他 zaojiu毕业了",其中 "毕业"为谓语,是动词, 而词库中拼音为 "zaojiu" 的词有 "早就 ""造就""枣酒", 如果语法、 语义、 词频等其它方面能起到辅 助作用, 则可进一步转换成"他早就毕业了,,。 由于声韵配合的多样性, 以 位值比较得到的初步结果中, 拼音如 "tebayile"之类的句型也会出现。 可在 位标记字符串的基础上, 用一个质数对应一个音节的质数代换方法作二次 筛选, 得到更接近的结果。 1 000,0101 ,0000,0000,0000,0001 , 101 1 ,0000,0000,0000,0000,0000,0000,0000,0000,0000 If the voice conversion or pinyin is input "tazaojiubiyele", the post-mark value W t is : 1000,01 01 ,0001 ,0010,0000,0001 , 101 1 ,0000, 1000,0000, 1000,0000,0000,0000,0000,0000 Then "He graduated, is the basic sentence that can be referenced by "tazaojiubiyele" Type, after processing, you can get "he graduated from zaojiu", in which "graduation" is a predicate, which is a verb, and the word "zaojiu" in the lexicon is "early" and "made""jujube", if grammar, Semantics, word frequency and other aspects can play a supporting role, then it can be further converted into "he has graduated,". Due to the diversity of the phonetic rhyme, the initial results of the bit value comparison, the pinyin such as "tebayile" will appear. On the basis of the bit mark string, a prime number corresponding to a syllable can be used for secondary screening to obtain a closer result.
如果字词搭配及基本句型中有 "楼" 与量词 "栋" 的搭配 "栋楼" , 又有 "楼" 与经常修饰它的形容词 "高" 的搭配 "高楼" , 可按它们的声 母韵母进行标记。 设语音转换或者拼音输入" zhedongxiezilouhengao", 同样 按声母韵母进行标记, 以数据库全部字词搭配及基本句型的位值 Wn与之 进行比较, 可筛选出 "栋楼" "高楼" 等搭配作参考, 处理后可以得到 "zhe 栋 xiezi楼 hen高", 其中的 xiezi或 xiezilou可从词库中得到 "写字"或 "写 字楼" , 如果语法、 语义、 词频等其它方面能起到辅助作用, 可将 If there is a combination of "loud" and the quantifier "dong" in the word collocation and basic sentence pattern, and the "high building" with the adjective "high" which often modifies it, you can press their initials. The finals are marked. Set the voice conversion or pinyin input "zhedongxiezilouhengao", also mark the initials of the initials, compare all the words in the database with the position value of the basic sentence type W n , and filter out the "Building House", "High Building" and so on. Reference, after processing, you can get "zhe dong xiezi floor hen high", where xiezi or xiezilou can get "writing" or "office" from the thesaurus, if the grammar, semantics, word frequency and other aspects can play a supporting role, will
"zhedongxiezilouhengao"转换成 "这栋写字楼很高,,。 类似地可以得到汉语的 其它搭配及相应的位值。 行政区划如, "省市" "省县" ; 数字如, "七 八" "万千" ; 量词 "斤两" "元角" ; 乃至姓氏的组合 "张李" "张刘" 。 如果输入 "hubeishengxianningshi" , 按照上述方法, 可以得到 "hubei省 xianning市" 。 : δ口果输入 "zhanglonglihu" , 可以得 J "张 long李 hu" 。 口果输入 "qiwansanqian" , 可以得到 "七万 san千" 。 同样, 可在位才示 i己 字符串的基础上, 用一个质数对应一个音节的质数代换方法作二次筛选' 得到更接近的结果。 "zhedongxiezilouhengao" is converted into "this office building is very high., similarly, you can get other Chinese matching and corresponding position value. Administrative divisions such as "provincial city""provincialcounty"; figures such as "seven eight""Thousands"; quantifier "Jian two""Yuanjiao"; even the combination of surnames "Zhang Li""ZhangLiu". If you enter "hubeishengxianningshi", according to the above method, you can get "hubei province xianning city". : δ 口果输入"Zhanglonglihu", you can get J "Zhang long Li hu". Enter "qiwansanqian" for the word "qiwansanqian", you can get "70,000 san thousand". Similarly, you can show yourself in your position. On the basis of a string, a prime number corresponding to a prime syllable is used for secondary screening' to get a closer result.
另外, 由于汉语声母、 韵母的使用频率是不均衡的, 单位标记时, 以 In addition, since the frequency of use of Chinese initials and finals is not balanced, when the unit is marked,
8个字节的数据的 64个 "位" 对应汉语拼音的声母、 韵母, 以某些音节如 li检索时, 由于 1与 i均是高频声母、 韵母, 筛选效果可能不好, 可以考虑 用双位标记, 即以 n为 32, j为 2, 即 496组位, 对应汉语的 400个音节, 并对数据库作测算, 使之尽可能均衡。 当然, 按 400个音节用双位标记, 初步结果中肯定会出现 "非目的" 句型和搭配, 可在位标记检索筛选的基 础上, 用一个质数对应一个音节的质数代换方法作二次筛选, 得到更接近 的结果。 The 64 "bits" of the 8-byte data correspond to the initials and finals of the Hanyu Pinyin. When searching with certain syllables such as li, since both 1 and i are high-frequency initials and finals, the screening effect may be poor. Double-digit mark, that is, n is 32, j is 2, that is, 496 group, corresponding to 400 syllables of Chinese, and the database is measured to make it as balanced as possible. Of course, with 4 00 syllables marked with double digits, the "non-purpose" sentence pattern and collocation will definitely appear in the preliminary results. On the basis of the position marker search and screening, a prime number substitution method corresponding to one syllable can be used. Sub-screening, getting closer results.
汉语分词中也可运用此项技术, 将词库中的词语、 参考搭配及句型, 按拼音标记位值, 关键词 (实际是句子) 也按拼音标记位值, 进行位蕴含 运算, 得到 Ri , 可以有效地缩小查找范围; 当然也可以将词库中的词语、 参考搭配及句型, 按内码的分组标记位值, 关键词也按同样的分组标记位 值, 进行位蕴含运算, 同样可以有效地缩小查找范围, 再辅以其它方法完 成对关键词 (也就是句子) 的分解。  This technique can also be used in Chinese word segmentation. The words, reference collocations and sentence patterns in the lexicon are marked by pinyin. The keywords (actually sentences) are also digitized by the pinyin value, and Ri is obtained. , can effectively narrow the search range; of course, you can also use the words in the lexicon, reference collocations and sentence patterns, mark the bit value according to the inner code group, and the keyword also performs the bit implication operation according to the same group mark bit value. It can effectively narrow the search range, and then use other methods to complete the decomposition of keywords (that is, sentences).
英语中由于连读、 语音流变现象多, 不似汉语音节分明清晰, 也可用 位标记字符串检索技术 选参考词汇、 短语。 设数据库有短语 in the morning, 按其中的元音、 辅音 i、 η、 δ、 3、 m、 o :、 n、 i、 η进行标记得 到位值。 设语音输入转换后, 其中的 the所对应的 比较模糊, 可以剔除, 以其它比较响亮的 i、 n、 m、 0 :、 n、 i、 g进行标记得到位值, 以之与数 据库参考词汇、 短语的位值进行比较, 可以得短语 in the morning, 及其它 读音含 i、 n、 m、 o :、 n、 i、 。的短语或单词。 在此基础上, 用质数代换 音节的方法进一步缩小范围, 然后再根据上下文, 以语法、 语义、 词频作 辅助, 选择 in the morning。  In English, because of continuous reading and speech rhythm, it is not as clear as Chinese syllables. It can also be used to select reference vocabulary and phrases. Let the database have the phrase in the morning, and mark it according to the vowel, consonant i, η, δ, 3, m, o:, n, i, η. After the speech input is converted, the corresponding one of the corresponding ones is ambiguous and can be eliminated. The other relatively loud i, n, m, 0:, n, i, g are marked to obtain the bit value, and the database reference vocabulary, The phrase values of the phrase are compared to obtain the phrase in the morning, and other pronunciations include i, n, m, o:, n, i, . Phrase or word. On this basis, the method of substituting prime numbers for syllables further narrows the scope, and then according to the context, with the help of grammar, semantics, and word frequency, select in the morning.
如果将两种语言的基本句型建立对应关系, 则位标记及质数代换, 可 用于机器翻译。 设有位标记及质数代换的汉语基本句型库, 其中有 "我学 习" , 对应英语的基本句型 "I learn", 如有句子 "我在学习法语" , 进行 标记及质数代换后, 利用逆检索可迅速查找到可参考的基本句型为"我学 习" , 对应英语的基本句型为 "I learn", "学习"是谓语动词, 其后的 "法 语" 是宾语, 对应的单词是 "French" , 再参考两种语言的语法, 可以将 句子翻译为 "I am learning French" 。 If the basic sentence patterns of the two languages are associated, the bit marks and prime numbers can be used for machine translation. There is a basic Chinese sentence pattern library with bit mark and prime number substitution, among which there is "I study", corresponding to the basic sentence form "I learn" of English, if there is a sentence "I am learning French", after marking and prime substitution Using reverse search, you can quickly find the basic sentence pattern that can be referenced as "I learn Xi, "The basic sentence pattern corresponding to English is "I learn", "Learning" is a predicate verb, followed by "French" is the object, the corresponding word is "French", then refer to the grammar of the two languages, you can put the sentence Translated into "I am learning French".
以上叙述了位标记字符串检索的方法及相关的概率计算公式、 逻辑代 数原理, 并侧重于数据库中的字符串记录的筛选。 从编程角度看, 数据库 的数据结构是多种多样的, 各种数据结构也不只用于数据库。 但位标记字 符串检索技术作为一种字符串算法, 可用于各种数据结构的字符串查找, 如大数组中的字符串模糊查找。 至于小数组之字符串模糊查找, 则未必有 需要, 因为字符串的标记也需时间及空间。 具体实施方式  The above describes the method of bit mark string retrieval and related probability calculation formulas, logical algebra principles, and focuses on the screening of string records in the database. From a programming perspective, the data structure of the database is diverse, and various data structures are not only used for databases. However, bit-marker string retrieval technology is a string algorithm that can be used for string lookups of various data structures, such as string fuzzy lookups in large arrays. As for the string fuzzy search of small arrays, it is not necessary, because the mark of the string also takes time and space. detailed description
本发明在汉字字符串数据库及英文字符串数据库模糊检索中已得到良 好的实现, 下面是对 SQL SERVER2000中的汉字字符串数据库, 以 " 1 " 进行标记的 vb6.0代码, 其它编程语言、 数据库的位标记字符串模糊检索 可参照实施。  The invention has been well implemented in the Chinese character string database and the English string database fuzzy retrieval. The following is the Chinese character string database in SQL SERVER2000, the vb6.0 code marked with "1", other programming languages, databases The bit mark string fuzzy search can be referred to the implementation.
1.建立数据库  1. Establish a database
设数据库 shuku有表 biao,其中有字段 shuming,数据类型为 nvarchar, 长度为 40。 另建立字段 wei , 数据类型为 "长整型" , 也就是 4个字节, 有 32个位, 其中一 "位" 为符号位, 其余 31个位可以用于标记。  Let the database shuku have a table biao, which has a field shuming, the data type is nvarchar, and the length is 40. Another field wei is created, the data type is "long integer", that is, 4 bytes, there are 32 bits, one of which is a sign bit, and the remaining 31 bits can be used for marking.
2. vb6.0中没有直接将 "位" 置为 " 1 " 或 "0" 的命令, 故利用位的 "或" 运算对数据库字符串作 "标记" dim shuzu(30) As Long 2. There is no direct command to set "bit" to "1" or "0" in vb6.0, so use the "or" operation of the bit to "tag" the database string dim shuzu(30) As Long
'定义一个 31个元素的长整型数组。  'Define a long integer array of 31 elements.
shuzu(O) = 1  Shuzu(O) = 1
For = 1 To 30  For = 1 To 30
shuzu(x) = 2 * shuzu(x - 1)  Shuzu(x) = 2 * shuzu(x - 1)
Next '对长整型数组的 31个元素赋值, 从 1、 2、 4、 8、 16至 1073741824, 从二进制来看, 有一位为 1 , 而其余位为 0, 也就是 "基本位值" 。 Next 'Assigning 31 elements of a long integer array, from 1, 2, 4, 8, 16 to 1073741824, from the binary point of view, one bit is 1 and the remaining bits are 0, which is the "basic bit value".
Dim biaostr As String Dim biaostr As String
'存贮当前处理的字符串  'Store the currently processed string
Dim weizhi As Long  Dim weizhi As Long
'存贮字符串的位值  'Stored string bit value
Dim weizhilin As Long  Dim weizhilin As Long
'存贮一个字符的基本位值  'Store the basic bit value of a character
Dim x As Integer biaors.MoveFirst  Dim x As Integer biaors.MoveFirst
'移至数据库记录集 biaors的第一个记录  'Move to the database record set biaors the first record
Do  Do
weizhilin = 0  Weizhilin = 0
weizhi = 0  Weizhi = 0
With biaors  With biaors
biaostr = .Fields("shuming")  Biaostr = .Fields("shuming")
End With  End With
'读入一个记录之字符串, 赋与字符串变量 biaostr  'Read a string of records, assign a string variable biaostr
For x = 1 To Len(biaostr) For x = 1 To Len(biaostr)
index = Abs(AscW(Mid(biaostr, x, 1)) Mod 31) '从字符串变量 biaostr中取一个字符, 并将该字符内码, 以 31 为模作运算, 再取绝对值, 并赋与 index, 也就是对字符元进行分组。  Index = Abs(AscW(Mid(biaostr, x, 1)) Mod 31) 'From the string variable biaostr, take a character, and the character inner code, with 31 as the module, then take the absolute value, and assign Index, which is to group character elements.
weizhilin = shuzu(index)  Weizhilin = shuzu(index)
'将数组 shuzu(index)值赋与 weizhilin,是 1、2、4、8、 16至 1073741824 之一。  ' Assign the array shuzu(index) value to weizhilin, which is one of 1, 2, 4, 8, 16 to 1073741824.
weizhi = weizhi Or weizhilin '将一个字符的 "基本位值" wdzhilin值与 weizhi作位的 "或"运算。 Next Weizhi = weizhi Or weizhilin 'Or the "or" operation of the "basic bit value" wdzhilin value of a character and weizhi. Next
'循环结耒, 获得当前字符串的 "位值" weizhi  'Cycle knot, get the "bit value" of the current string weizhi
With biaors With biaors
.Fields("wei") = weizhi  .Fields("wei") = weizhi
End With End With
biaors.Update biaors.Update
'将 "位值" weizhi存贮进当前记录的字段 wei中  'Load the bit value' weizhi into the current record field wei
biaors.MoveNext biaors.MoveNext
'处理下一个记录  'Processing the next record
Loop While Not biaors.EOF 利用位的 "和" 运算进行数据库字符串模糊检索  Loop While Not biaors.EOF uses the "and" operation of bits to perform database string fuzzy retrieval
Dim shuzu(30) As Long Dim shuzu(30) As Long
shuzu(O) = 1 Shuzu(O) = 1
For x = 1 To 30 For x = 1 To 30
shuzu(x) = 2 * shuzu(x - 1) Shuzu(x) = 2 * shuzu(x - 1)
Next Next
'定义一个 31个元素的长整型数组, 赋值从 1、 2、 4、 8、 16至 1824, 与 "标记" 之数组完成一致。  'Define a long integer array of 31 elements, assigned values from 1, 2, 4, 8, 16 to 1824, consistent with the array of "tags".
Dim weizhi As Long Dim weizhi As Long
'存贮一个字符串的位值  'Store a bit value of a string
Dim weizhilin As Long Dim weizhilin As Long
'存贮一个字符的位值  'Store a bit value of a character
Dim textstr As String  Dim textstr As String
'存贮检索字符串 、  'Storage search string,
Dim x As Integer Dim x As Integer
weizhilin = 0 weizhi = 0 Weizhilin = 0 Weizhi = 0
textstr = Text 1. Text  Textstr = Text 1. Text
'Textl .Text为文本框中的检索关键词  'Textl .Text is the search keyword in the text box
For x = 1 To Len(biaostr) For x = 1 To Len(biaostr)
index = Abs(AscW(Mid(textstr, x, 1)) Mod 31)  Index = Abs(AscW(Mid(textstr, x, 1)) Mod 31)
weizhilin = shuzu(index)  Weizhilin = shuzu(index)
weizhi = weizhi Or weizhilin  Weizhi = weizhi Or weizhilin
Next  Next
'对检索关键词进行标记得到 "位值" weizhi, 方法与数据库字符串 己" 方法一致。 strQuery = "select * from (SELECT * FROM biao WHERE (wei & " & weizhi & " ) = " & weizhi & ";) DERIVEDTBL WHERE (shuming like '%" & textstr & "%')"  'Marking the search keyword to get the "bit value" weizhi, the method is consistent with the database string method. strQuery = "select * from (SELECT * FROM biao WHERE (wei & " & weizhi & " ) = " & weizhi &";) DERIVEDTBL WHERE (shuming like '%" & textstr & "%')"
'SQL SERVER 2000中没有位蕴含运算符,这里将检索关键词的 "位 值,, weizhi与数据库各记录的 "位值" wei作 "和" (and )运算, 筛选出 "和" ( and )运算结果等于关键词的 "位值" weizhi 的记录 (也可以用(wei | " & weizhi & " ) = wei ), 得到初步结果集 Ri, 再以通常字符串模糊检索方 式作二次检索, 得到最终结果 Rz。 这是 SQL SERVER 2000的查询语句, 其它数据库可能略有不同。 'SQL SERVER 2000 does not have a bit implied operator, here will search for the "bit value of the keyword, weizhi and the "bit value" of the database records as "and" (and) operation, filter out "and" (and ) The result of the operation is equal to the record of the "bit value" of the keyword weizhi (you can also use (wei | "& weizhi &" ) = wei ) to obtain the preliminary result set Ri, and then perform the second search by the usual string fuzzy search method. The final result is R z . This is the SQL SERVER 2000 query, other databases may be slightly different.
Adodcl .RecordSource = strQuery Adodcl .RecordSource = strQuery
Adodcl . Refresh  Adodcl . Refresh
'执行检索  'Execute search
DataListl丄 istField = "shuming"  DataListl丄 istField = "shuming"
DataListl .ReFill  DataListl .ReFill
'在列表框中显示当前检索结果。  'Show the current search results in the list box.

Claims

权利要求 Rights request
1.一种字符串糊检索技术, 其特征在于: A string paste retrieval technique, characterized in that:
W指一个数据的 n个均为 0的位 (bit), 用 W来标记构成字符串的字符 元信息, 以其中 j个位 (j=l,2,3,—) 为一个组合, 则共有一~ - ~个位的组 合, 相应地分全部字符元为 ~ - ~组, 以一个位的组合对应一组字符元, W refers to the bit of each piece of data that is 0. The W is used to mark the character meta-information that constitutes the string. The j bits (j=l, 2, 3, -) are combined. A combination of one ~ - ~ ones, correspondingly all character elements are ~ - ~ groups, with a combination of one bit corresponding to a group of character elements,
;'!(" - 7')·' ;'!(" - 7')·'
进行标记。 如果若干个字符串 S的一个字符元 属于第 r组, 则将 W的构 成第 r个位的组合的 j个位标记为 1,类似地,根据 S的其它字符元 P2、 P3 、 P4…所属的组对 W进行标记, 完成全部字符元标记后的 W, 记录有 S的字 符元信息, 称为若千个字符串 S的 "位值" , 此种方式称为 1标记。 Mark it. If one character element of a plurality of character strings S belongs to the r-th group, the j bits constituting the combination of the r-th bit of W are marked as 1, similarly, other character elements P 2 , P 3 , P according to S 4 ... the group to which the group belongs is marked W, the W after all the character element marks are completed, and the character element information of S is recorded, which is called the "bit value" of the thousands of character strings S, and this method is called 1 mark.
W指一个数据的 n个均为 1的位 (bit), 用 W来标记构成字符串的字符 元信息, 以其中 j个位 (j=l,2,3,...) 为一个组合, 则共有 ~ - ~个位的组 合, 相应地分全部字符元为 ~ - ~组, 以一个位的组合对应一组字符元, 进行标记。 如果若干个字符串 S的一个字符元 属于第 r组, 则将 W的构 成第 r个位的组合的 j个位标记为 0,类似地,根据 S的其它字符元 P2、 P3 、 •P4…所属的组对 W进行标记, 完成全部字符元标记后的 W, 记录有 S的字 符元信息, 称为若干个字符串 S的 "位值" , 此种方式称为 0标记。 W refers to a bit of n of one data, and uses W to mark the character meta information constituting the string, in which j bits (j=l, 2, 3, ...) are a combination. There is a combination of ~ - ~ ones, correspondingly all character elements are ~ - ~ groups, with a combination of one bit corresponding to a group of character elements, marked. If one character element of a plurality of character strings S belongs to the r-th group, the j bits constituting the combination of the r-th bit of W are marked as 0, similarly, other character elements P 2 , P 3 , The group to which P 4 ... belongs marks W, and the W after all the character meta tags are completed, and the character meta information of S is recorded, which is called the "bit value" of a plurality of character strings S. This method is called 0 mark.
记 Sa的 1标记 "位值" 为 Wa, 记 Sa的 0标记 "位值" 为 ^ , 记 Sb的 1标记 "位值" 为 Wb, 记 Sb的 0标记 "位值" ^, 下面方法均可用于字符 串比较: Denoted S a 1 flag "bit value" is W a, denoted S a zero flag "bit value" is ^, denoted S b 1 flag "bit value" is W b, denoted S b is 0 flag "bit value " ^, the following methods can be used for string comparison:
或者, I .Sa和 Sb均用 1标记, 以 \\^与 wb进行比较, 如果所有 wa为 1的位, Wb相应的位也为 1, 则 sb包含 (含等于, 下同)或可能包含 sa的所 有字符元。 Alternatively, I .S a and S b are marked by 1, and w b ^ \\ to compare, if all the bits w a is 1, the corresponding bit is also 1 W b, comprising the s b (containing equal, The same below) or may contain all the character elements of s a .
或者, Il .S p Sb均用 0标记, 以^与^ ;进行比较, 如果所有^为 1 的位, 相应的位也为 1, 则 Sb包含或可能包含 S 所有字符元。 或者, m .sa用 l标记而 sb用 o标记, 以 3与 ^进行比较, 如果^与 wa所有相应的位, 不同时为 1 , 则 sb包含或可能包含 s 々所有字符元。 Alternatively, Il.S p S b is marked with 0, and is compared with ^ ; if all bits of 1 are 1 and the corresponding bit is also 1, then S b contains or may contain all character elements of S. Alternatively, m .s a is marked with l and s b is marked with o, and 3 is compared with ^. If ^ and w a are all corresponding bits, and are not 1, then s b contains or may contain s 々 all character elements .
或者, IV.Sa用 0标记而 Sb用 1标记, 以^ ;与 \ [)进行比较, 如果^与 wb所有相应的位, 不同时为 0, 则 Sb包含或可能包含 S 々所有字符元。 Alternatively, IV.S a mark with 0 and S B with a tag to ^; comparison with \ [), and if all corresponding bits ^ w b, are not simultaneously 0, S, or may contain B contains S 々 All character elements.
2. 按照权利要求 1所述的方法, 其特征在于: 用位运算符对 Wa、 ^与 Wb、 进行比较的具体操作方法有多种。 例如, 记所有位为 1的 W为 WT, 则方法 I可用 。→ = 7.、 wa \ wh =wb , 实现, 至于方法: H吏 用其它位运算符实现, 以及方法 II、 III、 IV实现的具体操作方法, 均可按 逻辑代数原理得出。 2. The method according to claim 1, wherein there are a plurality of specific operation methods for comparing W a , ^ and W b with a bit operator. For example, if all bits are 1 and W is W T , then Method I is available. → = 7 ., w a \ w h =w b , implementation, as for the method: H吏 is implemented by other bit operators, and the specific operation methods implemented by methods II, III and IV can be derived according to the principle of logical algebra.
3. 按照权利要求 1所述的方法,其特征在于: 通过对 Wa、 ^与 Wb、 Wb 进行比较, 若不符合判断标准, 则 Sb不包含 SJ 所有字符元, 即不包含 sa。 若符合判断标准, 单位标记且每组中只有一个字符元, 则 sb包含 "sa的所 有字符元" , 但由于未考虑字符元的排列, 不能肯定 sb是否包含 "sa" ;若 符合判断标准, 单位标记而每组中有多个字符元, 或者多位标记, 则 sb只 是 "可能" 包含 "SJ 所有字符元" , 更不能肯定 sb是否包含 sa。 但根据 需要, 可再用字符逐位比较方法判断 sb是否包含 "sa" 。 3. The method according to claim 1, wherein: by comparing W a , ^ with W b and W b , if not meeting the criterion, S b does not include all character elements of SJ, that is, does not include s a . If the unit is marked and there is only one character in each group, then s b contains "all character elements of s a ", but since the arrangement of character elements is not considered, it is not certain whether s b contains "s a "; In accordance with the judgment standard, the unit mark and each group has multiple character elements, or multiple bit marks, then s b only "may" contain "SJ all character elements", but it is not sure whether s b contains s a . However, as needed, the character-by-bit comparison method can be used to determine whether s b contains "s a ".
4. 按照权利要求 1所述的方法, 其特征在于: 字符元可以是通常意义 的字符, 根据需要, 可以是汉字偏旁、 笔画; 汉语拼音的字母、 声母、 韵母、 音节; 其它语言的字母、 音节、 单词; 汉语拼音的字母、 声母、 韵母、 音节; 其它语言的音标; 也可以是它们的结合。  4. The method according to claim 1, wherein: the character element can be a character of a usual meaning, and can be a Chinese character radical, a stroke, a Chinese alphabetical alphabet, an initial, a final, a syllable; Syllables, words; Chinese pinyin letters, initials, finals, syllables; phonetic symbols of other languages; or a combination of them.
5. 按照权利要求 1所述的方法, 其特征在于: "若千个字符串" 是 S 的同位概念, 强调 S指一个或多个字符串。 就是说, 位标记字符串检索, 进 行比较的 W或 ^值, 均可以是一个字符串的字符元的标记信息, 也可以是 多个字符串所有字符元的标记信息。 5. The method according to claim 1, wherein: "if a thousand strings" is a parity concept of S, and emphasis S refers to one or more character strings. That is to say, the bit mark string search, the W or ^ value to be compared, can be the tag information of the character element of a string, or Tag information for all character elements of multiple strings.
6. 按照权利要求 1所述的方法, 其特征在于: 检索关键词 St有值 Wt、 而字符串 S,、 S2、 S3、 S4…有各自的位值 W,、 W2、 W3、 W4— 或 、 w2 , w, , w, …。 通过对 wt、 ^与 wn、 „进行比较: 既可以筛选出 "包 含或可能包含检索关键词 st的所有字符元" 的 sn, 如果需要, 再用字符逐 位比较方法判断 sn是否包含 "st" , 即通常意义的检索; 反之, 也可以筛 选出 "检索关键词 st包含或可能包含其所有字符元" 的 sn, 如果需要, 可 再用字符逐位比较方法判断 st是否包含 "sn" , 称为逆检索 。 6. The method according to claim 1, wherein: the search keyword S t has a value W t , and the character strings S, S 2 , S 3 , S 4 ... have respective bit values W, W 2 , W 3 , W 4 — or , w 2 , w, , w, .... By comparing w t , ^ with w n , „: It is possible to filter out s n of “all character elements containing or possibly containing the search key s t ”, and if necessary, judge the s n by character-by-bit comparison method. Whether it contains "s t ", that is, the search of the usual meaning; on the contrary, it can also filter out the s n of "search keyword s t contains or may contain all its character elements", if necessary, can be judged by bit by bit comparison method Whether s t contains "s n " is called inverse retrieval.
PCT/CN2005/001642 2005-01-17 2005-10-08 Retrieval technology of character string marked with bit WO2006074586A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200510023383.5 2005-01-17
CNA2005100233835A CN1645374A (en) 2005-01-17 2005-01-17 Digit marking character string searching technology

Publications (1)

Publication Number Publication Date
WO2006074586A1 true WO2006074586A1 (en) 2006-07-20

Family

ID=34875846

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2005/001642 WO2006074586A1 (en) 2005-01-17 2005-10-08 Retrieval technology of character string marked with bit

Country Status (2)

Country Link
CN (2) CN1645374A (en)
WO (1) WO2006074586A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4271227B2 (en) * 2006-10-30 2009-06-03 株式会社エスグランツ Bit string search device, search method and program
CN101794283A (en) * 2009-02-03 2010-08-04 华为技术有限公司 Method and system for processing character strings and matcher
CN102682033A (en) * 2011-03-17 2012-09-19 环达电脑(上海)有限公司 Method for querying words by matching binary characteristic values
CN103870537B (en) * 2013-12-03 2017-02-01 山东金质信息技术有限公司 Intelligent word segmentation method for standard retrieval
CN106933938A (en) * 2015-12-30 2017-07-07 唯溥思株式会社 The document retrieval method and literature index method encoded using multibyte

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08147320A (en) * 1994-11-22 1996-06-07 Internatl Business Mach Corp <Ibm> Information retrieving method and system
CN1281191A (en) * 1999-07-19 2001-01-24 松下电器产业株式会社 Information retrieval method and information retrieval device
JP2002007411A (en) * 2000-06-21 2002-01-11 Hitachi Ltd Information retrieving method, its executing device and recording medium in which its processing program is recorded

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6785677B1 (en) * 2001-05-02 2004-08-31 Unisys Corporation Method for execution of query to search strings of characters that match pattern with a target string utilizing bit vector
JP2003152548A (en) * 2001-11-14 2003-05-23 Canon Inc Retrieving method of character string in data compression

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08147320A (en) * 1994-11-22 1996-06-07 Internatl Business Mach Corp <Ibm> Information retrieving method and system
CN1281191A (en) * 1999-07-19 2001-01-24 松下电器产业株式会社 Information retrieval method and information retrieval device
JP2002007411A (en) * 2000-06-21 2002-01-11 Hitachi Ltd Information retrieving method, its executing device and recording medium in which its processing program is recorded

Also Published As

Publication number Publication date
CN1645374A (en) 2005-07-27
CN101488127B (en) 2015-01-07
CN101488127A (en) 2009-07-22

Similar Documents

Publication Publication Date Title
Chan et al. Compressed indexes for dynamic text collections
JP3196868B2 (en) Relevant word form restricted state transducer for indexing and searching text
US6584458B1 (en) Method and apparatuses for creating a full text index accommodating child words
CN108573045B (en) Comparison matrix similarity retrieval method based on multi-order fingerprints
US9110980B2 (en) Searching and matching of data
CN101388012B (en) Phonetic check system and method with easy confusion tone recognition
US20020165707A1 (en) Methods and apparatus for storing and processing natural language text data as a sequence of fixed length integers
Hsu et al. Space-efficient data structures for top-k completion
WO2004109492A1 (en) Object representing and processing method and apparatus
JPS6126176A (en) Dictionary for processing language
US7676487B2 (en) Method and system for formatting and indexing data
WO2020037794A1 (en) Index building method for english geographical name, and query method and apparatus therefor
WO2006074586A1 (en) Retrieval technology of character string marked with bit
CN102867049A (en) Chinese PINYIN quick word segmentation method based on word search tree
US20090055358A1 (en) Efficient processing of mapped boolean queries via generative indexing
Flor A fast and flexible architecture for very large word n-gram datasets
Cordova et al. Simple and efficient fully-functional succinct trees
Lippert et al. A space-efficient construction of the Burrows–Wheeler transform for genomic data
CN101739142B (en) Five-stroke input system and method
Barsky et al. Suffix trees for inputs larger than main memory
Hon et al. Compressed index for dictionary matching
TWI269193B (en) Keyword sector-index data-searching method and it system
WO2006058476A1 (en) Prime number replacing character string search technology
CN114595665A (en) Method for constructing binary extremely-short code word character and word coding set
Arroyuelo et al. A Lempel-Ziv text index on secondary storage

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 05795753

Country of ref document: EP

Kind code of ref document: A1

WWW Wipo information: withdrawn in national office

Ref document number: 5795753

Country of ref document: EP