WO2007068160A1 - Procede de recherche dans un index par correspondance de motifs - Google Patents

Procede de recherche dans un index par correspondance de motifs Download PDF

Info

Publication number
WO2007068160A1
WO2007068160A1 PCT/CN2006/000979 CN2006000979W WO2007068160A1 WO 2007068160 A1 WO2007068160 A1 WO 2007068160A1 CN 2006000979 W CN2006000979 W CN 2006000979W WO 2007068160 A1 WO2007068160 A1 WO 2007068160A1
Authority
WO
WIPO (PCT)
Prior art keywords
value
bit
records
prime
mark
Prior art date
Application number
PCT/CN2006/000979
Other languages
English (en)
Chinese (zh)
Inventor
Wenxin Xu
Original Assignee
Wenxin Xu
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wenxin Xu filed Critical Wenxin Xu
Publication of WO2007068160A1 publication Critical patent/WO2007068160A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Definitions

  • Pattern matching is widely used. There are two main methods for string pattern matching: 1. Improved methods of BF algorithm “KMP” and “BM”; 2. “Bit vector method”.
  • S b belongs to the group labeled W, after the completion of all the characters metatags W, S information is recorded, known as the "bit value" is S; by ⁇ £ 1 and W b 3 13 compared to 133 of It can be judged that S b does not contain, contains or may contain all the character elements of the search keyword S b , which is called "bit mark string retrieval technique"; in general, the "bit mark” method is one order of magnitude faster than "pattern matching" .
  • the characteristics of the prime substitution and bit mark retrieval methods are as follows: 1. The retrieval is through the prime number of the "total" of the main string and the substring. Product value, bit value Sub-comparison ", determine that the main string does not contain or may contain all the character elements of the substring, which can improve the speed. 2. Only consider whether the main string contains the "character element" of the substring, regardless of the corresponding characters of the main string and the substring. Whether “distance” and “order” are the same, is “similar comparison”. On the basis of "bit mark”, you can use “BM” and “bit vector method” to get the result of "pattern matching". On the basis of "similarity comparison", "grammar analysis” is more in line with language law than “pattern matching”, and its principle is suitable for various fields of information processing.
  • the invention proposes "planning storage” and "selection mark” on the basis of "bit mark” and “mass substitution” to further improve the search speed, VC programming, in the CPU of Celeron 800, generally 0.01-0.5 seconds
  • the target record can be found from 4 million records, making the language processing method based on the reference sentence type feasible on the middle and low-end cpu. Summary of the invention
  • the divisibility judgment "does not consider the character element order”, but can analyze the structure of the character element on the basis of "divisibility judgment", which is suitable for dealing with the "cross-syllable synergy phenomenon” and “disordered coordination phenomenon” of the language.
  • the "mass substitution conversion technology” is not fast enough, but the “primary mark string retrieval technology” can be used for sentence primaries.
  • the language analysis database can have refined sentences, collocations, phrases, phrases, words, etc., roughly referred to as “basic sentence patterns”, and its database is called “basic sentence database”, which is denoted as jxk. Filtered from jxk, the character element contained in the character element T is called the "reference sentence pattern”. After analyzing and comparing the “reference sentence patterns”, it is determined that the sentence pattern used to process T is called “basic sentence pattern”. Select "Reference Sentence Pattern" as the “Basic Sentence Pattern”, and select the sentence pattern with more characters, which is referred to as "long word first". In language processing, in addition to the principle of "long word priority", factors such as grammar, frequency, association, intonation, accent, pause, tone, etc. should be considered comprehensively. The language is complex. On the basis of the reference sentence pattern, stylistic information, citation information, and related information can be given to improve the accuracy of converting the phonetic string into a text string.
  • “Synergy phenomenon” exists not only in language understanding but also in the field of visual cognition.
  • the image is richer than the language. It can first pre-process the image, and then recognize the "picture element” of the image according to some rules, and perform the prime number substitution and bit mark according to the "picture element” to realize the image.
  • the quick search and preliminary comparison are similar to the language processing.
  • the structure of the "image” that is, the "grammar of the primitive” can be analyzed and compared.
  • Biological genes are also synergistic Use, can be treated in the same way, of course, the specific synergy between genes is found to be the work of biologists. Genes are units of genetic information.
  • pattern matching index search method is a subordinate technology of “mass substitution” "bit mark string retrieval technology”, and the main points of “bit mark string retrieval technology” are:
  • the mark can be marked with "1" or with "0". This file refers to the "1" mark.
  • the length k of the keyword T used for user retrieval is "uncontrollable", and the controllable factor is the number of bits n used for the marking.
  • the value of n must be greater than m.
  • k but the larger the value of n, the more space requirements for storing bit values.
  • n 2ma X (m, k) It is appropriate to set the database string to have an average length of 70 characters.
  • n 128.
  • n 128.
  • a "large character set" such as a Chinese character, it is easy to handle, and the Chinese characters are divided into 128 groups accordingly. If a character element of the character string S belongs to the i-th group, the i-th bit is set to 1.
  • Character characters if randomly grouped, may be evenly distributed in 4 W t , in order to reduce the number of items to be compared, the group containing "high frequency character elements" may be appropriately concentrated in W sl and W s2 , then T mark After that, the bits of "1" are concentrated in W tl and Wt2 to reduce the number of comparisons, which is called "high frequency matching".
  • a database with a number of "character elements” of several hundred and an unbalanced frequency can improve performance. If the "character elements" are many, the frequency is low, or the balance is balanced, the method may not be effective.
  • the pinyin text belongs to the "small character set".
  • English has only 26 or 52 characters, and 1 letter corresponds to 1 bit.
  • the marks overlap a lot, which is difficult to achieve the screening purpose.
  • the style mark ": 26 letters have 676 kinds of arrangement: aa, ab, ac, ⁇ zy, zz, statistical analysis to get the frequency of various arrangements, set the database string to an average of 70 letters, marked with 4 unsigned long integers , then n has 128 bits, and 676 kinds of arrangements are divided into 128 groups according to frequency equalization.
  • the string be "changjiang", press group 1, which belongs to ch, ha, an, ng, gj, ji, ia, an, ng, and set the corresponding i-th bit to "1" to get the bit value of "changjiang”.
  • the bit values W sl , W s2 , W s3 , W s4 of S are compared with the items W tl , Wt2 , Wt3 , W t4 corresponding to the bit values of T, and all the records satisfying the condition are entered, and the characters are used again.
  • Stringwise bitwise comparison yields R 2 . If you consider symbols and spaces in a string, there are more permutations. For very long texts, consider also setting the "chained" character to a three-letter arrangement: aab, aab, aac, " ⁇ , ⁇ , statistical frequency, grouping tag retrieval.
  • m and r can be set according to the situation.
  • “Chain segmentation” can be used not only to process long strings of databases, but also to handle long T in speech input reverse retrieval.
  • Zhongguoj ingj idechixukuaisuzengzhangyizhishixifangj ingj ixuej iesuoj inj inledaodemituan.
  • the sentence has 28 syllables, that is, 28 P.
  • a large integer data type is required for the direct substitution of the prime number, and the bit mark has a poor screening effect, so the segmentation process is performed.
  • Natural language pauses are generally performed in groups, but the computer may not be able to segment by pause.
  • the use of "chain-segmentation" can avoid such problems, such as the first division of 1-15 for a total of 15 syllables.
  • the second time divides 10-24 to a total of 15 syllables
  • the third time divides the syllables after 19th.
  • the following table shows the probability that a string contains K when the string is 3 P and the character of a different scale is K.
  • the number of P in the search keyword T is small, and in the case of Chinese, it is generally 2-6 Chinese characters.
  • T "square is learning”
  • "yes” is a control character element ⁇
  • W t ⁇ selectable mark bit value
  • S containing "square is learning” must contain "yes”
  • the other filter values of T or T are compared with the records in the characters or other filter values.
  • Other filter values refer to another set of bit values ⁇ or , or the logarithm of the prime product value F, or F. Lc can be seen from the above table.
  • 70% of the character elements are selected as the K mark W value, 3 character elements
  • the T has a probability of 97.3%, including 1-3 K, which means that W t has a value.
  • p should take the maximum value to reduce the probability that T does not contain K.
  • the search keyword T does not contain any K
  • the reference sentence pattern library jxk of a typical speech input database includes the following information: reference sentence pattern, pinyin string, bit tag value, frequency, and grammar information, and the grammar information may include the part of the reference sentence type, whether the collocation is between Other ingredients and the like can be inserted.
  • the reference sentence patterns and word combinations are as short as 2 Chinese characters, and long may be 7 or 8 Chinese characters. Therefore, more than 100,000 words are used in the test. The words are repeated with more than 40,000 three-word words, and they are combined into 4,019,576, and each of the five Chinese characters have "nonsense" reference sentence patterns.
  • the method is to organize the storage by the bit value jxk, so that each bit value W n is compared with the keyword bit value W t only once, or a few times:
  • the jxk is organized and stored by W, which can be realized by the "aggregate storage” of the general database, but the purpose is to store the records with the same bit value in the adjacent space, and it is not required to be in the order of the W value. Storage; Of course, if sorting by the number of bits of "1" in the W value, it is also beneficial to determine the starting point of the bit value comparison by the number of bits of "1" in T when searching.
  • the query is sorted by bit value, and a bit value index table is generated, which is recorded as syb.
  • the syb table has the field "mark”, which stores the "index value”, which is extracted from the jxk by removing the "general bit value” W from the jxk. It is recorded as V, which is only convenient for explanation. There is no essential difference from W; the field “jishu” is given in the syb table, which is the number of each "value of the bow” I appear in jxk; then the address field "pJuzi” is added in syb, giving each index The bit value "V, the address of the first record in jxk. See Figure 1 for the process.
  • V n 5648 records
  • V n 5649
  • the address of the first record is 0x00732C78.
  • bit marks are marked with "another character meta-grouping scheme" for jxk, and the bit value W of each record is obtained.
  • the double table processing is a reference sentence table jxk, a bit index table syb, and actually has a syllable information table yjb, which includes syllable characters, statistical frequencies, mark groups, basic bit values, generations of more than 400 syllables of Chinese Pinyin. Change the quality and other information.
  • the single table processing also has the syllable information table yjb, but combines the reference sentence table jxk and the bit index table syb into one table syjxb:
  • syjxb
  • the reference sentence pattern is mainly the basic sentence pattern of 6 character elements
  • the CPU takes about 0.1 seconds, and the database is still 4,019,576 records.
  • Planning storage index lookup can still improve the response speed.
  • Some languages may have more character elements constituting sentences because of their own characteristics.
  • the extracted basic character patterns must have more character elements.
  • the index value V types are correspondingly increased, and the plan storage is directly used. Index lookups don't work.
  • the bit value when the string length is 7 characters, after the tag, the bit value may have 1-7 bits as "1".
  • n 31, the V value is 3572223; when n is 24, the V value is 536154.
  • the string of the database is generally longer, and T is generally shorter; the string of the database in the reverse retrieval is generally shorter, and T is generally longer.
  • n the number of records, and the amount of memory.
  • the comparison screening takes less time.
  • the V pair jxk aggregate storage index, there are many records in each successive area. If the database is large, it cannot be fully loaded into the memory. When the retrieval needs to read data from the external memory, the time for head seek and positioning can be reduced. but!
  • ⁇ k is a certain time, n is small, the screening effect is not good.
  • the database and keywords be marked with "1"
  • the number of digits used for the token is n
  • S has m character elements
  • the search keyword T has k character elements.
  • the exact probability of inverse retrieval should consider the overlap probability, but can be used The following formula is roughly calculated:
  • index lookup is used for “index lookup” because bit operation is fast, but “quality substitution” also has merit.
  • quality substitution also has merit.
  • Select substitution with prime numbers. To avoid overflow or use large data type processing, p is required to be small, but p is small.
  • T does not contain K
  • n prime numbers m times product value F is: C '-i.
  • a certain selection ratio p, in S with a length of m, mk is mainly distributed near the mathematical expectation m*p. If the probability of K exceeding 6 in S is small, the F value can be less than 2,000,000, and the frequency is 800m. The divisibility judgment takes about 100-200 ms, and the screening probability is good, and the database with a large number of records can be used for organizing storage.
  • the division instruction takes a lot of time, the dividend is beyond the allowable length of the instruction, and it takes more time, which limits the size and number m of the prime number.
  • Multiplication and division operations can be implemented by logarithmic and exponential operations.
  • the prime product of S a prime product F a be the logarithm of L a
  • the prime product F of S b b L b is the number, if the power (r, (L a -L b)) is an integer,? 3 can be divisible by F b
  • S a contains or may contain all the character elements of S b .
  • Logarithms can handle longer strings, but because floating-point operations can cause errors, even if F a can be divisible by F b , power(r, ( L a -L b )) may not be an integer.
  • the characteristics of a database are determined by three aspects: the average length of the string m, the expected value of the length of the search keyword T, and the number of records r.
  • the selection mark storage index scheme is determined by four aspects: the number of bits of the mark n, the ratio of the selection mark K, the number of sets of the mark K and the organization.
  • the string of length m may have 0-m K, but the number of mmk is mainly distributed near the mathematical expectation m*p, Nm*p*2 or n ⁇ m*p*2 mark, R v is at most 2 n .
  • p is not large, the mk distribution is smaller than m*p and the marks are overlapped.
  • bit value type R v ⁇ r/10, 1 ⁇ 4 has a probability of value P>90%, screening probability? ⁇ 10%, then this solution is ideal, but when m is large, it is not easy to achieve.
  • the following general quantile marks are used for retrieval, reverse search, and prime substitution to illustrate the main methods of comprehensive application.
  • the application can be modified according to database, T, and hardware conditions.
  • the V type is 108,928, and the average basic sentence pattern corresponding to each bit value is 36.9.
  • the "selection mark" V value distribution is not as balanced as the "completely labeled” V value distribution, and the screening effect is poor, but the test shows that the scheme is also effective.
  • Planning storage can be performed using "Complete Marker Value” and "Select Marker Bit Value” because the markers have a certain probability of overlap.
  • the bit of "1" in the V value after the T mark is relatively small
  • the reverse search when the number of "1" in the V value after the database S mark is relatively small, the screening effect is not good, so another A set of "completely tagged bit values” is used for the second screening. It is especially necessary to use “selection tag bit values" for "planning storage”. See the specific implementation section for the search code. It is also possible to perform 2 or 3 screenings with other screening values such as F or L.
  • the first group mark value is Vi, and press V ⁇
  • the jxk organization stores vljxk and generates vlsyb; the second set of flag bit values is V 2 , and stores the v2jxk in accordance with V 2 for jxk organization, and generates v2syb.
  • V lt and V 2t are 0.
  • the second step does not have a filtering effect and can be skipped.
  • V lt and V 2i are 0 at the same time, directly locate the check in jxk, perform character element comparison or check other filter values to get R 3 . See Figure 2 for the search process.
  • V!, V 2 and jxk can be sorted according to other needs, so in the reverse input of speech input, jxk can be arranged in descending order according to the number of P.
  • the four schemes are not ideal.
  • "Grading organization" should make the screening effect of level 1 better than level 2. But only k * p is relatively large, the probability of screening go, it requires a large p, but p of m * p is large, to ensure that the probability of screening, then n must be large, the more types of V, and is a rear cross-2 Large, only a database with a particularly large number of records is easy to adopt.
  • V 3 and V 4 For databases with long strings and many records, more V 3 and V 4 can be given to organize and hierarchically organize each other. In general, long strings are not convenient for planning storage indexes. If the context is low, you can divide them into "short sentences" and then perform planning storage indexing.
  • the various methods mentioned above should be adjusted according to the characteristics of the database and T when applied to the reverse search.
  • the probability of "bit of 1", that is, kkb 4 is large, but the probability of "bit of 1", that is, mkb 3, is large after the mark overlap in the database, which affects the screening effect. .
  • the two schemes are marked at the same time, it can be arranged such that after indexing with one scheme, the index table sybl of mkb 4 is obtained, and the record of mkb 4 in jxk is organized and stored, and sybl is generated; the record for mkb 3 is used for another
  • the schemes are marked, and the storage is organized to generate syb2; when searching, T is marked by two schemes, and sybl and syb2 are respectively positioned to jxk.
  • the database m When the average value of the database is relatively large, select the 2 groups and K 2 schemes for "primary substitution", and the scheme 2 should cover the records of Ft no value and value overflow as much as possible.
  • the database m If the database m is different in length, it can be divided into two tables.
  • the m short record stores the F value with 32 or 64 bit integers;
  • the m long record stores the F value with an integer of 128 bits or more, or converts to L. Value, or treated with W, V.
  • the intermediate prime number 97 is used to allow 9 character elements.
  • K 16 of T
  • k 20 of T
  • select K with p 40% for "primary substitution", 75.5% probability kk ⁇ 10; with 2 sets of K for substitution, the probability that F t does not overflow is relatively large.
  • level 1 is filtered by F value.
  • V value is screened by V value, whether it is adopted or not, it should be analyzed and determined according to the characteristics of the database and T, and it is not necessary to do unhelpful complicated processing.
  • Figure 1 is a bit mark storage index, mutual organization storage index flow chart
  • FIG. 2 is an embodiment of a mutual organization storage index positive search flow chart This document describes a variety of storage indexing schemes. The application can be implemented according to the characteristics of the database and the hardware environment. The following is the "selection tag storage index" that is passed through the VC and the reverse retrieval code. Other schemes can be implemented by reference.
  • Weizhi is the selection mark index bit value V
  • weizhi2 is the full mark bit value W
  • Bool mushi; / / is searched as 0, reverse search is 1;

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention porte sur un procédé de recherche dans un index basé sur la correspondance de motifs consistant: à stocker une base de données ou d'autres valeurs filtrées en fonction de la valeur de bits en réduisant le nombre n de bit utilisés pour le marquage; et à sélectionner une partie des éléments de caractères à marquer ou à marquer de manière non équilibrée, groupe par groupe. On peut également stocker les bases de données en fonction du produit des nombres premiers, réduire le nombre n des nombres premiers utilisés en remplaçant des parties sélectionnées des éléments de caractère, ou effectuer un marquage non équilibré, pour accélérer la recherche de caractères basée sur le remplacement des nombres premiers. Il est également possible d'adopter une synthèse des différents procédés basés sur les caractéristiques de la base de données et la recherche des mots clefs.
PCT/CN2006/000979 2005-12-12 2006-05-15 Procede de recherche dans un index par correspondance de motifs WO2007068160A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200510111376.0 2005-12-12
CNA2005101113760A CN1983249A (zh) 2005-12-12 2005-12-12 字符串规划存贮索引查找技术

Publications (1)

Publication Number Publication Date
WO2007068160A1 true WO2007068160A1 (fr) 2007-06-21

Family

ID=38162544

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2006/000979 WO2007068160A1 (fr) 2005-12-12 2006-05-15 Procede de recherche dans un index par correspondance de motifs

Country Status (2)

Country Link
CN (1) CN1983249A (fr)
WO (1) WO2007068160A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013128333A1 (fr) * 2012-03-01 2013-09-06 International Business Machines Corporation Recherche d'une chaîne de meilleur appariement parmi un ensemble de chaînes
CN107169046A (zh) * 2017-04-25 2017-09-15 广东网金控股股份有限公司 一种数据库索引查找方法、装置及用户终端

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1179662A (zh) * 1996-09-25 1998-04-22 松下电器产业株式会社 模式匹配装置
CN1205486A (zh) * 1997-07-15 1999-01-20 三星电子株式会社 考虑到距离和方向的模式匹配装置及其方法

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1179662A (zh) * 1996-09-25 1998-04-22 松下电器产业株式会社 模式匹配装置
CN1205486A (zh) * 1997-07-15 1999-01-20 三星电子株式会社 考虑到距离和方向的模式匹配装置及其方法

Also Published As

Publication number Publication date
CN1983249A (zh) 2007-06-20

Similar Documents

Publication Publication Date Title
CN108829801B (zh) 一种基于文档级别注意力机制的事件触发词抽取方法
US7979268B2 (en) String matching method and system and computer-readable recording medium storing the string matching method
CN101388012A (zh) 带有易混淆音识别的拼音检查系统和方法
CN101246472B (zh) 一种汉语文本的大、小粒度切分实现方法和装置
TW200945065A (en) System and method for classification and retrieval of Chinese-type characters and character components
CN110647505B (zh) 一种基于指纹特征的计算机辅助密点标注方法
CN104239289B (zh) 音节划分方法和音节划分设备
CN101246478A (zh) 信息存储及检索方法
CN112686044A (zh) 一种基于语言模型的医疗实体零样本分类方法
CN104090864B (zh) 一种情感词典建立与情感计算方法
CN111882462B (zh) 一种面向多要素审查标准的中文商标近似检测方法
CN101533398A (zh) 模式匹配索引查找方法
CN112328773A (zh) 基于知识图谱的问答实现方法和系统
WO2007068160A1 (fr) Procede de recherche dans un index par correspondance de motifs
CN105045410A (zh) 一种形式化拼音和汉字对应识别的方法
CN101692188A (zh) 一种音形码汉字输入法
WO2006074586A1 (fr) Technologie d'extraction de chaines de caracteres marques de bits
KR100784287B1 (ko) 한국한자음을 이용하여 일본어 사전을 검색하는 방법 및 그시스템
CN112528003B (zh) 一种基于语义排序和知识修正的多项选择问答方法
TW420774B (en) Method and apparatus for automatically correcting documents in chinese language
CN100535836C (zh) 在中文输入法中恢复候选词顺序的方法及系统
CN111090338B (zh) 医疗文书的hmm输入法模型的训练方法、输入法模型和输入方法
Marchand et al. Evaluating automatic syllabification algorithms for English
JP4567025B2 (ja) テキスト分類装置、テキスト分類方法及びテキスト分類プログラム並びにそのプログラムを記録した記録媒体
Chen et al. BioLMiner system: interaction normalization task and interaction pair task in the BioCreative II. 5 challenge

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 06741869

Country of ref document: EP

Kind code of ref document: A1