CN103365998B - A kind of similar character string search method - Google Patents

A kind of similar character string search method Download PDF

Info

Publication number
CN103365998B
CN103365998B CN201310294529.4A CN201310294529A CN103365998B CN 103365998 B CN103365998 B CN 103365998B CN 201310294529 A CN201310294529 A CN 201310294529A CN 103365998 B CN103365998 B CN 103365998B
Authority
CN
China
Prior art keywords
gram
character string
index
inverted
candidate set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310294529.4A
Other languages
Chinese (zh)
Other versions
CN103365998A (en
Inventor
胡颢继
王晓玲
周傲英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN201310294529.4A priority Critical patent/CN103365998B/en
Publication of CN103365998A publication Critical patent/CN103365998A/en
Application granted granted Critical
Publication of CN103365998B publication Critical patent/CN103365998B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种相似字符串检索方法,包括:利用倒排索引中倒排链表之间的重合信息,获取gram辨识度指标;将gram辨识度指标与倒排链表长度比例进行整合,形成综合指标;根据综合指标通过泛化前缀过滤器获取候选集,计算候选集中的字符串真实的编辑距离获取检索结果。本发明相似字符检索方法能够提高检索效率,尤其是当编辑距离阀值增大时,能够获得极大提升检索效率。

The invention discloses a similar character string retrieval method, comprising: using the overlapping information between inverted linked lists in an inverted index to obtain a gram identification degree index; integrating the gram identification degree index with the length ratio of the inverted linked list to form a comprehensive Index; According to the comprehensive index, the candidate set is obtained through the generalized prefix filter, and the real edit distance of the strings in the candidate set is calculated to obtain the retrieval result. The similar character retrieval method of the present invention can improve the retrieval efficiency, especially when the editing distance threshold increases, the retrieval efficiency can be greatly improved.

Description

一种相似字符串检索方法A method for searching similar character strings

技术领域technical field

本发明属于数据库技术领域,具体涉及一种对相似字符串数据进行检索的方法。The invention belongs to the technical field of databases, and in particular relates to a method for retrieving similar character string data.

背景技术Background technique

近年来,数据库领域对字符串相似性查询进行了大量研究。字符串相似性查询可以广泛应用在数据融合,数据清洗,重复性检测,自然语言处理和生物信息学中等领域中。例如,在真实世界中代表相同实体的非同源数据库记录可能会存在一些错误,建立数据仓库时需要对这些相同实体进行数据融合;在蛋白质检测中,需要快速获取新的蛋白质序列的相关信息,此时可以使用跟新的蛋白质最相似的蛋白质序列来对新的蛋白质序列进行描述;在现有的搜索引擎中,由于人的输入往往包含有错误,此时需要搜索引擎提供一些纠错功能,提示并帮助用户识别其输入中存在的错误。In recent years, a lot of research has been done on string similarity queries in the database field. String similarity query can be widely used in data fusion, data cleaning, duplication detection, natural language processing and bioinformatics. For example, non-homologous database records representing the same entity in the real world may have some errors, and data fusion of these same entities is required when building a data warehouse; in protein detection, it is necessary to quickly obtain relevant information about new protein sequences, At this time, the protein sequence most similar to the new protein can be used to describe the new protein sequence; in the existing search engine, since human input often contains errors, the search engine needs to provide some error correction functions at this time. Prompt and help users identify errors in their input.

现有最新提出的解决字符串相似性查询的方法有,VChunk,asymmetric-Search和Adaptive-Search。其中,VChunk利用一种基于统计的Chunk边界字典的结构来对字符串进行拆分生成用于检索的签名。asymmetric-Search通过利用查询跟索引采用不同的签名时(一个采用gram时,另一个采用chunk)能够减少匹配项,从而达到提高性能的目的。Adaptive-Search则主要基于一个观察:通过实验观察发现,在使用前缀过滤器的时候,适当增加用于访问倒排表的gram个数可以获得性能提升。前两种方法使用的基础过滤器是前缀过滤器,传统前缀过滤器会产生大量的候选集;第三种方法意图通过增加实际用来访问倒排索引的gram数来改进传统的前缀过滤器,但是由于只是基于实验观察的启发式规则,因此其没有达到最好的性能。The latest existing methods for solving string similarity queries include VChunk, asymmetric-Search and Adaptive-Search. Among them, VChunk uses a structure based on the statistical Chunk boundary dictionary to split the string to generate a signature for retrieval. Asymmetric-Search uses different signatures for queries and indexes (one uses gram, the other uses chunk) to reduce matching items, thereby improving performance. Adaptive-Search is mainly based on an observation: Through experimental observations, it is found that when using a prefix filter, appropriately increasing the number of grams used to access the posting list can improve performance. The basic filter used by the first two methods is a prefix filter. The traditional prefix filter will generate a large number of candidate sets; the third method intends to improve the traditional prefix filter by increasing the number of grams actually used to access the inverted index. But because it is only a heuristic rule based on experimental observations, it does not achieve the best performance.

发明内容Contents of the invention

本发明克服了现有技术中编辑距离阀值增大导致候选集过大、多种过滤器无法自适应选择、现有查询优化的量化指标不精准等缺陷,提出了一种相似字符串检索方法。The invention overcomes the defects in the prior art that the candidate set is too large due to the increase of the edit distance threshold, various filters cannot be adaptively selected, and the quantitative index of the existing query optimization is inaccurate, etc., and proposes a similar string retrieval method .

本发明提出了一种相似字符串检索方法,包括以下步骤:The present invention proposes a kind of similar character string retrieval method, comprises the following steps:

步骤一:利用倒排索引中倒排链表之间的重合信息,获取gram辨识度指标;Step 1: Using the overlapping information between the inverted linked lists in the inverted index to obtain the gram recognition index;

步骤二:将所述gram辨识度指标与倒排链表长度比例进行整合,形成综合指标;Step 2: Integrating the gram recognition index and the length ratio of the inverted linked list to form a comprehensive index;

步骤三:根据所述综合指标通过泛化前缀过滤器获取候选集,通过计算所述候选集中的字符串真实的编辑距离获取检索结果。Step 3: According to the comprehensive index, the candidate set is obtained through the generalized prefix filter, and the search result is obtained by calculating the real edit distance of the character strings in the candidate set.

本发明提出了一种相似字符串检索方法,其中,在步骤一之前,先对需要被检索的字符串集合建立倒排索引,以及生成倒排索引对应的MinHash签名。The present invention proposes a method for retrieving similar character strings, wherein, before step 1, an inverted index is first established for a set of character strings to be retrieved, and a MinHash signature corresponding to the inverted index is generated.

本发明提出了一种相似字符串检索方法,其中,通过对字符串进行编号与q-gram拆分,以拆分后的gram为入口映射至包含所述gram的字符串的关系建立倒排索引。The present invention proposes a similar character string retrieval method, wherein, by numbering the character string and splitting the q-gram, an inverted index is established by taking the split gram as an entry and mapping to the character string containing the gram .

本发明提出了一种相似字符串检索方法,其中,根据所述字符串的长度选取哈希函数,结合所述倒排索引生成MinHash签名。The present invention proposes a similar character string retrieval method, wherein a hash function is selected according to the length of the character string, and a MinHash signature is generated in combination with the inverted index.

本发明提出了一种相似字符串检索方法,其中,步骤一中gram辨识度指标由待查询字符串的gram集合中,与目标gram具有相同MinHash签名的gram个数的总和表示,所述总和不包含所述目标gram。The present invention proposes a similar character string retrieval method, wherein, in step 1, the gram recognition degree index is represented by the sum of the number of grams having the same MinHash signature as the target gram in the gram collection of the character string to be queried, and the sum is not Contains the target gram.

本发明提出了一种相似字符串检索方法,其中,步骤二中以实际使用的倒排链表元素的个数计算倒排链表长度,并通过基于链表长度的比例按权重加和来整合所述gram辨识度指标与倒排链表长度指标,形成综合指标。The present invention proposes a similar character string retrieval method, wherein, in step 2, the length of the inverted linked list is calculated based on the number of elements of the inverted linked list actually used, and the gram is integrated by adding the weights based on the ratio of the length of the linked list The recognition index and the length index of the posting list form a comprehensive index.

本发明提出了一种相似字符串检索方法,其中,所述步骤三通过所述泛化前缀过滤器获取检索结果包括下述步骤:The present invention proposes a similar character string retrieval method, wherein said step 3 obtaining retrieval results through said generalized prefix filter comprises the following steps:

步骤A1:对查询字符串按gram进行拆分后,按所述综合指标将拆分后的查询字符串进行升序排序;Step A1: After splitting the query string by gram, sort the split query string in ascending order according to the comprehensive index;

步骤A2:选取字符串无重叠部分的前τ+1个gram作为基gram查询集,其中τ为距离阀值;Step A2: Select the first τ+1 grams of the non-overlapping part of the string as the basic gram query set, where τ is the distance threshold;

步骤A3:循环添加新的gram并计算增加后的增益,直到新增的gram没有增益或者没有新的gram为止;Step A3: Add new grams in a loop and calculate the increased gain until the newly added gram has no gain or no new gram;

步骤A4:把所有gram进行合并,并取出出现次数至少为LB次的字符串加入候选集,对所述候选集中的字符串进行真实的编辑距离计算,从而获得检索结果;其中,LB是指所述泛化前缀过滤器使用的相同gram个数的下限值。Step A4: Merge all the grams, and take out the character strings whose occurrences are at least LB times and add them to the candidate set, and perform real edit distance calculation on the character strings in the candidate set, so as to obtain the retrieval results; where LB refers to the The lower limit of the number of identical grams used by the generalized prefix filter described above.

本发明提出了一种相似字符串检索方法,其中,所述步骤A3中的增益以下述公式表示:The present invention proposes a similar character string retrieval method, wherein, the gain in the step A3 is represented by the following formula:

Figure DEST_PATH_GDA0000370067510000021
Figure DEST_PATH_GDA0000370067510000021

式中,G表示增益,Ci表示当前的候选集的大小,|I(gn)|表示gn的倒排链表长度,Js表示Jaccard相似度,cost(Q)表示对查询进行一次真实编辑距离计算的平均时间耗时。In the formula, G represents the gain, C i represents the size of the current candidate set, |I(g n )| represents the length of the inverted list of g n , J s represents the Jaccard similarity, and cost(Q) represents a real Average time spent on edit distance calculation.

本发明提出了泛化前缀过滤器,并且利用倒排链表之间的重合信息提出了新的衡量gram质量的指标。本发明通过优化后的泛化前缀过滤器,能够提高检索效率,尤其是当编辑距离阀值变高时,能获得极大性能提升。本发明同时克服了传统方法,无法自适应选择最优的过滤器的缺点。The invention proposes a generalized prefix filter, and proposes a new index for measuring gram quality by utilizing the coincidence information between inverted linked lists. The present invention can improve the retrieval efficiency through the optimized generalized prefix filter, especially when the edit distance threshold becomes higher, the performance can be greatly improved. The present invention simultaneously overcomes the shortcomings of the traditional method that cannot select the optimal filter self-adaptively.

附图说明Description of drawings

图1表示本发明相似字符串检索方法的流程图;Fig. 1 represents the flow chart of similar character string retrieval method of the present invention;

图2表示实施例中的倒排索引。Fig. 2 shows the inverted index in the embodiment.

图3表示实施例中的MinHash签名。Fig. 3 shows the MinHash signature in the embodiment.

具体实施方式detailed description

结合以下具体实施例和附图,对本发明作进一步的详细说明。实施本发明的过程、条件、实验方法等,除以下专门提及的内容之外,均为本领域的普遍知识和公知常识,本发明没有特别限制内容。The present invention will be further described in detail in conjunction with the following specific embodiments and accompanying drawings. The process, conditions, experimental methods, etc. for implementing the present invention, except for the content specifically mentioned below, are common knowledge and common knowledge in this field, and the present invention has no special limitation content.

本发明的相似字符串检索方法引入一种新的泛化前缀过滤器,实现最大限度的减少候选集中。本发明相似字符串检索方法充分利用倒排索引的信息,提出利用倒排链表之间的重叠信息作为指标,在不降低性能的情况下,增加高质量的gram来减少候选集。虽然比现有技术增加了估计开销,但极大地提升了检索性能。如图1所示,本发明相似字符串检索方法包括以下步骤:The similar character string retrieval method of the present invention introduces a new generalized prefix filter to reduce the candidate set to the greatest extent. The similar character string retrieval method of the present invention makes full use of the information of the inverted index, proposes to use the overlapping information between the inverted linked lists as an index, and increases high-quality grams to reduce candidate sets without reducing performance. Although it increases the estimation overhead compared to the state-of-the-art, it greatly improves the retrieval performance. As shown in Figure 1, similar character string retrieval method of the present invention comprises the following steps:

步骤一:利用倒排索引中倒排链表之间的重合信息,获取gram辨识度指标;其中,步骤一中gram辨识度指标由待查询字符串的gram集合中,与目标gram具有相同MinHash签名的gram个数的总和来表示,该总和中不包含目标gram。Step 1: Use the overlapping information between the inverted linked lists in the inverted index to obtain the gram identification index; where the gram identification index in step 1 is from the gram collection of the query string, which has the same MinHash signature as the target gram The sum of the number of grams, which does not include the target gram.

进一步地,在步骤一之前先对需要被检索的字符串集合建立倒排索引,以及生成倒排索引对应的MinHash签名。本发明通过对字符串进行编号与q-gram拆分,以拆分后的gram为入口映射至包含gram的字符串的关系建立倒排索引。根据字符串的长度选取哈希函数,结合倒排索引生成MinHash签名。Further, before step 1, an inverted index is established for the set of character strings to be retrieved, and a MinHash signature corresponding to the inverted index is generated. In the present invention, the character string is numbered and q-gram split, and the split gram is used as an entry to map to a character string containing the gram to establish an inverted index. Select the hash function according to the length of the string, and combine the inverted index to generate the MinHash signature.

步骤二:将gram辨识度指标与倒排链表长度比例进行整合,形成综合指标。其中,步骤二中以实际使用的倒排链表元素的个数计算倒排链表长度,并通过基于链表长度的比例按权重加和来整合所述gram辨识度指标与倒排链表长度指标,形成综合指标。Step 2: Integrate the gram recognition index with the length ratio of the posting list to form a comprehensive index. Wherein, in step 2, the length of the posting list is calculated based on the number of posting list elements actually used, and the gram recognition index and posting list length index are integrated by adding the weight based on the ratio of the linked list length to form a comprehensive index.

步骤三:根据综合指标通过泛化前缀过滤器获取候选集,计算候选集中的字符串真实的编辑距离获取检索结果。其中,步骤三通过泛化前缀过滤器获取检索结果包括下述步骤:Step 3: According to the comprehensive index, the candidate set is obtained through the generalized prefix filter, and the real edit distance of the strings in the candidate set is calculated to obtain the retrieval result. Wherein, the step 3 to obtain the retrieval result through the generalized prefix filter includes the following steps:

步骤A1:对查询字符串按gram进行拆分后,按综合指标将拆分后的查询字符串进行升序排序;Step A1: After splitting the query string by gram, sort the split query string in ascending order according to the comprehensive index;

步骤A2:选取字符串无重叠部分的前τ+1个gram作为基gram查询集,其中τ为距离阀值;Step A2: Select the first τ+1 grams of the non-overlapping part of the string as the basic gram query set, where τ is the distance threshold;

步骤A3:循环添加新的gram并计算增加后的增益,直到新增的gram没有增益或者没有新的gram为止;其中,步骤A3中的增益以下述公式表示:Step A3: Add new grams in a loop and calculate the increased gain until the newly added gram has no gain or no new gram; wherein, the gain in step A3 is expressed by the following formula:

式中,G表示增益,Ci表示当前的候选集的大小,|I(gn)|表示gn的倒排链表长度,Js表示Jaccard相似度,cost(Q)表示对查询进行一次真实编辑距离计算的平均时间耗时。In the formula, G represents the gain, C i represents the size of the current candidate set, |I(g n )| represents the length of the inverted list of g n , J s represents the Jaccard similarity, and cost(Q) represents a real Average time spent on edit distance calculation.

步骤A4:把所有gram进行合并,并取出出现次数至少为LB次的字符串加入候选集,对候选集中的字符串进行真实的编辑距离计算,从而获得检索结果,其中LB是指泛化前缀过滤器使用的相同gram个数的下限值。Step A4: Merge all the grams, and take out strings with occurrences of at least LB times and add them to the candidate set, and perform real edit distance calculation on the strings in the candidate set to obtain the retrieval results, where LB refers to generalized prefix filtering The lower limit of the number of same grams used by the processor.

本发明中,泛化前缀过滤器是指字符串在进行相似性查询的时候,在按综合指标升序排序后的gram集合中,仅以序列的最前面若干gram来过滤不相关的字符串的过滤方式。In the present invention, the generalized prefix filter refers to the filtering of only the first few grams of the sequence to filter irrelevant character strings in the gram collection sorted in ascending order according to the comprehensive index when the strings are searched for similarity Way.

实施例:Example:

在本发明中,在预处理阶段需要完成两个部分:建立倒排索引和产生MinHash签名。In the present invention, two parts need to be completed in the preprocessing stage: establishing an inverted index and generating a MinHash signature.

倒排索引需要记录从gram到字符串的映射关系。表1是一个字符串集合,包含5个字符串,并且使用0~4进行编号。The inverted index needs to record the mapping relationship from gram to string. Table 1 is a set of character strings, including 5 character strings, which are numbered from 0 to 4.

表1本实施例的字符串Table 1 The character string of this embodiment

idid StringsStrings 00 richrich 11 stickstick 22 stichstich 33 stuckstuck 44 stuch stud

先对所有5个字符串进行2-gram拆分,然后以所有出现的gram作为入口建立倒排索引,获得如图2所示的倒排索引。其中,倒排链表上保存的是字符串集合中包含当前gram的所有字符串。First split all 5 character strings into 2-grams, and then use all the appearing grams as entries to create an inverted index to obtain an inverted index as shown in Figure 2. Wherein, all strings in the string collection containing the current gram are saved on the inverted list.

由于MinHash签名的生成过程需要置换操作,通常使用哈希函数来模拟置换操作。表2是两个哈希函数。对相同长度的字符串使用同一组哈希函数。本实施例中每组哈希函数都只包含一个哈希函数。Since the generation process of the MinHash signature requires a replacement operation, a hash function is usually used to simulate the replacement operation. Table 2 is two hash functions. Use the same set of hash functions for strings of the same length. In this embodiment, each set of hash functions includes only one hash function.

表2不同组采用的哈希函数Table 2 Hash functions used by different groups

字符串长度string length 哈希函数hash function 44 (x+1)%2 (x+1)%2

55 (2x+3)%5 (2x+3)%5

第一个哈希函数使用于编号0的字符串,因为其长度为4,第二个哈希函数使用于编号1~4的字符串,因为它们的长度都为5。以ch为例,在长度4区间里面,因为只有一个字符串,其MinHash签名值为(0+1)%2=1。在长度5区间里面,有4个字符串,因此分别计算每个字符串在通过哈希函数打乱之后的第一个出现的非0值。因为倒排链表中的编号即表示向量化后为1的位,因此直接将链表中的编号依次代入哈希函数,并且选择最小的位即可。对两个非0位使用第二个哈希函数得到(2*2+3)%5=2和(2*4+3)%5=1,因为后一位最小,所以选择后一位的哈希值作为长度区间5的MinHash签名。ch最后的MinHash签名为1-1。图3为根据图2中的倒排索引生成的MinHash签名。The first hash function is used for string number 0 because its length is 4, and the second hash function is used for string numbers 1-4 because they all have length 5. Taking ch as an example, in the interval of length 4, because there is only one character string, its MinHash signature value is (0+1)%2=1. In the interval of length 5, there are 4 strings, so calculate the first non-zero value of each string after being shuffled by the hash function. Because the numbers in the inverted linked list represent the bits that are 1 after vectorization, directly substitute the numbers in the linked list into the hash function in turn, and select the smallest bit. Use the second hash function on the two non-zero bits to get (2*2+3)%5=2 and (2*4+3)%5=1, because the latter one is the smallest, so choose the latter one The hash value is signed as a MinHash of length interval 5. The final MinHash signature of ch is 1-1. Figure 3 shows the MinHash signature generated based on the inverted index in Figure 2.

对于gram的质量,本发明提出了一种新的综合指标:基于gram倒排链表的长度和gram的辨识度。gram的辨识度指标是用来衡量gram产生候选结果多少的程度,辨识度指标越低,gram越不容易与其他gram一起产生候选字符串,因此gram的质量也越高,应该比其他gram优先选择。由于泛化gram过滤器中选择任意D+LB个gram都能保证结果正确,但是不同的gram组合产生的过滤器性能有很大区别,gram的辨识度就是用来辅助选择gram的量化指标,代表了gram的质量。gram辨识度是一个相对指标,必须以一个具体的q-gram集合为基础,其具体定义如下:对于给定的查询字符串S的q-gram集合Gq(S),对于任意一个gram gi∈Gq(S),I(gi)表示在倒排索引中以gram gi为入口的倒排链表的所有字符串的集合,对于任意ej∈I(gi),先计数Gq(S)中每个gram的倒排链表中包含ej的gram个数,并把该值减去1,记为每个元素的辨识度,把I(gi)倒排链表中的每个元素的辨识度进行加和即获得gram gi的辨识度。但由于倒排链表通常都很长,在查询中实际计算该指标是非常耗时的,因此本发明通过基于MinHash签名估计gram辨识度,其估计方法与计算真实辨识度类似,只需将计算辨识度的方法从原始的倒排换成MinHash签名即可。如,令当前的查询为strick,并且τ=1,其查询的2-gram集合为{st,tr,ri,ic,ck}。以st为例,由于τ=1,strick的字符串长度L为6,根据字符串检索的性质,只需考虑倒排链表中,在区间[L-τ,L+τ]的部分,即只有区间5的MinHash签名需要考虑。在长度区间5中,st的MinHash签名为0,在所有查询拆分出来的2-gram中,同一个hash函数下签名也为0的有ck,ic和st3个,因此st的估计辨识度为3-1=2。对于倒排链表长度与辨识度的合并生成gram的总权重可以直接在正则化后各取1/2的权重加和即可。表3中表示合并权重表示每个gram在每个区间的gram质量。For the quality of the gram, the present invention proposes a new comprehensive index: based on the length of the gram inverted list and the recognition degree of the gram. The recognition index of a gram is used to measure how many candidate results a gram generates. The lower the recognition index, the less likely it is for a gram to generate candidate strings together with other grams. Therefore, the quality of a gram is higher, and it should be preferred over other grams. . Since the selection of any D+LB gram in the generalized gram filter can guarantee the correct result, but the performance of the filter produced by different gram combinations is very different. The quality of the gram. Gram recognition is a relative index, which must be based on a specific q-gram set. Its specific definition is as follows: For the q-gram set Gq(S) of a given query string S, for any gram g i ∈ Gq(S), I(g i ) represents the collection of all character strings in the inverted linked list with gram g i as the entry in the inverted index. For any e j ∈ I(g i ), first count Gq(S) The number of grams of e j is included in the inverted list of each gram in , and the value is subtracted by 1, which is recorded as the recognition degree of each element, and the identification of each element in the I(g i ) inverted list is Degrees are summed to obtain the recognition degree of gram g i . However, since the inverted list is usually very long, it is very time-consuming to actually calculate the index in the query. Therefore, the present invention estimates the gram recognition degree based on the MinHash signature. The estimation method is similar to the calculation of the real recognition degree. The degree method can be changed from the original reverse to MinHash signature. For example, if the current query is strick, and τ=1, the 2-gram set of the query is {st, tr, ri, ic, ck}. Taking st as an example, since τ=1, the string length L of strick is 6. According to the nature of string retrieval, only the part in the interval [L-τ, L+τ] in the inverted list needs to be considered, that is, only The MinHash signature of interval 5 needs to be considered. In the length interval 5, the MinHash signature of st is 0. Among the 2-grams split from all queries, there are 3 ck, ic and st whose signatures are also 0 under the same hash function, so the estimated recognition degree of st is 3-1=2. The total weight of the gram generated by combining the length of the inverted list and the degree of recognition can be directly added to the weight of 1/2 after regularization. Table 3 indicates that the combined weight represents the gram quality of each gram in each interval.

表3gram辨识度与倒排链表长度比例的权重Table 3 The weight of gram recognition degree and the length ratio of inverted list

gramgram 辨识度4Recognition 4 辨识度5Recognition 5 长度5length 5 合并权重5Merge Weight 5 总权重total weight stst 00 22 44 0.420.42 0.420.42 trtr nullnull nullnull nullnull nullnull nullnull rithe ri 11 00 00 00 00 icic 11 22 22 0.290.29 0.290.29 ckck 00 22 22 0.290.29 0.29 0.29

在合并不同区间的权重时候,采用按不同区间所包含的字符串个数的比例进行加和。因此,对于st来说,总的辨识度就是0*0.2+0.42*0.8=0.336。When merging the weights of different intervals, the sum is performed in proportion to the number of character strings contained in different intervals. Therefore, for st, the total recognition is 0*0.2+0.42*0.8=0.336.

本发明基于泛化前缀过滤器的搜索框架分为两步:生成基查询gram集合与添加新的gram。执行之前进行前先要对查询生成的gram按照gram的综合质量指标进行升序排序。按表3的最终权重所示,最终顺序为ri,ic,ck,st。由于有包含有相同位置的gram之间会同时受到发生在该共有位置编辑操作的影响,新增的gram如果包含已选的位置,将不会带来性能增加,因此,在挑选gram的时候,如果前面的gram挑选好后,新增的gram不能与已选gram共享位置。例如字符串abcdef,如果选择了gram ab以后,那么bc就不能使用,因为ab跟bc共享了b。排序之后,选择满足无共享位置的前最小τ+1个gram作为基查询gram集合。本实施例选择ri和ck(因为ri与ic有共享部分,所有不选ic),合并这两个gram的倒排链表,所有出现在链表上的元素1和3均作为候选结果。添加新gram的步骤是一个循环过程,当没有性能增益的时候,循环停止。在用表示增益。其中,Ci表示当前的候选集的大小,|I(gn)|表示gn的倒排链表长度,Js表示Jaccard相似度,cost(Q)表示对查询进行一次真实编辑距离计算的平均时间耗时。如果该增益大于1,那么就表示当前新增的gram可以提高性能,可以加入候选集中。按照之前的排序依次往下寻找新增的gram,此时对st进行增益评估。通过计算,增加st没有性能增益,因此不可以加入候选集,循环停止。本实施例中候选集为{1,3},即为{stick,stuck},对这两个候选结果与strick进行真实编辑距离计算之后,只有字符串1的真实编辑距离满足与查询strick距离在1以内,所以得到字符串1为满足条件的结果。The search framework based on the generalized prefix filter of the present invention is divided into two steps: generating base query gram sets and adding new grams. Before execution, the gram generated by the query must be sorted in ascending order according to the comprehensive quality index of the gram. As shown by the final weights in Table 3, the final order is ri, ic, ck, st. Since the grams containing the same position will be affected by the editing operation at the common position at the same time, if the newly added gram contains the selected position, it will not bring about a performance increase. Therefore, when selecting the gram, If the previous gram is selected, the newly added gram cannot share the position with the selected gram. For example, the string abcdef, if gram ab is selected, then bc cannot be used, because ab shares b with bc. After sorting, select the top minimum τ+1 grams that satisfy the no-shared position as the base query gram set. In this embodiment, ri and ck are selected (because ri and ic have shared parts, so ic is not selected), and the inverted linked lists of these two grams are merged, and all elements 1 and 3 appearing on the linked list are used as candidate results. The step of adding new grams is a cyclic process, and the loop stops when there is no performance gain. Using Indicates gain. Among them, C i represents the size of the current candidate set, |I(g n )| represents the length of the inverted list of g n , J s represents the Jaccard similarity, and cost(Q) represents the average value of a real edit distance calculation for the query Time consuming. If the gain is greater than 1, it means that the newly added gram can improve performance and can be added to the candidate set. Search for newly added grams in order according to the previous sorting, and then perform gain evaluation on st. Through calculation, increasing st has no performance gain, so it cannot be added to the candidate set, and the loop stops. In this embodiment, the candidate set is {1, 3}, that is, {stick, stuck}. After calculating the real edit distance between these two candidate results and strick, only the real edit distance of string 1 satisfies the query strick distance within 1, so the string 1 is obtained as the result that satisfies the condition.

本发明的保护内容不局限于以上实施例。在不背离发明构思的精神和范围下,本领域技术人员能够想到的变化和优点都被包括在本发明中,并且以所附的权利要求书为保护范围。The protection content of the present invention is not limited to the above embodiments. Without departing from the spirit and scope of the inventive concept, changes and advantages conceivable by those skilled in the art are all included in the present invention, and the appended claims are the protection scope.

Claims (4)

1. a similar character string search method, it is characterised in that comprise the following steps:
Step one: utilize the coincidence information between the table of falling row chain in inverted index, obtains gram identification index;Wherein, described Gram identification index is by the gram set of character string to be checked, has identical MinHash signature with target gram The summation of gram number represents, does not comprise target gram in this summation described;
Step 2: described gram identification index integrated with the table of falling row chain length ratio, forms aggregative indicator;Wherein, The table of falling row chain length is calculated with the number of the actually used table element of falling row chain, and by ratio based on chained list length by weight Add and integrate described gram identification index and the table of falling row chain length index, forming aggregative indicator;
Step 3: obtain Candidate Set by extensive prefix filter according to described aggregative indicator, by calculating in described Candidate Set The real editing distance of character string obtain retrieval result;Wherein, retrieval result bag is obtained by described extensive prefix filter Include following step:
Step A1: after splitting inquiry string by gram, the inquiry string after splitting by described aggregative indicator enters Row ascending sort;
Step A2: choose+1 gram of front τ of the non-overlapping part of character string as base gram query set, wherein τ is distance threshold values;
Step A3: circulate the gain after adding new gram and calculating increase, until newly-increased gram does not has gain or does not has Till new gram;Wherein, described gain represents with following formula:
G = ( | C i | | I ( g n ) | + 1 J s ( C i , I ( g n ) ) + 1 - 1 ) * cos t ( Q ) ;
In formula, G represents gain, CiRepresent the size of current Candidate Set, | I (gn) | represent gnThe table of falling row chain length, gn represents Gram after the fractionation of sequence number n, JsRepresent that Jaccard similarity, cost (Q) expression carry out once true editing distance to inquiry The average time calculated is time-consuming;
Step A4: all gram are merged, and take out occurrence number be at least LB time character string add Candidate Set, to institute State the character string in Candidate Set and carry out real editing distance calculating, thus obtain retrieval result;Wherein, LB refers to described extensive The lower limit of the identical gram number that prefix filter uses.
Similar character string search method the most according to claim 1, it is characterised in that before step one, first to needs The string assemble being retrieved sets up inverted index, and generates the MinHash signature that inverted index is corresponding.
Similar character string search method the most according to claim 2, it is characterised in that by character string is numbered with Q-gram splits, and the gram after splitting sets up inverted index for entry maps to the relation of the character string comprising described gram.
Similar character string search method the most according to claim 2, it is characterised in that select according to the length of described character string Take hash function, generate MinHash signature in conjunction with described inverted index.
CN201310294529.4A 2013-07-12 2013-07-12 A kind of similar character string search method Active CN103365998B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310294529.4A CN103365998B (en) 2013-07-12 2013-07-12 A kind of similar character string search method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310294529.4A CN103365998B (en) 2013-07-12 2013-07-12 A kind of similar character string search method

Publications (2)

Publication Number Publication Date
CN103365998A CN103365998A (en) 2013-10-23
CN103365998B true CN103365998B (en) 2016-08-24

Family

ID=49367339

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310294529.4A Active CN103365998B (en) 2013-07-12 2013-07-12 A kind of similar character string search method

Country Status (1)

Country Link
CN (1) CN103365998B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105426759A (en) * 2015-10-30 2016-03-23 百度在线网络技术(北京)有限公司 URL legality determining method and apparatus
CN107153652B (en) * 2016-03-03 2020-10-30 创新先进技术有限公司 Method and device for converting target character string into normalized character string
CN105913094B (en) * 2016-05-03 2019-06-21 中国科学院信息工程研究所 A search method for minimum distance string calculation
CN106897990B (en) * 2016-08-31 2019-10-25 广东工业大学 Character defect detection method of tire mold
CN108255836B (en) * 2016-12-28 2020-12-25 普天信息技术有限公司 Character string matching method and device
CN108984695B (en) * 2018-07-04 2021-04-06 科大讯飞股份有限公司 Character string matching method and device
CN110008383B (en) * 2019-04-11 2021-07-27 北京安护环宇科技有限公司 Black and white list retrieval method and device based on multiple indexes
CN110502629B (en) * 2019-08-27 2020-09-11 桂林电子科技大学 LSH-based connection method for filtering and verifying similarity of character strings
CN110533035B (en) * 2019-08-28 2022-02-15 海南阿凡题科技有限公司 Student homework page number identification method based on text matching
CN111078821B (en) * 2019-11-27 2023-12-08 泰康保险集团股份有限公司 Dictionary setting method, dictionary setting device, medium and electronic equipment
CN112486989B (en) * 2020-11-28 2021-08-27 河北省科学技术情报研究院(河北省科技创新战略研究院) Multi-source data granulation fusion and index classification and layering processing method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101763405A (en) * 2009-11-16 2010-06-30 陆嘉恒 Approximate character string searching technology based on synonym rule

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8271499B2 (en) * 2009-06-10 2012-09-18 At&T Intellectual Property I, L.P. Incremental maintenance of inverted indexes for approximate string matching

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101763405A (en) * 2009-11-16 2010-06-30 陆嘉恒 Approximate character string searching technology based on synonym rule

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Can We Beat the Prefix Filtering?An Adaptive Framework for Similarity Join and Search;Jiannan Wang等;《Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD)》;20120524;第85-96页 *
Efficient Exact Edit Similarity Query Processing with the Asymmetric Signature Scheme;Jianbin Qin等;《Proc. ACM SIGMOD Int’l Conf. Management of Data(SIGMOD)》;20110616;第1033-1044页 *
VChunkJoin:An Efficient Algorithm for Edit Similarity Joins;Wei Wang等;《IEEE Tran.on Knowledge and Data Engineering》;20120417;第1916-1929页 *

Also Published As

Publication number Publication date
CN103365998A (en) 2013-10-23

Similar Documents

Publication Publication Date Title
CN103365998B (en) A kind of similar character string search method
Zhang et al. Bed-tree: an all-purpose index structure for string similarity search based on edit distance
CN109359172B (en) An Entity Alignment Optimization Method Based on Graph Partitioning
CN103914494B (en) Method and system for identifying identity of microblog user
CN108959258B (en) A Domain-Specific Integrated Entity Linking Method Based on Representation Learning
US9720986B2 (en) Method and system for integrating data into a database
US20150254307A1 (en) System and methods for rapid data analysis
CN106294762B (en) Entity identification method based on learning
CN112000783B (en) Patent recommendation method, device, device and storage medium based on text similarity analysis
AU2014201516A1 (en) Resolving similar entities from a transaction database
CN109165382B (en) Similar defect report recommendation method combining weighted word vector and potential semantic analysis
CN107291895B (en) A Fast Hierarchical Document Query Method
CN106682172A (en) Keyword-based document research hotspot recommending method
CN103049575A (en) Topic-adaptive academic conference searching system
CN107038263B (en) A kind of chess game optimization method based on data map, Information Atlas and knowledge mapping
Tao Approximate string joins with abbreviations
Li et al. A partition-based method for string similarity joins with edit-distance constraints
CN106649557B (en) Semantic association mining method for defect report and mail list
WO2016029230A1 (en) Automated creation of join graphs for unrelated data sets among relational databases
CN102541960A (en) Method and device of fuzzy retrieval
CN105550169A (en) Method and device for identifying point of interest names based on character length
Kudryavtseva et al. Modeling cluster development using programming methods: Case of Russian arctic regions
CN106980639B (en) Short text data aggregation system and method
CN114281809A (en) Multi-source heterogeneous data cleaning method and device
CN108038211A (en) A kind of unsupervised relation data method for detecting abnormality based on context

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant