WO2018161548A1 - 一种基于二值码字典树的搜索方法 - Google Patents

一种基于二值码字典树的搜索方法 Download PDF

Info

Publication number
WO2018161548A1
WO2018161548A1 PCT/CN2017/104398 CN2017104398W WO2018161548A1 WO 2018161548 A1 WO2018161548 A1 WO 2018161548A1 CN 2017104398 W CN2017104398 W CN 2017104398W WO 2018161548 A1 WO2018161548 A1 WO 2018161548A1
Authority
WO
WIPO (PCT)
Prior art keywords
binary code
node
dictionary tree
substring
queried
Prior art date
Application number
PCT/CN2017/104398
Other languages
English (en)
French (fr)
Inventor
段凌宇
黄祎程
王哲
高文
Original Assignee
北京大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京大学 filed Critical 北京大学
Publication of WO2018161548A1 publication Critical patent/WO2018161548A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures

Definitions

  • the invention relates to computer vision technology, in particular to a search method based on a binary code dictionary tree.
  • Binary code has the advantages of easy storage, easy indexing, fast comparison, etc. It is the first choice for processing large-scale data applications. Although the Hamming distance between binary codes is very fast (millions of comparisons can be completed in one second), when the data size is extremely large, the way of linearly scanning the entire data set is still unable to achieve real-time retrieval. . Therefore, it is necessary to design an efficient indexing algorithm to improve the retrieval speed of binary code under large-scale data sets.
  • a common method of indexing binary codes and performing neighborhood search is to use a hash table, in which the binary code is directly inserted into the hash table as a key value (address).
  • a large number of tests have shown that the retrieval speed of this method is significantly higher than that of linear scanning.
  • the use of hash tables in practice consumes a lot of memory, essentially swapping space for time.
  • a binary code with an index length of d requires 2 d hash buckets. When d grows to 64, the memory consumption of 2 64 ⁇ 10 19 hash buckets is unacceptable.
  • MIH Multi-Index Hashing
  • MIH uses a set of hash tables to index substrings of binary codes.
  • the MIH divides the binary code into a number of mutually exclusive (non-overlapping disjoint) substrings, and uses a hash table index for each substring separately, and no longer indexes the entire binary code.
  • This segmentation strategy enables efficient indexing of long vectors.
  • Experimental results show that MIH can achieve significant retrieval acceleration on long vectors of lengths of 64, 128, and 256.
  • the problem with the hash table-based indexing method is that it is necessary to enumerate all possible Hamming distances of the query vector q not to exceed the neighbors of r, and look up the corresponding hash bucket to check whether it exists. Given the vector length d and the search radius r, the total number of hash buckets that need to be looked up is
  • the present invention proposes a search method based on a binary code dictionary tree that overcomes the above problems or at least partially solves the above problems.
  • the present invention provides a search method based on a binary code dictionary tree, comprising:
  • a binary code dictionary tree of the j-th substring is established; the number of the binary code dictionary trees is m; each binary code dictionary tree includes: an internal node and Leaf node
  • m, j are positive integers, r is a predetermined non-negative integer value, and j is less than or equal to m.
  • the method further includes:
  • the combined deduplication test is performed to obtain the search result of the image to be queried.
  • the step of establishing a binary code dictionary tree of the j-th substring for all the j-th substrings of all images in the database including:
  • the node of the binary code dictionary tree is established by taking the first b bits, and the binary code dictionary tree of the j-th substring is constructed;
  • the root node of the binary code dictionary tree establishes a branch according to the first smallest index unit on the left side of the j-th substring; for the i-th node, the i-th sub-string from left to right according to the i-th sub-string a minimum index unit establishes a branch; the leaf node is a last node of the binary code dictionary tree;
  • Each node in the binary code dictionary tree corresponds to a character string, and the root node corresponds to an empty string; for the node of the i-th layer, the corresponding character string is the first i minimum index in the j-th sub-string a string of units of length i*c;
  • b and c are positive integers
  • b is a multiple of c
  • the nodes of the root node and the i-th layer are internal nodes of the binary code dictionary tree
  • i is a positive integer less than or equal to b/c.
  • each leaf node in each binary code dictionary tree is a last node in the form of a container
  • the container contains all the strings inserted into the last node, and the strings contain the same prefix, that is, a string of length b corresponding to the last node.
  • the Hamming distance in the binary code dictionary tree corresponding to the j-th substring of all images in the database is not exceeded.
  • the steps of the binary code include:
  • the step of acquiring the binary code of each image in the database and dividing each binary code into the m-segment sub-string includes:
  • Segmentation strategy is used to divide each binary code into m non-overlapping disjoint substrings
  • the step of obtaining the binary code of the image to be queried and the m-segment substring of the binary code including:
  • Segmentation strategy is used to divide each binary code into m non-overlapping disjoint substrings
  • the process of the tree is implemented in the offline phase.
  • the present invention provides a binary code neighbor search method, including:
  • each binary code dictionary tree includes: an internal node and a leaf node;
  • m, j are positive integers, r is a predetermined non-negative integer value, and j is less than or equal to m.
  • the step of establishing a binary code dictionary tree of the j-th substring for all the j-th substrings of the binary code comprises:
  • the node of the binary code dictionary tree is established by taking the first b bits, and the binary code dictionary tree of the j-th substring is obtained;
  • the root node of the binary code dictionary tree establishes a branch according to the first smallest index unit on the left side of the j-th substring; for the i-th node, the i-th sub-string from left to right according to the i-th sub-string a minimum index unit establishes a branch; the leaf node is a last node of the binary code dictionary tree;
  • Each node in the binary code dictionary tree corresponds to a character string, and the root node corresponds to an empty string; for the node of the i-th layer, the corresponding character string is the first i minimum index in the j-th sub-string a string of units of length i*c;
  • b and c are positive integers, b is a multiple of c, and the nodes of the root node and the i-th layer are internal nodes of the binary code dictionary tree.
  • each leaf node in each binary code dictionary tree is in a container shape The last node of the existence; the container contains all the strings inserted into the last node, and these strings contain the same prefix;
  • the Hamming distance in the binary code dictionary tree corresponding to the j-th substring of all binary codes in the database is not exceeded.
  • the steps of the binary code include:
  • the search method based on the binary code dictionary tree proposed by the present invention checks the node appearing in the binary code dictionary tree in the search process by establishing a binary code dictionary tree of all substrings of the database image.
  • the above elements can effectively avoid the missing search problem of the MIH scheme in the prior art, thereby reducing the number of searches and improving the search speed.
  • 1(a) and 1(b) are schematic diagrams showing a binary code dictionary tree in an embodiment of the present invention.
  • 1(c) is a schematic diagram of searching for r neighbors by using MBNT according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a process of performing image search using MBNT according to an embodiment of the present invention
  • FIG. 3 is a schematic flowchart of a search method based on a binary code dictionary tree according to an embodiment of the present invention
  • FIG. 4 is a schematic flowchart diagram of a binary code neighbor search method according to an embodiment of the present invention.
  • K-nearest neighbor search finds the K vectors closest to the Hamming distance in the data set compared to the given query vector
  • r-neighbor search finds all the data sets in the data set with a Hamming distance of no more than a fixed value (r) All vectors.
  • the second problem is mainly solved, that is, the r-nearest neighbor search, but can also be used to solve the K-nearest neighbor search.
  • the formal description of the r-neighbor search problem is as follows:
  • H( ⁇ ) represents the Hamming distance
  • the hash table-based method directly uses the binary code as an address in the hash table. Given the query q and the search radius r, enumerate all possible r nearest neighbor vectors of q and then find the corresponding hash bucket to determine if it exists. The method requires a hash table containing 2d buckets to index the d-bit binary code. When d is 64 bits or more, the memory consumption of the method becomes unacceptable.
  • MIH establishes m hash tables T 1 , . . . , T m for m substrings.
  • n is the total number of binary codes in the data set. Since there are 2 s buckets in the hash table, the average number of elements in each bucket is n/2 s . The number of candidates to test is
  • n In most application scenarios n ⁇ 2 s , which means that lookup dominates the cost function. On the other hand, n is much smaller than 2 s, resulting in a large number of empty buckets in the hash table. Checking such empty buckets is unnecessary and time consuming.
  • a binary multi-word binary code dictionary tree is used to avoid checking for missing, thereby reducing the overhead of the searching phase and improving the search speed.
  • the embodiment of the present invention is mainly for solving the problem of accurate neighbor search in the Hamming space, that is, given a query vector q, searching for r neighbors (or K nearest neighbors) of q in the database, that is, the query vector q is All r neighbors in Hamming space.
  • the embodiment of the invention proposes a new data structure-blocking multi-fork binary codeword
  • the tree-tree structure (MULTI-BLOCK N-ARY TRIE, MBNT), which is related to the MBNT in the Hamming spatial neighbor search method (as shown in Fig. 1b), uses it to solve the neighbor search problem in the Hamming space.
  • a minimum index unit in a binary code dictionary tree will be set; when a query vector is given, a neighbor search is performed by traversing nodes in the binary code dictionary tree (as shown in FIG. 1b). . For this reason, by checking the elements appearing on the nodes in the binary code dictionary tree, the search for missing problems can be effectively avoided, thereby improving the search speed, that is, achieving efficient access speed and memory overhead.
  • the embodiment of the present invention also uses the same segmentation strategy as MIH to cope with long vector retrieval.
  • the binary code of each image in the database is divided into m non-overlapping disjoint substrings and a binary code dictionary tree of all the same substrings is respectively established.
  • the following discusses how to design an efficient index structure/dictionary tree structure for each substring.
  • FIG. 1(a) shows a binary code dictionary tree.
  • Figure 1(b) shows a block multi-fork binary code dictionary tree;
  • Figure 1 (c) shows an example of using MBNT to search for r neighbors.
  • the binary code dictionary tree is a tree structure, and each node u on the tree corresponds to a character string w, and the strings corresponding to all descendants satisfying the node all have the same prefix w.
  • the root node corresponds to an empty string.
  • the node with a depth of l in the binary code dictionary tree represents a set of strings containing the same first l character, and its branch (son node) is based on the l+1th character of the string.
  • a dictionary tree constructed with a binary code is a special case where the binary code dictionary tree is a binary tree because each node has only two branches, 0 and 1.
  • Figure 1(a) is an example of a binary code dictionary tree.
  • the MBNT in the embodiment of the present invention refers to consecutive c bits in the binary code, called a block, as the smallest index unit in the binary code dictionary tree. Specifically, given a binary code of length s bits, each c bits are treated as one block, and s/c blocks are shared, and each block can express 2 c different symbols.
  • the MBNT becomes a binary dictionary tree.
  • the reason why the several bits are combined into one block is that the access speed of the binary code dictionary tree can be improved and the memory overhead can be reduced, because the string of the index s bits requires the binary code dictionary tree of the s layer, when s is large The binary code dictionary tree will become very deep.
  • each leaf node in the binary code dictionary tree exists in the form of a container containing all the strings in the database prefixed by the string corresponding to the leaf.
  • b ⁇ s the first b bits (b ⁇ s) are used to establish a node of the binary code dictionary tree, thereby reducing memory overhead.
  • a static structure a full 2 c -tree, is used to implement a block-multi-fork binary code dictionary tree.
  • a 1-bit Boolean flag indicates whether each node actually exists.
  • the container is implemented using a static array.
  • the following example illustrates the use of MBNT to search for r neighbors.
  • the r neighbors are searched by traversing the MBNT.
  • the Hamming distance of the string corresponding to the node and the first l block of q is calculated. If the distance is greater than r, the traversal is terminated at this node.
  • the leaf node is reached, the corresponding container is checked, and all binary codes with a Hamming distance h ⁇ r from q are returned.
  • Figure 1(c) is an example of searching for r neighbors using MBNT, and algorithm 1 is pseudo code.
  • 1/density(b) represents the speedup of the MBNT algorithm relative to the hash table algorithm during the lookup phase.
  • the image search is divided into two stages, an offline stage. And online stages.
  • the offline phase works mainly to extract features and build a dictionary tree in the database.
  • the specific steps are as follows:
  • Step 01 Extract features from the database image and compress (binarize) the features into a compact binary code.
  • features are extracted for each image in the database, and a compact binary code of each image is obtained.
  • this is not limited by the method of extracting features and the method of feature compression (binarization).
  • Binary codes generated by arbitrary features such as SIFT, VLAD, FisherVector, GIST, CNN features, etc.
  • arbitrary binarization methods such as local sensitivity hash LSH, iterative quantization ITQ, spectral hash SH, DeepHash, etc.
  • Step 02 Index the binary code.
  • the binary code is divided into m segments that do not overlap the disjoint substrings, and m binary code dictionary trees (MBNTs) are constructed, and the jth segment of all images is inserted into the jth MBNT.
  • MBNTs binary code dictionary trees
  • the compact binary code of each image is divided into m-segment sub-strings, and a binary code dictionary tree is established for the same sub-substring of all images.
  • Step 03 Extract features from the query image and compress (binarize) the features into a compact binary code.
  • the feature may be extracted in any manner in the prior art, and a compact binary code is obtained, which is not limited in this embodiment.
  • Step 04 Using a search algorithm, look up the neighbors of the query vector in the Hamming space in the binary code dictionary tree of the response established in the offline mode.
  • Step 05 Return the search result according to the ID of the neighbor.
  • the binary code of each image may be obtained by using any method in the prior art.
  • the binary code here may be a compact binary code, which is not limited in this embodiment.
  • obtaining a binary code of a database image the binary code length being d; dividing each binary code into m non-overlapping disjoint substrings by using a segmentation strategy;
  • the length of the first v segment is The length of the post mv segment is Where d, m, s are all positive integers, and m is less than or equal to d.
  • each binary code dictionary tree includes: an internal node and a leaf node.
  • the nodes of the tree structure are divided into internal nodes and external nodes.
  • An external node refers to a leaf node (which is characterized by no child nodes, at the lowest level of the tree), and an internal node refers to a non-leaf node.
  • the root node also belongs to the internal node.
  • obtaining a binary code of an image to be queried the binary code length being d
  • Segmentation strategy is used to divide each binary code into m non-overlapping disjoint substrings
  • the Hamming distance in the binary code dictionary tree corresponding to the j-th substring of all images in the database is not exceeded. The binary code.
  • m, j are all positive integers, r is a predetermined non-negative integer value, and j is less than or equal to m.
  • the foregoing steps 301 and 302 can be completed in an offline mode, that is, in an offline phase, and the subsequent steps 303 to 305 can be completed in an online manner.
  • the above method further includes the following step 306:
  • a substring distance less than or equal to r' is a necessary condition rather than a sufficient condition that the entire string distance is less than or equal to r.
  • the query result in the above step 305 is a superset of the correct result (ie, the search result).
  • step 302 may include steps 3021 to 3022 not shown in the following figures;
  • a node of the binary code dictionary tree is established by taking the first b bits, and a binary code dictionary tree of the j-th substring is constructed;
  • the root node of the binary code dictionary tree is based on the first side of the left side of the jth substring a minimum index unit establishes a branch; for a node of the i-th layer, a branch is established according to the i-th smallest index unit of the j-th sub-string from left to right; the leaf node is a last node of the binary code dictionary tree;
  • Each node in the binary code dictionary tree corresponds to a character string, and the root node corresponds to an empty string; for the node of the i-th layer, the corresponding character string is the first i minimum index in the j-th sub-string a string of units of length i*c;
  • b and c are positive integers
  • b is a multiple of c
  • the nodes of the root node and the i-th layer are internal nodes of the binary code dictionary tree
  • i is a positive integer less than or equal to b/c.
  • each leaf node in each binary code dictionary tree in this embodiment may be a last node in the form of a container
  • the container contains all the strings inserted into the last node, and the strings contain the same prefix, that is, a string of length b corresponding to the last node.
  • the ID of the image is actually recorded in the container of the leaf in the specific implementation process instead of the binary code itself (similar to the inverted index.
  • the binary code can be inferred from the path from the root to the leaf, so the record Binary code does not make sense). If there are two sub-strings of the image binary code in the database that are identical, they will be recorded in the same leaf in the corresponding dictionary tree. It is necessary to record the IDs of multiple images in one leaf, so the container is necessary. of.
  • step 304 may include steps 3041 to 3044 which are not shown in the following figures;
  • the above embodiment is mainly an image (video) search in a general scene.
  • the depth and memory overhead of the binary code dictionary tree are reduced, and the access speed of the binary code dictionary tree is improved.
  • the record information of the nodes in the binary code dictionary tree of the block multi-fork can effectively avoid the missing problem and improve the search speed.
  • an embodiment of the present invention further provides a binary code neighbor search method.
  • the method shown in FIG. 4 includes the following steps:
  • each binary code dictionary tree includes: an internal node and a leaf node.
  • the minimum index unit of the binary code dictionary tree is determined according to the length of the j-th substring and the preset parameter value c;
  • the node of the binary code dictionary tree is established by taking the first b bits, and the binary code dictionary tree of the j-th substring is obtained;
  • the root node of the binary code dictionary tree establishes a branch according to the first smallest index unit on the left side of the j-th substring; for the i-th node, the i-th sub-string from left to right according to the i-th sub-string a minimum index unit establishes a branch; the leaf node is a last node of the binary code dictionary tree;
  • Each node in the binary code dictionary tree corresponds to a character string, and the root node corresponds to an empty string; for the node of the i-th layer, the corresponding character string is the first i minimum index in the j-th sub-string a string of units of length i*c;
  • b and c are positive integers, b is a multiple of c, and the nodes of the root node and the i-th layer are internal nodes of the binary code dictionary tree.
  • Each leaf node in each binary code dictionary tree is a last node in the form of a container; the container contains all the strings inserted into the last node, and these strings contain the same prefix.
  • the Hamming distance in the binary code dictionary tree corresponding to the j-th substring of all binary codes in the database is not exceeded. The binary code.
  • the Hamming distance in the binary code dictionary tree corresponding to the j-th substring of all binary codes in the database is not exceeded.
  • the steps of the binary code include:
  • m, j are positive integers, r is a predetermined non-negative integer value, and j is less than or equal to m.
  • the binary code neighbor search method provided in this embodiment can be applied to any binary code neighbor search process to achieve efficient access speed and reduce memory overhead in the search phase.
  • DSP digital signal processor
  • the invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于二值码字典树的搜索方法,包括:获取数据库中每一图像的二值码,将每个二值码划分为m段子串(301);针对数据库中所有图像的第j段子串,建立该第j段子串的一个二值码字典树(302);所述二值码字典树的数量为m个;每一二值码字典树包括:内部节点和叶子节点;获取待查询图像的二值码以及该二值码的m段子串(303);针对待查询图像二值码的第j段子串,在数据库中所有图像的第j段子串对应的二值码字典树中查找汉明距离不超过r'的二值码(304);遍历待查询图像二值码的所有子串,获得每一子串的查询结果(305);j小于等于m;根据待查询图像二值码的所有子串的查询结果,进行合并去重测试,获取待查询图像的搜索结果(306)。该方法在汉明空间精确近邻搜索时可以降低查找数量,提高搜索速度。

Description

一种基于二值码字典树的搜索方法 技术领域
本发明涉及计算机视觉技术,具体涉及一种基于二值码字典树的搜索方法。
背景技术
近年来,高维向量的二进制表达问题(binary representation)获得了广泛的关注。二进制编码的目标是将特征压缩为紧凑的二值码(binary code)。二值码具有易存储、易索引、对比速度快等优点,是处理大规模数据应用的首选。尽管二值码之间的汉明距离比对速度非常快(1秒内能完成数百万次比对),但当数据规模特别大时,线性扫描整个数据集的方式仍然无法实现实时的检索。因此,设计高效的索引算法来提高大规模数据集下二值码的检索速度是十分必要的。
常见的一种索引二值码并进行近邻搜索的方法是使用哈希表,其中二值码直接作为键值(地址)插入哈希表中。大量测试表明该方法的检索速度相比线性扫描有显著提高。然而,实践中使用哈希表需消耗大量内存,本质上是以空间换时间。理想状况下为索引长度为d的二值码需要2d个哈希桶。当d增长至64时,264≈1019个哈希桶的内存消耗是不可接受的。
为了能够处理长向量,业内人士提出了分段索引哈希(Multi-Index Hashing,简称MIH),采用一组哈希表来索引二值码的子串(substrings)。特别地,MIH将二值码划分成若干互斥(不重叠不相交)的子串,对每个子串单独采用一个哈希表索引,不再索引整个二值码。这种分段的策略实现了对长向量的高效索引,实验结果表明MIH在长度为64,128,256的长向量上能够实现显著的检索加速。
基于哈希表的索引方法的问题在于,需要枚举查询向量q的所有可能的汉明距离不超过r的近邻,并查找(lookup)对应的哈希桶检查其是否存在。给定向量长度d以及搜索半径r,需要查找的哈希桶的总数为
Figure PCTCN2017104398-appb-000001
其中L(d,r)随r指数增长。当r很大时,搜索范围的增长速度非常快。然而,在实际应用中发现大部分哈希桶都是空的,访问空的桶(称之为查找缺失),既不必要且将浪费大量时间。
发明内容
鉴于上述问题,本发明提出了克服上述问题或者至少部分地解决上述问题的一种基于二值码字典树的搜索方法。
为此目的,第一方面,本发明提出一种基于二值码字典树的搜索方法,包括:
获取数据库中每一图像的二值码,将每个二值码划分为m段子串;
针对数据库中所有图像的第j段子串,建立该第j段子串的一个二值码字典树;所述二值码字典树的数量为m个;每一二值码字典树包括:内部节点和叶子节点;
获取待查询图像的二值码以及该二值码的m段子串;
针对待查询图像二值码的第j段子串,在数据库中所有图像的第j段子串对应的二值码字典树中查找汉明距离不超过
Figure PCTCN2017104398-appb-000002
的二值码;
遍历待查询图像二值码的所有子串,获得每一子串的查询结果;
其中:m,j均为正整数,r为预先确定的非负整数值,且j小于等于m。
可选地,所述方法还包括:
根据待查询图像二值码的所有子串的查询结果,进行合并去重测试,获取待查询图像的搜索结果。
可选地,针对数据库中所有图像的第j段子串,建立该第j段子串的一个二值码字典树的步骤,包括:
依据第j段子串长度和预设的参数值c,确定二值码字典树的最小索引单位;
以及,根据预设的参数值b和最小索引单位,取前b个比特建立二值码字典树的节点,构建该第j段子串的二值码字典树;
其中,所述二值码字典树的根节点依据该第j段子串左侧的第一个最小索引单位建立分支;对于第i层的节点,依据该第j段子串从左至右的第i个最小索引单位建立分支;所述叶子节点为所述二值码字典树的末节点;
所述二值码字典树中的每个节点对应于一个字符串,根节点对应于空串;对于第i层的节点,其对应的字符串为该第j段子串中的前i个最小索引单位组成的长度为i*c的字符串;
其中b,c均为正整数,b为c的倍数,所述根节点和第i层的节点均为所述二值码字典树的内部节点,i为小于等于b/c的正整数。
可选地,每一个二值码字典树中的每一个叶子节点均为以容器形式存在的末节点;
所述容器内含有所有插入到这个末节点的字符串,这些字符串含有相同的前缀,即该末节点所对应的长度为b的字符串。
可选地,针对待查询图像二值码的第j段子串,在数据库中所有图像的第j段子串对应的二值码字典树中查找汉明距离不超过
Figure PCTCN2017104398-appb-000003
的二值码的步骤,包括:
从所述二值码字典树的根节点开始遍历该二值码字典树;
对于该二值码字典树中的每一个节点,计算该节点对应的字符串与所述待查询图像二值码的第j段子串的汉明距离;
若计算的汉明距离大于r’,则遍历在当前节点处停止;
或者,当遍历至叶子节点时,在叶子节点所属的容器中获取相应的汉明距离不超过
Figure PCTCN2017104398-appb-000004
的二值码。
可选地,获取数据库中每一图像的二值码,将每个二值码划分为m段子串的步骤,包括:
获取数据库图像的二值码,该二值码长度为d;
采用分段策略将每个二值码划分为m个不重叠不相交的子串;
若d为m的倍数,则将二值码分为m段长度相同的子串,每段长度均为s=d/m;
若d不是m的倍数,令v等于d除以m所得的余数,则分段时,前v段的长度为
Figure PCTCN2017104398-appb-000005
后m-v段的长度为
Figure PCTCN2017104398-appb-000006
和/或,获取待查询图像的二值码以及该二值码的m段子串的步骤,包括:
获取待查询图像的二值码,该二值码长度为d;
采用分段策略将每个二值码划分为m个不重叠不相交的子串;
若d为m的倍数,则将二值码分为m段长度相同的子串,每段长度均为s=d/m;
若d不是m的倍数,令v等于d除以m所得的余数,则分段时,前v段的长度为
Figure PCTCN2017104398-appb-000007
后m-v段的长度为
Figure PCTCN2017104398-appb-000008
其中d,m,s均为正整数,m小于等于d。
可选地,若分段后每一字串长度s=32;则b=30,c=3或b=28,c=4;以及,建立数据库中所有图像的m段子串的二值码字典树的过程为离线阶段实现。
第二方面,本发明提供一种二值码近邻搜索方法,包括:
对数据库中的所有二值码进行分段处理,获得m段子串;
针对所有二值码的第j段子串,建立该第j段子串的一个二值码 字典树;所述二值码字典树的数量为m个;每一二值码字典树包括:内部节点和叶子节点;
接收待查询的二值码,将待查询的二值码分为m段子串;
针对待查询二值码的第j段子串,在数据库中所有二值码的第j段子串对应的二值码字典树中查找汉明距离不超过
Figure PCTCN2017104398-appb-000009
的二值码;
遍历待查询二值码的所有子串,获得每一子串的查询结果;
根据待查询二值码的所有子串的查询结果,进行合并去重测试,获取待查询二值码的查询结果;
其中:m,j均为正整数,r为预先确定的非负整数值,且j小于等于m。
可选地,针对所有二值码的第j段子串,建立该第j段子串的一个二值码字典树的步骤,包括:
依据第j段子串长度和预设的参数值c,确定二值码字典树的最小索引单位;
以及,根据预设的参数值b和最小索引单位,取前b个比特建立二值码字典树的节点,获得该第j段子串的二值码字典树;
其中,所述二值码字典树的根节点依据该第j段子串左侧的第一个最小索引单位建立分支;对于第i层的节点,依据该第j段子串从左至右的第i个最小索引单位建立分支;所述叶子节点为所述二值码字典树的末节点;
所述二值码字典树中的每个节点对应于一个字符串,根节点对应于空串;对于第i层的节点,其对应的字符串为该第j段子串中的前i个最小索引单位组成的长度为i*c的字符串;
其中b,c均为正整数,b为c的倍数,所述根节点和第i层的节点均为所述二值码字典树的内部节点。
可选地,每一个二值码字典树中的每一个叶子节点均为以容器形 式存在的末节点;所述容器内含有所有插入到这个末节点的字符串,这些字符串含有相同的前缀;
相应地,针对待查询二值码的第j段子串,在数据库中所有二值码的第j段子串对应的二值码字典树中查找汉明距离不超过
Figure PCTCN2017104398-appb-000010
的二值码的步骤,包括:
从所述二值码字典树的根节点开始遍历该二值码字典树;
对于该二值码字典树中的每一个节点,计算该节点对应的字符串与所述待查询图像二值码的第j段子串的汉明距离;
若计算的汉明距离大于r’,则遍历在当前节点处停止;
或者,当遍历至叶子节点时,在叶子节点所属的容器中获取相应的汉明距离不超过
Figure PCTCN2017104398-appb-000011
的二值码。
由上述技术方案可知,本发明提出的基于二值码字典树的搜索方法,通过建立数据库图像的所有子串的二值码字典树,进而在搜索过程中检查出现在二值码字典树的节点上的元素,可有效避免现有技术中MIH方案的查找缺失问题,从而能降低查找数量,提高搜索速度。
附图说明
图1(a)和图1(b)为本发明一实施例中的二值码字典树的示意图;
图1(c)为本发明一实施例中的采用MBNT查找r近邻的示意图;
图2为本发明一实施例中的利用MBNT进行图像搜索的过程示意图;
图3为本发明一实施例中的基于二值码字典树的搜索方法的流程示意图;
图4为本发明一实施例中的二值码近邻搜索方法的流程示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。
汉明空间中的近邻搜索问题有两个,即K近邻搜索和r近邻搜索。其中,K近邻搜索要找到数据集中与给定查询向量相比汉明距离最近的K个向量;r近邻搜索要找到数据集中所有与查询向量相比汉明距离不超过一个固定值(r)的所有向量。这两个问题本质上是可以互相转化的。
本发明实施例中主要是解决第二个问题,即r近邻搜索,但亦可用于解决K近邻搜索。r近邻搜索问题的形式化描述如下:
定义1
给定数据集
Figure PCTCN2017104398-appb-000012
以及查询q,其中bi,q∈{0,1}d(均为长度为d的二值码),q的r近邻Dr(q,B)定义为所有B中与q相比差别不超过r个比特的向量:
Dr(q,B)={bi∈B:H(bi,q)≤r}
其中H(·)代表汉明距离。
传统方案中,基于哈希表的方法将二值码直接作为哈希表中的地址。给定查询q以及搜索半径r,需枚举q的所有可能的r近邻向量,然后查找相应的哈希桶确定其是否存在。该方法需要一个包含2d个桶的哈希表来索引d比特的二值码。当d为64比特或者更大时,该方法的内存消耗变得不可接受。
另外,现有技术的MIH将二值码划分为m个不重叠不相交的子串,每个子串的长度均为s=d/m比特。根据鸽巢原理,如果两个二值码p和q的差别(汉明距离)不超过r比特,则它们的m个子串中 至少有一个子串的差别不超过
Figure PCTCN2017104398-appb-000013
比特。故MIH为m个子串建立了m个哈希表T1,...,Tm。给定查询q以及相应的子串q1,...,qm,MIH首先为每个子串qj在Tj中查找r′近邻,记为Dr′(qj)。然后将所有r′近邻集合合并为一个候选项集合S=∪jDr′(qj)。最后,测试所有集合S中的候选项并仅保留q的r近邻。
下面考虑MIH的时间消耗。先计算每一个子串带来的消耗,然后乘以m以得到总的结果。假设二值码在汉明空间中呈均匀分布。对于每一个子串,查找阶段需要检查的哈希桶的数量为
lookupMIH(s)=L(s,r′)
记n为数据集中二值码的总数。由于哈希表中共有2s个桶,平均每个桶内的元素数量为n/2s。则需要测试的候选项数量为
Figure PCTCN2017104398-appb-000014
故,搜索查询q的r近邻的总的时间消耗为
Figure PCTCN2017104398-appb-000015
在大多数应用场景下n<<2s,这意味着lookup在cost函数中占据主导。另一方面,n远小于2s导致哈希表中存在着大量的空桶,检查这样的空桶是不必要且浪费时间的。
由此,本发明实施例中采用分块多叉的二值码字典树来避免检查缺失,从而降低查找阶段的开销,提高搜索速度。
结合上述内容,本发明实施例主要是为了解决汉明空间中的精确近邻搜索问题,即给定一个查询向量q,在数据库中查找q的r近邻(或K近邻),也就是查询向量q在汉明空间中的所有r近邻。
本发明实施例提出了一种新的数据结构——分块多叉二值码字 典树结构(MULTI-BLOCK N-ARY TRIE,MBNT),即涉及汉明空间近邻搜索方法中的MBNT(如图1b所示),并利用它解决汉明空间的近邻搜索问题。
在本发明的MBNT中,例如,将设置二值码字典树中的最小索引单位;在给定一个查询向量时,通过遍历二值码字典树中的节点实现近邻搜索(如图1b所示)。为此,通过检查出现在二值码字典树中节点上的元素,查找缺失问题能够被有效避免,从而能提高搜索速度,即实现了高效的访问速度和内存开销。
另外,本发明实施例还使用了与MIH相同的分段策略以应对长向量检索。例如,将数据库中每个图像的二值码划分为m个不重叠不相交的子串并分别建立所有相同段子串的二值码字典树。下面讨论如何为每个子串设计高效的索引结构/字典树结构。
图1a中示出了8个6比特的二值码B={000000;000010;000011;000101;010010;011000;011101;011111},图1(a)示出的是一种二值码字典树;图1(b)示出了一种分块多叉的二值码字典树;图1(c)为使用MBNT搜索r近邻的示例。其中查询向量q=111101,搜索半径r=2;MBNT中的遍历路径加粗标出。
在本发明实施例中,二值码字典树是一个树结构,树上的每个节点u对应于一个字符串w,且满足该节点的所有后代所对应的字符串均含有相同的前缀w。根节点对应于空字符串。二值码字典树中深度为l的节点代表了含有相同的前l个字符的字符串集合,其分支(儿子节点)则基于字符串的第l+1个字符。用二值码构造的字典树是一种特殊情况,此时的二值码字典树为一颗二叉树,因为每个节点仅有0和1两个分支。图1(a)是二值码字典树的一个示例。
本发明实施例中的MBNT将二值码中连续的c个比特——称作一个块,作为二值码字典树中的最小索引单位。具体的,给定长度为s比特的二值码,将每c个比特看做一个块,则共有s/c个块,每 个块可以表达2c个不同的符号。
特别地,若设定c=1,MBNT就变成了二叉字典树。将若干个比特合并为一个块的原因在于既能提高二值码字典树的访问速度又能降低内存开销,因为索引s比特的字符串需要s层的二值码字典树,当s很大时,二值码字典树将变得很深。
此外,将二值码字典树中的每个叶子节点以容器形式存在,其中包含数据库中所有以该叶子所对应的字符串为前缀的字符串。在实践中,仅采用前b个比特(b≤s)建立二值码字典树的节点,以此来降低内存开销。图1(b)是MBNT的一个示例,其中数据集B={000000,000010,000011,000101,010010,011000,011101,011111}。在图1(b)中b=4,c=2。
特别说明的是:采用一个静态结构——满2c叉树,来实现分块多叉的二值码字典树。一个1比特的布尔型标志即可表示每个节点是否真实存在。容器采用一个静态数组来实现。b应设定为c的倍数。增大b可以帮助降低叶子密度,但同时会增加树上的访问时间。通常,对于32比特的子串设定b=30,c=3或b=28,c=4。
以下举例说明使用MBNT搜索r近邻
给定查询向量q以及搜索半径r,采用遍历MBNT的方式来搜索r近邻。遍历从根开始,设定初始汉明距离h=0。当遍历当第l层的节点时,计算出该节点对应的字符串与q的前l个块的汉明距离,若该距离大于r,则遍历在这个节点处终止。当到达叶子节点时,检查对应的容器,返回所有与q的汉明距离h≤r的二值码。图1(c)是使用MBNT搜索r近邻的一个示例,算法1为伪代码。
Figure PCTCN2017104398-appb-000016
Figure PCTCN2017104398-appb-000017
此外,给定b的取值,MBNT至多包含p=2b个叶子节点。假设二值码在汉明空间中呈均匀分布,且总共插入n个二值码到MBNT中。则任意叶子节点在MBNT中不存在的概率为
Figure PCTCN2017104398-appb-000018
此时,由于p的值很大(1-1/p)p≈1/e。
于是,叶子节点密度的期望值为
Figure PCTCN2017104398-appb-000019
用lookupMIH(s)表示查找阶段总共需要枚举的r近邻的数量,则MBNT的期望查找数量为:
Figure PCTCN2017104398-appb-000020
其中1/density(b)代表了MBNT算法在查找阶段相对哈希表算法的加速比。当s=32,b=30,n=5*107时,density(b)≈0.045。这意味着95.5%的查找项能够被MBNT算法剔除。
进一步地,结合图2所示,图像搜索分为两个阶段,离线阶段 和在线阶段。
离线阶段的工作主要为数据库中图像提取特征和构建字典树,具体步骤为:
步骤01:对数据库图像提取特征,并将特征压缩(二值化)为紧凑的二值码。
本实施例中是针对数据库中每一图像提取特征,并获取每一图像的紧凑二值码。
在实际使用时,该处不受提取特征方法以及特征压缩(二值化)方法所限。任意的特征(如SIFT,VLAD,FisherVector,GIST,CNN特征等)以及任意的二值化方法(如局部敏感性哈希LSH,迭代量化ITQ,谱哈希SH,DeepHash等)产生的二值码均可适用于本发明的方法。
步骤02:对二值码进行索引。
例如,将二值码划分为m段不重叠不相交的子串,并构建m个二值码字典树(MBNT),将所有图像的第j段串插入到第j个MBNT中。
可理解的是,本实施例中是对每一图像的紧凑二值码划分为m段子串,针对所有图像的相同段子串建立一个二值码字典树。
在线阶段的主要步骤有:
步骤03:对查询图像提取特征,并将特征压缩(二值化)为紧凑的二值码。
本实施例中可采用现有技术中的任意方式提取特征,并获得紧凑二值码,本实施例不对其进行限定。
步骤04:利用搜索算法,在上述离线方式建立的响应的二值码字典树中查找查询向量在汉明空间中的近邻。
步骤05:根据近邻的ID返回搜索结果。
为更清楚的说明搜索过程,参见如图3所示,图3所示的方法 包括下述的步骤:
301、获取数据库中每一图像的二值码,将每个二值码划分为m段子串。
本实施例中可采用现有技术中的任意方式获取每一图像的二值码,这里的二值码可为紧凑二值码,本实施例不对其进行限定。
举例来说,获取数据库图像的二值码,该二值码长度为d;采用分段策略将每个二值码划分为m个不重叠不相交的子串;
若d为m的倍数,则将二值码分为m段长度相同的子串,每段长度均为s=d/m;
若d不是m的倍数,令v等于d除以m所得的余数,则分段时,前v段的长度为
Figure PCTCN2017104398-appb-000021
后m-v段的长度为
Figure PCTCN2017104398-appb-000022
其中d,m,s均为正整数,m小于等于d。
302、针对数据库中所有图像的第j段子串,建立该第j段子串的一个二值码字典树。
本实施例中,二值码字典树的数量为m个;每一二值码字典树包括:内部节点和叶子节点。树结构的节点分为内部节点和外部节点。外部节点指的就是叶子节点(其特征为没有子节点,在树的最下层),内部节点指非叶子节点。根节点也属于内部节点。
应说明的是,在本步骤中,遍历所有段子串,建立有所有图像相同段的二值码字典树。
303、获取待查询图像的二值码以及该二值码的m段子串。
举例来说,获取待查询图像的二值码,该二值码长度为d;
采用分段策略将每个二值码划分为m个不重叠不相交的子串;
若d为m的倍数,则将二值码分为m段长度相同的子串,每段长度均为s=d/m;
若d不是m的倍数,令v等于d除以m所得的余数,则分段时,前v段的长度为
Figure PCTCN2017104398-appb-000023
后m-v段的长度为
Figure PCTCN2017104398-appb-000024
其中d,m,s均为正整数,m小于等于d。
304、针对待查询图像二值码的第j段子串,在数据库中所有图像的第j段子串对应的二值码字典树中查找汉明距离不超过
Figure PCTCN2017104398-appb-000025
的二值码。
305、遍历待查询图像二值码的所有子串,获得每一子串的查询结果。
上述的m,j均为正整数,r为预先确定的非负整数值,且j小于等于m。
特别说明的是,上述步骤301和步骤302可为离线方式完成,即属于离线阶段,后续步骤303至步骤305可为在线方式完成。
在实际应用中,上述方法还包括下述的步骤306:
306、根据待查询图像二值码的所有子串的查询结果,进行合并去重测试,获取待查询图像的搜索结果。
需要说明的是,某一子串距离小于等于r’是整串距离小于等于r的必要条件而非充分条件。进而上述步骤305中的查询结果是正确结果(即搜索结果)的超集。
在合并去重之后,还需要进行一步必要的测试工作。即把所有返回的结果计算一遍整串的汉明距离确定是否小于等于r,剔除大于r的项,进而得到搜索结果。
在一种可选的实现方式中,前述的步骤302可包括下述的图中未示出的步骤3021至步骤3022;
3021、依据第j段子串长度和预设的参数值c,确定二值码字典树的最小索引单位;
3022、根据预设的参数值b和最小索引单位,取前b个比特建立二值码字典树的节点,构建该第j段子串的二值码字典树;
其中,所述二值码字典树的根节点依据该第j段子串左侧的第一 个最小索引单位建立分支;对于第i层的节点,依据该第j段子串从左至右的第i个最小索引单位建立分支;所述叶子节点为所述二值码字典树的末节点;
所述二值码字典树中的每个节点对应于一个字符串,根节点对应于空串;对于第i层的节点,其对应的字符串为该第j段子串中的前i个最小索引单位组成的长度为i*c的字符串;
其中b,c均为正整数,b为c的倍数,所述根节点和第i层的节点均为所述二值码字典树的内部节点,i为小于等于b/c的正整数。
进一步地,本实施例了中的每一个二值码字典树中的每一个叶子节点均可为以容器形式存在的末节点;
所述容器内含有所有插入到这个末节点的字符串,这些字符串含有相同的前缀,即该末节点所对应的长度为b的字符串。
特别说明的是,在具体实现过程中叶子的容器内记录的实际上是图像的ID而不是二值码本身(类似倒排索引。从根到叶子的路径即可推断出二值码,故记录二值码没有意义)。若数据库中有两个图像二值码的某一段子串完全相同,则它们在对应字典树中将被记录在同一个叶子内,一个叶子内可能需要记录多个图像的ID,故容器是必要的。
在另一种可选的实现方式中,前述的步骤304可包括下述的图中未示出的步骤3041至步骤3044;
3041、从所述二值码字典树的根节点开始遍历该二值码字典树;
3042、对于该二值码字典树中的每一个节点,计算该节点对应的字符串与所述待查询图像二值码的第j段子串的汉明距离;
3043、若计算的汉明距离大于r’,则遍历在当前节点处停止;或者,当遍历至叶子节点时,在叶子节点所属的容器中获取相应的汉明距离不超过
Figure PCTCN2017104398-appb-000026
的二值码。
实验证明,若分段后每一字串长度s=32;则b=30,c=3或b=28,c=4。
上述实施例主要为一般场景下的图像(视频)搜索。通过将二值码的若干比特合并成一个块,降低了二值码字典树的深度和内存开销,提高了二值码字典树的访问速度。
在搜索时,利用分块多叉的二值码字典树中节点的记录信息能够有效避免查找缺失问题,提高搜索速度。
另一方面,本发明实施例还提供一种二值码近邻搜索方法,如图4所示,图4所示的方法包括下述的步骤:
401、对数据库中的所有二值码进行分段处理,获得m段子串;
402、针对所有二值码的第j段子串,建立该第j段子串的一个二值码字典树。
本实施例中国,二值码字典树的数量为m个;每一二值码字典树包括:内部节点和叶子节点。
也就是说,依据第j段子串长度和预设的参数值c,确定二值码字典树的最小索引单位;
以及,根据预设的参数值b和最小索引单位,取前b个比特建立二值码字典树的节点,获得该第j段子串的二值码字典树;
其中,所述二值码字典树的根节点依据该第j段子串左侧的第一个最小索引单位建立分支;对于第i层的节点,依据该第j段子串从左至右的第i个最小索引单位建立分支;所述叶子节点为所述二值码字典树的末节点;
所述二值码字典树中的每个节点对应于一个字符串,根节点对应于空串;对于第i层的节点,其对应的字符串为该第j段子串中的前i个最小索引单位组成的长度为i*c的字符串;
其中b,c均为正整数,b为c的倍数,所述根节点和第i层的节点均为所述二值码字典树的内部节点。
每一个二值码字典树中的每一个叶子节点均为以容器形式存在的末节点;所述容器内含有所有插入到这个末节点的字符串,这些字符串含有相同的前缀。
403、接收待查询的二值码,将待查询的二值码分为m段子串;
404、针对待查询二值码的第j段子串,在数据库中所有二值码的第j段子串对应的二值码字典树中查找汉明距离不超过
Figure PCTCN2017104398-appb-000027
的二值码。
针对待查询二值码的第j段子串,在数据库中所有二值码的第j段子串对应的二值码字典树中查找汉明距离不超过
Figure PCTCN2017104398-appb-000028
的二值码的步骤,包括:
从所述二值码字典树的根节点开始遍历该二值码字典树;
对于该二值码字典树中的每一个节点,计算该节点对应的字符串与所述待查询图像二值码的第j段子串的汉明距离;
若计算的汉明距离大于r’,则遍历在当前节点处停止;
或者,当遍历至叶子节点时,在叶子节点所属的容器中获取相应的汉明距离不超过
Figure PCTCN2017104398-appb-000029
的二值码。
405、遍历待查询二值码的所有子串,获得每一子串的查询结果;
406、根据待查询二值码的所有子串的查询结果,进行合并去重测试,获取待查询二值码的查询结果;
其中:m,j均为正整数,r为预先确定的非负整数值,且j小于等于m。
本实施例提供的二值码近邻搜索方法可应用在任意二值码近邻搜索过程中,以实现高效的访问速度,并降低查找阶段的内存开销。
本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施 例。
本领域技术人员可以理解,实施例中的各步骤可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。
虽然结合附图描述了本发明的实施方式,但是本领域技术人员可以在不脱离本发明的精神和范围的情况下做出各种修改和变型,这样的修改和变型均落入由所附权利要求所限定的范围之内。

Claims (10)

  1. 一种基于二值码字典树的搜索方法,其特征在于,包括:
    获取数据库中每一图像的二值码,将每个二值码划分为m段子串;
    针对数据库中所有图像的第j段子串,建立该第j段子串的一个二值码字典树;所述二值码字典树的数量为m个;每一二值码字典树包括:内部节点和叶子节点;
    获取待查询图像的二值码以及该二值码的m段子串;
    针对待查询图像二值码的第j段子串,在数据库中所有图像的第j段子串对应的二值码字典树中查找汉明距离不超过
    Figure PCTCN2017104398-appb-100001
    的二值码;
    遍历待查询图像二值码的所有子串,获得每一子串的查询结果;
    其中:m,j均为正整数,r为预先确定的非负整数值,且j小于等于m。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    根据待查询图像二值码的所有子串的查询结果,进行合并去重测试,获取待查询图像的搜索结果。
  3. 根据权利要求1所述的方法,其特征在于,针对数据库中所有图像的第j段子串,建立该第j段子串的一个二值码字典树的步骤,包括:
    依据第j段子串长度和预设的参数值c,确定二值码字典树的最小索引单位;
    以及,根据预设的参数值b和最小索引单位,取前b个比特建立二值码字典树的节点,构建该第j段子串的二值码字典树;
    其中,所述二值码字典树的根节点依据该第j段子串左侧的第一个最小索引单位建立分支;对于第i层的节点,依据该第j段子串从左至右的第i个最小索引单位建立分支;所述叶子节点为所述二值码 字典树的末节点;
    所述二值码字典树中的每个节点对应于一个字符串,根节点对应于空串;对于第i层的节点,其对应的字符串为该第j段子串中的前i个最小索引单位组成的长度为i*c的字符串;
    其中b,c均为正整数,b为c的倍数,所述根节点和第i层的节点均为所述二值码字典树的内部节点,i为小于等于b/c的正整数。
  4. 根据权利要求3所述的方法,其特征在于,
    每一个二值码字典树中的每一个叶子节点均为以容器形式存在的末节点;
    所述容器内含有所有插入到这个末节点的字符串,这些字符串含有相同的前缀。
  5. 根据权利要求4所述的方法,其特征在于,针对待查询图像二值码的第j段子串,在数据库中所有图像的第j段子串对应的二值码字典树中查找汉明距离不超过
    Figure PCTCN2017104398-appb-100002
    的二值码的步骤,包括:
    从所述二值码字典树的根节点开始遍历该二值码字典树;
    对于该二值码字典树中的每一个节点,计算该节点对应的字符串与所述待查询图像二值码的第j段子串的汉明距离;
    若计算的汉明距离大于r’,则遍历在当前节点处停止;
    或者,当遍历至叶子节点时,在叶子节点所属的容器中获取相应的汉明距离不超过
    Figure PCTCN2017104398-appb-100003
    的二值码。
  6. 根据权利要求1至5任一所述的方法,其特征在于,获取数据库中每一图像的二值码,将每个二值码划分为m段子串的步骤,包括:
    获取数据库图像的二值码,该二值码长度为d;
    采用分段策略将每个二值码划分为m个不相交不重叠的子串;
    若d为m的倍数,则将二值码分为m段长度相同的子串,每段长度均为s=d/m;
    若d不是m的倍数,令v等于d除以m所得的余数,则分段时,前v段的长度为
    Figure PCTCN2017104398-appb-100004
    后m-v段的长度为
    Figure PCTCN2017104398-appb-100005
    和/或,获取待查询图像的二值码以及该二值码的m段子串的步骤,包括:
    获取待查询图像的二值码,该二值码长度为d;
    采用分段策略将每个二值码划分为m个不相交不重叠的子串;
    若d为m的倍数,则将二值码分为m段长度相同的子串,每段长度均为s=d/m;
    若d不是m的倍数,令v等于d除以m所得的余数,则分段时,前v段的长度为
    Figure PCTCN2017104398-appb-100006
    后m-v段的长度为
    Figure PCTCN2017104398-appb-100007
    其中d,m,s均为正整数,m小于等于d。
  7. 根据权利要求6所述的方法,其特征在于:
    若分段后每一字串长度s=32;则b=30,c=3或b=28,c=4;
    以及,建立数据库中所有图像的m段子串的二值码字典树的过程为离线阶段实现。
  8. 一种二值码近邻搜索方法,其特征在于,包括:
    对数据库中的所有二值码进行分段处理,获得m段子串;
    针对所有二值码的第j段子串,建立该第j段子串的一个二值码字典树;所述二值码字典树的数量为m个;每一二值码字典树包括:内部节点和叶子节点;
    接收待查询的二值码,将待查询的二值码分为m段子串;
    针对待查询二值码的第j段子串,在数据库中所有二值码的第j段子串对应的二值码字典树中查找汉明距离不超过
    Figure PCTCN2017104398-appb-100008
    的二值码;
    遍历待查询二值码的所有子串,获得每一子串的查询结果;
    根据待查询二值码的所有子串的查询结果,进行合并去重测试, 获取待查询二值码的查询结果;
    其中:m,j均为正整数,r为预先确定的非负整数值,且j小于等于m。
  9. 根据权利要求8所述的方法,其特征在于,针对所有二值码的第j段子串,建立该第j段子串的一个二值码字典树的步骤,包括:
    依据第j段子串长度和预设的参数值c,确定二值码字典树的最小索引单位;
    以及,根据预设的参数值b和最小索引单位,取前b个比特建立二值码字典树的节点,获得该第j段子串的二值码字典树;
    其中,所述二值码字典树的根节点依据该第j段子串左侧的第一个最小索引单位建立分支;对于第i层的节点,依据该第j段子串从左至右的第i个最小索引单位建立分支;所述叶子节点为所述二值码字典树的末节点;
    所述二值码字典树中的每个节点对应于一个字符串,根节点对应于空串;对于第i层的节点,其对应的字符串为该第j段子串中的前i个最小索引单位组成的长度为i*c的字符串;
    其中b,c均为正整数,b为c的倍数,所述根节点和第i层的节点均为所述二值码字典树的内部节点。
  10. 根据权利要求8或9所述的方法,其特征在于,
    每一个二值码字典树中的每一个叶子节点均为以容器形式存在的末节点;所述容器内含有所有插入到这个末节点的字符串,这些字符串含有相同的前缀;
    相应地,针对待查询二值码的第j段子串,在数据库中所有二值码的第j段子串对应的二值码字典树中查找汉明距离不超过
    Figure PCTCN2017104398-appb-100009
    的二值码的步骤,包括:
    从所述二值码字典树的根节点开始遍历该二值码字典树;
    对于该二值码字典树中的每一个节点,计算该节点对应的字符串 与所述待查询图像二值码的第j段子串的汉明距离;
    若计算的汉明距离大于r’,则遍历在当前节点处停止;
    或者,当遍历至叶子节点时,在叶子节点所属的容器中获取相应的汉明距离不超过
    Figure PCTCN2017104398-appb-100010
    的二值码。
PCT/CN2017/104398 2017-03-10 2017-09-29 一种基于二值码字典树的搜索方法 WO2018161548A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710142528.6A CN106980656B (zh) 2017-03-10 2017-03-10 一种基于二值码字典树的搜索方法
CN201710142528.6 2017-03-10

Publications (1)

Publication Number Publication Date
WO2018161548A1 true WO2018161548A1 (zh) 2018-09-13

Family

ID=59338160

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/104398 WO2018161548A1 (zh) 2017-03-10 2017-09-29 一种基于二值码字典树的搜索方法

Country Status (2)

Country Link
CN (1) CN106980656B (zh)
WO (1) WO2018161548A1 (zh)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106980656B (zh) * 2017-03-10 2018-07-10 北京大学 一种基于二值码字典树的搜索方法
CN107679073A (zh) * 2017-08-25 2018-02-09 中国科学院信息工程研究所 一种压缩网页指纹库构建方法和压缩网页快速相似性匹配方法
CN107862026B (zh) * 2017-10-31 2021-01-01 北京小度信息科技有限公司 数据存储方法及装置、数据查询方法及装置、电子设备
CN110188242B (zh) * 2019-05-30 2020-09-04 北京三快在线科技有限公司 无人驾驶设备定位方法、装置、无人驾驶设备和存储介质
CN110516118A (zh) * 2019-08-13 2019-11-29 出门问问(武汉)信息科技有限公司 一种字符串匹配方法、设备及计算机存储介质
CN111553670B (zh) * 2020-04-28 2021-10-15 腾讯科技(深圳)有限公司 一种交易处理方法、装置及计算机可读存储介质
CN112069286B (zh) * 2020-08-28 2024-01-02 喜大(上海)网络科技有限公司 字典树参数更新方法、装置、设备及存储介质
CN112988834B (zh) * 2021-02-07 2023-03-10 潍坊北大青鸟华光照排有限公司 一种字典短语的查询方法
WO2024108449A1 (zh) * 2022-11-23 2024-05-30 北京小米移动软件有限公司 一种信号量化方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103345496A (zh) * 2013-06-28 2013-10-09 新浪网技术(中国)有限公司 多媒体信息检索方法和系统
US20140365500A1 (en) * 2013-06-11 2014-12-11 InfiniteBio Fast, scalable dictionary construction and maintenance
CN104951559A (zh) * 2014-12-30 2015-09-30 大连理工大学 一种基于位权重的二值码重排方法
CN105989001A (zh) * 2015-01-27 2016-10-05 北京大学 图像搜索方法及装置、图像搜索系统
CN106980656A (zh) * 2017-03-10 2017-07-25 北京大学 一种基于二值码字典树的搜索方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140365500A1 (en) * 2013-06-11 2014-12-11 InfiniteBio Fast, scalable dictionary construction and maintenance
CN103345496A (zh) * 2013-06-28 2013-10-09 新浪网技术(中国)有限公司 多媒体信息检索方法和系统
CN104951559A (zh) * 2014-12-30 2015-09-30 大连理工大学 一种基于位权重的二值码重排方法
CN105989001A (zh) * 2015-01-27 2016-10-05 北京大学 图像搜索方法及装置、图像搜索系统
CN106980656A (zh) * 2017-03-10 2017-07-25 北京大学 一种基于二值码字典树的搜索方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WANG, ZHE: "Component hashing of variable-length binary aggregated desc- riptors for fast image search", 2014 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP 2014), October 2014 (2014-10-01), pages 2217 - 2221, XP055538086 *

Also Published As

Publication number Publication date
CN106980656A (zh) 2017-07-25
CN106980656B (zh) 2018-07-10

Similar Documents

Publication Publication Date Title
WO2018161548A1 (zh) 一种基于二值码字典树的搜索方法
US11048966B2 (en) Method and device for comparing similarities of high dimensional features of images
US7433869B2 (en) Method and apparatus for document clustering and document sketching
US9141676B2 (en) Systems and methods of modeling object networks
CN104035949B (zh) 一种基于局部敏感哈希改进算法的相似性数据检索方法
Chakrabarti et al. An efficient filter for approximate membership checking
Lu et al. Efficiently Supporting Edit Distance Based String Similarity Search Using B $^+ $-Trees
CN107368527B (zh) 基于数据流的多属性索引方法
CN109166615B (zh) 一种随机森林哈希的医学ct图像存储与检索方法
CN111868710A (zh) 搜索大规模非结构化数据的随机提取森林索引结构
US9298757B1 (en) Determining similarity of linguistic objects
WO2020057272A1 (zh) 一种索引数据存储及检索方法、装置及存储介质
CN107357843B (zh) 基于数据流结构的海量网络数据查找方法
US11520738B2 (en) Internal key hash directory in table
Wang et al. An Efficient Sliding Window Approach for Approximate Entity Extraction with Synonyms.
Eghbali et al. Online nearest neighbor search in binary space
Matsui et al. Pqtable: Nonexhaustive fast search for product-quantized codes using hash tables
US8886677B1 (en) Integrated search engine devices that support LPM search operations using span prefix masks that encode key prefix length
Zhou et al. Adaptive subspace symbolization for content-based video detection
CN114911826A (zh) 一种关联数据检索方法和系统
Arroyuelo A dynamic pivoting algorithm based on spatial approximation indexes
Chauhan et al. Finding similar items using lsh and bloom filter
Huang et al. A Multi-Block N-ary trie structure for exact r-neighbour search in hamming space
Hou et al. Quick search algorithms based on ethnic facial image database
Mishra et al. Efficient edit distance based string similarity search using deletion neighborhoods

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17899262

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07.02.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 17899262

Country of ref document: EP

Kind code of ref document: A1