WO2022252316A1 - 度量空间最优完全划分索引寻找方法、装置及相关组件 - Google Patents

度量空间最优完全划分索引寻找方法、装置及相关组件 Download PDF

Info

Publication number
WO2022252316A1
WO2022252316A1 PCT/CN2021/102674 CN2021102674W WO2022252316A1 WO 2022252316 A1 WO2022252316 A1 WO 2022252316A1 CN 2021102674 W CN2021102674 W CN 2021102674W WO 2022252316 A1 WO2022252316 A1 WO 2022252316A1
Authority
WO
WIPO (PCT)
Prior art keywords
division
space
data
arrangement
selection
Prior art date
Application number
PCT/CN2021/102674
Other languages
English (en)
French (fr)
Inventor
毛睿
戴英龙
赖裕雄
王毅
刘刚
陆克中
陆敏华
陈倩婷
Original Assignee
深圳计算科学研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳计算科学研究院 filed Critical 深圳计算科学研究院
Publication of WO2022252316A1 publication Critical patent/WO2022252316A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/77Software metrics

Definitions

  • the present application relates to the technical field of computer software, in particular to a method, a device and related components for finding an optimal complete partition index of a metric space.
  • a partition-based metric space index divides the data space into subregions, either spatially or data-distributed, when creating the index. Most of the existing partition-based metric space indexes can be divided into two parts based on spherical partitioning and hyperplane-based according to the logical shape of the subspace of the partitioning result.
  • the index method based on spherical partitioning uses support points and radii as parameters to divide the space into multiple spherical subspaces.
  • Typical forms include spherical partitioning of space represented by vp-tree and hierarchical spherical partitioning represented by M-tree. two kinds.
  • the vp-tree directly uses the distance from the data to the support point to divide the data into two parts, the inside of the ball and the outside of the ball, while the M-tree uses the form of the minimum bounding sphere to divide the data in a balanced manner.
  • hyperplane division The core idea based on hyperplane division is to divide the data into the area represented by the nearest support point, and the divided space logically shows the shape of Voronoi.
  • the most basic form based on hyperplane division is the hyperplane tree gh-tree proposed by Jeffrey K. Uhlmann and GNA-tree proposed by Sergey Brin.
  • the result of hyperplane division has good geometric properties, and the regions obtained by the division do not overlap each other.
  • BM-index uses the weighted information of the distance from the data to the support point to divide, which is currently rare from the form of dividing the boundary.
  • An example of optimizing an index is that the performance comparison between different indexes is carried out through their own experiments, and the index conditions of different index methods are often different, and the performance impact is also determined by multiple factors. Different points, different division methods, index balance, etc. will have a greater impact on the performance of the index. Using different indexes to directly compare performance, there is no unified model that can objectively evaluate the pros and cons of different methods, so it cannot objectively reflect the inherent differences of different division methods hidden behind the experiment.
  • the embodiment of the present application provides a method, device and related components for finding an optimal complete partition index of a metric space, aiming at measuring the performance of different partitioning methods to determine the optimal partitioning method.
  • the embodiment of the present application provides a method for finding an optimal complete partition index of a metric space, including:
  • the weight vector of the preset division method is set as the normal vector candidate set for dividing the support point space hyperplane
  • the performance difference of the corresponding division mode is determined according to the number of data falling into the target area, so as to determine the optimal division mode.
  • the embodiment of the present application provides a device for finding an optimal complete partition index of a metric space, including:
  • the data mapping unit is used to input the data in the preset data set into the metric space, and select n support points from the data set by using the point selection method to form a support point space, and then map the data to the support points space;
  • the candidate set setting unit is used to set the weight vector of the preset division mode as the normal vector candidate set for dividing the support point space hyperplane;
  • the normal vector selection unit is used to select n linearly independent normal vectors in the normal vector candidate set according to various selection arrangements to obtain corresponding multiple selection arrangement results, and use each selection arrangement result as a corresponding division method;
  • a complete linear division unit configured to use n linearly independent normal vectors in each division mode to perform a complete linear division on the support point space, to obtain a division result corresponding to each arrangement mode;
  • the performance determination unit is configured to, in the division results corresponding to each arrangement mode, determine the performance difference of the corresponding division mode according to the number of data falling into the target area, so as to determine the optimal division mode.
  • an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and operable on the processor.
  • the processor executes the computer program, The method for finding an optimal complete partition index in a metric space as described in the first aspect is realized.
  • an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the metric space as described in the first aspect is implemented Optimal fully partitioned index lookup method.
  • the embodiment of the present application provides a method, device and related components for finding an optimal complete partition index in a metric space. Concentratingly select n support points to form a support point space, and then map the data to the support point space; set the weight vector of the preset division method as a normal vector candidate set for dividing the support point space hyperplane; according to various Selecting a permutation mode to select n linearly independent normal vectors in the normal vector candidate set to obtain corresponding multiple selection permutation results, and using each selection permutation result as a corresponding division mode; using each division mode
  • the n linearly independent normal vectors of n linearly independent normal vectors completely linearly divide the support point space, and obtain the division results corresponding to each arrangement mode; in the division results corresponding to each arrangement mode, according to the data falling into the target area
  • the number determines the performance difference of the corresponding division method, so as to determine the optimal division method.
  • the weight vectors corresponding to different division methods are used as hyperplanes to completely linearly divide the
  • FIG. 1 is a schematic flowchart of a method for finding an optimal complete partition index in a metric space provided by an embodiment of the present application
  • FIG. 2 is a schematic subflow diagram of a method for finding an optimal complete partition index in a metric space provided by an embodiment of the present application;
  • FIG. 3 is a schematic block diagram of a device for finding an optimal complete partition index of a metric space provided by an embodiment of the present application
  • FIG. 4 is a sub-schematic block diagram of a device for finding an optimal complete partition index of a metric space provided by an embodiment of the present application
  • FIG. 5 is a schematic diagram of an example of a method for finding an optimal fully partitioned index in a metric space provided by an embodiment of the present application.
  • FIG. 1 is a schematic flowchart of a method for finding an optimal complete partition index in a metric space provided by an embodiment of the present application, which specifically includes steps S101 to S105.
  • the normal vectors of the support point space are completely linearly divided, so it can be understood that the number of weight vectors in the normal vector candidate set must be greater than or equal to n. Since there are many ways to select n linearly independent normal vectors, each selected arrangement can be used as the corresponding division method, and the support point space can be completely linearly divided, so that the data falling into the target area can be
  • the performance of different partitioning methods is judged to some extent, and then the partitioning method with the best performance can be selected among multiple partitioning methods.
  • the index performance can be regarded as the partition performance of the partition method, that is, whether the index can be used to maximize the partition. For example, the data is divided into nk parts by completely linear division. If nk –1 parts can be excluded, only the last part can be kept, so that there is no need to search for nk –1 parts of data.
  • this embodiment does not need to clearly know the specific form of the division boundary of the division method in the metric space, but only needs to know the linear expression of the unknown form in the support point space to perform performance comparison, which greatly improves the accuracy of the comparison. time cost and efficiency.
  • a unified partitioning model—complete linear partitioning is adopted to unify the different partitioning methods.
  • Use the unified model that is, the method for finding the optimal complete partition index of the metric space provided in this embodiment
  • complete linear partitioning is used to realize spherical partitioning and hyperplane partitioning.
  • the spherical partitioning and hyperplane partitioning are unified, and various partition boundaries are mapped to the support point space, which is expressed in the support point space in the form of linear expression. , transforming the problem with different partition boundaries into a problem with different linear expressions, no longer limited to a certain partition method, but by finding the linear expression with the highest exclusion rate to improve the partition performance.
  • this embodiment uses the unified model (that is, the method for finding the optimal complete partition index of the metric space provided by this embodiment), and provides the possibility to find a better partition method through the linear expression of the partition boundary in the support point space.
  • the partition boundary is also mapped to the support point space, and the partition boundary is expressed in the form of a linear equation in the support point space.
  • the x-axis represents d(s,p1)
  • the y-axis represents d(s,p2)
  • a metric space is an abstraction that covers a wide range of data types.
  • a metric space can be defined as a two-tuple (S, d), where S is a finite and non-empty data set, and d is defined on S with the following Properties of the distance function:
  • the support point space FP,d(S) is the image of S in Rn:
  • the data set described in this embodiment may be random uniform vector, DNA, protein, two-dimensional map boundary data, etc.
  • the data dimension of the data set may be 1-5 dimensions
  • the data format is uniform.
  • the step S101 includes:
  • the point selection algorithm adopts the farthest first traversal algorithm, and the farthest first traversal algorithm (Farthest First Traversal, FFT) can select data corner points, and has linear space complexity and time complexity.
  • FFT farthest First Traversal
  • other point selection algorithms may also be used, such as the k-means point selection algorithm (a method for determining various k values and how to select an initial point).
  • the step S103 includes:
  • the normal vector candidate set includes m weight vectors, from which n linearly independent normal vectors are selected, then there are Select the permutation results, and then each selected permutation result can be used as a division method for dividing the support point space.
  • the m vectors in this embodiment can be linearly related or linearly independent, but it is necessary to ensure that the selected n normal vectors are linearly independent, because only when n linearly independent normal vectors are selected can n be guaranteed Information in each direction of the dimensional vector space is taken into account.
  • the normal vector candidate set contains n weight vectors, then it will be n linearly independent normal vectors extended for a certain division method.
  • this embodiment selects m weight vectors, which is equivalent to combining multiple division methods, and selects different numbers of normal vectors from different division methods, a total of n, so that it can be used in the current multiple division methods Find the optimal combination of partition vectors, that is, the optimal partition method.
  • the size of m can be determined according to the actual situation, that is, in the actual application process, which division methods need to be combined to find the optimal division method, and then the weight vectors of these division methods can be input.
  • the step S104 includes:
  • division results For example, using The support point space is completely linearly divided by two division methods, and then we can get division results, that is, 12 division results.
  • the step S104 further includes: steps S201-S204.
  • (n, k) complete linear division is performed by using n linearly independent normal vectors. Specifically, for the support point space of n support points, select n linearly independent ordered vectors as the normal vectors for dividing the hyperplane, and mark v1,...,vn in turn. Then use k parallel hyperplanes with v1 as the normal vector to divide the support point space into k first subspaces, and then use k parallel hyperplanes with v2 as the normal vector for each first subspace (acting on different The parallel hyperplanes of the subspaces may be different) are further divided into k second subspaces, and so on, until n normal vectors are exhausted to generate k n subspaces.
  • the region is divided by three hyperplanes, each of which divides the space into two parts, totaling eight subspaces.
  • the target area is the r-neighborhood of the division boundary L corresponding to each division mode.
  • the r-neighborhood of the division boundary L refers to an area "near" the division boundary L.
  • the center q of the range search R(q,r) falls into this area, the two sides of the division boundary are divided during the range search None of the regions can be excluded, denoted as Nr(L).
  • the step S105 includes:
  • the division method corresponding to the n linearly independent normal vectors with the least number of data falling into the r-neighborhood is determined as the optimal division method.
  • the pruning capability of each index is judged by how much data falls into the r-neighborhood after division, so as to determine the performance of each division method. Therefore, when n linearly independent normal vectors are completely linearly divided, the data falling into the r-neighborhood is the least, and the corresponding division method is the optimal division method.
  • Fig. 3 is a schematic block diagram of a device 300 for finding an optimal complete partition index of a metric space provided by an embodiment of the present application, and the device 300 includes:
  • the data mapping unit 301 is used to input the data in the preset data set into the metric space, and select n support points from the data set by using a point selection method to form a support point space, and then map the data to the support point space;
  • the candidate set setting unit 302 is used to set the weight vector of the preset division method as the normal vector candidate set for dividing the support point space hyperplane;
  • the normal vector selection unit 303 is used to select n linearly independent normal vectors in the normal vector candidate set according to various selection arrangements, to obtain corresponding multiple selection arrangement results, and use each selection arrangement result as a corresponding the division method;
  • a complete linear division unit 304 configured to use n linearly independent normal vectors in each division mode to perform a complete linear division on the support point space, to obtain a division result corresponding to each arrangement mode;
  • the performance determination unit 305 is configured to, in the division results corresponding to each arrangement mode, determine the performance difference of the corresponding division mode according to the number of data falling into the target area, so as to determine the optimal division mode.
  • the data mapping unit 301 includes:
  • the farthest first traversal unit is configured to select n support points from the data set to form a support point space by using the farthest first traversal algorithm.
  • the normal vector selection unit 303 includes:
  • Select the permutation result acquisition unit which is used to select n linearly independent normal vectors from m weight vectors, and obtain selection and arrangement results, and use each selection and arrangement result as a corresponding division method.
  • the complete linear division unit 304 includes:
  • the complete linear division unit 304 further includes:
  • a marking unit 401 configured to mark n linearly independent normal vectors as v1, v2, ..., vn in sequence;
  • the first division unit 402 is configured to divide the support point space into k first subspaces by using k parallel hyperplanes of v1 normal vectors;
  • the second division unit 403 is configured to obtain k second subspaces by using the parallel hyperplane division of k v2 normal vectors for each first subspace;
  • the exhaustion unit 404 is used to deduce it until n linearly independent normal vector divisions are exhausted to obtain k n subspaces.
  • the target area is the r-neighborhood of the division boundary L corresponding to each division mode.
  • the performance determination unit 305 includes:
  • the determination unit is configured to determine the division mode corresponding to the n linearly independent normal vectors with the least number of data falling in the r-neighborhood as the optimal division mode.
  • the embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored. When the computer program is executed, the steps provided in the above-mentioned embodiments can be realized.
  • the storage medium may include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes.
  • the embodiment of the present application also provides a computer device, which may include a memory and a processor.
  • a computer program is stored in the memory.
  • the processor invokes the computer program in the memory, the steps provided in the above embodiments can be implemented.
  • the computer equipment may also include components such as various network interfaces and power supplies.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Complex Calculations (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种度量空间最优完全划分索引寻找方法、装置及相关组件,该方法包括:将预先设置的数据集中的数据输入至度量空间中,并利用选点方法从所述数据集中选取n个支撑点,构成支撑点空间,然后将数据映射至所述支撑点空间(S101);将预先设置的划分方式的权值向量设置为划分支撑点空间超平面的法向量候选集(S102);按照多种选取排列方式在所述法向量候选集中各选取n个线性无关的法向量,得到对应的多种选取排列结果,并将每一种选取排列结果作为对应的划分方式(S103);利用每一种划分方式中的n个线性无关的法向量对所述支撑点空间进行完全线性划分,得到每一种排列方式对应的划分结果(S104);在每一种排列方式对应的划分结果中,根据落入目标区域内的数据个数确定对应的划分方式的性能差异,以此确定最优的划分方式(S105)。可以确定不同划分方式的性能差异,进而获得性能最优的划分方式。

Description

度量空间最优完全划分索引寻找方法、装置及相关组件
本申请是以申请号为202110612925.1、申请日为2021年6月2日的中国专利申请为基础,并主张其优先权,该申请的全部内容在此作为整体引入本申请中。
技术领域
本申请涉及计算机软件技术领域,特别涉及度量空间最优完全划分索引寻找方法、装置及相关组件。
背景技术
现有的度量空间树形索引大多是基于距离划分的,利用数据到支撑点的距离关系将数据空间划分成多个较小的区域以求在进行范围查找时可以排除掉部分区域而提高搜索的效率。基于划分的度量空间索引在创建索引时,或基于空间或基于数据分布将数据空间划分成多个子区域。现有的基于划分的度量空间索引,根据划分结果的子空间的逻辑形态,大多可以分为以球形划分为基础和以超平面为基础两部分。
基于球形划分的索引方法利用支撑点和半径作为参数将空间划分成多个球子空间,典型的形态有以vp-tree为代表的对空间进行球形划分和以M-tree为代表的层次球形划分两种。vp-tree直接利用数据到支撑点的距离,将数据划分成球内和球外两个部分,而M-tree利用最小边界球的形式将数据进行平衡划分。
基于超平面划分的核心思想是将数据划分到离自己最近的支撑点所代表的区域,划分后的空间在逻辑上显现出Voronoi的形态。基于超平面划分的最基本形式是Jeffrey K.Uhlmann提出来的超平面树gh-tree和Sergey Brin提出了GNA-tree。超平面划分的结果具有良好的几何特性,其划分所得的各区域之间不相互重叠。
但是当前对经典索引的优化,在划分边界的形态上着手对索引进行优化的方式较少,BM-index利用数据到支撑点距离的加权信息进行划分,是目前少有 的从划分边界的形态上对索引进行优化例子。造成这种局面的原因一方面是不同索引之间的性能比较都是通过各自的实验进行的,而不同的索引方法的索引条件往往是不相同的,性能影响也由多个因素共同决定,支撑点的不同、划分方式的不同、索引的平衡性等等都会对索引的性能造成较大的影响。利用不同的索引直接进行性能比较,没有一套统一的模型可以客观地对不同方法的优劣进行评价,因此无法客观地反映潜藏在实验背后的不同划分方式的内在差异,同时实验采用的数据集的不同和实验环境的差异也大大降低了仅利用实验结果得出的结论的客观性。另一方面,度量空间中不同的划分方式的形态各异难以统一,而大部分数学工具都无法在度量空间中使用,因此对划分方式的探索没有明确的研究方向,从而加大了从划分方式对索引进行优化的难度。
申请内容
本申请实施例提供了一种度量空间最优完全划分索引寻找方法、装置及相关组件,旨在对不同的划分方式的性能进行衡量,以确定最优的划分方式。
第一方面,本申请实施例提供了一种度量空间最优完全划分索引寻找方法,包括:
将预先设置的数据集中的数据输入至度量空间中,并利用选点方法从所述数据集中选取n个支撑点,构成支撑点空间,然后将数据映射至所述支撑点空间;
将预先设置的划分方式的权值向量设置为划分支撑点空间超平面的法向量候选集;
按照多种选取排列方式在所述法向量候选集中各选取n个线性无关的法向量,得到对应的多种选取排列结果,并将每一种选取排列结果作为对应的划分方式;
利用每一种划分方式中的n个线性无关的法向量对所述支撑点空间进行完全线性划分,得到每一种排列方式对应的划分结果;
在每一种排列方式对应的划分结果中,根据落入目标区域内的数据个数确定对应的划分方式的性能差异,以此确定最优的划分方式。
第二方面,本申请实施例提供了一种度量空间最优完全划分索引寻找装置,包括:
数据映射单元,用于将预先设置的数据集中的数据输入至度量空间中,并利用选点方法从所述数据集中选取n个支撑点,构成支撑点空间,然后将数据映射至所述支撑点空间;
候选集设置单元,用于将预先设置的划分方式的权值向量设置为划分支撑点空间超平面的法向量候选集;
法向量选取单元,用于按照多种选取排列方式在所述法向量候选集中各选取n个线性无关的法向量,得到对应的多种选取排列结果,并将每一种选取排列结果作为对应的划分方式;
完全线性划分单元,用于利用每一种划分方式中的n个线性无关的法向量对所述支撑点空间进行完全线性划分,得到每一种排列方式对应的划分结果;
性能确定单元,用于在每一种排列方式对应的划分结果中,根据落入目标区域内的数据个数确定对应的划分方式的性能差异,以此确定最优的划分方式。
第三方面,本申请实施例提供了一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如第一方面所述的度量空间最优完全划分索引寻找方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如第一方面所述的度量空间最优完全划分索引寻找方法。
本申请实施例提供了一种度量空间最优完全划分索引寻找方法、装置及相关组件,该方法包括:将预先设置的数据集中的数据输入至度量空间中,并利用选点方法从所述数据集中选取n个支撑点,构成支撑点空间,然后将数据映射至所述支撑点空间;将预先设置的划分方式的权值向量设置为划分支撑点空间超平面的法向量候选集;按照多种选取排列方式在所述法向量候选集中各选取n个线性无关的法向量,得到对应的多种选取排列结果,并将每一种选取排列结果作为对应的划分方式;利用每一种划分方式中的n个线性无关的法向量对所述支撑点空间进行完全线性划分,得到每一种排列方式对应的划分结果;在每一种排列方式对应的划分结果中,根据落入目标区域内的数据个数确定对应的划分方式的性能差异,以此确定最优的划分方式。本申请实施例将不同划分方式对应的权值向量作为超平面对支撑点空间进行完全线性划分,从而根据不同的划分结果确定各划分方式的性能差异,进而可以获得性能最优的划分方 式。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种度量空间最优完全划分索引寻找方法的流程示意图;
图2为本申请实施例提供的一种度量空间最优完全划分索引寻找方法的子流程示意图;
图3为本申请实施例提供的一种度量空间最优完全划分索引寻找装置的示意性框图;
图4为本申请实施例提供的一种度量空间最优完全划分索引寻找装置的子示意性框图;
图5为本申请实施例提供的一种度量空间最优完全划分索引寻找方法中的示例示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个” 及“该”意在包括复数形式。
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
下面请参见图1,图1为本申请实施例提供的一种度量空间最优完全划分索引寻找方法的流程示意图,具体包括:步骤S101~S105。
S101、将预先设置的数据集中的数据输入至度量空间中,并利用选点方法从所述数据集中选取n个支撑点,构成支撑点空间,然后将数据映射至所述支撑点空间;
S102、将预先设置的划分方式的权值向量设置为划分支撑点空间超平面的法向量候选集;
S103、按照多种选取排列方式在所述法向量候选集中各选取n个线性无关的法向量,得到对应的多种选取排列结果,并将每一种选取排列结果作为对应的划分方式;
S104、利用每一种划分方式中的n个线性无关的法向量对所述支撑点空间进行完全线性划分,得到每一种排列方式对应的划分结果;
S105、在每一种排列方式对应的划分结果中,根据落入目标区域内的数据个数确定对应的划分方式的性能差异,以此确定最优的划分方式。
本实施例中,首先利用数据集构建支撑点空间,并将数据集中的数据映射至支撑点空间,然后根据不同划分方式的权值向量构建一法向量候选集,并从其中选取n个线性无关的法向量对支撑点空间进行完全线性划分,因而可以理解的是,所述法向量候选集中的权值向量的个数一定是大于或者等于n的。由于选取n个线性无关的法向量的方式具有多种,因此可以将每一种选取排列方式作为对应的划分方式,并对支撑点空间进行完全线性划分,从而可以根据落入目标区域内的数据多少对不同的划分方式的性能进行判定,进而可以在多个划分方式中选取性能最优的划分方式。
本申请实施例在对多种划分方式进行范围搜索时,不需要对每种划分方式编写一套代码进行索引的建立来进行划分方式优劣的比较,只需要通过将不同划分方式映射到支撑点空间的线性表达作为参数传入即可进行比较。通过结合多种划分方式对索引性能进行判断,为找到更优的划分提供了可能。而索引性 能可以看成是划分方式的划分性能,也就是能否通过该索引进行最大化的划分。比如通过完全线性划分,将数据分成n k份,如果可以排除掉n k–1份,便可以只保留最后一份,如此便可以无需对n k–1份的数据进行搜索。另外,本实施例不需要明确地知道划分方式的划分边界在度量空间中的具体形态,只需要知道该不知名形态在支撑点空间中的线性表达即可进行性能比较,极大地提高了比较的时间成本和效率。
本实施例通过对不同的划分方式进行分析,在最大程度上保留不同划分方式的性能特性的基础上,采用统一的划分模型---完全线性划分对不同的划分方式进行统一。利用统一后的模型(即本实施例提供的度量空间最优完全划分索引寻找方法)对不同划分方式之间的性能差异进行判定,利用相同的支撑点、相同的划分子区域数目,可以减少划分以外的因素对索引性能的影响,提高比较的公平性、结论的可靠性。并且应用完全线性划分实现基于球形的划分和超平面的划分,将球形划分和超平面划分进行了统一,将多种划分边界映射到支撑点空间中,以线性表达的形式表现在支撑点空间中,将划分边界不同的问题转换为线性表达式不同的问题,不再局限于某一种划分方式,而是通过查找出排除率最高的线性表达式,提高划分性能。其次,本实施例利用统一模型(即本实施例提供的度量空间最优完全划分索引寻找方法),通过对划分边界在支撑点空间的线性表达,为找出更优良的划分方式提供了可能。
可以理解的是,在实际操作中,当将数据映射到支撑点空间时,也会把划分边界映射到支撑点空间中,且划分边界在支撑点空间中是以线性方程的形式表达出来的。例如:GH划分,假设选取两个支撑点p1、p2,GH划分的实质是将数据划分到离自己更近的支撑点,则有划分边界d(s,p1)=d(s,p2),s为任一数据点。如果将数据映射到以p1、p2为支撑点的支撑点空间,则x轴代表d(s,p1),y轴代表d(s,p2),划分边界在该支撑点空间中的表达形式为x=y。
度量空间是一种覆盖范围很广的数据类型的抽象,度量空间可以定义为一个二元组(S,d),其中S是有限非空的数据集合,而d是定义在S上的具有如下性质的距离函数:
(1)非负性:对于任意x,y S,d(x,y)≥0,并且d(x,y)=0 x=y。
(2)对称性:对于任意x,yS,d(x,y)=d(y,x)。
(3)三角不等性:对于任意x、y,、zS,d(x,y)+d(y,z)≥d(x,z)。
度量空间(M,d),S={si|si∈M,i=1,2,...,m},S中选择的n个支撑点P={p1,p2,...,pn},对于
Figure PCTCN2021102674-appb-000001
以其到支撑点的距离d(s,pi)作为坐标,可以定义一个从M到n维空间的映射,并用sp表示s在n维空间中的像,则有:
FP,d:M->Rn:sP≡FP,d(s)=(f1(s),f2(s),...,fn(s))=(d(s,p1),d(s,p2),...,d(s,pn))∈FP,d(M)。
那么,支撑点空间FP,d(S)就是S在Rn的像:
FP,d(s)={sP|sP=d(s,p1),d(s,p2),...,d(s,pn)),s∈S}。
假设度量空间中三个数据点s1、s2、s3,其中d(s2,s1)=12,d(s2,s3)=23,d(s1,s3)=13,当选用s1、s3两个数据点作为支撑点时,得到的支撑点空间维度即为2,同时,s1、s2、s3在支撑点空间中的像分别为s1P=(d(s1,s1),d(s1,s3))=(0,13)、s2P=(d(s2,s1),d(s2,s3))=(12,23)、s3P=(d(s3,s1),d(s3,s3))=(13,0)。
另外,本实施例所述的数据集可以是随机均匀向量、DNA、蛋白质、二维地图边界数据等等,所述数据集的数据维度可以是1~5维,且数据格式统一。
在一实施例中,所述步骤S101包括:
利用最远优先遍历算法从所述数据集中选取n个支撑点构成支撑点空间。
本实施例中,所述选点算法采用最远优先遍历算法,所述最远优先遍历算法(Farthest First Traversal,FFT)可以选出数据拐角的点,具有线性的空间复杂度和时间复杂度。当然,在其他实施例中,还可以采用其他的选点算法,例如k-means选点算法(多种k值确定以及如何选取初始点方法)。
在一实施例中,所述权值向量为m个,其中,m大于n;
所述步骤S103包括:
在m个权值向量中选取n个线性无关的法向量,得到
Figure PCTCN2021102674-appb-000002
种选取排列结果,并将每一种选取排列结果作为对应的划分方式。
本实施例中,所述法向量候选集中包括m个权值向量,从中选取n个线性无关的法向量,那么共有
Figure PCTCN2021102674-appb-000003
选取排列结果,继而可以将每一种选取排列结果作为一种划分方式,用于对支撑点空间进行划分。
需要说明的是,本实施例中的m个向量可以线性有关,也可以线性无关,但是需要确保选取的n个法向量线性无关,因为只有当选取n个线性无关的法向量,才可以保证n维向量空间的每个方向的信息都会被考虑到。
另外,如果法向量候选集中包含n个权值向量,那么则会是针对某一种划 分方式所扩展出来的n个线性无关的法向量。而本实施例选用m个权值向量,也就相当于对多种划分方式进行组合,从不同的划分方式中选取不同数量的法向量,共n个,如此便可以在当前的多种划分方式中找到最优的划分向量组合,即最优划分方式。同时,m的大小可以根据实际情况确定,即在实际应用过程中,需要组合哪些划分方式来寻找最优划分方式,便可以输入这些划分方式的权值向量。
在一实施例中,所述步骤S104包括:
分别利用
Figure PCTCN2021102674-appb-000004
种划分方式中的n个线性无关的法向量对所述支撑点空间进行完全线性划分,得到
Figure PCTCN2021102674-appb-000005
个划分结果。
本实施例中,利用
Figure PCTCN2021102674-appb-000006
种划分方式,便可以得到
Figure PCTCN2021102674-appb-000007
个划分结果。例如利用
Figure PCTCN2021102674-appb-000008
种划分方式对支撑点空间进行完全线性划分,便可以得到
Figure PCTCN2021102674-appb-000009
个划分结果,即12个划分结果。
在一实施例中,如图2所示,所述步骤S104还包括:步骤S201~S204。
S201、将n个线性无关的法向量依次标记为v1,v2,…,vn;
S202、利用k个v1法向量的平行超平面对所述支撑点空间划分为k个第一子空间;
S203、对于每一第一子空间,利用k个v2法向量的平行超平面划分得到k个第二子空间;
S204、以此类推,直至穷尽n个线性无关的法向量划分得到k n个子空间。
本实施例中,利用n个线性无关的法向量进行(n,k)完全线性划分。具体来说,对于n个支撑点的支撑点空间,选定n个线性无关的有序向量作为划分超平面的法向量,并依次标记v1,…,vn。然后利用k个以v1为法向量的平行超平面将支撑点空间划分为k个第一子空间,再对每个第一子空间利用k个以v2为法向量的平行超平面(作用于不同子空间的平行超平面可能不同)进一步划分为k个第二子空间,如此类推,直至穷尽n个法向量产生k n个子空间。
举例来说,如图5所示,该区域被三个超平面划分,每个平面将空间分成两部分,共分成八个子空间。
另外,对于球面划分来说,在将球面映射到支撑点空间后,其线性表达也是以超平面形式出现的,如VP(即VP-Tree,近邻搜索之制高点树)的划分边界映射到支撑点空间中的表达形式为x=r。
在一实施例中,所述目标区域为每一划分方式对应的划分边界L的r-邻域。
本实施例中,划分边界L的r-邻域是指划分边界L“附近”的一个区域,当范围搜索R(q,r)的中心q落入该区域时,进行范围搜索时划分边界两边的区域都无法被排除,记作Nr(L)。
从r-邻域的定义可知,在进行范围查找时,q落入某个划分边界的r-邻域的机率越大,该划分边界有效地排除掉另一半数据的概率越低,该划分边界的剪枝能力越差。也即,r-邻域大小和划分边界的剪枝能力成负相关关系。
在一实施例中,所述步骤S105包括:
将落入r-邻域内数据个数最少的n个线性无关的法向量对应的划分方式判定为最优划分方式。
本实施例中,利用划分之后数据落入r-邻域内的数据的多少判断各索引的剪枝能力,从而确定各划分方式性能的优劣。因此,在n个线性无关的法向量进行完全线性划分后落入r-邻域内的数据最少时,对应的划分方式即为最优划分方式。
图3为本申请实施例提供的一种度量空间最优完全划分索引寻找装置300的示意性框图,该装置300包括:
数据映射单元301,用于将预先设置的数据集中的数据输入至度量空间中,并利用选点方法从所述数据集中选取n个支撑点,构成支撑点空间,然后将数据映射至所述支撑点空间;
候选集设置单元302,用于将预先设置的划分方式的权值向量设置为划分支撑点空间超平面的法向量候选集;
法向量选取单元303,用于按照多种选取排列方式在所述法向量候选集中各选取n个线性无关的法向量,得到对应的多种选取排列结果,并将每一种选取排列结果作为对应的划分方式;
完全线性划分单元304,用于利用每一种划分方式中的n个线性无关的法向量对所述支撑点空间进行完全线性划分,得到每一种排列方式对应的划分结果;
性能确定单元305,用于在每一种排列方式对应的划分结果中,根据落入目标区域内的数据个数确定对应的划分方式的性能差异,以此确定最优的划分方式。
在一实施例中,所述数据映射单元301包括:
最远优先遍历单元,用于利用最远优先遍历算法从所述数据集中选取n个支撑点构成支撑点空间。
在一实施例中,所述权值向量为m个,其中,m大于n;
所述法向量选取单元303包括:
选取排列结果获取单元,用于在m个权值向量中选取n个线性无关的法向量,得到
Figure PCTCN2021102674-appb-000010
种选取排列结果,并将每一种选取排列结果作为对应的划分方式。
在一实施例中,完全线性划分单元304包括:
划分结果获取单元,用于分别利用
Figure PCTCN2021102674-appb-000011
种划分方式中的n个线性无关的法向量对所述数据进行完全线性划分,得到
Figure PCTCN2021102674-appb-000012
个划分结果。
在一实施例中,如图4所示,完全线性划分单元304还包括:
标记单元401,用于将n个线性无关的法向量依次标记为v1,v2,…,vn;
第一划分单元402,用于利用k个v1法向量的平行超平面对所述支撑点空间划分为k个第一子空间;
第二划分单元403,用于对于每一第一子空间,利用k个v2法向量的平行超平面划分得到k个第二子空间;
穷尽单元404,用于以此类推,直至穷尽n个线性无关的法向量划分得到k n个子空间。
在一实施例中,所述目标区域为每一划分方式对应的划分边界L的r-邻域。
在一实施例中,所述性能确定单元305包括:
判定单元,用于将落入r-邻域内数据个数最少的n个线性无关的法向量对应的划分方式判定为最优划分方式。
由于装置部分的实施例与方法部分的实施例相互对应,因此装置部分的实施例请参见方法部分的实施例的描述,这里暂不赘述。
本申请实施例还提供了一种计算机可读存储介质,其上存有计算机程序,该计算机程序被执行时可以实现上述实施例所提供的步骤。该存储介质可以包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本申请实施例还提供了一种计算机设备,可以包括存储器和处理器,存储器中存有计算机程序,处理器调用存储器中的计算机程序时,可以实现上述实 施例所提供的步骤。当然计算机设备还可以包括各种网络接口,电源等组件。
说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。
还需要说明的是,在本说明书中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的状况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。

Claims (10)

  1. 一种度量空间最优完全划分索引寻找方法,其特征在于,包括:
    将预先设置的数据集中的数据输入至度量空间中,并利用选点方法从所述数据集中选取n个支撑点,构成支撑点空间,然后将数据映射至所述支撑点空间;
    将预先设置的划分方式的权值向量设置为划分支撑点空间超平面的法向量候选集;
    按照多种选取排列方式在所述法向量候选集中各选取n个线性无关的法向量,得到对应的多种选取排列结果,并将每一种选取排列结果作为对应的划分方式;
    利用每一种划分方式中的n个线性无关的法向量对所述支撑点空间进行完全线性划分,得到每一种排列方式对应的划分结果;
    在每一种排列方式对应的划分结果中,根据落入目标区域内的数据个数确定对应的划分方式的性能差异,以此确定最优的划分方式。
  2. 根据权利要求1所述的度量空间最优完全划分索引寻找方法,其特征在于,所述利用选点方法从所述数据集中选取n个支撑点,构成支撑点空间,包括:
    利用最远优先遍历算法从所述数据集中选取n个支撑点构成支撑点空间。
  3. 根据权利要求1所述的度量空间最优完全划分索引寻找方法,其特征在于,所述权值向量为m个,其中,m大于n;
    所述按照多种选取排列方式在所述法向量候选集中各选取n个线性无关的法向量,得到对应的多种选取排列结果,并将每一种选取排列结果作为对应的划分方式,包括:
    在m个权值向量中选取n个线性无关的法向量,得到
    Figure PCTCN2021102674-appb-100001
    种选取排列结果,并将每一种选取排列结果作为对应的划分方式。
  4. 根据权利要求3所述的度量空间最优完全划分索引寻找方法,其特征在于,所述利用每一种划分方式中的n个线性无关的法向量对所述数据进行完全线性划分,得到每一种排列方式对应的划分结果,包括:
    分别利用
    Figure PCTCN2021102674-appb-100002
    种划分方式中的n个线性无关的法向量对所述数据进行完全线 性划分,得到
    Figure PCTCN2021102674-appb-100003
    个划分结果。
  5. 根据权利要求1所述的度量空间最优完全划分索引寻找方法,其特征在于,所述利用每一种划分方式中的n个线性无关的法向量对所述支撑点空间进行完全线性划分,得到每一种排列方式对应的划分结果,包括:
    将n个线性无关的法向量依次标记为v1,v2,…,vn;
    利用k个v1法向量的平行超平面对所述支撑点空间划分为k个第一子空间;
    对于每一第一子空间,利用k个v2法向量的平行超平面划分得到k个第二子空间;
    以此类推,直至穷尽n个线性无关的法向量划分得到k n个子空间。
  6. 根据权利要求5所述的度量空间最优完全划分索引寻找方法,其特征在于,所述目标区域为每一划分方式对应的划分边界L的r-邻域。
  7. 根据权利要求6所述的度量空间最优完全划分索引寻找方法,其特征在于,所述在每一种排列方式对应的划分结果中,根据落入目标区域内的数据个数确定对应的划分方式的性能差异,包括:
    将落入r-邻域内数据个数最少的n个线性无关的法向量对应的划分方式判定为最优划分方式。
  8. 一种度量空间最优完全划分索引寻找装置,其特征在于,包括:
    数据映射单元,用于将预先设置的数据集中的数据输入至度量空间中,并利用选点方法从所述数据集中选取n个支撑点,构成支撑点空间,然后将数据映射至所述支撑点空间;
    候选集设置单元,用于将预先设置的划分方式的权值向量设置为划分支撑点空间超平面的法向量候选集;
    法向量选取单元,用于按照多种选取排列方式在所述法向量候选集中各选取n个线性无关的法向量,得到对应的多种选取排列结果,并将每一种选取排列结果作为对应的划分方式;
    完全线性划分单元,用于利用每一种划分方式中的n个线性无关的法向量对所述支撑点空间进行完全线性划分,得到每一种排列方式对应的划分结果;
    性能确定单元,用于在每一种排列方式对应的划分结果中,根据落入目标区域内的数据个数确定对应的划分方式的性能差异,以此确定最优的划分方式。
  9. 一种计算机设备,其特征在于,包括存储器、处理器及存储在所述存储 器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1至7任一项所述的度量空间最优完全划分索引寻找方法。
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述的度量空间最优完全划分索引寻找方法。
PCT/CN2021/102674 2021-06-02 2021-06-28 度量空间最优完全划分索引寻找方法、装置及相关组件 WO2022252316A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110612925.1A CN113282337B (zh) 2021-06-02 2021-06-02 度量空间最优完全划分索引寻找方法、装置及相关组件
CN202110612925.1 2021-06-02

Publications (1)

Publication Number Publication Date
WO2022252316A1 true WO2022252316A1 (zh) 2022-12-08

Family

ID=77283090

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/102674 WO2022252316A1 (zh) 2021-06-02 2021-06-28 度量空间最优完全划分索引寻找方法、装置及相关组件

Country Status (2)

Country Link
CN (1) CN113282337B (zh)
WO (1) WO2022252316A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116796083B (zh) * 2023-06-29 2023-12-22 山东省国土测绘院 一种空间数据划分方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070250476A1 (en) * 2006-04-21 2007-10-25 Lockheed Martin Corporation Approximate nearest neighbor search in metric space
CN106528790A (zh) * 2016-11-08 2017-03-22 深圳大学 度量空间中支撑点的选取方法及装置
CN109669971A (zh) * 2018-12-18 2019-04-23 广东奥博信息产业股份有限公司 一种基于快速随机密集支撑点的度量空间离群检测方法
CN111831660A (zh) * 2020-07-16 2020-10-27 深圳大学 度量空间划分方式评价方法、装置、计算机设备及存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104281652B (zh) * 2014-09-16 2017-10-17 深圳大学 度量空间中逐个支撑点数据划分方法
CN107480258A (zh) * 2017-08-15 2017-12-15 佛山科学技术学院 一种基于多种支撑点的度量空间离群检测方法
CN109508349A (zh) * 2018-10-29 2019-03-22 广东奥博信息产业股份有限公司 一种度量空间离群检测方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070250476A1 (en) * 2006-04-21 2007-10-25 Lockheed Martin Corporation Approximate nearest neighbor search in metric space
CN106528790A (zh) * 2016-11-08 2017-03-22 深圳大学 度量空间中支撑点的选取方法及装置
CN109669971A (zh) * 2018-12-18 2019-04-23 广东奥博信息产业股份有限公司 一种基于快速随机密集支撑点的度量空间离群检测方法
CN111831660A (zh) * 2020-07-16 2020-10-27 深圳大学 度量空间划分方式评价方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN113282337A (zh) 2021-08-20
CN113282337B (zh) 2023-02-24

Similar Documents

Publication Publication Date Title
US9454580B2 (en) Recommendation system with metric transformation
CN111831660B (zh) 度量空间划分方式评价方法、装置、计算机设备及存储介质
Djouadi et al. A fast algorithm for the nearest-neighbor classifier
Wang et al. Trinary-projection trees for approximate nearest neighbor search
JP6070956B2 (ja) 類似性検出装置及び指向性近傍検出方法
US12013899B2 (en) Building a graph index and searching a corresponding dataset
WO2022252316A1 (zh) 度量空间最优完全划分索引寻找方法、装置及相关组件
WO2022217748A1 (zh) 一种度量空间支撑点性能衡量方法、装置及相关组件
CN111026922A (zh) 一种分布式向量索引方法、系统、插件及电子设备
KR102158049B1 (ko) Cf 트리를 활용한 범위 질의 기반의 데이터 클러스터링 장치 및 방법
CN112784115A (zh) 用于图上快速内积搜索的变换
CN113297331A (zh) 数据存储方法及装置、数据查询方法及装置
Hussain et al. Diversifying with few regrets, but too few to mention
KR101577249B1 (ko) 보로노이 셀 기반의 서포트 클러스터링 장치 및 방법
CN113590889A (zh) 度量空间索引树构建方法、装置、计算机设备及存储介质
WO2022267094A1 (zh) 基于欧氏距离的度量空间索引构建方法、装置及相关设备
US11537622B2 (en) K-nearest neighbour spatial queries on a spatial database
CN113901278A (zh) 一种基于全局多探测和适应性终止的数据搜索方法和装置
Achtert et al. Approximate reverse k-nearest neighbor queries in general metric spaces
Papadopoulos et al. Domination mining and querying
US20140129565A1 (en) Information processing device
WO2022267098A1 (zh) 度量空间划分多边界搜索性能衡量的方法及相关组件
JP5332954B2 (ja) 設計対象形状の幾何特性を考慮した多目的最適化設計支援装置、方法、及びプログラム
CN114791966A (zh) 索引构建方法、装置、向量搜索方法及检索系统
Liu et al. On efficient distance-based similarity search

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21943673

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE