WO2022267094A1 - 基于欧氏距离的度量空间索引构建方法、装置及相关设备 - Google Patents

基于欧氏距离的度量空间索引构建方法、装置及相关设备 Download PDF

Info

Publication number
WO2022267094A1
WO2022267094A1 PCT/CN2021/104409 CN2021104409W WO2022267094A1 WO 2022267094 A1 WO2022267094 A1 WO 2022267094A1 CN 2021104409 W CN2021104409 W CN 2021104409W WO 2022267094 A1 WO2022267094 A1 WO 2022267094A1
Authority
WO
WIPO (PCT)
Prior art keywords
support point
euclidean distance
support
algorithm
dimension
Prior art date
Application number
PCT/CN2021/104409
Other languages
English (en)
French (fr)
Inventor
毛睿
陈家颖
王毅
秦建斌
刘刚
陆克中
陆敏华
陈倩婷
Original Assignee
深圳计算科学研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳计算科学研究院 filed Critical 深圳计算科学研究院
Publication of WO2022267094A1 publication Critical patent/WO2022267094A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90324Query formulation using system suggestions
    • G06F16/90328Query formulation using system suggestions using search space presentation or visualization, e.g. category or range presentation and selection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification

Definitions

  • the present invention relates to the technical field of data processing, in particular to a method, device and related equipment for constructing a metric space index based on Euclidean distance.
  • the search result of the approximate nearest neighbor search method is not necessarily the data p closest to the search point q, but it must be very close to the nearest data p, that is, errors are allowed.
  • metric index constructs a prefix tree based on the distance order of the support points for the data based on the distance from the data to the support points for indexing.
  • this method still cannot avoid the disadvantages of the traditional tree index algorithm, and it will not be as good as linear scanning when the number of selected support points is relatively large.
  • a metric space approximate nearest neighbor search method based on compression and Euclidean distance is needed, so that after the data is mapped to the support point space, the approximate nearest neighbor algorithm of Euclidean distance is used to search, and all algorithms based on Euclidean distance are extended Applicable distance function to improve accuracy and query speed.
  • the purpose of the present invention is to provide a method, device and related equipment for constructing a metric space index based on Euclidean distance, aiming at solving the problems of slow query speed and low accuracy in the prior art.
  • the embodiment of the present invention provides a method for constructing a metric space index based on Euclidean distance, including:
  • the original dimension is estimated by a dimension estimation algorithm
  • the mapping support point is selected by a support point selection algorithm, and the number of the mapping support point is greater than the value of the original dimension;
  • the similarity between the data mapped to the support point space is calculated by Euclidean distance, and the index is constructed by Euclidean distance approximate nearest neighbor algorithm.
  • an embodiment of the present invention provides a device for constructing a metric space index based on Euclidean distance, including:
  • An estimating dimension unit configured to obtain an original data set, and obtain the original dimension by estimating a dimension estimation algorithm according to the type of the original data set;
  • a support point selection unit configured to select a mapping support point through a support point selection algorithm according to the original dimension, and the number of the mapping support points is greater than the value of the original dimension
  • mapping unit configured to map the original data set into a support point space through a distance function and the mapping support points
  • a dimensionality reduction unit is used to reduce the dimensionality of the data in the support point space through a dimensionality reduction algorithm
  • the index construction unit is used to calculate the similarity between the data mapped to the support point space through the Euclidean distance according to the support point space after dimension reduction, and construct the index through the Euclidean distance approximation nearest neighbor algorithm.
  • the embodiment of the present invention provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor executes the computer program.
  • the program implements the Euclidean distance-based metric space index construction method described in the first aspect above.
  • an embodiment of the present invention further provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the above-mentioned first step.
  • the method for constructing a metric space index based on Euclidean distance is provided.
  • the present invention constructs a metric space index based on the Euclidean distance through the approximate nearest neighbor algorithm of the Euclidean distance, which can be used for searching when searching, and simplifies the original complicated distance calculation into the well-known and relatively simple Euclidean distance The calculation improves the accuracy and query speed.
  • FIG. 1 is a schematic flowchart of a method for constructing a metric space index based on Euclidean distance provided by an embodiment of the present invention
  • FIG. 2 is a schematic subflow diagram of step S102 of the method for constructing a metric space index based on Euclidean distance provided by an embodiment of the present invention.
  • FIG. 3 is a structural block diagram of an apparatus for constructing a metric space index based on Euclidean distance provided by an embodiment of the present invention.
  • a metric space is an abstraction of data types with a wide range of coverage. It abstracts complex data objects into points in metric space, and utilizes the triangular inequality of user-defined distance functions to remove irrelevant data and reduce the number of direct distance calculations. The data is abstracted into points in the metric space. Although the generality is improved, the coordinate information is lost at the same time. The only available information is the distance value. The lack of coordinates makes the research method of metric space relatively simple, and the research progress is greatly restricted. Therefore, the support point space model is used to transform the metric space without coordinates into the support point space with coordinates.
  • the metric space is a pair (M, d), where M is a finite and non-empty data set, and d is a distance function defined on M.
  • the distance function satisfies:
  • the support point space F P,d (S) is the image of S at R n :
  • Step S101 Obtain an original data set, and estimate the original dimension through a dimension estimation algorithm according to the type of the original data set;
  • the dimension estimation algorithm converts the data into the form of a distance matrix, and then estimates the dimension through the method of eigenvalues.
  • Step S102 According to the original dimension, select a mapping support point through a support point selection algorithm, and the number of the mapping support point is greater than the value of the original dimension;
  • the mapped data since the data is mapped into the metric space by selecting the mapping support points, the mapped data must be different from the previous data (that is, only some points are selected as support points, so there are no support points. The information of part of the data is lost. In order to reduce the loss of information as much as possible, it can be operated from two aspects: 1. Use a good point selection algorithm such as FFT and its related improved algorithm; 2. Increase the number of support points), so It is necessary to ensure that the selected mapping support point is larger than the value of the original dimension to reduce the loss of precision.
  • a good point selection algorithm such as FFT and its related improved algorithm
  • the number of mapping support points is three times the value of the original dimension.
  • mapping support points when reducing the number of mapping support points, the mapped data dimension will be correspondingly reduced, and the data accuracy will be correspondingly reduced, but the storage cost will be reduced; when the number of mapping support points is increased, it will be correspondingly Increase the data dimension after mapping, correspondingly increase the data precision, but will increase the storage cost, so it is necessary to find a balance point between the storage cost and data precision, this point is the number of mapping support points for the original data set 3 times the dimension.
  • mapping support points may also be around three times the dimension value of the original data set, subject to actual operation.
  • the support point selection algorithm is FFT algorithm
  • mapping support points by the support point selection algorithm includes:
  • S201 Randomly select a piece of data from the original data set as the first support point, and store it in an initially empty set of support points;
  • S202 Use all the data in the original data set except as support points as non-support points and store them in an initially empty non-support point set;
  • S203 Calculate the distances from all the non-support points to each support point in the support point set, and take the minimum value and store it in an initially empty minimum distance set;
  • S204 Select a non-support point corresponding to a maximum value in the minimum distance set as a second support point, and add it to the support point set;
  • the calculating the distances from all the non-support points to each support point in the support point set and taking the minimum value and storing it in an initially empty minimum distance set includes:
  • p j represents a certain support point in the support point set P
  • xi represents a certain non-support point in the original data set X
  • the distance function is used to map the multidimensional data in the metric space to the multidimensional data in the support point space with coordinates according to the distance between the data in the original data set and each support point.
  • the dimension reduction algorithm is used to reduce the dimensionality of the multi-dimensional data in the support point space, extract the main feature components of the data, alleviate the curse of dimensionality, and make each feature of the data after dimensionality reduction independent of each other.
  • the dimension of the reduced data is the same as the original dimension estimated by the dimension estimation algorithm.
  • the data accuracy is the highest, and the accuracy will not be improved if it is higher than this dimension. , it will drop when it is low, so in actual use, it can meet the demand.
  • the dimensionality reduction algorithm is a principal component analysis algorithm.
  • S105 According to the support point space after dimension reduction, calculate the similarity between the data after being mapped to the support point space by Euclidean distance, and construct an index by Euclidean distance approximation nearest neighbor algorithm.
  • the similarity between the coordinates (coordinates in the support point space) represented by each data in the metric space is calculated by the Euclidean distance approximate nearest neighbor algorithm. The smaller the Euclidean distance is, the more similar it is, and then according to the similar The size of the degree is sorted to form an index.
  • the approximate nearest neighbor algorithm for the Euclidean distance may be an algorithm such as PQ, HNSW, etc., and these algorithms can quickly calculate the Euclidean distance.
  • PCA will give a matrix for matrix multiplication
  • a distance codebook can be obtained, and the index codebook can be searched through the distance codebook.
  • the Euclidean distance between two points can be obtained, so we can compare the similarity of the two data through the Euclidean distance, which reduces the distance calculation time and the data transmission time from the storage device to the CPU, saving transmission time.
  • the codebook is the coordinates or serial numbers of a section of center points provided by approximate nearest neighbor algorithms such as PQ and HNSW.
  • Minkowski distance function is used in the support point space to calculate the distance stretching generated by mapping the data from the metric space to the support point space.
  • the distance d(x, y) between two points x and y in the metric space and the distance L p (x p , y p ) in the support point space mapped to x, y are compared in size, where, k is the number of support points, k ⁇ 2.
  • p is the Minkowski distance function.
  • p is a specific value, it represents a specific distance. For example, when p is 1, it is Hamming distance, and when p is 2, it is Euclidean distance.
  • L 1 , L 2 and L ⁇ has errors, and the upper bound of the error is L 1 (x p ,y p ) ⁇ kd(x,y), L ⁇ (x p , y p ) ⁇ d(x, y), here the accuracy of L ⁇ is calculated through experiments, and the accuracy of L 1 , L 2 and L ⁇ is compared, so it is not listed here.
  • L 2 has better stability, and it has higher accuracy than L 1 and L ⁇ when the support point data is relatively low and the amount of data access is relatively small.
  • an apparatus 300 for constructing a metric space index based on Euclidean distance including:
  • An estimating dimension unit 301 configured to obtain an original data set, and obtain the original dimension by estimating a dimension estimation algorithm according to the type of the original data set;
  • the support point selection unit 302 is configured to select a mapping support point through a support point selection algorithm according to the original dimension, and the number of the mapping support points is greater than the value of the original dimension;
  • a mapping unit 303 configured to map the original data set into a support point space through a distance function and the mapped support points
  • a dimensionality reduction unit 304 configured to perform dimensionality reduction on the data in the support point space through a dimensionality reduction algorithm
  • the index construction unit 305 is configured to construct an index by using the Euclidean distance approximation nearest neighbor algorithm according to the support point space after dimensionality reduction.
  • An embodiment of the present invention also provides a computer device, including a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein, when the processor executes the computer program, the above The metric space index construction method based on Euclidean distance.
  • a computer readable storage medium may be a non-volatile computer-readable storage medium.
  • the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the above-mentioned method for constructing a metric space index based on Euclidean distance.
  • the disclosed devices, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only logical function division.
  • there may be other division methods, and units with the same function may also be combined into one Units such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a storage medium.
  • the technical solution of the present invention is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of software products, and the computer software products are stored in a storage medium
  • several instructions are included to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present invention.
  • the aforementioned storage medium includes: various media capable of storing program codes such as a U disk, a mobile hard disk, a read-only memory (ROM, Read-Only Memory), a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了基于欧氏距离的度量空间索引构建方法、装置及相关设备,方法包括获取原始数据集,根据原始数据集的类型,通过维度估计算法估算得到原始维度;根据原始维度,通过支撑点选取算法选取映射支撑点,映射支撑点的个数大于原始维度的数值;通过距离函数和映射支撑点将度量空间中的原始数据集映射到支撑点空间;通过降维算法对支撑点空间中的数据进行降维;根据降维后的支撑点空间,通过欧氏距离近似最近邻算法构建索引。通过欧氏距离的近似最近邻算法构建基于欧氏距离的度量空间索引,在检索的时候可通过该索引进行检索,将原本复杂的距离计算简化为了人们熟知且计算较为简单的欧氏距离的计算,提高了准确度和查询速度。

Description

基于欧氏距离的度量空间索引构建方法、装置及相关设备 技术领域
本发明涉及数据处理技术领域,尤其涉及一种基于欧氏距离的度量空间索引构建方法、装置及相关设备。
背景技术
在高维数据下,由于维数灾难问题,传统的精确搜索方法如树状索引性能会急剧下降,甚至会不如线性扫描。因此,近似最近邻查找的方法便诞生了,近似最近邻查找方法的搜索结果并不一定是距离搜索点q最近的那个数据p,但一定离最近的数据p很近,即允许存在误差。
在非度量空间的近似最近邻算法中,这些算法大多只针对欧氏距离,在欧氏距离上有很好的性能,但无法扩展到其它距离函数,因为这些搜索算法都是针对欧氏距离等特定的距离函数所涉及的。
度量空间的近似最近邻算法的研究很少,目前了解到的有metric index,这种索引方法根据数据到支撑点的距离,为数据构建基于支撑点距离大小顺序的前缀树进行索引。但这种方法仍然无法避免传统树状索引算法的弊端,在选取的支撑点数目比较多的情况下会不如线性扫描。
由此需要一种基于压缩和欧氏距离的度量空间近似最近邻查找方法,使数据在映射到支撑点空间后,用欧氏距离的近似最近邻算法进行查找,扩展所有基于欧氏距离的算法的适用距离函数,提高准确度和查询速度。
发明内容
本发明的目的是提供一种基于欧氏距离的度量空间索引构建方法、装置及相关设备,旨在解决现有技术中,查询速度过慢且准确度低的问题。
第一方面,本发明实施例提供了基于欧氏距离的度量空间索引构建方法,包括:
获取原始数据集,根据所述原始数据集的类型,通过维度估计算法估算得 到原始维度;
根据所述原始维度,通过支撑点选取算法选取映射支撑点,所述映射支撑点的个数大于所述原始维度的数值;
通过距离函数和所述映射支撑点将原始数据集映射为支撑点空间;
通过降维算法对支撑点空间中的数据进行降维;
根据降维后的支撑点空间,通过欧式距离计算映射到支撑点空间后数据之间的相似程度,并通过欧氏距离近似最近邻算法构建索引。
第二方面,本发明实施例提供了基于欧氏距离的度量空间索引构建装置,包括:
估算维度单元,用于获取原始数据集,根据所述原始数据集的类型,通过维度估计算法估算得到原始维度;
支撑点选取单元,用于根据所述原始维度,通过支撑点选取算法选取映射支撑点,所述映射支撑点的个数大于所述原始维度的数值;
映射单元,用于通过距离函数和所述映射支撑点将原始数据集映射为支撑点空间;
降维单元,用于通过降维算法对支撑点空间中的数据进行降维;
索引构建单元,用于根据降维后的支撑点空间,通过欧式距离计算映射到支撑点空间后数据之间的相似程度,并通过欧氏距离近似最近邻算法构建索引。
第三方面,本发明实施例又提供了一种计算机设备,其包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述第一方面所述的基于欧氏距离的度量空间索引构建方法。
第四方面,本发明实施例还提供了一种计算机可读存储介质,其中所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行上述第一方面所述的基于欧氏距离的度量空间索引构建方法。
本发明通过欧氏距离的近似最近邻算法构建基于欧氏距离的度量空间索引,在检索的时候可通过该索引进行检索,将原本复杂的距离计算简化为了人们熟知且计算较为简单的欧氏距离的计算,提高了准确度和查询速度。
附图说明
为了更清楚地说明本发明实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的基于欧氏距离的度量空间索引构建方法的流程示意图;
图2为本发明实施例提供的基于欧氏距离的度量空间索引构建方法的步骤S102的子流程示意图。
图3为本发明实施例提供的基于欧氏距离的度量空间索引构建装置的结构框图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本发明说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本发明。如在本发明说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
还应当进一步理解,在本发明说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
度量空间是一种覆盖范围很广的数据类型抽象。它把复杂的数据对象抽象成度量空间中的点,利用用户定义距离函数的三角不等性来去除无关数据并减少直接距离计算的次数。数据被抽象成度量空间中的点,虽然提高了通用性,但同时也损失了坐标信息,唯一可用的信息就是距离值。坐标的缺失使度量空 间的研究手段比较单一,研究进展受到了很大限制。因此,采用支撑点空间模型,把没有坐标的度量空间转化为有坐标的支撑点空间。
所述度量空间为二元组(M,d),其中M是有限非空的数据集合,d是定义在M上的距离函数。
所述距离函数满足:
对于任意,d(x,y)≥0,并且d(x,y)=0时,x=y;
对于任意,d(x,y)=d(y,x);
对于任意,d(x,y)+d(y,z)≥d(x,z)。
对于度量空间(M,d),数据S={s i|s i∈M,i=1,2,...,m},S中选择n个支撑点P={p 1,p 2,...,p n},对于
Figure PCTCN2021104409-appb-000001
以数据到支撑点的距离d(s,p i)作为坐标,可以定义一个从M到n维空间的映射,用s p表示s在n维空间中的像,则有映射函数F P,d如下:
F P,d(s)=(f 1(s),f 2(s),...,f n(s))=(d(s,p 1),d(s,p 2),...,d(s,p n))∈F P,d(M);
支撑点空间F P,d(S)是S在R n的像:
F P,d(s)={s P|s P=d(s,p 1),d(s,p 2),...,d(s,p n),s∈S}。
例如,度量空间中三个数据s 1,s 2,s 3,其中d(s 2,s 1)=12,d(s 2,s 3)=23,d(s 1,s 3)=13,当选用s 1,s 3两个支撑点时,得到的支撑点空间维度为2,s 1,s 2,s 3在支撑点空间中的像分别为s 1 P=(d(s 1,s 1),d(s 1,s 3))=(0,13),s 2 P=(d(s 2,s 1),d(s 2,s 3))=(12,23),s 3 P=(d(s3,s1),d(s3,s3))=(13,0)。
以上是度量空间和相关定义。
请参阅图1,一种基于欧氏距离的度量空间索引构建方法,包括步骤S101-S105:
步骤S101:获取原始数据集,根据所述原始数据集的类型,通过维度估计算法估算得到原始维度;
在本实施例中,所述维度估计算法通过将数据转换为距离矩阵的形式,随后通过特征值的方法估计维度。
由于不同的数据类型有不同的真实维度,然而不是所有的数据的真实维度都是公开的,故需要通过此方法进行估算,通过此方法估算,可以得到属于所述原始数据集的维度,方便后续的处理和精度计算。
步骤S102:根据所述原始维度,通过支撑点选取算法选取映射支撑点,所 述映射支撑点的个数大于所述原始维度的数值;
在本实施例中,由于通过选取映射支撑点将数据映射到度量空间中,映射后的数据肯定和之前的数据不一样了(即只选取了部分点作为支撑点,所以没有作为支撑点的那部分数据的信息就损失了,为了尽可能的减少信息的丢失,可以从两方面进行操作:1、采用好的选点算法如FFT及其相关改进算法;2、增加支撑点的数目),所以需要将选取的映射支撑点保证比所述原始维度的数值大,减少精度的损失。
优选的,映射支撑点的个数是所述原始维度的数值的3倍。
具体的,在减少映射支撑点的个数时,将会相应的减小映射后的数据维度,相应的降低数据精度,但是减少了存储代价;在增加映射支撑点的个数时,将会相应的增加映射后的数据维度,相应的增加数据精度,然而会增加存储代价,故需要在存储代价和数据精度上找到一个平衡点,这个点就是映射支撑点的个数为所述原始数据集的维度的3倍。
当然,映射支撑点的个数也可以是在所述原始数据集的维度数值的3倍数值附近,以实际操作为准。
请参阅图2,在一实施例中,所述支撑点选取算法为FFT算法;
所述通过支撑点选取算法选取映射支撑点,包括:
S201:从原始数据集中随机选取一个数据作为第一支撑点,并存入到一个初始为空的支撑点集合中;
S202:将所述原始数据集中除作为支撑点之外的所有数据作为非支撑点并存入一个初始为空的非支撑点集合中;
S203:计算所有所述非支撑点分别到所述支撑点集合中的每一支撑点的距离并取其中的最小值存入一个初始为空的最小距离集合;
S204:选取所述最小距离集合中的最大值所对应的非支撑点作为第二支撑点,并加入所述支撑点集合中;
S205:以此类推(重复步骤S202-S204),直至所述支撑点集合中具有K+1个支撑点,将第一支撑点剔除,得到K个支撑点,作为映射支撑点。
在一实施例中,所述计算所有所述非支撑点分别到所述支撑点集合中的每一支撑点的距离并取其中的最小值存入一个初始为空的最小距离集合,包括:
按照以下公式计算所有所述非支撑点分别到所述支撑点集合中的每一支撑 点的距离的最小值:
Figure PCTCN2021104409-appb-000002
其中,p j表示支撑点集合P中的某一支撑点,x i表示原始数据集X中的某一非支撑点;
Figure PCTCN2021104409-appb-000003
表示原始数据集中一个非支撑点到一个支撑点之间的距离;
其中,上式在计算时,只需保持其中的p j固定不变,x i遍历所述原始数据集X中的所有非支撑点即可,以获得所有所述非支撑点分别到所述支撑点集合中的支撑点的距离。
具体的,可参考下表理解:
假设有n个支撑点p 1,p 2,…,p n,n<k(k表示总共要选取的支撑点数目),原始数据集中总共有m个非支撑点,求下一个支撑点的FFT方法为:
Figure PCTCN2021104409-appb-000004
表1
如表1所示,每列表示原始数据集中所有数据到一个支撑点的距离d n,n=1,2,3,…,n,从每列中找到最小的距离D n=min(d n);然后再从这些最小的距离中找到最大距离max(D 1,D 2,…,D n),将该最大距离对应的数据作为下一个支撑点。
S103:通过距离函数和所述映射支撑点将原始数据集映射为支撑点空间;
通过距离函数计算原始数据集中的数据之间映射后的相似性;
在本实施例中,通过距离函数,根据原始数据集中的数据到各个支撑点之间的距离将度量空间中的多维数据映射为具有坐标的支撑点空间中的多维数据。
S104:通过降维算法对支撑点空间中的数据进行降维;
在本实施例中,通过降维算法对支撑点空间中的多维数据进行降维,提取数据的主要特征分量,缓解维度灾难,使得降维之后的数据各特征相互独立。
优选的,降维后的数据的维度与通过维度估计算法估算得到所述原始维度是一样的,在此种情况下的数据精度是最高的,比这个维度高了准确度也不会有所提升,低了会有所下降,故实际使用中,满足需求即可。
具体的,所述降维算法为主成分分析算法。
S105:根据降维后的支撑点空间,通过欧式距离计算映射到支撑点空间后数据之间的相似程度,并通过欧氏距离近似最近邻算法构建索引。
在本实施例中,通过欧氏距离近似最近邻算法计算度量空间中的各个数据所代表的坐标(支撑点空间中的坐标)之间的相似度,欧式距离小的就越相似,随后按照相似度的大小进行排序,形成索引。
具体的,所述欧氏距离近似最近邻算法可以是PQ、HNSW等算法,这些算法可以快速计算欧氏距离。
以下以DNA为例对索引应用进行解释:
之前已经构建好了的索引,为压缩的数据和简化距离计算的一个码本;
查找时输入DNA片段数据,比如“AGTC”一个片段;
通过支撑点估计算法得到“AGTC”片段的估算维度;
并通过支撑点选取算法选取了4个支撑点:p1,p2,p3,p4;
通过距离函数(编辑距离)计算“AGTC”片段中某个数据到各个支撑点的距离为d1,d2,d3,d4;这四位就是在支撑点空间代表该数据的坐标(d1,d2,d3,d4);
通过PCA进行映射(PCA会给出一个矩阵,进行矩阵乘法),可以获得坐标(d1,d2,d3,d4)映射后的坐标(d’1,d’2,d’3,d’4);
用之前得到的索引进行索引操作,即将(d’1,d’2,d’3,d’4)与码本计算,可以获得一个距离码本,通过距离码本对索引码本进行查找就可以得到两个点之间的欧式距离,由此我们便可以通过欧式距离的大小比较两个数据的相似程度,减少了距离计算的时间和数据从存储设备到CPU的传输时间,节省传输时间。
返回距离该DNA片段最近的一个或几个片段。
其中,码本是由PQ、HNSW等近似最近邻算法所提供的一段中心点的坐标或序号,通过计算查询点到各个中心点的欧氏距离(这里就将原本复杂的距离计算简化为了人们熟知且计算较为简单的欧氏距离的计算),得出近似最近邻。
以下通过一则推导来证明欧式距离在度量空间下具有较高的性能:
具体为闵可夫斯基距离簇映射后的对比,其中L1是曼哈顿距离,L2是欧式距离,L 是切比雪夫距离。
支撑点空间中采用闵可夫斯基距离函数,计算数据从度量空间映射到支撑点空间所产生的距离伸缩。
具体的,度量空间中x,y两点之间的距离d(x,y)和x,y映射到支撑点空间中的距离L p(x p,y p)进行大小的比较,其中,
Figure PCTCN2021104409-appb-000005
k为支撑点个数,k≥2。
其中,p是闵可夫斯基距离函数,p为特定值时表示一个特定距离,比如p为1的时候是汉明距离,p为2的时候是欧式距离。
Figure PCTCN2021104409-appb-000006
非完全支撑点空间中:
对于距离函数为L 1:当x,y都是支撑点,令p t=x且p l=y:
Figure PCTCN2021104409-appb-000007
Figure PCTCN2021104409-appb-000008
因此2d(x,y)≤L 1(x p,y p)≤kd(x,y);
①当x,y都不是支撑点:
Figure PCTCN2021104409-appb-000009
②当x或y其中一个为支撑点,设x为支撑点,令p t=x:
Figure PCTCN2021104409-appb-000010
因此d(x,y)≤L 1(x p,y p)≤kd(x,y)。
对于距离函数为L 2来说
①当x,y都是支撑点,令p t=x且p l=y:
Figure PCTCN2021104409-appb-000011
因此
Figure PCTCN2021104409-appb-000012
②当x,y都不是支撑点:
Figure PCTCN2021104409-appb-000013
③当x或y其中一个为支撑点,设x为支撑点,令p t=x:
Figure PCTCN2021104409-appb-000014
因此
Figure PCTCN2021104409-appb-000015
其中,对于x=y和x≠y两种情况,因为所得的不等式相同,因此就不分开讨论。
在完全支撑点空间下,我们可以通过数学证明得知L 是没有误差的,所以L 是最好的。
但在实际应用中,在数据规模比较庞大时,我们难以将数据映射到完全支撑点空间,只能将数据映射到非完全支撑点空间,在非完全支撑点空间下,L 1、L 2和L 都是有误差的,误差的上界为L 1(x p,y p)≤kd(x,y),
Figure PCTCN2021104409-appb-000016
L (x p,y p)≤d(x,y),这里通过实验计算L 的准确度,并对L 1、L 2和L 的准确度进行比较就不罗列出来了。
通过实验可知,在近似最近邻查找中,L 2具有更好的稳定性,在支撑点数据比较低、数据访问量比较少的时候就有比L 1和L 更高的准确度。
在访问数据量不变的情况下,随着支撑点数目的增加,L 的准确度会慢慢逼近L 2,甚至超越L 2的准确度,这符合我们对于L 的预期(即越趋近完全支撑点空间,L 的误差就越小),但此时L 2已结具有很高的准确度(并且是我们可以接受的准确度),且我们在平时情况下是不能映射到完全支撑点空间的(数据太过于庞大)。在非完全支撑点空间,L 2的性能是最高的。
请参阅图3,一种基于欧氏距离的度量空间索引构建装置300,包括:
估算维度单元301,用于获取原始数据集,根据所述原始数据集的类型,通过维度估计算法估算得到原始维度;
支撑点选取单元302,用于根据所述原始维度,通过支撑点选取算法选取映射支撑点,所述映射支撑点的个数大于所述原始维度的数值;
映射单元303,用于通过距离函数和所述映射支撑点将原始数据集映射为支撑点空间;
降维单元304,用于通过降维算法对支撑点空间中的数据进行降维;
索引构建单元305,用于根据降维后的支撑点空间,通过欧氏距离近似最近邻算法构建索引。
本发明实施例还提供一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现如上所述的基于欧氏距离的度量空间索引构建方法。
在本发明的另一实施例中提供计算机可读存储介质。该计算机可读存储介质可以为非易失性的计算机可读存储介质。该计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行如上所述的基于欧氏距离的度量空间索引构建方法。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
在本发明所提供的几个实施例中,应该理解到,所揭露的设备、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为逻辑功能划分,实际实现时可以有另外的划分方式,也可以将具有相同功能的单元集合成一个单元,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另 外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本发明实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-OnlyMemory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。

Claims (10)

  1. 一种基于欧氏距离的度量空间索引构建方法,其特征在于,包括:
    获取原始数据集,根据所述原始数据集的类型,通过维度估计算法估算得到原始维度;
    根据所述原始维度,通过支撑点选取算法选取映射支撑点,所述映射支撑点的个数大于所述原始维度的数值;
    通过距离函数和所述映射支撑点将原始数据集映射为支撑点空间;
    通过降维算法对支撑点空间中的数据进行降维;
    根据降维后的支撑点空间,通过欧式距离计算映射到支撑点空间后数据之间的相似程度,并通过欧氏距离近似最近邻算法构建索引。
  2. 根据权利要求1所述的基于欧氏距离的度量空间索引构建方法,其特征在于:所述映射支撑点的个数是所述原始维度的数值的3倍。
  3. 根据权利要求1所述的基于欧氏距离的度量空间索引构建方法,其特征在于,所述支撑点选取算法为FFT算法;
    所述通过支撑点选取算法选取映射支撑点,包括:
    从所述原始数据集中随机选取一个数据作为第一支撑点,并存入到一个初始为空的支撑点集合中;
    将所述原始数据集中除作为支撑点之外的所有数据作为非支撑点并存入一个初始为空的非支撑点集合中;
    计算所有所述非支撑点分别到所述支撑点集合中的每一支撑点的距离并取其中的最小值存入一个初始为空的最小距离集合;
    选取所述最小距离集合中的最大值所对应的非支撑点作为第二支撑点,并加入所述支撑点集合中;
    以此类推,直至所述支撑点集合中具有K+1个支撑点,将第一支撑点剔除,得到K个支撑点,作为映射支撑点。
  4. 根据权利要求3所述的基于欧氏距离的度量空间索引构建方法,其特征在于,所述计算所有所述非支撑点分别到所述支撑点集合中的每一支撑点的距离并取其中的最小值存入一个初始为空的最小距离集合,包括:
    按照以下公式计算所有所述非支撑点分别到所述支撑点集合中的每一支撑 点的距离的最小值:
    Figure PCTCN2021104409-appb-100001
    其中,p j表示支撑点集合P中的某一支撑点,x i表示原始数据集X中的某一非支撑点;
    Figure PCTCN2021104409-appb-100002
    表示原始数据集中一个非支撑点到一个支撑点之间的距离;
    其中,上式在计算时,需保持其中的p j固定不变,x i遍历所述原始数据集X中的所有非支撑点,以获得所有所述非支撑点分别到所述支撑点集合中的支撑点的距离。
  5. 根据权利要求1所述的基于欧氏距离的度量空间索引构建方法,其特征在于:所述欧氏距离近似最近邻算法为PQ算法或HNSW算法。
  6. 根据权利要求1所述的基于欧氏距离的度量空间索引构建方法,其特征在于:降维后所述支撑点空间中的数据的维度与所述原始维度相等。
  7. 根据权利要求1所述的基于欧氏距离的度量空间索引构建方法,其特征在于:所述降维算法为主成分分析算法。
  8. 一种基于欧氏距离的度量空间索引构建装置,其特征在于,包括:
    估算维度单元,用于根据原始数据集的类型,通过维度估计算法估算得到原始维度;
    支撑点选取单元,用于根据所述原始维度,通过支撑点选取算法选取映射支撑点,所述映射支撑点的个数数值大于所述原始数据集的维度数值;
    映射单元,用于通过距离函数和所述映射支撑点将度量空间中的原始数据集映射到支撑点空间;
    降维单元,用于通过降维算法对支撑点空间中的数据进行降维;
    索引构建单元,用于根据降维后的支撑点空间,通过欧式距离计算映射到支撑点空间后数据之间的相似程度,并通过欧氏距离近似最近邻算法构建索引。
  9. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7中任一项所述的基于欧氏距离的度量空间索引构建方法。
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行如权利要求1至7任一项所述的基于欧氏距离的度量空间索引构建方法。
PCT/CN2021/104409 2021-06-22 2021-07-05 基于欧氏距离的度量空间索引构建方法、装置及相关设备 WO2022267094A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110689178.1A CN113407786A (zh) 2021-06-22 2021-06-22 基于欧氏距离的度量空间索引构建方法、装置及相关设备
CN202110689178.1 2021-06-22

Publications (1)

Publication Number Publication Date
WO2022267094A1 true WO2022267094A1 (zh) 2022-12-29

Family

ID=77682145

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/104409 WO2022267094A1 (zh) 2021-06-22 2021-07-05 基于欧氏距离的度量空间索引构建方法、装置及相关设备

Country Status (2)

Country Link
CN (1) CN113407786A (zh)
WO (1) WO2022267094A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117892231A (zh) * 2024-03-18 2024-04-16 天津戎军航空科技发展有限公司 一种碳纤维弹匣生产数据智能管理方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020147703A1 (en) * 2001-04-05 2002-10-10 Cui Yu Transformation-based method for indexing high-dimensional data for nearest neighbour queries
CN103279551A (zh) * 2013-06-06 2013-09-04 浙江大学 一种基于欧氏距离的高维数据准确近邻快速检索方法
CN106503245A (zh) * 2016-11-08 2017-03-15 深圳大学 一种支撑点集合的选择方法及装置
CN106528790A (zh) * 2016-11-08 2017-03-22 深圳大学 度量空间中支撑点的选取方法及装置
CN107480258A (zh) * 2017-08-15 2017-12-15 佛山科学技术学院 一种基于多种支撑点的度量空间离群检测方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631928B (zh) * 2013-12-05 2017-02-01 中国科学院信息工程研究所 一种基于局部敏感哈希的聚类索引方法及系统
US10162878B2 (en) * 2015-05-21 2018-12-25 Tibco Software Inc. System and method for agglomerative clustering
CN105260742A (zh) * 2015-09-29 2016-01-20 深圳大学 一种针对多种数据类型的统一分类方法及系统
CN108460123B (zh) * 2018-02-24 2020-09-08 湖南视觉伟业智能科技有限公司 高维数据检索方法、计算机设备和存储介质
CN109508349A (zh) * 2018-10-29 2019-03-22 广东奥博信息产业股份有限公司 一种度量空间离群检测方法及装置
CN110070100A (zh) * 2019-03-01 2019-07-30 广东奥博信息产业股份有限公司 一种多因子集成的农业气象离群检测方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020147703A1 (en) * 2001-04-05 2002-10-10 Cui Yu Transformation-based method for indexing high-dimensional data for nearest neighbour queries
CN103279551A (zh) * 2013-06-06 2013-09-04 浙江大学 一种基于欧氏距离的高维数据准确近邻快速检索方法
CN106503245A (zh) * 2016-11-08 2017-03-15 深圳大学 一种支撑点集合的选择方法及装置
CN106528790A (zh) * 2016-11-08 2017-03-22 深圳大学 度量空间中支撑点的选取方法及装置
CN107480258A (zh) * 2017-08-15 2017-12-15 佛山科学技术学院 一种基于多种支撑点的度量空间离群检测方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117892231A (zh) * 2024-03-18 2024-04-16 天津戎军航空科技发展有限公司 一种碳纤维弹匣生产数据智能管理方法
CN117892231B (zh) * 2024-03-18 2024-05-28 天津戎军航空科技发展有限公司 一种碳纤维弹匣生产数据智能管理方法

Also Published As

Publication number Publication date
CN113407786A (zh) 2021-09-17

Similar Documents

Publication Publication Date Title
US11048966B2 (en) Method and device for comparing similarities of high dimensional features of images
US9454580B2 (en) Recommendation system with metric transformation
US12038896B2 (en) Data indexing and searching using permutation indexes
WO2022011851A1 (zh) 度量空间划分方式评价方法、装置、计算机设备及存储介质
CN107341178B (zh) 一种基于自适应的二进制量化哈希编码的数据检索方法
US12013899B2 (en) Building a graph index and searching a corresponding dataset
CN112732883A (zh) 基于知识图谱的模糊匹配方法、装置和计算机设备
WO2022241813A1 (zh) 一种基于图压缩的图数据库构建方法、装置及相关组件
CN111552692A (zh) 一种加减法布谷鸟过滤器
Qi et al. Indexable online time series segmentation with error bound guarantee
WO2022267094A1 (zh) 基于欧氏距离的度量空间索引构建方法、装置及相关设备
CN111026922B (zh) 一种分布式向量索引方法、系统、插件及电子设备
WO2021027149A1 (zh) 基于画像相似性的信息检索推荐方法、装置及存储介质
WO2022217748A1 (zh) 一种度量空间支撑点性能衡量方法、装置及相关组件
CN109213972A (zh) 确定文档相似度的方法、装置、设备和计算机存储介质
JP4440246B2 (ja) 空間インデックス方法
JP3938815B2 (ja) ノード作成方法、画像検索方法及び記録媒体
JP2004046612A (ja) データマッチング方法、データマッチング装置、データマッチングプログラムおよびコンピュータで読み取り可能な記録媒体
CN110175220B (zh) 一种基于关键词位置结构分布的文档相似性度量方法及系统
Yang et al. Isometric hashing for image retrieval
CN113590889A (zh) 度量空间索引树构建方法、装置、计算机设备及存储介质
Carvalho et al. Self similarity wide-joins for near-duplicate image detection
JPWO2010084712A1 (ja) データ検索装置
WO2022267096A1 (zh) 度量空间划分边界的性能衡量方法、装置及相关设备
WO2022267098A1 (zh) 度量空间划分多边界搜索性能衡量的方法及相关组件

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21946583

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19.04.2024)