WO2022063150A1 - Data storage method and device, and data query method and device - Google Patents

Data storage method and device, and data query method and device Download PDF

Info

Publication number
WO2022063150A1
WO2022063150A1 PCT/CN2021/119760 CN2021119760W WO2022063150A1 WO 2022063150 A1 WO2022063150 A1 WO 2022063150A1 CN 2021119760 W CN2021119760 W CN 2021119760W WO 2022063150 A1 WO2022063150 A1 WO 2022063150A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster center
center point
data
target
cluster
Prior art date
Application number
PCT/CN2021/119760
Other languages
French (fr)
Chinese (zh)
Inventor
楼仁杰
李飞飞
占超群
魏闯先
Original Assignee
阿里云计算有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里云计算有限公司 filed Critical 阿里云计算有限公司
Publication of WO2022063150A1 publication Critical patent/WO2022063150A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Definitions

  • the present specification relates to the technical field of data processing, and in particular, to a data storage method and device, and a data query method and device.
  • the k-means clustering algorithm (k-means clustering algorithm, K-means clustering algorithm) will cluster the data according to the similar distance, so as to divide all the data into multiple spaces.
  • the query vector You only need to search for data points in the same space as yourself to ensure high accuracy.
  • the embodiments of this specification provide a data storage method.
  • This specification also relates to a data storage device, a data query method and device, two kinds of computing devices, and two kinds of computer-readable storage media, so as to solve the technical defects existing in the prior art.
  • a data storage method comprising:
  • Cluster the data set to be stored, and determine multiple cluster center points
  • cluster center point For each cluster center point in the plurality of cluster center points, according to the cluster center point, determine the corresponding neighbor cluster center point, and cluster the cluster center point and the corresponding neighbor cluster point.
  • the space formed by the center point is determined as the clustering subspace;
  • the to-be-stored data in the to-be-stored data set is stored.
  • determining the corresponding nearest neighbor cluster center point including:
  • the point is determined as the adjacent cluster center point corresponding to the first cluster center point.
  • storing the data to be stored in the data set to be stored according to the clustering subspace includes:
  • the first target cluster center point determines the first target neighbor cluster center point corresponding to the first target cluster center point
  • the first data to be stored is stored in a clustering subspace formed by the first target cluster center point and the first target neighbor cluster center point.
  • the determining the first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set includes:
  • Sort the second distances from nearest to farthest select the second preset value and the second distance ahead of the sorting, and determine the cluster center point corresponding to the selected second preset value and the second distance is the first target cluster center point corresponding to the first data to be stored.
  • the determining the first target neighbor cluster center point corresponding to the first target cluster center point according to the first data to be stored and the first target cluster center point includes:
  • the method further includes:
  • a search space corresponding to the data to be queried is determined.
  • a data query method comprising:
  • a search space corresponding to the data to be queried is determined.
  • determining the second target cluster center point corresponding to the data to be queried from the plurality of cluster center points includes:
  • Sort the fourth distances from nearest to farthest select the fourth preset value and the fourth distance ahead of the sorting, and determine the cluster center point corresponding to the fourth preset value and the fourth distance as the The second target cluster center point corresponding to the data to be queried is described.
  • determining the target cluster subspace corresponding to the second target cluster center point includes:
  • each neighbor cluster center point corresponding to the second target cluster center point Obtain each neighbor cluster center point corresponding to the second target cluster center point, and determine the each neighbor cluster center point as the second target neighbor cluster center point corresponding to the second target cluster center point;
  • the cluster subspace formed by the second target cluster center point and the second target neighbor cluster center point is determined as the target cluster subspace.
  • determining the search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace includes:
  • Sort the fifth distances from near to far select the fifth preset value and the fifth distance ahead of the sorting, and determine the target clustering subspace corresponding to the fifth preset value and the fifth distance as The search space corresponding to the data to be queried.
  • the calculating the fifth distance between the target cluster subspace and the data to be queried includes:
  • a sixth distance between the target point and the data to be queried is calculated, and the sixth distance is determined as a fifth distance between the target cluster subspace and the data to be queried.
  • a data storage device comprising:
  • a first determining module configured to cluster the data set to be stored, and determine a plurality of cluster center points
  • the second determination module is configured to, for each cluster center point in the plurality of cluster center points, determine the corresponding nearest neighbor cluster center point according to the cluster center point, and assign the cluster center point to the cluster center point.
  • the space formed by the corresponding neighbor cluster center points is determined as a cluster subspace;
  • the storage module is configured to store the to-be-stored data in the to-be-stored data set according to the clustering subspace.
  • a data query device comprising:
  • the third determination module is configured to determine the second target cluster center point corresponding to the data to be queried from the plurality of cluster center points;
  • a fourth determination module configured to determine a target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point;
  • a fifth determining module is configured to determine a search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace.
  • a computing device including:
  • the memory is used to store computer-executable instructions
  • the processor is used to execute the computer-executable instructions to implement the following methods:
  • Cluster the data set to be stored, and determine multiple cluster center points
  • cluster center point For each cluster center point in the plurality of cluster center points, according to the cluster center point, determine the corresponding neighbor cluster center point, and cluster the cluster center point and the corresponding neighbor cluster point.
  • the space formed by the center point is determined as the clustering subspace;
  • the to-be-stored data in the to-be-stored data set is stored.
  • a computing device including:
  • the memory is used to store computer-executable instructions
  • the processor is used to execute the computer-executable instructions to implement the following methods:
  • a search space corresponding to the data to be queried is determined.
  • a computer-readable storage medium which stores computer-executable instructions, and when the instructions are executed by a processor, implements the steps of the data storage method.
  • a computer-readable storage medium which stores computer-executable instructions, and when the instructions are executed by a processor, implements the steps of the data query method.
  • the data storage method provided in this specification can cluster the data set to be stored and determine multiple cluster center points; then, for each cluster center point in the multiple cluster center points, according to the cluster center point , determine the corresponding neighbor cluster center point, and determine the space formed by the cluster center point and the corresponding neighbor cluster center point as the cluster subspace; then, according to the cluster subspace, the to-be-stored data set Store data for storage.
  • the data to be stored is not directly stored in a layer of space formed with the cluster center point as the center, but is further determined according to the cluster center point.
  • the data to be stored is stored in the clustering subspace composed of the cluster center point and the corresponding neighbor cluster center point, and the clustering algorithm and the idea of the nearest neighbor graph are combined.
  • the neighbor graph relationship of the second-level subspace is introduced, and the first-level space is further divided into two-level clustering subspaces, thereby improving the retrieval accuracy of the entire index; and this second-level subspace division
  • the method saves the cost of multi-layer clustering, and the index construction speed is faster.
  • this specification improves the construction and storage cost of the index without introducing additional cost of index construction and storage. accuracy of retrieval.
  • Fig. 1 is a kind of ANN index structure diagram based on K-means clustering provided by an embodiment of this specification;
  • Fig. 3 is a kind of index structure diagram combining K-means clustering and nearest neighbor graph provided by an embodiment of this specification;
  • FIG. 5 is a schematic structural diagram of a data storage device provided by an embodiment of the present specification.
  • FIG. 6 is a schematic structural diagram of a data query apparatus provided by an embodiment of the present specification.
  • FIG. 7 is a structural block diagram of a computing device provided by an embodiment of the present specification.
  • FIG. 8 is a structural block diagram of another computing device provided by an embodiment of the present specification.
  • ANN retrieval Approximate Nearest Neighbor Search takes advantage of the characteristic that the data will form a clustered distribution when the amount of data increases, and classifies or encodes the data in the database by analyzing and clustering the data. For the target data, predict the data category to which it belongs according to its data characteristics, and return some or all of the categories as the retrieval result. That is, the nearest N adjacent data (ie, vectors) in the high-dimensional space are quickly retrieved through pre-built indexes, but only approximately accurate results can be returned, and absolute accuracy cannot be guaranteed.
  • the core idea of approximate nearest neighbor retrieval is to search for data items that may be nearest neighbors instead of returning only the most probable items, improving retrieval efficiency at the expense of accuracy within an acceptable range.
  • Vector index a type of index structure that provides ANN retrieval capabilities for high-dimensional vector data.
  • K-means clustering algorithm also known as k-means clustering algorithm (k-means clustering algorithm), is an iterative solution clustering analysis algorithm, the steps are to randomly select K objects as the initial cluster center Calculate the distance between each object and each cluster center point, assign each object to the cluster center point closest to it, and the cluster center point and the objects assigned to them represent a cluster.
  • K-means clustering algorithm originated from a vector quantization method in signal processing, and is also a classic clustering analysis method in the field of data mining.
  • Voronoi space a subdivision of hyperspace, which is characterized in that any position in the subspace is closest to the center point of the subspace, and is relatively far away from the center point in adjacent subspaces, and each subspace is far away from the center point of the subspace. Contains one and only one center point.
  • the k-means clustering algorithm (k-means clustering algorithm, K-means clustering algorithm) will cluster the data according to the similar distance, so as to divide all the data into multiple spaces.
  • the query vector Just search for data points in the same space as yourself.
  • K-means clustering algorithm alone for the vector index of space division, when the data clustering effect is not good, the retrieval accuracy will be seriously reduced, and when the query vector falls on the boundary of multiple spaces,
  • the system must search for multiple spaces adjacent to the query vector to ensure accuracy.
  • each space will be adjacent to a lot of spaces, which further exacerbates this problem.
  • Figure 1 is an ANN index structure diagram based on K-means clustering.
  • the K-means clustering algorithm is used to cluster the data to be stored, and 8 different K-means such as C0-C7 are determined.
  • K-means clustering algorithm is used to cluster the data to be stored, and 8 different K-means such as C0-C7 are determined.
  • C0-C7 K-means
  • each cluster center point is divided into a Voronoi space.
  • the data q to be queried is at the space boundary, it has to search multiple spaces of C0, C2 and C3 at the same time, which greatly reduces the retrieval efficiency.
  • this specification proposes a data storage method and device, and a data query method and device.
  • After clustering the to-be-stored data set and determining multiple cluster center points For each cluster center point in the cluster center points, determine the corresponding neighbor cluster center point, and determine the space formed by the cluster center point and the corresponding neighbor cluster center point as the cluster subspace; The storage data is stored in the cluster subspace.
  • the clustering algorithm is combined with the idea of the nearest neighbor graph.
  • K-means clustering the neighbor graph relationship of the second-level subspace is introduced, and the space Ci of one layer is further divided into two-layer clustering subspaces, thus Improve the retrieval efficiency and accuracy of the entire index.
  • a data storage method is provided.
  • This specification also relates to a data storage device, a data query method and device, two computing devices, and two computer-readable storage media. In the following embodiments are explained in detail one by one.
  • FIG. 2 shows a flowchart of a data storage method provided according to an embodiment of the present specification, which specifically includes the following steps:
  • Step 202 Cluster the data set to be stored, and determine a plurality of cluster center points.
  • the to-be-stored data set is a set composed of all to-be-stored data, and the to-be-stored data set includes a plurality of to-be-stored data.
  • K-means clustering algorithm can be used to cluster the data set to be stored, so as to determine the multiple cluster center points.
  • K data can be randomly selected from the data set to be stored, and the selected The K data are determined as K cluster center points.
  • Step 204 For each cluster center point in the plurality of cluster center points, according to the cluster center point, determine the corresponding neighbor cluster center point, and the cluster center point and the corresponding neighbor cluster center point.
  • the formed space is determined as a clustering subspace.
  • clustering on the basis of clustering the data set to be stored and determining multiple cluster center points, further, for each cluster center point in the multiple cluster center points, according to the cluster center point , determine the corresponding neighbor cluster center point, and determine the space formed by the cluster center point and the corresponding neighbor cluster center point as the cluster subspace.
  • each cluster center point Ci finds n nearest neighbor cluster center points close to it, each cluster center point Ci and its corresponding n neighbor cluster center points can form multiple clusters. subspace.
  • the corresponding nearest neighbor cluster center point is determined, and the specific implementation process may be as follows:
  • Sort the first distances from nearest to farthest select the first preset value and the first distance in the first order, and determine the second cluster center point corresponding to the selected first preset value and the first distance as the first distance.
  • a cluster center point corresponds to the nearest neighbor cluster center point.
  • the first preset value can be set in advance, such as 2, 3, 4, etc. After sorting the calculated first distances from near to far, the first preset value can be filtered and aggregated. Multiple nearest neighbor cluster center points that are close to the class center point.
  • the determined cluster center points are C0, C1, C2, ..., Ck, for C0, C0 is determined as the first cluster center point, and C1, C2, ..., Ck is determined as the second cluster
  • the center point calculate the distance between C0 and C1, C2, ..., Ck respectively, assuming that the distance between C0 and C1, C2, ..., C7 is a, and the distance between C8, C9, ..., C20
  • the distance is b
  • the distance between C21, C22, ..., Ck is c
  • the distance a is less than the distance b is less than the distance c
  • the first preset value is 7, then the determined nearest neighbor clustering corresponding to C0 at this time
  • the center points are C1, C2, ..., C7.
  • the above steps are also performed for C1, and the adjacent cluster center points corresponding to C1 are determined to be C0, C2, C7, C8, C9, C10, and C11; the above steps are also performed for C2, and the adjacent cluster center points corresponding to C2 are determined to be C0, C1, C3, C12, C13, C14, C15; and so on, until the nearest neighbor cluster center point corresponding to Ck is determined.
  • the corresponding nearest neighbors can be filtered out by setting a distance threshold.
  • Cluster center point For each cluster center point in the plurality of cluster center points, according to the cluster center point, the corresponding neighboring cluster center point is determined, and the specific implementation process may be as follows:
  • a first distance smaller than the first distance threshold in the first distances is determined, and a second cluster center point corresponding to the first distance smaller than the first distance threshold is determined as a neighbor cluster center point corresponding to the first cluster center point.
  • the first distance threshold may be set in advance.
  • the determined cluster center points are C0, C1, C2, ..., Ck, for C0, C0 is determined as the first cluster center point, and C1, C2, ..., Ck is determined as the second cluster
  • the center point calculate the distance between C0 and C1, C2, ..., Ck respectively, assuming that the distance between C0 and C1, C2, ..., C7 is 1, and the distance between C8, C9, ..., C20
  • the distance is 2, the distance between C21, C22, ..., Ck is 3, and the first distance threshold is 1.5, then the center point of the nearest neighbor cluster corresponding to C0 determined at this time is C1, C2, ..., C7.
  • the above steps are also performed for C1, C2, .
  • each cluster center point Ci and one of its neighbors are clustered together.
  • the class center point Cj constitutes a cluster subspace B(i, j), that is, each cluster center point Ci and its multiple neighboring cluster center points can constitute multiple cluster subspaces.
  • the determined neighbor cluster center points corresponding to C0 are C1, C2, ..., C7
  • the neighbor cluster center points corresponding to C1 are C0, C2, C7, C8, C9, C10, C11.
  • C0 and C1 can form a clustering subspace B(0,1)
  • C0 and C2 can form a clustering subspace B(0,2)
  • C0 and C3 can form a clustering subspace B(0,3 )
  • C0 and C4 can form a clustering subspace B(0, 4)
  • C0 and C5 can form a clustering subspace B(0,5)
  • C0 and C6 can form a clustering subspace B(0, 6)
  • C0 and C7 can form a clustering subspace B(0, 7).
  • C1 and C0 can form a clustering subspace B(1,0)
  • C1 and C2 can form a clustering subspace B(1,2)
  • C1 and C7 can form a clustering subspace B(1,7)
  • C1 and C8 can form a clustering subspace B(1, 8)
  • C1 and C9 can form a clustering subspace B(1,9)
  • C1 and C10 can form a clustering subspace B(1,10
  • C1 and C11 can form a clustering subspace B(1, 11).
  • Step 206 According to the clustering subspace, store the data to be stored in the data set to be stored.
  • the cluster center point determines the corresponding neighbor cluster center point, and the cluster center point and the corresponding neighbor cluster center point On the basis that the space formed by the points is determined as the clustering subspace, further, according to the clustering subspace, the to-be-stored data in the to-be-stored data set is stored.
  • each data to be stored in the data set to be stored needs to be stored in the corresponding clustering subspace to facilitate subsequent retrieval and query. Therefore, for each data to be stored in the data set to be stored, it needs to be calculated once and determined. In which clustering subspace to store it.
  • the data to be stored in the data set to be stored is stored, and the specific implementation process may be as follows:
  • the first target cluster center point determines the first target neighbor cluster center point corresponding to the first target cluster center point
  • the first data to be stored is stored in the clustering subspace formed by the first target cluster center point and the first target neighbor cluster center point.
  • the first data to be stored may be any data to be stored in the data set to be stored, and each of the data to be stored in the data set to be stored needs to perform the above operation steps once, so as to determine its corresponding clustering subspace, to store. That is, each to-be-stored data in the to-be-stored data set is to be used as the above-mentioned first to-be-stored data once.
  • the first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set is determined, and in specific implementation, the first to-be-stored data and each cluster center point of the plurality of cluster center points may be calculated Then, sort the second distances from near to far, select the second preset value and the second distance in the front of the sorting, and put the cluster corresponding to the second preset value and the second distance.
  • the class center point is determined as the first target cluster center point corresponding to the first data to be stored.
  • the closest cluster subspace can be selected, or the two closest cluster subspaces can be selected. Therefore, when determining the center point of the first target cluster corresponding to the first data to be stored, it can be filtered by the second preset value, wherein the second preset value can be set in advance, such as 1 or 2 , 3, 4, etc.
  • the data set to be stored is ⁇ X1, X2, X3, ..., Xm ⁇
  • the first data to be stored is X1
  • the cluster center points determined according to the data set to be stored are C0, C1, C2, ..., C7, calculate the distance between X1 and C0, C1, C2, ..., C7 respectively, assuming that the distance between X1 and C0 is 1, the distance between X1 and C1 is 1.5, and the distance between X1 and C2 is 2, the distance between X1 and C3 is 3, the distance between X1 and C4 is 4, the distance between X1 and C5 is 5, the distance between X1 and C6 is 6, and the distance between X1 and C7 is 7.
  • the center point of the first target cluster that is closest to the first data to be stored may not be screened by the second preset value, but by setting a distance threshold, the corresponding The closest first target cluster center point.
  • the first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set is determined, and the specific implementation process may be as follows:
  • the cluster center point corresponding to the second distance is determined as the first target cluster center point corresponding to the first data to be stored.
  • the second distance threshold may be set in advance.
  • the data set to be stored is ⁇ X1, X2, X3, ..., Xm ⁇
  • the first data to be stored is X1
  • the cluster center points determined according to the data set to be stored are C0, C1, C2, ..., C7, calculate the distance between X1 and C0, C1, C2, ..., C7 respectively, assuming that the distance between X1 and C0 is 1, the distance between X1 and C1 is 1.5, and the distance between X1 and C2 is 2, the distance between X1 and C3 is 3, the distance between X1 and C4 is 4, the distance between X1 and C5 is 5, the distance between X1 and C6 is 6, and the distance between X1 and C7 is 7.
  • the first target neighbor cluster center point corresponding to the first target cluster center point is determined, and in specific implementation, the first target cluster center point may be obtained first The corresponding neighbor cluster center points, and then calculate the third distance between the first data to be stored and the neighbor cluster center points, sort the third distances from near to far, and select the third The preset value and the third distance are determined, and the selected neighbor cluster center point corresponding to the third preset value and the third distance is determined as the first target neighbor cluster center point corresponding to the first target cluster center point.
  • the first target cluster center point corresponding to the first target cluster center point of each to-be-stored data is also determined respectively.
  • the target neighbor cluster center point that is, the first target neighbor cluster center point corresponding to the first target cluster center point and closest to the first data to be stored.
  • filtering may be performed by a third preset value, wherein the third preset value may be set in advance, such as 1, 2, 3, 4, and so on.
  • the first data to be stored is X1
  • the first target cluster center point corresponding to X1 is C0
  • the neighbor cluster center points corresponding to C0 are C1, C2, ..., C7, respectively, calculate X1 and C1, C2, ... ..., the distance between C7 (if it has been calculated before, the result can be directly obtained here), assuming that the distance between X1 and C1 is 1.5, the distance between X1 and C2 is 2, and the distance between X1 and C3 is 2.
  • the distance is 3, the distance between X1 and C4 is 4, the distance between X1 and C5 is 5, the distance between X1 and C6 is 6, the distance between X1 and C7 is 7, and the above 7 distances are as follows Sort from near to far, assuming that the first second distance (1.5) is selected, then C1 is determined as the first target nearest neighbor cluster center point corresponding to C0.
  • the corresponding target neighbor cluster center points are determined according to the above method.
  • the first target neighbor cluster center point corresponding to the first target cluster center point and closest to the first to-be-stored data may not be screened by the third preset value. , but by setting a distance threshold, filter out the corresponding first target nearest neighbor cluster center point that is closest to the first data to be stored. In this way, according to the first data to be stored and the first target cluster center point, the first target neighbor cluster center point corresponding to the first target cluster center point is determined, and the specific implementation process may be as follows:
  • the neighbor cluster center point corresponding to the third distance smaller than the third distance threshold value is determined as the first target neighbor cluster center point corresponding to the first target cluster center point.
  • the third distance threshold may be set in advance.
  • the first data to be stored is X1
  • the first target cluster center point corresponding to X1 is C0
  • the neighbor cluster center points corresponding to C0 are C1, C2, ..., C7, respectively, calculate X1 and C1, C2, ... ..., the distance between C7, assuming the distance between X1 and C1 is 1.5, the distance between X1 and C2 is 2, the distance between X1 and C3 is 3, the distance between X1 and C4 is 4, and the distance between X1 and C4 is 4.
  • the distance between X1 and C5 is 5, the distance between X1 and C6 is 6, and the distance between X1 and C7 is 7.
  • the distance between X1 and C1 is 1.5 less than the third distance. If the distance threshold is set, at this time, C1 is determined as the center point of the first target nearest neighbor clustering corresponding to C0.
  • the corresponding target neighbor cluster center points are determined according to the above method.
  • the first data to be stored can be stored in the first target cluster.
  • the first data to be stored is X1
  • the first target cluster center point corresponding to X1 is C0
  • the first target neighbor cluster center point is C1
  • X1 is stored in the cluster subspace formed by C0 and C1. in B(0,1).
  • other to-be-stored data in the to-be-stored data set are stored in sequence according to the above method.
  • all the encoded to-be-stored data will be stored in an inverted structure according to the belonging clustering subspace B(i, j) and then written to the index file.
  • the storage in the inverted structure refers to arranging and storing each data to be stored according to the number (id) of the clustering subspace, and writing it into the index file.
  • the clustering subspace can be searched according to the following steps 208-212.
  • Step 208 From the plurality of cluster center points, determine a second target cluster center point corresponding to the data to be queried.
  • a second target cluster center point close to the data to be queried needs to be determined from a plurality of cluster center points.
  • Sort the fourth distances from nearest to farthest select the fourth preset value and the fourth distance at the top of the ranking, and determine the cluster center point corresponding to the fourth preset value and the fourth distance as the data to be queried.
  • the second target cluster center point The second target cluster center point.
  • the distance between it and each cluster center point Ci is calculated, so as to determine the cluster center point close to the data to be queried, that is, the fourth preset numerical value Two-target cluster center point.
  • the fourth preset value can be set in advance, such as 2, 3, 4, etc. After sorting the calculated fourth distances from near to far, the fourth preset value can be used to filter and wait for The query data is close to multiple second target cluster center points.
  • the cluster center points are C0, C1, C2, ..., C7, and for the data q to be queried, calculate the distance between q and C0, C1, C2, ..., C7, respectively, assuming The distance between q and C0 is 0.5, the distance between q and C1 is 3, the distance between q and C2 is 1, the distance between q and C3 is 0.8, the distance between q and C4 is 3.2, The distance between q and C5 is 5, the distance between q and C6 is 7, and the distance between q and C7 is 5.5. Assuming that the fourth preset value is 3, the second target corresponding to q is determined at this time.
  • the cluster center points are C0, C2 and C3.
  • the center points of multiple target clusters that are close to the data to be queried may not be screened by the fourth preset value, but the corresponding target clusters may be screened by setting a distance threshold. class center point.
  • the second target cluster center point corresponding to the data to be queried is determined, and the specific implementation process may be as follows:
  • a fourth distance smaller than the fourth distance threshold is determined, and the cluster center point corresponding to the fourth distance smaller than the fourth distance threshold is determined as the second target cluster center point corresponding to the data to be queried.
  • the fourth distance threshold may be set in advance.
  • the cluster center points are C0, C1, C2, ..., C7, and for the data q to be queried, calculate the distance between q and C0, C1, C2, ..., C7, respectively, assuming The distance between q and C0 is 0.5, the distance between q and C1 is 3, the distance between q and C2 is 1, the distance between q and C3 is 0.8, the distance between q and C4 is 3.2, The distance between q and C5 is 5, the distance between q and C6 is 7, and the distance between q and C7 is 5.5. Assuming that the fourth distance threshold is 1.5, the second target cluster corresponding to q is determined at this time.
  • the class center points are C0, C2, and C3.
  • Step 210 Determine the target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point.
  • the second target cluster center point corresponding to the data to be queried from the plurality of cluster center points further, according to the to-be-queried data and the second target cluster center point, determine The target cluster subspace corresponding to the second target cluster center point.
  • multiple clustering subspaces where the data to be queried may be stored can be selected according to the data to be queried in the previous step and the center point of the second target cluster.
  • the specific implementation process is as follows:
  • each neighbor cluster center point corresponding to the second target cluster center point Obtain each neighbor cluster center point corresponding to the second target cluster center point, and determine each neighbor cluster center point as the second target neighbor cluster center point corresponding to the second target cluster center point;
  • the cluster subspace formed by the second target cluster center point and the second target neighbor cluster center point is determined as the target cluster subspace.
  • each second target cluster center point will have a corresponding neighboring cluster center point. Therefore, for each second target cluster center point , the corresponding second target neighbor cluster center point should be determined, and then the corresponding target cluster subspace should be determined.
  • the second target cluster center points corresponding to the data to be queried are C0, C2, and C3, and the neighboring cluster center points corresponding to C0 are C1, C2, ..., C7, so the second target cluster center point
  • the second target nearest neighbor cluster center points corresponding to C0 are C1, C2, ..., C7.
  • the target cluster subspace corresponding to the second target cluster center point C0 is B(0,1), B(0, 2), B(0,3), B(0,4), B(0,5), B(0,6), B(0,7);
  • the center points of each nearest neighbor cluster corresponding to C2 are C0, C1, C3, C12, C13, C14, C15, so the second target neighbor cluster center points corresponding to the second target cluster center point C2 are C0, C1, C3, C12, C13, C14, C15, and the second target cluster center point C2
  • the target cluster subspace corresponding to the target cluster center point C2 is B(2,0), B(2,1), B(2,3), B(2,12), B(2,13), B (2,14), B(2,15);
  • the center points of each neighboring cluster corresponding to C3 are C0, C2, C4, C16, C17, C18, C19, so the second target cluster center point C3 corresponds to the second
  • the target nearest neighbor cluster center points are
  • Step 212 According to the data to be queried and the target clustering subspace, determine a search space corresponding to the data to be queried.
  • the search space refers to the space in which the data to be queried is finally queried.
  • the search space corresponding to the data to be queried is determined according to the data to be queried and the target clustering subspace, and the specific implementation process may be as follows:
  • Sort the fifth distances from nearest to farthest select the fifth preset value and the fifth distance at the top of the ranking, and determine the target clustering subspace corresponding to the fifth preset value and the fifth distance as the corresponding data to be queried. search space.
  • the midpoint between the second target cluster center point and the second target neighbor cluster center point can be determined as the target point; then the sixth distance between the target point and the data to be queried is calculated, and the sixth distance is determined as the target
  • the fifth distance between the clustering subspace and the data to be queried That is, the distance from the target cluster subspace to the data to be queried is represented by the distance from the mean center point of the cluster center point Ci and the adjacent cluster center point Cj to the data to be queried.
  • target clustering subspaces obtained by the above steps. For each target clustering subspace, it is necessary to determine the distance between it and the data to be stored, so as to decide whether to cluster the target.
  • the subspace is determined as the search space for querying the data to be queried.
  • the target clustering subspace corresponding to C0 is B(0,1), B(0,2), B(0,3), B(0,4), B(0,5), B(0, 6), B(0,7)
  • the target clustering subspace corresponding to C2 is B(2,0), B(2,1), B(2,3), B(2,12), B(2 ,13), B(2,14), B(2,15)
  • the target clustering subspace corresponding to C3 is B(3,0), B(3,2), B(3,4), B( 3,16), B(3,17), B(3,18), B(3,19)
  • the nearest target clustering subspace is used as the final search space.
  • B(0,2), B(0,3), B(2,0), and B(3,0) are selected as the final search space, and the data to be queried is queried in this search space. q.
  • the data storage method provided in this specification can cluster the data set to be stored and determine multiple cluster center points; then, for each cluster center point in the multiple cluster center points, according to the cluster center point , determine the corresponding neighbor cluster center point, and determine the space formed by the cluster center point and the corresponding neighbor cluster center point as the cluster subspace; then, according to the cluster subspace, the to-be-stored data set Store data for storage.
  • the second target cluster center point corresponding to the data to be queried can be determined from the plurality of cluster center points; and then according to the data to be queried and the second target cluster center point, determine the target clustering subspace corresponding to the center point of the second target clustering, and then further determine the search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace, and perform a search in the search space .
  • the data to be stored is not directly stored in a layer of space formed with the cluster center point as the center, but is further determined according to the cluster center point. Then the data to be stored is stored in the clustering subspace formed by the cluster center point and the corresponding neighbor cluster center point, and the clustering algorithm and the idea of the nearest neighbor graph are combined.
  • the neighbor graph relationship of the second-level subspace is introduced, and the first-level space is further divided into more detailed two-level clustering subspaces, thereby improving the retrieval accuracy of the entire index.
  • this two-layer subspace division method saves the cost of multi-layer clustering compared with the two-layer hierarchical clustering structure, and the index construction speed is faster. Compared with the single-layer K-means clustering structure, this specification does not On the basis of introducing additional index construction and storage overhead, the retrieval accuracy is improved.
  • the stored data is clustered simply by the K-means clustering algorithm, and 8 different K-means clustering centers such as C0-C7 are determined. Each cluster center point is divided into a Voronoi space.
  • the data q to be queried is at the space boundary, multiple spaces of C0, C2 and C3 have to be searched at the same time.
  • the data storage method provided in this specification after determining 8 different K-means clustering center points such as C0-C7, will further determine the neighboring clustering center points corresponding to C0-C7, Therefore, the second-level spatial clustering subspace is divided twice, and the data q to be queried is also queried.
  • the method provided in this specification only needs to be in the clustering subspaces B(0,2), B(0,3), B(2, 0) and B(3,0), which greatly reduces the search range and improves the search efficiency and accuracy.
  • FIG. 4 shows a flowchart of a data storage method provided according to an embodiment of the present specification, which specifically includes the following steps:
  • Step 402 From the plurality of cluster center points, determine the second target cluster center point corresponding to the data to be queried. It should be noted that, the specific implementation process of step 402 is the same as the specific implementation process of the above-mentioned step 208, and details are not described herein again in this specification.
  • Step 404 Determine the target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point.
  • step 404 is the same as the specific implementation process of the above-mentioned step 210, and details are not described herein again in this specification.
  • Step 406 Determine a search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace.
  • step 406 is the same as the specific implementation process of the above-mentioned step 212, and details are not described herein again in this specification.
  • the data query method provided in this specification can determine the second target cluster center point corresponding to the data to be queried from a plurality of cluster center points, and then determine the second target cluster center point according to the data to be queried and the second target cluster center point
  • the target clustering subspace corresponding to the cluster center point is further determined according to the data to be queried and the target clustering subspace to determine the search space corresponding to the data to be queried, and the search is performed in the search space.
  • the clustering algorithm is combined with the idea of the nearest neighbor graph, that is, on the basis of K-means clustering, the neighbor graph relationship of the two-layer subspace is introduced, and the space of the first layer is further divided into more
  • the detailed two-level clustering subspace can narrow the search scope when querying the data to be queried subsequently, thereby improving the retrieval efficiency and accuracy of the entire index.
  • FIG. 5 shows a schematic structural diagram of a data storage apparatus provided by an embodiment of this specification.
  • the device includes:
  • the first determining module 502 is configured to cluster the data set to be stored, and determine a plurality of cluster center points;
  • the second determination module 504 is configured to, for each cluster center point in the plurality of cluster center points, determine the corresponding nearest neighbor cluster center point according to the type of center point, and the cluster center point and the corresponding cluster center point The space formed by the neighbor cluster center points is determined as the cluster subspace;
  • the storage module 506 is configured to store the to-be-stored data in the to-be-stored data set according to the cluster subspace.
  • the second determining module 504 is further configured to:
  • Sort the first distances from nearest to farthest select the first preset value and the first distance in the first order, and determine the second cluster center point corresponding to the selected first preset value and the first distance as the first distance.
  • a cluster center point corresponds to the nearest neighbor cluster center point.
  • the storage module 506 is further configured to:
  • the first target cluster center point determines the first target neighbor cluster center point corresponding to the first target cluster center point
  • the first data to be stored is stored in the clustering subspace formed by the first target cluster center point and the first target neighbor cluster center point.
  • the storage module 506 is further configured to:
  • Sort the second distances from nearest to farthest select the second preset value and the second distance at the top of the ranking, and determine the cluster center point corresponding to the selected second preset value and the second distance as the first
  • the first target cluster center point corresponding to the data is stored.
  • the storage module 506 is further configured to:
  • Sort the third distances from nearest to farthest select the third preset value and the third distance at the top of the ranking, and determine the nearest neighbor cluster center point corresponding to the selected third preset value and the third distance as the first The first target neighbor cluster center point corresponding to the target cluster center point.
  • the apparatus further includes:
  • a third determination module configured to determine a second target cluster center point corresponding to the data to be queried from among the plurality of cluster center points;
  • a fourth determining module configured to determine a target clustering subspace corresponding to the second target clustering center point according to the data to be queried and the second target clustering center point;
  • the fifth determining module is configured to determine a search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace.
  • the data to be stored is not directly stored in a layer of space formed with the cluster center point as the center, but is further determined according to the cluster center point.
  • the nearest neighbor clustering center point, and then the data to be stored is stored in the clustering subspace formed by the cluster center point and the corresponding neighbor cluster center point, and the clustering algorithm and the nearest neighbor graph idea are combined, that is, in K
  • the neighbor graph relationship of the second-level subspace is introduced, and the first-level space is further divided into more detailed two-level clustering subspaces, thereby improving the retrieval accuracy of the entire index.
  • this two-layer subspace division method saves the cost of multi-layer clustering compared with the two-layer hierarchical clustering structure, and the index construction speed is faster. Compared with the single-layer K-means clustering structure, this specification does not On the basis of introducing additional index construction and storage overhead, the retrieval accuracy is improved.
  • the above is a schematic solution of a data storage device according to this embodiment. It should be noted that the technical solution of the data storage device and the technical solution of the above-mentioned data storage method belong to the same concept, and the details that are not described in detail in the technical solution of the data storage device can be referred to the description of the technical solution of the above-mentioned data storage method. .
  • FIG. 6 shows a schematic structural diagram of a data query apparatus provided by an embodiment of the present specification.
  • the device includes:
  • the third determination module 602 is configured to determine the second target cluster center point corresponding to the data to be queried from the plurality of cluster center points;
  • the fourth determination module 604 is configured to determine the target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point;
  • the fifth determination module 606 is configured to determine a search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace.
  • the third determining module 602 is further configured to:
  • Sort the fourth distances from nearest to farthest select the fourth preset value and the fourth distance at the top of the ranking, and determine the cluster center point corresponding to the fourth preset value and the fourth distance as the data to be queried.
  • the second target cluster center point The second target cluster center point.
  • the fourth determining module 604 is further configured to:
  • each neighbor cluster center point corresponding to the second target cluster center point Obtain each neighbor cluster center point corresponding to the second target cluster center point, and determine each neighbor cluster center point as the second target neighbor cluster center point corresponding to the second target cluster center point;
  • the cluster subspace formed by the second target cluster center point and the second target neighbor cluster center point is determined as the target cluster subspace.
  • the fifth determining module 606 is further configured to:
  • Sort the fifth distance from nearest to farthest select the fifth preset value and the fifth distance ahead of the sorting, and determine the target clustering subspace corresponding to the fifth preset value and the fifth distance as the to-be-to-be The search space corresponding to the query data.
  • the fifth determining module 606 is further configured to:
  • the data query device provided in this specification can determine the second target cluster center point corresponding to the data to be queried from a plurality of cluster center points, and then determine the second target cluster center point according to the data to be queried and the second target cluster center point
  • the target clustering subspace corresponding to the cluster center point is further determined according to the data to be queried and the target clustering subspace to determine the search space corresponding to the data to be queried, and the search is performed in the search space.
  • the clustering algorithm is combined with the idea of the nearest neighbor graph, that is, on the basis of K-means clustering, the neighbor graph relationship of the two-layer subspace is introduced, and the space of the first layer is further divided into more
  • the detailed second-level subspace can narrow the search range when querying the data to be queried, thereby improving the retrieval efficiency and accuracy of the entire index.
  • the above is a schematic solution of a data query apparatus according to this embodiment. It should be noted that the technical solution of the data query device and the technical solution of the above-mentioned data query method belong to the same concept, and the details that are not described in detail in the technical solution of the data query device can be referred to the description of the technical solution of the above-mentioned data query method. .
  • FIG. 7 shows a structural block diagram of a computing device 700 according to an embodiment of the present specification.
  • Components of the computing device 700 include, but are not limited to, memory 710 and processor 720 .
  • the processor 720 is connected with the memory 710 through the bus 730, and the database 750 is used for storing data.
  • Computing device 700 also includes access device 740 that enables computing device 700 to communicate via one or more networks 760 .
  • networks include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet.
  • Access device 740 may include one or more of any type of network interface (eg, a network interface card (NIC)), wired or wireless, such as an IEEE 802.11 wireless local area network (WLAN) wireless interface, World Interoperability for Microwave Access ( Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth interface, Near Field Communication (NFC) interface, and the like.
  • NIC network interface card
  • computing device 700 and other components not shown in FIG. 7 may also be connected to each other, such as through a bus. It should be understood that the structural block diagram of the computing device shown in FIG. 7 is only for the purpose of example, rather than limiting the scope of the present specification. Those skilled in the art can add or replace other components as required.
  • Computing device 700 may be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (eg, tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (eg, smart phones) ), wearable computing devices (eg, smart watches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or PCs.
  • mobile computers or mobile computing devices eg, tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.
  • mobile phones eg, smart phones
  • wearable computing devices eg, smart watches, smart glasses, etc.
  • desktop computers or PCs e.g., desktop computers or PCs.
  • Computing device 700 may also be a mobile or stationary server.
  • the processor 720 is configured to execute the following computer-executable instructions to implement the following method:
  • Cluster the data set to be stored, and determine multiple cluster center points
  • cluster center point For each cluster center point in the plurality of cluster center points, according to the cluster center point, determine the corresponding neighbor cluster center point, and the space formed by the cluster center point and the corresponding neighbor cluster center point Determined as a clustering subspace;
  • the to-be-stored data in the to-be-stored data set is stored.
  • the above is a schematic solution of a computing device according to this embodiment. It should be noted that the technical solution of the computing device and the technical solution of the above-mentioned data storage method belong to the same concept. For details not described in detail in the technical solution of the computing device, refer to the description of the technical solution of the above-mentioned data storage method.
  • FIG. 8 shows a structural block diagram of a computing device 800 according to an embodiment of the present specification.
  • Components of the computing device 800 include, but are not limited to, a memory 810 and a processor 820 .
  • the processor 820 is connected with the memory 810 through the bus 830, and the database 850 is used for saving data.
  • Computing device 800 also includes access device 840 that enables computing device 800 to communicate via one or more networks 860 .
  • networks include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet.
  • Access device 840 may include one or more of any type of network interface (eg, network interface card (NIC)), wired or wireless, such as IEEE 802.11 wireless local area network (WLAN) wireless interface, World Interoperability for Microwave Access ( Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth interface, Near Field Communication (NFC) interface, and the like.
  • NIC network interface card
  • computing device 800 may also be connected to each other, such as through a bus.
  • bus may also be connected to each other, such as through a bus.
  • FIG. 8 the structural block diagram of the computing device shown in FIG. 8 is only for the purpose of example, rather than limiting the scope of this specification. Those skilled in the art can add or replace other components as required.
  • Computing device 800 may be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (eg, tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (eg, smart phones) ), wearable computing devices (eg, smart watches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or PCs.
  • Computing device 800 may also be a mobile or stationary server.
  • the processor 820 is configured to execute the following computer-executable instructions to implement the following method:
  • a search space corresponding to the data to be queried is determined.
  • An embodiment of the present specification further provides a computer-readable storage medium, which stores computer instructions, and when the instructions are executed by a processor, is used to implement the operation steps of the above data storage method.
  • An embodiment of the present specification further provides a computer-readable storage medium, which stores computer instructions, which, when executed by a processor, are used to implement the operation steps of the above data query method.
  • the computer instructions include computer program code, which may be in source code form, object code form, an executable file, some intermediate form, or the like.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable media may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, the computer-readable media Electric carrier signals and telecommunication signals are not included.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data storage method and device, and a data query method and device. The data storage method comprises: clustering a data set to be stored, and determining a plurality of clustering center points (202); for each clustering center point in the plurality of clustering center points, determining a corresponding neighbor clustering center point according to the clustering center point, and determining a space formed by the clustering center point and the corresponding neighbor clustering center point as a clustering subspace (204); and storing, according to the clustering subspace, data to be stored in said data set (206). In the data storage method, a clustering algorithm is combined with a proximity graph concept, and a proximity graph relationship of a two-layer subspace is introduced on the basis of one-layer clustering to further divide a one-layer space into more detailed two-layer clustering subspaces, to improve the retrieval accuracy of the whole index.

Description

数据存储方法及装置、数据查询方法及装置Data storage method and device, data query method and device
本申请要求2020年09月27日递交的申请号为202011035973.0、发明名称为“数据存储方法及装置、数据查询方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application No. 202011035973.0 filed on September 27, 2020, and the title of the invention is “Data Storage Method and Device, Data Query Method and Device”, the entire contents of which are incorporated into this application by reference .
技术领域technical field
本说明书涉及数据处理技术领域,特别涉及数据存储方法及装置、数据查询方法及装置。The present specification relates to the technical field of data processing, and in particular, to a data storage method and device, and a data query method and device.
背景技术Background technique
随着计算机技术和网络技术的飞速发展,产生了大量的数据,随之而来的是数据存储和数据查询方面的巨大压力。k-means聚类算法(k-means clustering algorithm,K均值聚类算法),会把数据按照相似距离进行聚类,从而把全部数据划分到多个空间中,在数据查询的过程中,查询向量只需要搜索和自己在相同空间的数据点,即可保证很高的准确率。With the rapid development of computer technology and network technology, a large amount of data has been generated, followed by huge pressure on data storage and data query. The k-means clustering algorithm (k-means clustering algorithm, K-means clustering algorithm) will cluster the data according to the similar distance, so as to divide all the data into multiple spaces. In the process of data query, the query vector You only need to search for data points in the same space as yourself to ensure high accuracy.
然而,当查询向量落在多个空间的边界部分时,系统必须搜索和查询向量相邻的多个空间才能保证准确率,同时因为向量的超高维度,每个空间会和非常多的空间相邻,更加加剧了这个问题,进而需要更简单更便捷的方法进行数据存储、数据查询的操作或者处理。However, when the query vector falls on the boundary of multiple spaces, the system must search multiple spaces adjacent to the query vector to ensure the accuracy. At the same time, due to the high dimension of the vector, each space will be related to a lot of spaces. Neighbors further aggravate this problem, and thus require simpler and more convenient methods for data storage, data query operations or processing.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本说明书实施例提供了一种数据存储方法。本说明书同时涉及一种数据存储装置、数据查询方法及装置,两种计算设备,以及两种计算机可读存储介质,以解决现有技术中存在的技术缺陷。In view of this, the embodiments of this specification provide a data storage method. This specification also relates to a data storage device, a data query method and device, two kinds of computing devices, and two kinds of computer-readable storage media, so as to solve the technical defects existing in the prior art.
根据本说明书实施例的第一方面,提供了一种数据存储方法,所述方法包括:According to a first aspect of the embodiments of this specification, there is provided a data storage method, the method comprising:
对待存储数据集进行聚类,确定多个聚类中心点;Cluster the data set to be stored, and determine multiple cluster center points;
针对所述多个聚类中心点中的每一个聚类中心点,根据所述聚类中心点,确定对应的近邻聚类中心点,将所述聚类中心点和对应的所述近邻聚类中心点构成的空间确定为聚类子空间;For each cluster center point in the plurality of cluster center points, according to the cluster center point, determine the corresponding neighbor cluster center point, and cluster the cluster center point and the corresponding neighbor cluster point. The space formed by the center point is determined as the clustering subspace;
根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储。According to the clustering subspace, the to-be-stored data in the to-be-stored data set is stored.
可选的,所述根据所述聚类中心点,确定对应的近邻聚类中心点,包括:Optionally, according to the cluster center point, determining the corresponding nearest neighbor cluster center point, including:
将所述聚类中心点确定为第一聚类中心点,将所述多个聚类中心点中除所述第一聚类中心点外的聚类中心点确定为第二聚类中心点;Determining the cluster center point as the first cluster center point, and determining the cluster center point other than the first cluster center point among the plurality of cluster center points as the second cluster center point;
计算所述第一聚类中心点和各个所述第二聚类中心点之间的第一距离;Calculate the first distance between the first cluster center point and each of the second cluster center points;
将所述第一距离按照由近至远进行排序,选择排序靠前的第一预设数值个第一距离,将选择的所述第一预设数值个第一距离对应的第二聚类中心点确定为所述第一聚类中心点对应的近邻聚类中心点。Sort the first distances from nearest to farthest, select the first preset value and the first distance in the ranking, and select the second cluster center corresponding to the first preset value and the first distance. The point is determined as the adjacent cluster center point corresponding to the first cluster center point.
可选的,所述根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储,包括:Optionally, storing the data to be stored in the data set to be stored according to the clustering subspace includes:
确定所述待存储数据集中的第一待存储数据对应的第一目标聚类中心点;determining the first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set;
根据所述第一待存储数据和所述第一目标聚类中心点,确定所述第一目标聚类中心点对应的第一目标近邻聚类中心点;According to the first data to be stored and the first target cluster center point, determine the first target neighbor cluster center point corresponding to the first target cluster center point;
将所述第一待存储数据存储至所述第一目标聚类中心点和所述第一目标近邻聚类中心点构成的聚类子空间中。The first data to be stored is stored in a clustering subspace formed by the first target cluster center point and the first target neighbor cluster center point.
可选的,所述确定所述待存储数据集中的第一待存储数据对应的第一目标聚类中心点,包括:Optionally, the determining the first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set includes:
计算所述第一待存储数据和所述多个聚类中心点中每个聚类中心点之间的第二距离;Calculate the second distance between the first data to be stored and each cluster center point in the plurality of cluster center points;
将所述第二距离按照由近至远进行排序,选择排序靠前的第二预设数值个第二距离,将选择的所述第二预设数值个第二距离对应的聚类中心点确定为所述第一待存储数据对应的第一目标聚类中心点。Sort the second distances from nearest to farthest, select the second preset value and the second distance ahead of the sorting, and determine the cluster center point corresponding to the selected second preset value and the second distance is the first target cluster center point corresponding to the first data to be stored.
可选的,所述根据所述第一待存储数据和所述第一目标聚类中心点,确定所述第一目标聚类中心点对应的第一目标近邻聚类中心点,包括:Optionally, the determining the first target neighbor cluster center point corresponding to the first target cluster center point according to the first data to be stored and the first target cluster center point includes:
获取所述第一目标聚类中心点对应的各个近邻聚类中心点;Obtain each neighbor cluster center point corresponding to the first target cluster center point;
计算所述第一待存储数据和所述各个近邻聚类中心点之间的第三距离;Calculate the third distance between the first data to be stored and the center points of each neighboring cluster;
将所述第三距离按照由近至远进行排序,选择排序靠前的第三预设数值个第三距离,将选择的所述第三预设数值个第三距离对应的近邻聚类中心点确定为所述第一目标聚类中心点对应的第一目标近邻聚类中心点。Sort the third distances from nearest to farthest, select the third preset value and the third distance ahead of the sorting, and select the nearest neighbor clustering center points corresponding to the selected third preset value and the third distance. It is determined as the first target neighbor cluster center point corresponding to the first target cluster center point.
可选的,所述根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储之后,还包括:Optionally, after storing the to-be-stored data in the to-be-stored data set according to the clustering subspace, the method further includes:
从所述多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;From the plurality of cluster center points, determine the second target cluster center point corresponding to the data to be queried;
根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对应的目标聚类子空间;According to the data to be queried and the second target cluster center point, determine the target cluster subspace corresponding to the second target cluster center point;
根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间。According to the data to be queried and the target clustering subspace, a search space corresponding to the data to be queried is determined.
根据本说明书实施例的第二方面,提供了一种数据查询方法,所述方法包括:According to a second aspect of the embodiments of this specification, a data query method is provided, the method comprising:
从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;From the plurality of cluster center points, determine the second target cluster center point corresponding to the data to be queried;
根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对 应的目标聚类子空间;According to the data to be queried and the second target cluster center point, determine the target cluster subspace corresponding to the second target cluster center point;
根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间。According to the data to be queried and the target clustering subspace, a search space corresponding to the data to be queried is determined.
可选的,所述从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点,包括:Optionally, determining the second target cluster center point corresponding to the data to be queried from the plurality of cluster center points includes:
计算所述待查询数据和所述多个聚类中心点中每个聚类中心点之间的第四距离;Calculate the fourth distance between the data to be queried and each cluster center point in the plurality of cluster center points;
将所述第四距离按照由近至远进行排序,选择排序靠前的第四预设数值个第四距离,将所述第四预设数值个第四距离对应的聚类中心点确定为所述待查询数据对应的第二目标聚类中心点。Sort the fourth distances from nearest to farthest, select the fourth preset value and the fourth distance ahead of the sorting, and determine the cluster center point corresponding to the fourth preset value and the fourth distance as the The second target cluster center point corresponding to the data to be queried is described.
可选的,所述根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对应的目标聚类子空间,包括:Optionally, according to the data to be queried and the second target cluster center point, determining the target cluster subspace corresponding to the second target cluster center point includes:
获取所述第二目标聚类中心点对应的各个近邻聚类中心点,将所述各个近邻聚类中心点确定为所述第二目标聚类中心点对应的第二目标近邻聚类中心点;Obtain each neighbor cluster center point corresponding to the second target cluster center point, and determine the each neighbor cluster center point as the second target neighbor cluster center point corresponding to the second target cluster center point;
将所述第二目标聚类中心点和所述第二目标近邻聚类中心点构成的聚类子空间确定为所述目标聚类子空间。The cluster subspace formed by the second target cluster center point and the second target neighbor cluster center point is determined as the target cluster subspace.
可选的,所述根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间,包括:Optionally, determining the search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace includes:
计算所述目标聚类子空间和所述待查询数据的第五距离;Calculate the fifth distance between the target cluster subspace and the data to be queried;
将所述第五距离按照由近至远进行排序,选择排序靠前的第五预设数值个第五距离,将所述第五预设数值个第五距离对应的目标聚类子空间确定为所述待查询数据对应的搜索空间。Sort the fifth distances from near to far, select the fifth preset value and the fifth distance ahead of the sorting, and determine the target clustering subspace corresponding to the fifth preset value and the fifth distance as The search space corresponding to the data to be queried.
可选的,所述计算所述目标聚类子空间和所述待查询数据的第五距离,包括:Optionally, the calculating the fifth distance between the target cluster subspace and the data to be queried includes:
将所述第二目标聚类中心点和所述第二目标近邻聚类中心点的中点确定为目标点;Determining the midpoint of the second target cluster center point and the second target neighbor cluster center point as the target point;
计算所述目标点和所述待查询数据的第六距离,将所述第六距离确定为所述目标聚类子空间和所述待查询数据的第五距离。A sixth distance between the target point and the data to be queried is calculated, and the sixth distance is determined as a fifth distance between the target cluster subspace and the data to be queried.
根据本说明书实施例的第三方面,提供了一种数据存储装置,所述装置包括:According to a third aspect of the embodiments of the present specification, there is provided a data storage device, the device comprising:
第一确定模块,被配置为对待存储数据集进行聚类,确定多个聚类中心点;a first determining module, configured to cluster the data set to be stored, and determine a plurality of cluster center points;
第二确定模块,被配置为针对所述多个聚类中心点中的每一个聚类中心点,根据所述聚类中心点,确定对应的近邻聚类中心点,将所述聚类中心点和对应的所述近邻聚类中心点构成的空间确定为聚类子空间;The second determination module is configured to, for each cluster center point in the plurality of cluster center points, determine the corresponding nearest neighbor cluster center point according to the cluster center point, and assign the cluster center point to the cluster center point. The space formed by the corresponding neighbor cluster center points is determined as a cluster subspace;
存储模块,被配置为根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储。The storage module is configured to store the to-be-stored data in the to-be-stored data set according to the clustering subspace.
根据本说明书实施例的第四方面,提供了一种数据查询装置,所述装置包括:According to a fourth aspect of the embodiments of the present specification, there is provided a data query device, the device comprising:
第三确定模块,被配置为从多个聚类中心点中,确定待查询数据对应的第二目标聚 类中心点;The third determination module is configured to determine the second target cluster center point corresponding to the data to be queried from the plurality of cluster center points;
第四确定模块,被配置为根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对应的目标聚类子空间;a fourth determination module, configured to determine a target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point;
第五确定模块,被配置为根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间。A fifth determining module is configured to determine a search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace.
根据本说明书实施例的第五方面,提供了一种计算设备,包括:According to a fifth aspect of the embodiments of the present specification, a computing device is provided, including:
存储器和处理器;memory and processor;
所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,以实现下述方法:The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions to implement the following methods:
对待存储数据集进行聚类,确定多个聚类中心点;Cluster the data set to be stored, and determine multiple cluster center points;
针对所述多个聚类中心点中的每一个聚类中心点,根据所述聚类中心点,确定对应的近邻聚类中心点,将所述聚类中心点和对应的所述近邻聚类中心点构成的空间确定为聚类子空间;For each cluster center point in the plurality of cluster center points, according to the cluster center point, determine the corresponding neighbor cluster center point, and cluster the cluster center point and the corresponding neighbor cluster point. The space formed by the center point is determined as the clustering subspace;
根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储。According to the clustering subspace, the to-be-stored data in the to-be-stored data set is stored.
根据本说明书实施例的第六方面,提供了一种计算设备,包括:According to a sixth aspect of the embodiments of the present specification, a computing device is provided, including:
存储器和处理器;memory and processor;
所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,以实现下述方法:The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions to implement the following methods:
从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;From the plurality of cluster center points, determine the second target cluster center point corresponding to the data to be queried;
根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对应的目标聚类子空间;According to the data to be queried and the second target cluster center point, determine the target cluster subspace corresponding to the second target cluster center point;
根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间。According to the data to be queried and the target clustering subspace, a search space corresponding to the data to be queried is determined.
根据本说明书实施例的第七方面,提供了一种计算机可读存储介质,其存储有计算机可执行指令,该指令被处理器执行时实现所述数据存储方法的步骤。According to a seventh aspect of the embodiments of the present specification, a computer-readable storage medium is provided, which stores computer-executable instructions, and when the instructions are executed by a processor, implements the steps of the data storage method.
根据本说明书实施例的第八方面,提供了一种计算机可读存储介质,其存储有计算机可执行指令,该指令被处理器执行时实现所述数据查询方法的步骤。According to an eighth aspect of the embodiments of the present specification, a computer-readable storage medium is provided, which stores computer-executable instructions, and when the instructions are executed by a processor, implements the steps of the data query method.
本说明书提供的数据存储方法,可以对待存储数据集进行聚类,确定多个聚类中心点;之后,针对该多个聚类中心点中的每一个聚类中心点,根据该聚类中心点,确定对应的近邻聚类中心点,将该聚类中心点和对应的近邻聚类中心点构成的空间确定为聚类子空间;然后根据该聚类子空间,对该待存储数据集中的待存储数据进行存储。这种情况下,确定聚类中心点后,并不是直接将待存储数据存储至以该聚类中心点为中心形成的一层空间中,而是进一步再根据该聚类中心点确定出与其接近的近邻聚类中心点,然后将待存储数据存储至该聚类中心点和对应的近邻聚类中心点构成的聚类子空间中,把 聚类算法和近邻图思想进行了结合,在一层聚类的基础上,引入了二层子空间的近邻图关系,把一层的空间进一步划分为二层聚类子空间,从而提升整个索引的检索准度;并且,这种二层子空间划分方式相比二层的层次聚类结构节省了多层聚类的开销,索引构建速度更快,相比单层的聚类结构,本说明书在没有引入额外索引构建和存储开销的基础上提升了检索的准度。The data storage method provided in this specification can cluster the data set to be stored and determine multiple cluster center points; then, for each cluster center point in the multiple cluster center points, according to the cluster center point , determine the corresponding neighbor cluster center point, and determine the space formed by the cluster center point and the corresponding neighbor cluster center point as the cluster subspace; then, according to the cluster subspace, the to-be-stored data set Store data for storage. In this case, after the cluster center point is determined, the data to be stored is not directly stored in a layer of space formed with the cluster center point as the center, but is further determined according to the cluster center point. Then the data to be stored is stored in the clustering subspace composed of the cluster center point and the corresponding neighbor cluster center point, and the clustering algorithm and the idea of the nearest neighbor graph are combined. On the basis of clustering, the neighbor graph relationship of the second-level subspace is introduced, and the first-level space is further divided into two-level clustering subspaces, thereby improving the retrieval accuracy of the entire index; and this second-level subspace division Compared with the two-layer hierarchical clustering structure, the method saves the cost of multi-layer clustering, and the index construction speed is faster. Compared with the single-layer clustering structure, this specification improves the construction and storage cost of the index without introducing additional cost of index construction and storage. accuracy of retrieval.
附图说明Description of drawings
图1是本说明书一实施例提供的一种基于K-means聚类的ANN索引结构图;Fig. 1 is a kind of ANN index structure diagram based on K-means clustering provided by an embodiment of this specification;
图2是本说明书一实施例提供的一种数据存储方法的流程图;2 is a flowchart of a data storage method provided by an embodiment of the present specification;
图3是本说明书一实施例提供的一种结合K-means聚类和近邻图的索引结构图;Fig. 3 is a kind of index structure diagram combining K-means clustering and nearest neighbor graph provided by an embodiment of this specification;
图4是本说明书一实施例提供的一种数据查询方法的流程图;4 is a flowchart of a data query method provided by an embodiment of this specification;
图5是本说明书一实施例提供的一种数据存储装置的结构示意图;5 is a schematic structural diagram of a data storage device provided by an embodiment of the present specification;
图6是本说明书一实施例提供的一种数据查询装置的结构示意图;6 is a schematic structural diagram of a data query apparatus provided by an embodiment of the present specification;
图7是本说明书一实施例提供的一种计算设备的结构框图;FIG. 7 is a structural block diagram of a computing device provided by an embodiment of the present specification;
图8是本说明书一实施例提供的另一种计算设备的结构框图。FIG. 8 is a structural block diagram of another computing device provided by an embodiment of the present specification.
具体实施方式detailed description
在下面的描述中阐述了很多具体细节以便于充分理解本说明书。但是本说明书能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本说明书内涵的情况下做类似推广,因此本说明书不受下面公开的具体实施的限制。In the following description, numerous specific details are set forth in order to provide a thorough understanding of this specification. However, this specification can be implemented in many other ways different from those described herein, and those skilled in the art can make similar promotions without departing from the connotation of this specification. Therefore, this specification is not limited by the specific implementation disclosed below.
在本说明书一个或多个实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本说明书一个或多个实施例。在本说明书一个或多个实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本说明书一个或多个实施例中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in one or more embodiments of this specification is for the purpose of describing a particular embodiment only and is not intended to limit the one or more embodiments of this specification. As used in the specification or embodiments and the appended claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used in this specification in one or more embodiments refers to and includes any and all possible combinations of one or more of the associated listed items.
应当理解,尽管在本说明书一个或多个实施例中可能采用术语第一、第二等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本说明书一个或多个实施例范围的情况下,第一也可以被称为第二,类似地,第二也可以被称为第一。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It will be understood that although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other. For example, a first could be termed a second, and similarly, a second could be termed a first, without departing from the scope of one or more embodiments of this specification. Depending on the context, the word "if" as used herein can be interpreted as "at the time of" or "when" or "in response to determining."
首先,对本说明书一个或多个实施例涉及的名词术语进行解释。First, the terminology involved in one or more embodiments of the present specification is explained.
ANN检索:近似最近邻检索(Approximate Nearest Neighbor Search)利用数据量增大后数据之间会形成簇状聚集分布的特性,通过对数据分析聚类的方法对数据库中的数据进行分类或编码,对于目标数据,根据其数据特征预测其所属的数据类别,返回类别 中的部分或全部作为检索结果。也就是,通过预先构建的索引快速检索高维空间中的最近的N个相邻数据(即向量),但只能返回近似准确结果,不能保证绝对准确。近似最近邻检索的核心思想:搜索可能是近邻的数据项而不再只局限于返回最可能的项目,在牺牲可接受范围内的精度的情况下提高检索效率。ANN retrieval: Approximate Nearest Neighbor Search takes advantage of the characteristic that the data will form a clustered distribution when the amount of data increases, and classifies or encodes the data in the database by analyzing and clustering the data. For the target data, predict the data category to which it belongs according to its data characteristics, and return some or all of the categories as the retrieval result. That is, the nearest N adjacent data (ie, vectors) in the high-dimensional space are quickly retrieved through pre-built indexes, but only approximately accurate results can be returned, and absolute accuracy cannot be guaranteed. The core idea of approximate nearest neighbor retrieval is to search for data items that may be nearest neighbors instead of returning only the most probable items, improving retrieval efficiency at the expense of accuracy within an acceptable range.
向量索引:一类专门提供高维向量数据ANN检索能力的索引结构。Vector index: a type of index structure that provides ANN retrieval capabilities for high-dimensional vector data.
K-means聚类算法:也即k均值聚类算法(k-means clustering algorithm),是一种迭代求解的聚类分析算法,其步骤是随机选取K个对象作为初始的聚类中心点,然后计算每个对象与各个聚类中心点之间的距离,把每个对象分配给距离它最近的聚类中心点,聚类中心点以及分配给它们的对象就代表一个聚类。K-means聚类算法源于信号处理中的一种向量量化方法,同时也是数据挖掘领域中经典的聚类分析方法。K-means clustering algorithm: also known as k-means clustering algorithm (k-means clustering algorithm), is an iterative solution clustering analysis algorithm, the steps are to randomly select K objects as the initial cluster center Calculate the distance between each object and each cluster center point, assign each object to the cluster center point closest to it, and the cluster center point and the objects assigned to them represent a cluster. K-means clustering algorithm originated from a vector quantization method in signal processing, and is also a classic clustering analysis method in the field of data mining.
Voronoi空间:对超空间的一种剖分,其特点是子空间内的任何位置离该子空间的中心点的距离最近,离相邻子空间内中心点的距离相对远,且每个子空间内含且仅包含一个中心点。Voronoi space: a subdivision of hyperspace, which is characterized in that any position in the subspace is closest to the center point of the subspace, and is relatively far away from the center point in adjacent subspaces, and each subspace is far away from the center point of the subspace. Contains one and only one center point.
接下来,对本说明书提供的数据存储方法、数据查询方法的基本构思进行简要阐述:Next, the basic concepts of the data storage method and data query method provided in this specification are briefly described:
k-means聚类算法(k-means clustering algorithm,K均值聚类算法),会把数据按照相似距离进行聚类,从而把全部数据划分到多个空间中,在数据查询的过程中,查询向量只需要搜索和自己在相同空间的数据点。然而,单纯利用K-means聚类算法进行空间划分的向量索引,在遇到数据聚类效果不好的时候,检索准度会严重下降,且当查询向量落在多个空间的边界部分时,系统必须搜索和查询向量相邻的多个空间才能保证准确率,同时因为向量的超高维度,每个空间会和非常多的空间相邻,更加加剧了这个问题。The k-means clustering algorithm (k-means clustering algorithm, K-means clustering algorithm) will cluster the data according to the similar distance, so as to divide all the data into multiple spaces. In the process of data query, the query vector Just search for data points in the same space as yourself. However, using the K-means clustering algorithm alone for the vector index of space division, when the data clustering effect is not good, the retrieval accuracy will be seriously reduced, and when the query vector falls on the boundary of multiple spaces, The system must search for multiple spaces adjacent to the query vector to ensure accuracy. At the same time, due to the high dimension of the vector, each space will be adjacent to a lot of spaces, which further exacerbates this problem.
例如,图1是一个基于K-means聚类的ANN索引结构图,如图1所示,通过K-means聚类算法对待存储数据进行聚类,确定出C0-C7等8个不同的K-means聚类中心点,每个聚类中心点都划分了一个Voronoi空间。当待查询数据q在空间边界的时候不得不同时搜索C0、C2和C3这多个空间,大大降低了检索效率。For example, Figure 1 is an ANN index structure diagram based on K-means clustering. As shown in Figure 1, the K-means clustering algorithm is used to cluster the data to be stored, and 8 different K-means such as C0-C7 are determined. means cluster center point, each cluster center point is divided into a Voronoi space. When the data q to be queried is at the space boundary, it has to search multiple spaces of C0, C2 and C3 at the same time, which greatly reduces the retrieval efficiency.
因而,本说明书为了提高检索效率和准确率,提出了一种数据存储方法及装置、数据查询方法及装置,可以在对待存储数据集进行聚类,确定多个聚类中心点之后,针对该多个聚类中心点中的每一个聚类中心点,确定对应的近邻聚类中心点,将该聚类中心点和对应的近邻聚类中心点构成的空间确定为聚类子空间;再将待存储数据存储至该聚类子空间中。把聚类算法和近邻图思想进行了结合,在K-means聚类的基础上,引入了二层子空间的近邻图关系,把一层的空间Ci进一步划分为二层聚类子空间,从而提升整个索引的检索效率和准确度。Therefore, in order to improve retrieval efficiency and accuracy, this specification proposes a data storage method and device, and a data query method and device. After clustering the to-be-stored data set and determining multiple cluster center points, For each cluster center point in the cluster center points, determine the corresponding neighbor cluster center point, and determine the space formed by the cluster center point and the corresponding neighbor cluster center point as the cluster subspace; The storage data is stored in the cluster subspace. The clustering algorithm is combined with the idea of the nearest neighbor graph. On the basis of K-means clustering, the neighbor graph relationship of the second-level subspace is introduced, and the space Ci of one layer is further divided into two-layer clustering subspaces, thus Improve the retrieval efficiency and accuracy of the entire index.
在本说明书中,提供了一种数据存储方法,本说明书同时涉及一种数据存储装置,一种数据查询方法及装置,两种计算设备,以及两种计算机可读存储介质,在下面的实施例中逐一进行详细说明。In this specification, a data storage method is provided. This specification also relates to a data storage device, a data query method and device, two computing devices, and two computer-readable storage media. In the following embodiments are explained in detail one by one.
图2示出了根据本说明书一实施例提供的一种数据存储方法的流程图,具体包括以下步骤:2 shows a flowchart of a data storage method provided according to an embodiment of the present specification, which specifically includes the following steps:
步骤202:对待存储数据集进行聚类,确定多个聚类中心点。Step 202: Cluster the data set to be stored, and determine a plurality of cluster center points.
具体的,待存储数据集就是全部待存储数据构成的集合,该待存储数据集包括多个待存储数据。Specifically, the to-be-stored data set is a set composed of all to-be-stored data, and the to-be-stored data set includes a plurality of to-be-stored data.
实际应用中,可以通过K-means聚类算法对待存储数据集进行聚类,从而确定出该多个聚类中心点,具体实现时,可以从待存储数据集中随机选取K个数据,将选取出的K个数据确定为K个聚类中心点。In practical applications, K-means clustering algorithm can be used to cluster the data set to be stored, so as to determine the multiple cluster center points. In the specific implementation, K data can be randomly selected from the data set to be stored, and the selected The K data are determined as K cluster center points.
步骤204:针对该多个聚类中心点中的每一个聚类中心点,根据该聚类中心点,确定对应的近邻聚类中心点,将该聚类中心点和对应的近邻聚类中心点构成的空间确定为聚类子空间。Step 204: For each cluster center point in the plurality of cluster center points, according to the cluster center point, determine the corresponding neighbor cluster center point, and the cluster center point and the corresponding neighbor cluster center point. The formed space is determined as a clustering subspace.
具体的,在对待存储数据集进行聚类,确定多个聚类中心点的基础上,进一步的,将针对该多个聚类中心点中的每一个聚类中心点,根据该聚类中心点,确定对应的近邻聚类中心点,将该聚类中心点和对应的近邻聚类中心点构成的空间确定为聚类子空间。Specifically, on the basis of clustering the data set to be stored and determining multiple cluster center points, further, for each cluster center point in the multiple cluster center points, according to the cluster center point , determine the corresponding neighbor cluster center point, and determine the space formed by the cluster center point and the corresponding neighbor cluster center point as the cluster subspace.
实际应用中,对每一个聚类中心点Ci,寻找和其接近的n个近邻聚类中心点,每一个聚类中心点Ci和其对应的n个近邻聚类中心点可以构成多个聚类子空间。In practical applications, for each cluster center point Ci, find n nearest neighbor cluster center points close to it, each cluster center point Ci and its corresponding n neighbor cluster center points can form multiple clusters. subspace.
在本实施例的一个或多个实施方式中,根据该聚类中心点,确定对应的近邻聚类中心点,具体实现过程可以为:In one or more implementations of this embodiment, according to the cluster center point, the corresponding nearest neighbor cluster center point is determined, and the specific implementation process may be as follows:
将该聚类中心点确定为第一聚类中心点,将该多个聚类中心点中除第一聚类中心点外的聚类中心点确定为第二聚类中心点;Determining the cluster center point as the first cluster center point, and determining the cluster center point of the plurality of cluster center points except the first cluster center point as the second cluster center point;
计算第一聚类中心点和各个第二聚类中心点之间的第一距离;Calculate the first distance between the first cluster center point and each second cluster center point;
将第一距离按照由近至远进行排序,选择排序靠前的第一预设数值个第一距离,将选择的第一预设数值个第一距离对应的第二聚类中心点确定为第一聚类中心点对应的近邻聚类中心点。Sort the first distances from nearest to farthest, select the first preset value and the first distance in the first order, and determine the second cluster center point corresponding to the selected first preset value and the first distance as the first distance. A cluster center point corresponds to the nearest neighbor cluster center point.
需要说明的是,对每一个聚类中心点Ci,要寻找和其接近的多个近邻聚类中心点,因而需要计算该聚类中心点Ci和其他聚类中心点之间的距离,从而筛选出距离较近的多个近邻聚类中心点,即第一预设数值个近邻聚类中心点。其中,第一预设数值可以提前设置,如2个、3个、4个等,对计算得到的多个第一距离按照由近至远排序后,可以通过该第一预设数值筛选与聚类中心点接近的多个近邻聚类中心点。It should be noted that, for each cluster center point Ci, it is necessary to find multiple neighboring cluster center points close to it, so it is necessary to calculate the distance between the cluster center point Ci and other cluster center points, so as to filter A plurality of neighboring cluster center points with a relatively short distance are obtained, that is, the first preset number of neighboring cluster center points. Among them, the first preset value can be set in advance, such as 2, 3, 4, etc. After sorting the calculated first distances from near to far, the first preset value can be filtered and aggregated. Multiple nearest neighbor cluster center points that are close to the class center point.
例如,确定出的聚类中心点为C0、C1、C2、……、Ck,针对C0,将C0确定为第一聚类中心点,将C1、C2、……、Ck确定为第二聚类中心点,分别计算C0和C1、C2、……、Ck之间的距离,假设C0和C1、C2、……、C7之间的距离为a,和C8、C9、……、C20之间的距离为b,和C21、C22、……、Ck之间的距离为c,且距离a小于距离b小于距离c,第一预设数值为7,则此时确定出的C0对应的近邻聚类中心点为C1、C2、……、 C7。针对C1同样执行上述步骤,确定出C1对应的近邻聚类中心点为C0、C2、C7、C8、C9、C10、C11;针对C2同样执行上述步骤,确定出C2对应的近邻聚类中心点为C0、C1、C3、C12、C13、C14、C15;以此类推,直至确定出Ck对应的近邻聚类中心点。For example, the determined cluster center points are C0, C1, C2, ..., Ck, for C0, C0 is determined as the first cluster center point, and C1, C2, ..., Ck is determined as the second cluster The center point, calculate the distance between C0 and C1, C2, ..., Ck respectively, assuming that the distance between C0 and C1, C2, ..., C7 is a, and the distance between C8, C9, ..., C20 The distance is b, and the distance between C21, C22, ..., Ck is c, and the distance a is less than the distance b is less than the distance c, and the first preset value is 7, then the determined nearest neighbor clustering corresponding to C0 at this time The center points are C1, C2, ..., C7. The above steps are also performed for C1, and the adjacent cluster center points corresponding to C1 are determined to be C0, C2, C7, C8, C9, C10, and C11; the above steps are also performed for C2, and the adjacent cluster center points corresponding to C2 are determined to be C0, C1, C3, C12, C13, C14, C15; and so on, until the nearest neighbor cluster center point corresponding to Ck is determined.
在本实施例的一个或多个实施方式中,还可以不通过第一预设数值筛选与聚类中心点接近的多个近邻聚类中心点,而是通过设置距离阈值,筛选出对应的近邻聚类中心点。此种方式下,针对该多个聚类中心点中的每一个聚类中心点,根据该聚类中心点,确定对应的近邻聚类中心点,具体实现过程可以如下:In one or more implementations of this embodiment, instead of filtering a plurality of neighboring cluster center points that are close to the cluster center point by the first preset value, the corresponding nearest neighbors can be filtered out by setting a distance threshold. Cluster center point. In this way, for each cluster center point in the plurality of cluster center points, according to the cluster center point, the corresponding neighboring cluster center point is determined, and the specific implementation process may be as follows:
将该聚类中心点确定为第一聚类中心点,将该多个聚类中心点中除第一聚类中心点外的聚类中心点确定为第二聚类中心点;Determining the cluster center point as the first cluster center point, and determining the cluster center point of the plurality of cluster center points except the first cluster center point as the second cluster center point;
计算第一聚类中心点和各个第二聚类中心点之间的第一距离;Calculate the first distance between the first cluster center point and each second cluster center point;
确定第一距离中小于第一距离阈值的第一距离,将该小于第一距离阈值的第一距离对应的第二聚类中心点确定为第一聚类中心点对应的近邻聚类中心点。A first distance smaller than the first distance threshold in the first distances is determined, and a second cluster center point corresponding to the first distance smaller than the first distance threshold is determined as a neighbor cluster center point corresponding to the first cluster center point.
其中,第一距离阈值可以提前设置。The first distance threshold may be set in advance.
例如,确定出的聚类中心点为C0、C1、C2、……、Ck,针对C0,将C0确定为第一聚类中心点,将C1、C2、……、Ck确定为第二聚类中心点,分别计算C0和C1、C2、……、Ck之间的距离,假设C0和C1、C2、……、C7之间的距离为1,和C8、C9、……、C20之间的距离为2,和C21、C22、……、Ck之间的距离为3,且第一距离阈值为1.5,则此时确定出的C0对应的近邻聚类中心点为C1、C2、……、C7。针对C1、C2、……、Ck同样执行上述步骤,确定出对应的近邻聚类中心点。For example, the determined cluster center points are C0, C1, C2, ..., Ck, for C0, C0 is determined as the first cluster center point, and C1, C2, ..., Ck is determined as the second cluster The center point, calculate the distance between C0 and C1, C2, ..., Ck respectively, assuming that the distance between C0 and C1, C2, ..., C7 is 1, and the distance between C8, C9, ..., C20 The distance is 2, the distance between C21, C22, ..., Ck is 3, and the first distance threshold is 1.5, then the center point of the nearest neighbor cluster corresponding to C0 determined at this time is C1, C2, ..., C7. The above steps are also performed for C1, C2, .
实际应用中,获得了每一个聚类中心点对应的多个近邻聚类中心点,称这为聚类中心点之间的近邻图关系,每个聚类中心点Ci和它的其中一个近邻聚类中心点Cj构成了一个聚类子空间B(i,j),也即每个聚类中心点Ci和它的多个近邻聚类中心点可以构成多个聚类子空间。In practical applications, multiple neighboring cluster center points corresponding to each cluster center point are obtained, which is called the neighbor graph relationship between the cluster center points. Each cluster center point Ci and one of its neighbors are clustered together. The class center point Cj constitutes a cluster subspace B(i, j), that is, each cluster center point Ci and its multiple neighboring cluster center points can constitute multiple cluster subspaces.
例如,确定出的C0对应的近邻聚类中心点为C1、C2、……、C7,C1对应的近邻聚类中心点为C0、C2、C7、C8、C9、C10、C11。则C0和C1可以构成一个聚类子空间B(0,1),C0和C2可以构成一个聚类子空间B(0,2),C0和C3可以构成一个聚类子空间B(0,3),C0和C4可以构成一个聚类子空间B(0,4),C0和C5可以构成一个聚类子空间B(0,5),C0和C6可以构成一个聚类子空间B(0,6),C0和C7可以构成一个聚类子空间B(0,7)。C1和C0可以构成一个聚类子空间B(1,0),C1和C2可以构成一个聚类子空间B(1,2),C1和C7可以构成一个聚类子空间B(1,7),C1和C8可以构成一个聚类子空间B(1,8),C1和C9可以构成一个聚类子空间B(1,9),C1和C10可以构成一个聚类子空间B(1,10),C1和C11可以构成一个聚类子空间B(1,11)。For example, the determined neighbor cluster center points corresponding to C0 are C1, C2, ..., C7, and the neighbor cluster center points corresponding to C1 are C0, C2, C7, C8, C9, C10, C11. Then C0 and C1 can form a clustering subspace B(0,1), C0 and C2 can form a clustering subspace B(0,2), and C0 and C3 can form a clustering subspace B(0,3 ), C0 and C4 can form a clustering subspace B(0, 4), C0 and C5 can form a clustering subspace B(0,5), C0 and C6 can form a clustering subspace B(0, 6), C0 and C7 can form a clustering subspace B(0, 7). C1 and C0 can form a clustering subspace B(1,0), C1 and C2 can form a clustering subspace B(1,2), C1 and C7 can form a clustering subspace B(1,7) , C1 and C8 can form a clustering subspace B(1, 8), C1 and C9 can form a clustering subspace B(1,9), C1 and C10 can form a clustering subspace B(1,10 ), C1 and C11 can form a clustering subspace B(1, 11).
步骤206:根据该聚类子空间,对该待存储数据集中的待存储数据进行存储。Step 206: According to the clustering subspace, store the data to be stored in the data set to be stored.
具体的,在针对该多个聚类中心点中的每一个聚类中心点,根据该聚类中心点,确定对应的近邻聚类中心点,将该聚类中心点和对应的近邻聚类中心点构成的空间确定为聚类子空间的基础上,进一步的,将根据该聚类子空间,对该待存储数据集中的待存储数据进行存储。Specifically, for each cluster center point in the plurality of cluster center points, according to the cluster center point, determine the corresponding neighbor cluster center point, and the cluster center point and the corresponding neighbor cluster center point On the basis that the space formed by the points is determined as the clustering subspace, further, according to the clustering subspace, the to-be-stored data in the to-be-stored data set is stored.
实际应用中,需要将待存储数据集中的每个待存储数据存储至相应的聚类子空间中,方便后续检索查询,因而对于待存储数据集中的每一个待存储数据,都需要计算一遍,确定将其存储至哪个聚类子空间中。In practical applications, each data to be stored in the data set to be stored needs to be stored in the corresponding clustering subspace to facilitate subsequent retrieval and query. Therefore, for each data to be stored in the data set to be stored, it needs to be calculated once and determined. In which clustering subspace to store it.
本实施例的一个或多个实施方式中,根据该聚类子空间,对该待存储数据集中的待存储数据进行存储,具体实现过程可以为:In one or more implementations of this embodiment, according to the clustering subspace, the data to be stored in the data set to be stored is stored, and the specific implementation process may be as follows:
确定该待存储数据集中的第一待存储数据对应的第一目标聚类中心点;determining the first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set;
根据第一待存储数据和第一目标聚类中心点,确定第一目标聚类中心点对应的第一目标近邻聚类中心点;According to the first data to be stored and the first target cluster center point, determine the first target neighbor cluster center point corresponding to the first target cluster center point;
将第一待存储数据存储至第一目标聚类中心点和第一目标近邻聚类中心点构成的聚类子空间中。The first data to be stored is stored in the clustering subspace formed by the first target cluster center point and the first target neighbor cluster center point.
具体的,第一待存储数据可以为待存储数据集中的任意一个待存储数据,待存储数据集中的每一个待存储数据都要执行一遍上述操作步骤,从而确定出其对应的聚类子空间,进行存储。也即是,待存储数据集中的每一个待存储数据都要作为一次上述第一待存储数据。Specifically, the first data to be stored may be any data to be stored in the data set to be stored, and each of the data to be stored in the data set to be stored needs to perform the above operation steps once, so as to determine its corresponding clustering subspace, to store. That is, each to-be-stored data in the to-be-stored data set is to be used as the above-mentioned first to-be-stored data once.
实际应用中,对每个待存储数据,计算其接近的聚类中心点Ci,然后从Ci对应的各个近邻聚类中心点中挑选出和待存储数据最接近的近邻聚类中心点Cj,最后把待存储数据进行编码写入到B(i,j)的存储空间中,需要说明的是,可以采用PQ(product quantization)编码,也可以采用其他编码方式,本说明书对此不进行限制。In practical applications, for each data to be stored, calculate its close cluster center point Ci, and then select the nearest neighbor cluster center point Cj from the neighbor cluster center points corresponding to Ci, which is closest to the data to be stored, and finally The data to be stored is encoded and written into the storage space of B(i, j). It should be noted that PQ (product quantization) encoding can be used, or other encoding methods can be used, which is not limited in this specification.
其中,确定该待存储数据集中的第一待存储数据对应的第一目标聚类中心点,具体实现时,可以计算第一待存储数据和该多个聚类中心点中每个聚类中心点之间的第二距离;然后将第二距离按照由近至远进行排序,选择排序靠前的第二预设数值个第二距离,将选择的第二预设数值个第二距离对应的聚类中心点确定为第一待存储数据对应的第一目标聚类中心点。The first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set is determined, and in specific implementation, the first to-be-stored data and each cluster center point of the plurality of cluster center points may be calculated Then, sort the second distances from near to far, select the second preset value and the second distance in the front of the sorting, and put the cluster corresponding to the second preset value and the second distance. The class center point is determined as the first target cluster center point corresponding to the first data to be stored.
具体的,在插入第一待存储数据时,需要确定将该第一待存储数据插入至哪个聚类子空间中,可以选择最接近的一个聚类子空间,也可以选择最接近的两个聚类子空间,因而在确定第一待存储数据对应的第一目标聚类中心点时,可以通过第二预设数值进行筛选,其中,第二预设数值可以提前设置,如1个、2个、3个、4个等。Specifically, when inserting the first data to be stored, it is necessary to determine into which cluster subspace the first data to be stored is inserted, and the closest cluster subspace can be selected, or the two closest cluster subspaces can be selected. Therefore, when determining the center point of the first target cluster corresponding to the first data to be stored, it can be filtered by the second preset value, wherein the second preset value can be set in advance, such as 1 or 2 , 3, 4, etc.
例如,待存储数据集为{X1、X2、X3、……、Xm},第一待存储数据为X1,根据待存储数据集确定出的聚类中心点为C0、C1、C2、……、C7,分别计算X1和C0、C1、C2、……、C7之间的距离,假设X1和C0之间的距离为1,X1和C1之间的距离为1.5, X1和C2之间的距离为2,X1和C3之间的距离为3,X1和C4之间的距离为4,X1和C5之间的距离为5,X1和C6之间的距离为6,X1和C7之间的距离为7,将上述8个距离按照由近至远进行排序,假设选择排序最靠前的1个第二距离(1),则将C0确定为该第一待存储数据X1对应的第一目标聚类中心点。以此类推,对于待存储数据集中的其他待存储数据{X2、X3、……、Xm},均按照上述方式,确定出对应的第一目标聚类中心点。For example, the data set to be stored is {X1, X2, X3, ..., Xm}, the first data to be stored is X1, and the cluster center points determined according to the data set to be stored are C0, C1, C2, ..., C7, calculate the distance between X1 and C0, C1, C2, ..., C7 respectively, assuming that the distance between X1 and C0 is 1, the distance between X1 and C1 is 1.5, and the distance between X1 and C2 is 2, the distance between X1 and C3 is 3, the distance between X1 and C4 is 4, the distance between X1 and C5 is 5, the distance between X1 and C6 is 6, and the distance between X1 and C7 is 7. Sort the above-mentioned 8 distances from near to far, assuming that the first second distance (1) is selected, then determine C0 as the first target cluster corresponding to the first data to be stored X1 center point. By analogy, for other to-be-stored data {X2, X3, .
在本实施例的一个或多个实施方式中,也可以不通过第二预设数值筛选与第一待存储数据最接近的第一目标聚类中心点,而是通过设置距离阈值,筛选出对应的最接近的第一目标聚类中心点。此种方式下,确定待存储数据集中的第一待存储数据对应的第一目标聚类中心点,具体实现过程可以如下:In one or more implementations of this embodiment, the center point of the first target cluster that is closest to the first data to be stored may not be screened by the second preset value, but by setting a distance threshold, the corresponding The closest first target cluster center point. In this way, the first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set is determined, and the specific implementation process may be as follows:
计算第一待存储数据和该多个聚类中心点中每个聚类中心点之间的第二距离,确定第二距离中小于第二距离阈值的第二距离,将小于第二距离阈值的第二距离对应的聚类中心点确定为第一待存储数据对应的第一目标聚类中心点。Calculate the second distance between the first data to be stored and each cluster center point in the plurality of cluster center points, determine the second distance in the second distance that is less than the second distance threshold, and determine the second distance less than the second distance threshold. The cluster center point corresponding to the second distance is determined as the first target cluster center point corresponding to the first data to be stored.
具体的,第二距离阈值可以提前设置。Specifically, the second distance threshold may be set in advance.
例如,待存储数据集为{X1、X2、X3、……、Xm},第一待存储数据为X1,根据待存储数据集确定出的聚类中心点为C0、C1、C2、……、C7,分别计算X1和C0、C1、C2、……、C7之间的距离,假设X1和C0之间的距离为1,X1和C1之间的距离为1.5,X1和C2之间的距离为2,X1和C3之间的距离为3,X1和C4之间的距离为4,X1和C5之间的距离为5,X1和C6之间的距离为6,X1和C7之间的距离为7,假设第二距离阈值为1.3,X1和C0之间的距离为1小于该第二距离阈值,则此时将C0确定为该第一待存储数据X1对应的第一目标聚类中心点。以此类推,对于待存储数据集中的其他待存储数据{X2、X3、……、Xm},均按照上述方式,确定出对应的第一目标聚类中心点。For example, the data set to be stored is {X1, X2, X3, ..., Xm}, the first data to be stored is X1, and the cluster center points determined according to the data set to be stored are C0, C1, C2, ..., C7, calculate the distance between X1 and C0, C1, C2, ..., C7 respectively, assuming that the distance between X1 and C0 is 1, the distance between X1 and C1 is 1.5, and the distance between X1 and C2 is 2, the distance between X1 and C3 is 3, the distance between X1 and C4 is 4, the distance between X1 and C5 is 5, the distance between X1 and C6 is 6, and the distance between X1 and C7 is 7. Assuming that the second distance threshold is 1.3 and the distance between X1 and C0 is 1 less than the second distance threshold, then determine C0 as the first target cluster center point corresponding to the first to-be-stored data X1. By analogy, for other to-be-stored data {X2, X3, .
其中,根据第一待存储数据和第一目标聚类中心点,确定第一目标聚类中心点对应的第一目标近邻聚类中心点,具体实现时,可以先获取第一目标聚类中心点对应的各个近邻聚类中心点,然后计算第一待存储数据和该各个近邻聚类中心点之间的第三距离,将第三距离按照由近至远进行排序,选择排序靠前的第三预设数值个第三距离,将选择的第三预设数值个第三距离对应的近邻聚类中心点确定为第一目标聚类中心点对应的第一目标近邻聚类中心点。Wherein, according to the first to-be-stored data and the first target cluster center point, the first target neighbor cluster center point corresponding to the first target cluster center point is determined, and in specific implementation, the first target cluster center point may be obtained first The corresponding neighbor cluster center points, and then calculate the third distance between the first data to be stored and the neighbor cluster center points, sort the third distances from near to far, and select the third The preset value and the third distance are determined, and the selected neighbor cluster center point corresponding to the third preset value and the third distance is determined as the first target neighbor cluster center point corresponding to the first target cluster center point.
具体的,对于待存储数据集中的每一个待存储数据都确定其对应的第一目标聚类中心点后,还要分别确定出每一个待存储数据的第一目标聚类中心点对应的第一目标近邻聚类中心点,即第一目标聚类中心点对应的,且与第一待存储数据最近的第一目标近邻聚类中心点。实现时,可以通过第三预设数值进行筛选,其中,第三预设数值可以提前设置,如1个、2个、3个、4个等。Specifically, after determining the corresponding first target cluster center point for each to-be-stored data set in the to-be-stored data set, the first target cluster center point corresponding to the first target cluster center point of each to-be-stored data is also determined respectively. The target neighbor cluster center point, that is, the first target neighbor cluster center point corresponding to the first target cluster center point and closest to the first data to be stored. During implementation, filtering may be performed by a third preset value, wherein the third preset value may be set in advance, such as 1, 2, 3, 4, and so on.
例如,第一待存储数据为X1,X1对应的第一目标聚类中心点为C0,C0对应的近邻聚类中心点为C1、C2、……、C7,分别计算X1和C1、C2、……、C7之间的距离(若之前计算过,则此处直接获取结果即可),假设X1和C1之间的距离为1.5,X1和C2之间的距离为2,X1和C3之间的距离为3,X1和C4之间的距离为4,X1和C5之间的距离为5,X1和C6之间的距离为6,X1和C7之间的距离为7,将上述7个距离按照由近至远进行排序,假设选择排序最靠前的1个第二距离(1.5),则将C1确定为C0对应的第一目标近邻聚类中心点。以此类推,对于待存储数据集中的其他待存储数据对应的目标聚类中心点,均按照上述方式,确定出对应的目标近邻聚类中心点。For example, the first data to be stored is X1, the first target cluster center point corresponding to X1 is C0, and the neighbor cluster center points corresponding to C0 are C1, C2, ..., C7, respectively, calculate X1 and C1, C2, ... ..., the distance between C7 (if it has been calculated before, the result can be directly obtained here), assuming that the distance between X1 and C1 is 1.5, the distance between X1 and C2 is 2, and the distance between X1 and C3 is 2. The distance is 3, the distance between X1 and C4 is 4, the distance between X1 and C5 is 5, the distance between X1 and C6 is 6, the distance between X1 and C7 is 7, and the above 7 distances are as follows Sort from near to far, assuming that the first second distance (1.5) is selected, then C1 is determined as the first target nearest neighbor cluster center point corresponding to C0. By analogy, for the target cluster center points corresponding to other to-be-stored data in the to-be-stored data set, the corresponding target neighbor cluster center points are determined according to the above method.
在本实施例的一个或多个实施方式中,也可以不通过第三预设数值筛选第一目标聚类中心点对应的,且与第一待存储数据最近的第一目标近邻聚类中心点,而是通过设置距离阈值,筛选出对应的,且与第一待存储数据最近的第一目标近邻聚类中心点。此种方式下,根据第一待存储数据和第一目标聚类中心点,确定第一目标聚类中心点对应的第一目标近邻聚类中心点,具体实现过程可以如下:In one or more implementations of this embodiment, the first target neighbor cluster center point corresponding to the first target cluster center point and closest to the first to-be-stored data may not be screened by the third preset value. , but by setting a distance threshold, filter out the corresponding first target nearest neighbor cluster center point that is closest to the first data to be stored. In this way, according to the first data to be stored and the first target cluster center point, the first target neighbor cluster center point corresponding to the first target cluster center point is determined, and the specific implementation process may be as follows:
可以先获取第一目标聚类中心点对应的各个近邻聚类中心点,然后计算第一待存储数据和该各个近邻聚类中心点之间的第三距离,确定第三距离中小于第三距离阈值的第三距离,将小于第三距离阈值的第三距离对应的近邻聚类中心点确定为第一目标聚类中心点对应的第一目标近邻聚类中心点。It is possible to first obtain the center points of each neighbor cluster corresponding to the center point of the first target cluster, and then calculate the third distance between the first data to be stored and the center points of each neighbor cluster, and determine that the third distance is smaller than the third distance. For the third distance of the threshold, the neighbor cluster center point corresponding to the third distance smaller than the third distance threshold value is determined as the first target neighbor cluster center point corresponding to the first target cluster center point.
具体的,第三距离阈值可以提前设置。Specifically, the third distance threshold may be set in advance.
例如,第一待存储数据为X1,X1对应的第一目标聚类中心点为C0,C0对应的近邻聚类中心点为C1、C2、……、C7,分别计算X1和C1、C2、……、C7之间的距离,假设X1和C1之间的距离为1.5,X1和C2之间的距离为2,X1和C3之间的距离为3,X1和C4之间的距离为4,X1和C5之间的距离为5,X1和C6之间的距离为6,X1和C7之间的距离为7,假设第三距离阈值为1.8,X1和C1之间的距离为1.5小于该第三距离阈值,则此时将C1确定为C0对应的第一目标近邻聚类中心点。以此类推,对于待存储数据集中的其他待存储数据对应的目标聚类中心点,均按照上述方式,确定出对应的目标近邻聚类中心点。For example, the first data to be stored is X1, the first target cluster center point corresponding to X1 is C0, and the neighbor cluster center points corresponding to C0 are C1, C2, ..., C7, respectively, calculate X1 and C1, C2, ... ..., the distance between C7, assuming the distance between X1 and C1 is 1.5, the distance between X1 and C2 is 2, the distance between X1 and C3 is 3, the distance between X1 and C4 is 4, and the distance between X1 and C4 is 4. The distance between X1 and C5 is 5, the distance between X1 and C6 is 6, and the distance between X1 and C7 is 7. Assuming that the third distance threshold is 1.8, the distance between X1 and C1 is 1.5 less than the third distance. If the distance threshold is set, at this time, C1 is determined as the center point of the first target nearest neighbor clustering corresponding to C0. By analogy, for the target cluster center points corresponding to other to-be-stored data in the to-be-stored data set, the corresponding target neighbor cluster center points are determined according to the above method.
实际应用中,经过上述步骤确定出第一待存储数据对应的第一目标聚类中心点以及第一目标近邻聚类中心点之后,就可以将该第一待存储数据存储至第一目标聚类中心点和所述第一目标近邻聚类中心点构成的聚类子空间中。In practical applications, after the first target cluster center point and the first target neighbor cluster center point corresponding to the first data to be stored are determined through the above steps, the first data to be stored can be stored in the first target cluster. The center point and the clustering subspace formed by the first target neighbor cluster center point.
例如,第一待存储数据为X1,X1对应的第一目标聚类中心点为C0,第一目标近邻聚类中心点为C1,则此时将X1存储至C0和C1构成的聚类子空间B(0,1)中。以此类推,对于待存储数据集中的其他待存储数据,均按照上述方式,依次存储。For example, the first data to be stored is X1, the first target cluster center point corresponding to X1 is C0, and the first target neighbor cluster center point is C1, then X1 is stored in the cluster subspace formed by C0 and C1. in B(0,1). By analogy, other to-be-stored data in the to-be-stored data set are stored in sequence according to the above method.
实际应用中,存储完待存储数据集中的全部待存储数据之后,所有编码后的待存储数据会按照归属的聚类子空间B(i,j)进行倒排结构存储然后写入到索引文件中。其中, 倒排结构存储是指将每个待存储数据按照聚类子空间的编号(id)进行排列存储,写进索引文件中。In practical applications, after storing all the to-be-stored data in the to-be-stored data set, all the encoded to-be-stored data will be stored in an inverted structure according to the belonging clustering subspace B(i, j) and then written to the index file. . The storage in the inverted structure refers to arranging and storing each data to be stored according to the number (id) of the clustering subspace, and writing it into the index file.
需要说明的是,按照上述步骤202-206对待存储数据集中的所有待存储数据进行存储后,如果需要查询某个待查询数据,就可以按照如下步骤208-212在聚类子空间中进行搜索。It should be noted that, after all the data to be stored in the to-be-stored data set is stored according to the above steps 202-206, if a certain data to be queried needs to be queried, the clustering subspace can be searched according to the following steps 208-212.
步骤208:从该多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点。Step 208: From the plurality of cluster center points, determine a second target cluster center point corresponding to the data to be queried.
实际应用中,对于每个待查询数据,要从多个聚类中心点中,确定与该待查询数据接近的第二目标聚类中心点,具体实现过程可以如下:In practical applications, for each data to be queried, a second target cluster center point close to the data to be queried needs to be determined from a plurality of cluster center points. The specific implementation process can be as follows:
计算该待查询数据和该多个聚类中心点中每个聚类中心点之间的第四距离;Calculate the fourth distance between the data to be queried and each cluster center point in the plurality of cluster center points;
将第四距离按照由近至远进行排序,选择排序靠前的第四预设数值个第四距离,将第四预设数值个第四距离对应的聚类中心点确定为待查询数据对应的第二目标聚类中心点。Sort the fourth distances from nearest to farthest, select the fourth preset value and the fourth distance at the top of the ranking, and determine the cluster center point corresponding to the fourth preset value and the fourth distance as the data to be queried. The second target cluster center point.
需要说明的是,对于每个待查询数据,要计算其和每个聚类中心点Ci之间的距离,从而确定出与待查询数据接近的聚类中心点,即第四预设数值个第二目标聚类中心点。其中,第四预设数值可以提前设置,如2个、3个、4个等,对计算得到的多个第四距离按照由近至远排序后,可以通过该第四预设数值筛选与待查询数据接近的多个第二目标聚类中心点。It should be noted that, for each data to be queried, the distance between it and each cluster center point Ci is calculated, so as to determine the cluster center point close to the data to be queried, that is, the fourth preset numerical value Two-target cluster center point. Among them, the fourth preset value can be set in advance, such as 2, 3, 4, etc. After sorting the calculated fourth distances from near to far, the fourth preset value can be used to filter and wait for The query data is close to multiple second target cluster center points.
例如,如图3所示,聚类中心点为C0、C1、C2、……、C7,针对待查询数据q,分别计算q和C0、C1、C2、……、C7之间的距离,假设q和C0之间的距离为0.5,q和C1之间的距离为3,q和C2之间的距离为1,q和C3之间的距离为0.8,q和C4之间的距离为3.2,q和C5之间的距离为5,q和C6之间的距离为7,q和C7之间的距离为5.5,假设第四预设数值为3,则此时确定出q对应的第二目标聚类中心点为C0、C2和C3。For example, as shown in Figure 3, the cluster center points are C0, C1, C2, ..., C7, and for the data q to be queried, calculate the distance between q and C0, C1, C2, ..., C7, respectively, assuming The distance between q and C0 is 0.5, the distance between q and C1 is 3, the distance between q and C2 is 1, the distance between q and C3 is 0.8, the distance between q and C4 is 3.2, The distance between q and C5 is 5, the distance between q and C6 is 7, and the distance between q and C7 is 5.5. Assuming that the fourth preset value is 3, the second target corresponding to q is determined at this time. The cluster center points are C0, C2 and C3.
在本实施例的一个或多个实施方式中,还可以不通过第四预设数值筛选与待查询数据接近的多个目标聚类中心点,而是通过设置距离阈值,筛选出对应的目标聚类中心点。此种方式下,从该多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点,具体实现过程可以如下:In one or more implementations of this embodiment, the center points of multiple target clusters that are close to the data to be queried may not be screened by the fourth preset value, but the corresponding target clusters may be screened by setting a distance threshold. class center point. In this way, from the plurality of cluster center points, the second target cluster center point corresponding to the data to be queried is determined, and the specific implementation process may be as follows:
计算该待查询数据和多个聚类中心点中每个聚类中心点之间的第四距离;Calculate the fourth distance between the data to be queried and each cluster center point in the plurality of cluster center points;
确定第四距离中小于第四距离阈值的第四距离,将该小于第四距离阈值的第四距离对应的聚类中心点确定为待查询数据对应的第二目标聚类中心点。Among the fourth distances, a fourth distance smaller than the fourth distance threshold is determined, and the cluster center point corresponding to the fourth distance smaller than the fourth distance threshold is determined as the second target cluster center point corresponding to the data to be queried.
其中,第四距离阈值可以提前设置。The fourth distance threshold may be set in advance.
例如,如图3所示,聚类中心点为C0、C1、C2、……、C7,针对待查询数据q,分别计算q和C0、C1、C2、……、C7之间的距离,假设q和C0之间的距离为0.5,q和C1之间的距离为3,q和C2之间的距离为1,q和C3之间的距离为0.8,q和C4之 间的距离为3.2,q和C5之间的距离为5,q和C6之间的距离为7,q和C7之间的距离为5.5,假设第四距离阈值为1.5,则此时确定出q对应的第二目标聚类中心点为C0、C2和C3。For example, as shown in Figure 3, the cluster center points are C0, C1, C2, ..., C7, and for the data q to be queried, calculate the distance between q and C0, C1, C2, ..., C7, respectively, assuming The distance between q and C0 is 0.5, the distance between q and C1 is 3, the distance between q and C2 is 1, the distance between q and C3 is 0.8, the distance between q and C4 is 3.2, The distance between q and C5 is 5, the distance between q and C6 is 7, and the distance between q and C7 is 5.5. Assuming that the fourth distance threshold is 1.5, the second target cluster corresponding to q is determined at this time. The class center points are C0, C2, and C3.
步骤210:根据该待查询数据和第二目标聚类中心点,确定第二目标聚类中心点对应的目标聚类子空间。Step 210: Determine the target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point.
具体的,在从该多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点的基础上,进一步的,将根据该待查询数据和第二目标聚类中心点,确定第二目标聚类中心点对应的目标聚类子空间。Specifically, on the basis of determining the second target cluster center point corresponding to the data to be queried from the plurality of cluster center points, further, according to the to-be-queried data and the second target cluster center point, determine The target cluster subspace corresponding to the second target cluster center point.
实际应用中,可以根据上一步待查询数据和第二目标聚类中心点,挑选出待查询数据可能存储的多个聚类子空间,具体实现过程如下:In practical applications, multiple clustering subspaces where the data to be queried may be stored can be selected according to the data to be queried in the previous step and the center point of the second target cluster. The specific implementation process is as follows:
获取第二目标聚类中心点对应的各个近邻聚类中心点,将各个近邻聚类中心点确定为第二目标聚类中心点对应的第二目标近邻聚类中心点;Obtain each neighbor cluster center point corresponding to the second target cluster center point, and determine each neighbor cluster center point as the second target neighbor cluster center point corresponding to the second target cluster center point;
将第二目标聚类中心点和第二目标近邻聚类中心点构成的聚类子空间确定为目标聚类子空间。The cluster subspace formed by the second target cluster center point and the second target neighbor cluster center point is determined as the target cluster subspace.
需要说明的是,确定出的第二目标聚类中心点可能为多个,每个第二目标聚类中心点都会有对应的近邻聚类中心点,因而对于每一个第二目标聚类中心点,均要确定出对应的第二目标近邻聚类中心点,进而确定出对应的目标聚类子空间。It should be noted that there may be multiple second target cluster center points determined, and each second target cluster center point will have a corresponding neighboring cluster center point. Therefore, for each second target cluster center point , the corresponding second target neighbor cluster center point should be determined, and then the corresponding target cluster subspace should be determined.
例如,确定出待查询数据对应的第二目标聚类中心点为C0、C2和C3,C0对应的各个近邻聚类中心点为C1、C2、……、C7,因而第二目标聚类中心点C0对应的第二目标近邻聚类中心点为C1、C2、……、C7,此时第二目标聚类中心点C0对应的目标聚类子空间为B(0,1),B(0,2),B(0,3),B(0,4),B(0,5),B(0,6),B(0,7);C2对应的各个近邻聚类中心点为C0、C1、C3、C12、C13、C14、C15,因而第二目标聚类中心点C2对应的第二目标近邻聚类中心点为C0、C1、C3、C12、C13、C14、C15,此时第二目标聚类中心点C2对应的目标聚类子空间为B(2,0),B(2,1),B(2,3),B(2,12),B(2,13),B(2,14),B(2,15);C3对应的各个近邻聚类中心点为C0、C2、C4、C16、C17、C18、C19,因而第二目标聚类中心点C3对应的第二目标近邻聚类中心点为C0、C2、C4、C16、C17、C18、C19,此时第二目标聚类中心点C3对应的目标聚类子空间为B(3,0),B(3,2),B(3,4),B(3,16),B(3,17),B(3,18),B(3,19)。For example, it is determined that the second target cluster center points corresponding to the data to be queried are C0, C2, and C3, and the neighboring cluster center points corresponding to C0 are C1, C2, ..., C7, so the second target cluster center point The second target nearest neighbor cluster center points corresponding to C0 are C1, C2, ..., C7. At this time, the target cluster subspace corresponding to the second target cluster center point C0 is B(0,1), B(0, 2), B(0,3), B(0,4), B(0,5), B(0,6), B(0,7); the center points of each nearest neighbor cluster corresponding to C2 are C0, C1, C3, C12, C13, C14, C15, so the second target neighbor cluster center points corresponding to the second target cluster center point C2 are C0, C1, C3, C12, C13, C14, C15, and the second target cluster center point C2 The target cluster subspace corresponding to the target cluster center point C2 is B(2,0), B(2,1), B(2,3), B(2,12), B(2,13), B (2,14), B(2,15); the center points of each neighboring cluster corresponding to C3 are C0, C2, C4, C16, C17, C18, C19, so the second target cluster center point C3 corresponds to the second The target nearest neighbor cluster center points are C0, C2, C4, C16, C17, C18, C19, and the target cluster subspace corresponding to the second target cluster center point C3 is B(3,0), B(3, 2), B(3,4), B(3,16), B(3,17), B(3,18), B(3,19).
步骤212:根据该待查询数据和目标聚类子空间,确定出该待查询数据对应的搜索空间。Step 212: According to the data to be queried and the target clustering subspace, determine a search space corresponding to the data to be queried.
具体的,搜索空间就是指最终查询该待查询数据的空间。Specifically, the search space refers to the space in which the data to be queried is finally queried.
本实施例的一个或多个实施方式中,根据该待查询数据和目标聚类子空间,确定出该待查询数据对应的搜索空间,具体实现过程可以为:In one or more implementations of this embodiment, the search space corresponding to the data to be queried is determined according to the data to be queried and the target clustering subspace, and the specific implementation process may be as follows:
计算目标聚类子空间和该待查询数据的第五距离;Calculate the fifth distance between the target cluster subspace and the data to be queried;
将第五距离按照由近至远进行排序,选择排序靠前的第五预设数值个第五距离,将第五预设数值个第五距离对应的目标聚类子空间确定为待查询数据对应的搜索空间。Sort the fifth distances from nearest to farthest, select the fifth preset value and the fifth distance at the top of the ranking, and determine the target clustering subspace corresponding to the fifth preset value and the fifth distance as the corresponding data to be queried. search space.
其中,可以将第二目标聚类中心点和第二目标近邻聚类中心点的中点确定为目标点;然后计算该目标点和待查询数据的第六距离,将该第六距离确定为目标聚类子空间和待查询数据的第五距离。也即,目标聚类子空间到待查询数据的距离,采用聚类中心点Ci和近邻聚类中心点Cj的均值中心点到待查询数据的距离来代表。Wherein, the midpoint between the second target cluster center point and the second target neighbor cluster center point can be determined as the target point; then the sixth distance between the target point and the data to be queried is calculated, and the sixth distance is determined as the target The fifth distance between the clustering subspace and the data to be queried. That is, the distance from the target cluster subspace to the data to be queried is represented by the distance from the mean center point of the cluster center point Ci and the adjacent cluster center point Cj to the data to be queried.
需要说明的是,由上述步骤得到的目标聚类子空间可能为多个,针对每一个目标聚类子空间,都需要确定其和待存储数据之间的距离,从而决定是否将该目标聚类子空间确定为查询待查询数据的搜索空间。It should be noted that there may be multiple target clustering subspaces obtained by the above steps. For each target clustering subspace, it is necessary to determine the distance between it and the data to be stored, so as to decide whether to cluster the target. The subspace is determined as the search space for querying the data to be queried.
例如,C0对应的目标聚类子空间为B(0,1),B(0,2),B(0,3),B(0,4),B(0,5),B(0,6),B(0,7),C2对应的目标聚类子空间为B(2,0),B(2,1),B(2,3),B(2,12),B(2,13),B(2,14),B(2,15),C3对应的目标聚类子空间为B(3,0),B(3,2),B(3,4),B(3,16),B(3,17),B(3,18),B(3,19),计算待查询数据q和上述所有目标聚类子空间之间的距离,然后选择多个距离较近的目标聚类子空间作为最后的搜索空间。如图3所示,选取了B(0,2)、B(0,3)、B(2,0)、B(3,0)作为最后的搜索空间,在该搜索空间中查询待查询数据q。For example, the target clustering subspace corresponding to C0 is B(0,1), B(0,2), B(0,3), B(0,4), B(0,5), B(0, 6), B(0,7), the target clustering subspace corresponding to C2 is B(2,0), B(2,1), B(2,3), B(2,12), B(2 ,13), B(2,14), B(2,15), the target clustering subspace corresponding to C3 is B(3,0), B(3,2), B(3,4), B( 3,16), B(3,17), B(3,18), B(3,19), calculate the distance between the data q to be queried and all the above target clustering subspaces, and then select multiple distance comparisons. The nearest target clustering subspace is used as the final search space. As shown in Figure 3, B(0,2), B(0,3), B(2,0), and B(3,0) are selected as the final search space, and the data to be queried is queried in this search space. q.
本说明书提供的数据存储方法,可以对待存储数据集进行聚类,确定多个聚类中心点;之后,针对该多个聚类中心点中的每一个聚类中心点,根据该聚类中心点,确定对应的近邻聚类中心点,将该聚类中心点和对应的近邻聚类中心点构成的空间确定为聚类子空间;然后根据该聚类子空间,对该待存储数据集中的待存储数据进行存储。之后,在需要查询待查询数据时,就可以从该多个聚类中心点中,确定该待查询数据对应的第二目标聚类中心点;然后根据该待查询数据和第二目标聚类中心点,确定第二目标聚类中心点对应的目标聚类子空间,再进一步根据该待查询数据和目标聚类子空间,确定出该待查询数据对应的搜索空间,在该搜索空间中进行搜索。The data storage method provided in this specification can cluster the data set to be stored and determine multiple cluster center points; then, for each cluster center point in the multiple cluster center points, according to the cluster center point , determine the corresponding neighbor cluster center point, and determine the space formed by the cluster center point and the corresponding neighbor cluster center point as the cluster subspace; then, according to the cluster subspace, the to-be-stored data set Store data for storage. Then, when the data to be queried needs to be queried, the second target cluster center point corresponding to the data to be queried can be determined from the plurality of cluster center points; and then according to the data to be queried and the second target cluster center point, determine the target clustering subspace corresponding to the center point of the second target clustering, and then further determine the search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace, and perform a search in the search space .
这种情况下,确定聚类中心点后,并不是直接将待存储数据存储至以该聚类中心点为中心形成的一层空间中,而是进一步再根据该聚类中心点确定出与其接近的近邻聚类中心点,然后将待存储数据存储至该聚类中心点和对应的近邻聚类中心点构成的聚类子空间中,把聚类算法和近邻图思想进行了结合,在K-means聚类的基础上,引入了二层子空间的近邻图关系,把一层的空间进一步划分为更为细致的二层聚类子空间,从而提升整个索引的检索准度。并且,这种二层子空间划分方式相比二层的层次聚类结构节省了多层聚类的开销,索引构建速度更快,相比单层的K-means聚类结构,本说明书在没有引入额外索引构建和存储开销的基础上提升了检索的准度。In this case, after the cluster center point is determined, the data to be stored is not directly stored in a layer of space formed with the cluster center point as the center, but is further determined according to the cluster center point. Then the data to be stored is stored in the clustering subspace formed by the cluster center point and the corresponding neighbor cluster center point, and the clustering algorithm and the idea of the nearest neighbor graph are combined. On the basis of means clustering, the neighbor graph relationship of the second-level subspace is introduced, and the first-level space is further divided into more detailed two-level clustering subspaces, thereby improving the retrieval accuracy of the entire index. In addition, this two-layer subspace division method saves the cost of multi-layer clustering compared with the two-layer hierarchical clustering structure, and the index construction speed is faster. Compared with the single-layer K-means clustering structure, this specification does not On the basis of introducing additional index construction and storage overhead, the retrieval accuracy is improved.
接下来,结合附图1和附图3对本说明书提供的数据存储方法能够带来的有益效果 进行示例说明:Next, in conjunction with accompanying drawing 1 and accompanying drawing 3, the beneficial effect that the data storage method provided in this specification can bring is illustrated by example:
假设待查询数据为q,若采用如图1所示的现有技术,单纯通过K-means聚类算法对待存储数据进行聚类,确定出C0-C7等8个不同的K-means聚类中心点,每个聚类中心点都划分了一个Voronoi空间。当待查询数据q在空间边界的时候则不得不同时搜索C0、C2和C3这多个空间。而如图3所示,本说明书提供的数据存储方法,在确定出C0-C7等8个不同的K-means聚类中心点之后,还会进一步确定C0-C7对应的近邻聚类中心点,从而二次划分出二层空间聚类子空间,同样查询待查询数据q,本说明书提供的方法只需要在聚类子空间B(0,2)、B(0,3)、B(2,0)、B(3,0)中搜索,大大减少了搜索范围,进而提高了搜索效率和准度。Assuming that the data to be queried is q, if the prior art as shown in Figure 1 is adopted, the stored data is clustered simply by the K-means clustering algorithm, and 8 different K-means clustering centers such as C0-C7 are determined. Each cluster center point is divided into a Voronoi space. When the data q to be queried is at the space boundary, multiple spaces of C0, C2 and C3 have to be searched at the same time. As shown in Figure 3, the data storage method provided in this specification, after determining 8 different K-means clustering center points such as C0-C7, will further determine the neighboring clustering center points corresponding to C0-C7, Therefore, the second-level spatial clustering subspace is divided twice, and the data q to be queried is also queried. The method provided in this specification only needs to be in the clustering subspaces B(0,2), B(0,3), B(2, 0) and B(3,0), which greatly reduces the search range and improves the search efficiency and accuracy.
图4示出了根据本说明书一实施例提供的一种数据存储方法的流程图,具体包括以下步骤:FIG. 4 shows a flowchart of a data storage method provided according to an embodiment of the present specification, which specifically includes the following steps:
步骤402:从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点。需要说明的是,步骤402的具体实现过程和上述步骤208的具体实现过程相同,本说明书在此不再赘述。Step 402: From the plurality of cluster center points, determine the second target cluster center point corresponding to the data to be queried. It should be noted that, the specific implementation process of step 402 is the same as the specific implementation process of the above-mentioned step 208, and details are not described herein again in this specification.
步骤404:根据该待查询数据和第二目标聚类中心点,确定第二目标聚类中心点对应的目标聚类子空间。Step 404: Determine the target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point.
需要说明的是,步骤404的具体实现过程和上述步骤210的具体实现过程相同,本说明书在此不再赘述。It should be noted that the specific implementation process of step 404 is the same as the specific implementation process of the above-mentioned step 210, and details are not described herein again in this specification.
步骤406:根据该待查询数据和目标聚类子空间,确定出该待查询数据对应的搜索空间。Step 406: Determine a search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace.
需要说明的是,步骤406的具体实现过程和上述步骤212的具体实现过程相同,本说明书在此不再赘述。It should be noted that the specific implementation process of step 406 is the same as the specific implementation process of the above-mentioned step 212, and details are not described herein again in this specification.
本说明书提供的数据查询方法,可以从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点,然后根据待查询数据和第二目标聚类中心点,确定第二目标聚类中心点对应的目标聚类子空间,再进一步根据待查询数据和目标聚类子空间,确定出待查询数据对应的搜索空间,在该搜索空间中进行搜索。这种情况下,把聚类算法和近邻图思想进行了结合,也即在K-means聚类的基础上,引入了二层子空间的近邻图关系,把一层的空间进一步划分为更为细致的二层聚类子空间,后续查询待查询数据时,可以缩小搜索范围,从而提升整个索引的检索效率和准度。The data query method provided in this specification can determine the second target cluster center point corresponding to the data to be queried from a plurality of cluster center points, and then determine the second target cluster center point according to the data to be queried and the second target cluster center point The target clustering subspace corresponding to the cluster center point is further determined according to the data to be queried and the target clustering subspace to determine the search space corresponding to the data to be queried, and the search is performed in the search space. In this case, the clustering algorithm is combined with the idea of the nearest neighbor graph, that is, on the basis of K-means clustering, the neighbor graph relationship of the two-layer subspace is introduced, and the space of the first layer is further divided into more The detailed two-level clustering subspace can narrow the search scope when querying the data to be queried subsequently, thereby improving the retrieval efficiency and accuracy of the entire index.
与上述方法实施例相对应,本说明书还提供了数据存储装置实施例,图5示出了本说明书一实施例提供的一种数据存储装置的结构示意图。如图5所示,该装置包括:Corresponding to the foregoing method embodiments, this specification also provides an embodiment of a data storage apparatus, and FIG. 5 shows a schematic structural diagram of a data storage apparatus provided by an embodiment of this specification. As shown in Figure 5, the device includes:
第一确定模块502,被配置为对待存储数据集进行聚类,确定多个聚类中心点;The first determining module 502 is configured to cluster the data set to be stored, and determine a plurality of cluster center points;
第二确定模块504,被配置为针对该多个聚类中心点中的每一个聚类中心点,根据该类中心点,确定对应的近邻聚类中心点,将该聚类中心点和对应的近邻聚类中心点构 成的空间确定为聚类子空间;The second determination module 504 is configured to, for each cluster center point in the plurality of cluster center points, determine the corresponding nearest neighbor cluster center point according to the type of center point, and the cluster center point and the corresponding cluster center point The space formed by the neighbor cluster center points is determined as the cluster subspace;
存储模块506,被配置为根据该聚类子空间,对该待存储数据集中的待存储数据进行存储。The storage module 506 is configured to store the to-be-stored data in the to-be-stored data set according to the cluster subspace.
在本实施例的一个或多个实施方式中,第二确定模块504进一步被配置为:In one or more implementations of this embodiment, the second determining module 504 is further configured to:
将该聚类中心点确定为第一聚类中心点,将该多个聚类中心点中除第一聚类中心点外的聚类中心点确定为第二聚类中心点;Determining the cluster center point as the first cluster center point, and determining the cluster center point of the plurality of cluster center points except the first cluster center point as the second cluster center point;
计算第一聚类中心点和各个第二聚类中心点之间的第一距离;Calculate the first distance between the first cluster center point and each second cluster center point;
将第一距离按照由近至远进行排序,选择排序靠前的第一预设数值个第一距离,将选择的第一预设数值个第一距离对应的第二聚类中心点确定为第一聚类中心点对应的近邻聚类中心点。Sort the first distances from nearest to farthest, select the first preset value and the first distance in the first order, and determine the second cluster center point corresponding to the selected first preset value and the first distance as the first distance. A cluster center point corresponds to the nearest neighbor cluster center point.
在本实施例的一个或多个实施方式中,存储模块506进一步被配置为:In one or more implementations of this embodiment, the storage module 506 is further configured to:
确定该待存储数据集中的第一待存储数据对应的第一目标聚类中心点;determining the first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set;
根据第一待存储数据和第一目标聚类中心点,确定第一目标聚类中心点对应的第一目标近邻聚类中心点;According to the first data to be stored and the first target cluster center point, determine the first target neighbor cluster center point corresponding to the first target cluster center point;
将第一待存储数据存储至第一目标聚类中心点和第一目标近邻聚类中心点构成的聚类子空间中。The first data to be stored is stored in the clustering subspace formed by the first target cluster center point and the first target neighbor cluster center point.
在本实施例的一个或多个实施方式中,存储模块506进一步被配置为:In one or more implementations of this embodiment, the storage module 506 is further configured to:
计算第一待存储数据和所述多个聚类中心点中每个聚类中心点之间的第二距离;Calculate the second distance between the first data to be stored and each cluster center point in the plurality of cluster center points;
将第二距离按照由近至远进行排序,选择排序靠前的第二预设数值个第二距离,将选择的第二预设数值个第二距离对应的聚类中心点确定为第一待存储数据对应的第一目标聚类中心点。Sort the second distances from nearest to farthest, select the second preset value and the second distance at the top of the ranking, and determine the cluster center point corresponding to the selected second preset value and the second distance as the first The first target cluster center point corresponding to the data is stored.
在本实施例的一个或多个实施方式中,存储模块506进一步被配置为:In one or more implementations of this embodiment, the storage module 506 is further configured to:
获取第一目标聚类中心点对应的各个近邻聚类中心点;Obtain the center points of each neighboring cluster corresponding to the center point of the first target cluster;
计算第一待存储数据和各个近邻聚类中心点之间的第三距离;Calculate the third distance between the first data to be stored and the center points of each neighboring cluster;
将第三距离按照由近至远进行排序,选择排序靠前的第三预设数值个第三距离,将选择的第三预设数值个第三距离对应的近邻聚类中心点确定为第一目标聚类中心点对应的第一目标近邻聚类中心点。Sort the third distances from nearest to farthest, select the third preset value and the third distance at the top of the ranking, and determine the nearest neighbor cluster center point corresponding to the selected third preset value and the third distance as the first The first target neighbor cluster center point corresponding to the target cluster center point.
在本实施例的一个或多个实施方式中,该装置还包括:In one or more implementations of this embodiment, the apparatus further includes:
第三确定模块,被配置为从该多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;a third determination module, configured to determine a second target cluster center point corresponding to the data to be queried from among the plurality of cluster center points;
第四确定模块,被配置为根据该待查询数据和该第二目标聚类中心点,确定第二目标聚类中心点对应的目标聚类子空间;a fourth determining module, configured to determine a target clustering subspace corresponding to the second target clustering center point according to the data to be queried and the second target clustering center point;
第五确定模块,被配置为根据该待查询数据和目标聚类子空间,确定出该待查询数据对应的搜索空间。The fifth determining module is configured to determine a search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace.
本说明书中,确定聚类中心点后,并不是直接将待存储数据存储至以该聚类中心点为中心形成的一层空间中,而是进一步再根据该聚类中心点确定出与其接近的近邻聚类中心点,然后将待存储数据存储至该聚类中心点和对应的近邻聚类中心点构成的聚类子空间中,把聚类算法和近邻图思想进行了结合,也即在K-means聚类的基础上,引入了二层子空间的近邻图关系,把一层的空间进一步划分为更为细致的二层聚类子空间,从而提升整个索引的检索准度。并且,这种二层子空间划分方式相比二层的层次聚类结构节省了多层聚类的开销,索引构建速度更快,相比单层的K-means聚类结构,本说明书在没有引入额外索引构建和存储开销的基础上提升了检索的准度。In this specification, after the cluster center point is determined, the data to be stored is not directly stored in a layer of space formed with the cluster center point as the center, but is further determined according to the cluster center point. The nearest neighbor clustering center point, and then the data to be stored is stored in the clustering subspace formed by the cluster center point and the corresponding neighbor cluster center point, and the clustering algorithm and the nearest neighbor graph idea are combined, that is, in K On the basis of -means clustering, the neighbor graph relationship of the second-level subspace is introduced, and the first-level space is further divided into more detailed two-level clustering subspaces, thereby improving the retrieval accuracy of the entire index. In addition, this two-layer subspace division method saves the cost of multi-layer clustering compared with the two-layer hierarchical clustering structure, and the index construction speed is faster. Compared with the single-layer K-means clustering structure, this specification does not On the basis of introducing additional index construction and storage overhead, the retrieval accuracy is improved.
上述为本实施例的一种数据存储装置的示意性方案。需要说明的是,该数据存储装置的技术方案与上述的数据存储方法的技术方案属于同一构思,数据存储装置的技术方案未详细描述的细节内容,均可以参见上述数据存储方法的技术方案的描述。The above is a schematic solution of a data storage device according to this embodiment. It should be noted that the technical solution of the data storage device and the technical solution of the above-mentioned data storage method belong to the same concept, and the details that are not described in detail in the technical solution of the data storage device can be referred to the description of the technical solution of the above-mentioned data storage method. .
与上述方法实施例相对应,本说明书还提供了数据查询装置实施例,图6示出了本说明书一实施例提供的一种数据查询装置的结构示意图。如图6所示,该装置包括:Corresponding to the foregoing method embodiments, the present specification also provides an embodiment of a data query apparatus, and FIG. 6 shows a schematic structural diagram of a data query apparatus provided by an embodiment of the present specification. As shown in Figure 6, the device includes:
第三确定模块602,被配置为从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;The third determination module 602 is configured to determine the second target cluster center point corresponding to the data to be queried from the plurality of cluster center points;
第四确定模块604,被配置为根据该待查询数据和第二目标聚类中心点,确定第二目标聚类中心点对应的目标聚类子空间;The fourth determination module 604 is configured to determine the target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point;
第五确定模块606,被配置为根据该待查询数据和目标聚类子空间,确定出待查询数据对应的搜索空间。The fifth determination module 606 is configured to determine a search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace.
在本实施例的一个或多个实施方式中,第三确定模块602进一步被配置为:In one or more implementations of this embodiment, the third determining module 602 is further configured to:
计算该待查询数据和该多个聚类中心点中每个聚类中心点之间的第四距离;Calculate the fourth distance between the data to be queried and each cluster center point in the plurality of cluster center points;
将第四距离按照由近至远进行排序,选择排序靠前的第四预设数值个第四距离,将第四预设数值个第四距离对应的聚类中心点确定为该待查询数据对应的第二目标聚类中心点。Sort the fourth distances from nearest to farthest, select the fourth preset value and the fourth distance at the top of the ranking, and determine the cluster center point corresponding to the fourth preset value and the fourth distance as the data to be queried. The second target cluster center point.
在本实施例的一个或多个实施方式中,第四确定模块604进一步被配置为:In one or more implementations of this embodiment, the fourth determining module 604 is further configured to:
获取第二目标聚类中心点对应的各个近邻聚类中心点,将各个近邻聚类中心点确定为第二目标聚类中心点对应的第二目标近邻聚类中心点;Obtain each neighbor cluster center point corresponding to the second target cluster center point, and determine each neighbor cluster center point as the second target neighbor cluster center point corresponding to the second target cluster center point;
将第二目标聚类中心点和第二目标近邻聚类中心点构成的聚类子空间确定为目标聚类子空间。The cluster subspace formed by the second target cluster center point and the second target neighbor cluster center point is determined as the target cluster subspace.
在本实施例的一个或多个实施方式中,第五确定模块606进一步被配置为:In one or more implementations of this embodiment, the fifth determining module 606 is further configured to:
计算目标聚类子空间和该待查询数据的第五距离;Calculate the fifth distance between the target cluster subspace and the data to be queried;
将该第五距离按照由近至远进行排序,选择排序靠前的第五预设数值个第五距离,将该第五预设数值个第五距离对应的目标聚类子空间确定为该待查询数据对应的搜索空间。Sort the fifth distance from nearest to farthest, select the fifth preset value and the fifth distance ahead of the sorting, and determine the target clustering subspace corresponding to the fifth preset value and the fifth distance as the to-be-to-be The search space corresponding to the query data.
在本实施例的一个或多个实施方式中,第五确定模块606进一步被配置为:In one or more implementations of this embodiment, the fifth determining module 606 is further configured to:
将第二目标聚类中心点和第二目标近邻聚类中心点的中点确定为目标点;Determining the midpoint between the second target cluster center point and the second target neighbor cluster center point as the target point;
计算目标点和待查询数据的第六距离,将第六距离确定为目标聚类子空间和该待查询数据的第五距离。Calculate the sixth distance between the target point and the data to be queried, and determine the sixth distance as the fifth distance between the target cluster subspace and the data to be queried.
本说明书提供的数据查询装置,可以从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点,然后根据待查询数据和第二目标聚类中心点,确定第二目标聚类中心点对应的目标聚类子空间,再进一步根据待查询数据和目标聚类子空间,确定出待查询数据对应的搜索空间,在该搜索空间中进行搜索。这种情况下,把聚类算法和近邻图思想进行了结合,也即在K-means聚类的基础上,引入了二层子空间的近邻图关系,把一层的空间进一步划分为更为细致的二层子空间,在查询待查询数据时,可以缩小搜索范围,从而提升整个索引的检索效率和准度。The data query device provided in this specification can determine the second target cluster center point corresponding to the data to be queried from a plurality of cluster center points, and then determine the second target cluster center point according to the data to be queried and the second target cluster center point The target clustering subspace corresponding to the cluster center point is further determined according to the data to be queried and the target clustering subspace to determine the search space corresponding to the data to be queried, and the search is performed in the search space. In this case, the clustering algorithm is combined with the idea of the nearest neighbor graph, that is, on the basis of K-means clustering, the neighbor graph relationship of the two-layer subspace is introduced, and the space of the first layer is further divided into more The detailed second-level subspace can narrow the search range when querying the data to be queried, thereby improving the retrieval efficiency and accuracy of the entire index.
上述为本实施例的一种数据查询装置的示意性方案。需要说明的是,该数据查询装置的技术方案与上述的数据查询方法的技术方案属于同一构思,数据查询装置的技术方案未详细描述的细节内容,均可以参见上述数据查询方法的技术方案的描述。The above is a schematic solution of a data query apparatus according to this embodiment. It should be noted that the technical solution of the data query device and the technical solution of the above-mentioned data query method belong to the same concept, and the details that are not described in detail in the technical solution of the data query device can be referred to the description of the technical solution of the above-mentioned data query method. .
图7示出了根据本说明书一实施例提供的一种计算设备700的结构框图。FIG. 7 shows a structural block diagram of a computing device 700 according to an embodiment of the present specification.
该计算设备700的部件包括但不限于存储器710和处理器720。处理器720与存储器710通过总线730相连接,数据库750用于保存数据。Components of the computing device 700 include, but are not limited to, memory 710 and processor 720 . The processor 720 is connected with the memory 710 through the bus 730, and the database 750 is used for storing data.
计算设备700还包括接入设备740,接入设备740使得计算设备700能够经由一个或多个网络760通信。这些网络的示例包括公用交换电话网(PSTN)、局域网(LAN)、广域网(WAN)、个域网(PAN)或诸如因特网的通信网络的组合。接入设备740可以包括有线或无线的任何类型的网络接口(例如,网络接口卡(NIC))中的一个或多个,诸如IEEE802.11无线局域网(WLAN)无线接口、全球微波互联接入(Wi-MAX)接口、以太网接口、通用串行总线(USB)接口、蜂窝网络接口、蓝牙接口、近场通信(NFC)接口,等等。Computing device 700 also includes access device 740 that enables computing device 700 to communicate via one or more networks 760 . Examples of such networks include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet. Access device 740 may include one or more of any type of network interface (eg, a network interface card (NIC)), wired or wireless, such as an IEEE 802.11 wireless local area network (WLAN) wireless interface, World Interoperability for Microwave Access ( Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth interface, Near Field Communication (NFC) interface, and the like.
在本说明书的一个实施例中,计算设备700的上述部件以及图7中未示出的其他部件也可以彼此相连接,例如通过总线。应当理解,图7所示的计算设备结构框图仅仅是出于示例的目的,而不是对本说明书范围的限制。本领域技术人员可以根据需要,增添或替换其他部件。In one embodiment of the present specification, the above-described components of computing device 700 and other components not shown in FIG. 7 may also be connected to each other, such as through a bus. It should be understood that the structural block diagram of the computing device shown in FIG. 7 is only for the purpose of example, rather than limiting the scope of the present specification. Those skilled in the art can add or replace other components as required.
计算设备700可以是任何类型的静止或移动计算设备,包括移动计算机或移动计算设备(例如,平板计算机、个人数字助理、膝上型计算机、笔记本计算机、上网本等)、移动电话(例如,智能手机)、可佩戴的计算设备(例如,智能手表、智能眼镜等)或其他类型的移动设备,或者诸如台式计算机或PC的静止计算设备。计算设备700还可以是移动式或静止式的服务器。Computing device 700 may be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (eg, tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (eg, smart phones) ), wearable computing devices (eg, smart watches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or PCs. Computing device 700 may also be a mobile or stationary server.
其中,处理器720用于执行如下计算机可执行指令,以实现下述方法:The processor 720 is configured to execute the following computer-executable instructions to implement the following method:
对待存储数据集进行聚类,确定多个聚类中心点;Cluster the data set to be stored, and determine multiple cluster center points;
针对该多个聚类中心点中的每一个聚类中心点,根据该聚类中心点,确定对应的近邻聚类中心点,将该聚类中心点和对应的近邻聚类中心点构成的空间确定为聚类子空间;For each cluster center point in the plurality of cluster center points, according to the cluster center point, determine the corresponding neighbor cluster center point, and the space formed by the cluster center point and the corresponding neighbor cluster center point Determined as a clustering subspace;
根据该聚类子空间,对该待存储数据集中的待存储数据进行存储。According to the clustering subspace, the to-be-stored data in the to-be-stored data set is stored.
上述为本实施例的一种计算设备的示意性方案。需要说明的是,该计算设备的技术方案与上述的数据存储方法的技术方案属于同一构思,计算设备的技术方案未详细描述的细节内容,均可以参见上述数据存储方法的技术方案的描述。The above is a schematic solution of a computing device according to this embodiment. It should be noted that the technical solution of the computing device and the technical solution of the above-mentioned data storage method belong to the same concept. For details not described in detail in the technical solution of the computing device, refer to the description of the technical solution of the above-mentioned data storage method.
图8示出了根据本说明书一实施例提供的一种计算设备800的结构框图。FIG. 8 shows a structural block diagram of a computing device 800 according to an embodiment of the present specification.
该计算设备800的部件包括但不限于存储器810和处理器820。处理器820与存储器810通过总线830相连接,数据库850用于保存数据。Components of the computing device 800 include, but are not limited to, a memory 810 and a processor 820 . The processor 820 is connected with the memory 810 through the bus 830, and the database 850 is used for saving data.
计算设备800还包括接入设备840,接入设备840使得计算设备800能够经由一个或多个网络860通信。这些网络的示例包括公用交换电话网(PSTN)、局域网(LAN)、广域网(WAN)、个域网(PAN)或诸如因特网的通信网络的组合。接入设备840可以包括有线或无线的任何类型的网络接口(例如,网络接口卡(NIC))中的一个或多个,诸如IEEE802.11无线局域网(WLAN)无线接口、全球微波互联接入(Wi-MAX)接口、以太网接口、通用串行总线(USB)接口、蜂窝网络接口、蓝牙接口、近场通信(NFC)接口,等等。Computing device 800 also includes access device 840 that enables computing device 800 to communicate via one or more networks 860 . Examples of such networks include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet. Access device 840 may include one or more of any type of network interface (eg, network interface card (NIC)), wired or wireless, such as IEEE 802.11 wireless local area network (WLAN) wireless interface, World Interoperability for Microwave Access ( Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth interface, Near Field Communication (NFC) interface, and the like.
在本说明书的一个实施例中,计算设备800的上述部件以及图8中未示出的其他部件也可以彼此相连接,例如通过总线。应当理解,图8所示的计算设备结构框图仅仅是出于示例的目的,而不是对本说明书范围的限制。本领域技术人员可以根据需要,增添或替换其他部件。In one embodiment of the present specification, the above-described components of computing device 800 and other components not shown in FIG. 8 may also be connected to each other, such as through a bus. It should be understood that the structural block diagram of the computing device shown in FIG. 8 is only for the purpose of example, rather than limiting the scope of this specification. Those skilled in the art can add or replace other components as required.
计算设备800可以是任何类型的静止或移动计算设备,包括移动计算机或移动计算设备(例如,平板计算机、个人数字助理、膝上型计算机、笔记本计算机、上网本等)、移动电话(例如,智能手机)、可佩戴的计算设备(例如,智能手表、智能眼镜等)或其他类型的移动设备,或者诸如台式计算机或PC的静止计算设备。计算设备800还可以是移动式或静止式的服务器。Computing device 800 may be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (eg, tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (eg, smart phones) ), wearable computing devices (eg, smart watches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or PCs. Computing device 800 may also be a mobile or stationary server.
其中,处理器820用于执行如下计算机可执行指令,以实现下述方法:The processor 820 is configured to execute the following computer-executable instructions to implement the following method:
从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;From the plurality of cluster center points, determine the second target cluster center point corresponding to the data to be queried;
根据该待查询数据和第二目标聚类中心点,确定第二目标聚类中心点对应的目标聚类子空间;According to the data to be queried and the second target cluster center point, determine the target cluster subspace corresponding to the second target cluster center point;
根据该待查询数据和目标聚类子空间,确定出待查询数据对应的搜索空间。According to the data to be queried and the target clustering subspace, a search space corresponding to the data to be queried is determined.
上述为本实施例的一种计算设备的示意性方案。需要说明的是,该计算设备的技术方案与上述的数据查询方法的技术方案属于同一构思,计算设备的技术方案未详细描述的细节内容,均可以参见上述数据查询方法的技术方案的描述。The above is a schematic solution of a computing device according to this embodiment. It should be noted that the technical solution of the computing device and the technical solution of the data query method above belong to the same concept, and the details not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the data query method.
本说明书一实施例还提供一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时以用于实现上述数据存储方法的操作步骤。An embodiment of the present specification further provides a computer-readable storage medium, which stores computer instructions, and when the instructions are executed by a processor, is used to implement the operation steps of the above data storage method.
本说明书一实施例还提供一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时以用于实现上述数据查询方法的操作步骤。An embodiment of the present specification further provides a computer-readable storage medium, which stores computer instructions, which, when executed by a processor, are used to implement the operation steps of the above data query method.
上述为本实施例的一种计算机可读存储介质的示意性方案。需要说明的是,该存储介质的技术方案与上述的数据存储方法、数据查询方法的技术方案属于同一构思,存储介质的技术方案未详细描述的细节内容,均可以参见上述数据存储方法、数据查询方法的技术方案的描述。The above is a schematic solution of a computer-readable storage medium of this embodiment. It should be noted that the technical solution of the storage medium and the technical solutions of the above-mentioned data storage method and data query method belong to the same concept. Description of the technical solution of the method.
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。The foregoing describes specific embodiments of the present specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. Additionally, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
所述计算机指令包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。The computer instructions include computer program code, which may be in source code form, object code form, an executable file, some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable media may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, the computer-readable media Electric carrier signals and telecommunication signals are not included.
需要说明的是,对于前述的各方法实施例,为了简便描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本说明书并不受所描述的动作顺序的限制,因为依据本说明书,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定都是本说明书所必须的。It should be noted that, for the convenience of description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that this specification is not limited by the described action sequence. Because in accordance with this specification, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily all necessary in the specification.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.
以上公开的本说明书优选实施例只是用于帮助阐述本说明书。可选实施例并没有详尽叙述所有的细节,也不限制该发明仅为所述的具体实施方式。显然,根据本说明书的内容,可作很多的修改和变化。本说明书选取并具体描述这些实施例,是为了更好地解释本说明书的原理和实际应用,从而使所属技术领域技术人员能很好地理解和利用本说明书。本说明书仅受权利要求书及其全部范围和等效物的限制。The preferred embodiments of the present specification disclosed above are provided only to aid in the elaboration of the present specification. Alternative embodiments are not intended to exhaust all details, nor do they limit the invention to only the described embodiments. Obviously, many modifications and variations are possible in light of the contents of this specification. These embodiments are selected and described in this specification to better explain the principles and practical applications of this specification, so that those skilled in the art can well understand and utilize this specification. This specification is limited only by the claims and their full scope and equivalents.

Claims (17)

  1. 一种数据存储方法,所述方法包括:A data storage method, the method comprising:
    对待存储数据集进行聚类,确定多个聚类中心点;Cluster the data set to be stored, and determine multiple cluster center points;
    针对所述多个聚类中心点中的每一个聚类中心点,根据所述聚类中心点,For each cluster center point in the plurality of cluster center points, according to the cluster center point,
    确定对应的近邻聚类中心点,将所述聚类中心点和对应的所述近邻聚类中心点Determine the corresponding neighbor cluster center point, and combine the cluster center point and the corresponding neighbor cluster center point
    构成的空间确定为聚类子空间;The formed space is determined as a clustering subspace;
    根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储。According to the clustering subspace, the to-be-stored data in the to-be-stored data set is stored.
  2. 根据权利要求1所述的数据存储方法,所述根据所述聚类中心点,确定The data storage method according to claim 1, wherein according to the cluster center point, determining
    对应的近邻聚类中心点,包括:The corresponding nearest neighbor cluster center points, including:
    将所述聚类中心点确定为第一聚类中心点,将所述多个聚类中心点中除所述第一聚类中心点外的聚类中心点确定为第二聚类中心点;Determining the cluster center point as the first cluster center point, and determining the cluster center point other than the first cluster center point among the plurality of cluster center points as the second cluster center point;
    计算所述第一聚类中心点和各个所述第二聚类中心点之间的第一距离;Calculate the first distance between the first cluster center point and each of the second cluster center points;
    将所述第一距离按照由近至远进行排序,选择排序靠前的第一预设数值个第一距离,将选择的所述第一预设数值个第一距离对应的第二聚类中心点确定为所述第一聚类中心点对应的近邻聚类中心点。Sort the first distances from nearest to farthest, select the first preset value and the first distance in the ranking, and select the second cluster center corresponding to the first preset value and the first distance. The point is determined as the adjacent cluster center point corresponding to the first cluster center point.
  3. 根据权利要求1所述的数据存储方法,所述根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储,包括:The data storage method according to claim 1, wherein storing the data to be stored in the to-be-stored data set according to the clustering subspace comprises:
    确定所述待存储数据集中的第一待存储数据对应的第一目标聚类中心点;determining the first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set;
    根据所述第一待存储数据和所述第一目标聚类中心点,确定所述第一目标聚类中心点对应的第一目标近邻聚类中心点;According to the first data to be stored and the first target cluster center point, determine the first target neighbor cluster center point corresponding to the first target cluster center point;
    将所述第一待存储数据存储至所述第一目标聚类中心点和所述第一目标近邻聚类中心点构成的聚类子空间中。The first data to be stored is stored in a clustering subspace formed by the first target cluster center point and the first target neighbor cluster center point.
  4. 根据权利要求3所述的数据存储方法,所述确定所述待存储数据集中的第一待存储数据对应的第一目标聚类中心点,包括:The data storage method according to claim 3, wherein said determining the first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set comprises:
    计算所述第一待存储数据和所述多个聚类中心点中每个聚类中心点之间的第二距离;calculating the second distance between the first data to be stored and each cluster center point in the plurality of cluster center points;
    将所述第二距离按照由近至远进行排序,选择排序靠前的第二预设数值个第二距离,将选择的所述第二预设数值个第二距离对应的聚类中心点确定为所述第一待存储数据对应的第一目标聚类中心点。Sort the second distances from nearest to farthest, select the second preset value and the second distance ahead of the sorting, and determine the cluster center point corresponding to the selected second preset value and the second distance is the first target cluster center point corresponding to the first data to be stored.
  5. 根据权利要求3所述的数据存储方法,所述根据所述第一待存储数据和所述第一目标聚类中心点,确定所述第一目标聚类中心点对应的第一目标近邻聚类中心点,包括:The data storage method according to claim 3, wherein the first target neighbor cluster corresponding to the first target cluster center point is determined according to the first data to be stored and the first target cluster center point Center points, including:
    获取所述第一目标聚类中心点对应的各个近邻聚类中心点;Obtain each neighbor cluster center point corresponding to the first target cluster center point;
    计算所述第一待存储数据和所述各个近邻聚类中心点之间的第三距离;Calculate the third distance between the first data to be stored and the center points of each neighboring cluster;
    将所述第三距离按照由近至远进行排序,选择排序靠前的第三预设数值个第三距离,将选择的所述第三预设数值个第三距离对应的近邻聚类中心点确定为所述第一目标 聚类中心点对应的第一目标近邻聚类中心点。Sort the third distances from nearest to farthest, select the third preset value and the third distance ahead of the sorting, and select the nearest neighbor clustering center points corresponding to the selected third preset value and the third distance. It is determined as the first target neighbor cluster center point corresponding to the first target cluster center point.
  6. 根据权利要求1所述的数据存储方法,所述根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储之后,还包括:The data storage method according to claim 1, after storing the to-be-stored data in the to-be-stored data set according to the clustering subspace, further comprising:
    从所述多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;From the plurality of cluster center points, determine the second target cluster center point corresponding to the data to be queried;
    根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对应的目标聚类子空间;According to the data to be queried and the second target cluster center point, determine the target cluster subspace corresponding to the second target cluster center point;
    根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间。According to the data to be queried and the target clustering subspace, a search space corresponding to the data to be queried is determined.
  7. 一种数据查询方法,所述方法包括:A data query method, the method includes:
    从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;From the plurality of cluster center points, determine the second target cluster center point corresponding to the data to be queried;
    根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对应的目标聚类子空间;According to the data to be queried and the second target cluster center point, determine the target cluster subspace corresponding to the second target cluster center point;
    根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间。According to the data to be queried and the target clustering subspace, a search space corresponding to the data to be queried is determined.
  8. 根据权利要求7所述的数据查询方法,所述从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点,包括:The data query method according to claim 7, wherein determining the second target cluster center point corresponding to the data to be queried from the plurality of cluster center points, comprising:
    计算所述待查询数据和所述多个聚类中心点中每个聚类中心点之间的第四距离;Calculate the fourth distance between the data to be queried and each cluster center point in the plurality of cluster center points;
    将所述第四距离按照由近至远进行排序,选择排序靠前的第四预设数值个第四距离,将所述第四预设数值个第四距离对应的聚类中心点确定为所述待查询数据对应的第二目标聚类中心点。Sort the fourth distances from nearest to farthest, select the fourth preset value and the fourth distance ahead of the sorting, and determine the cluster center point corresponding to the fourth preset value and the fourth distance as the The second target cluster center point corresponding to the data to be queried is described.
  9. 根据权利要求7所述的数据查询方法,所述根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对应的目标聚类子空间,包括:The data query method according to claim 7, wherein determining the target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point, comprising:
    获取所述第二目标聚类中心点对应的各个近邻聚类中心点,将所述各个近邻聚类中心点确定为所述第二目标聚类中心点对应的第二目标近邻聚类中心点;Obtain each neighbor cluster center point corresponding to the second target cluster center point, and determine each neighbor cluster center point as the second target neighbor cluster center point corresponding to the second target cluster center point;
    将所述第二目标聚类中心点和所述第二目标近邻聚类中心点构成的聚类子空间确定为所述目标聚类子空间。The cluster subspace formed by the second target cluster center point and the second target neighbor cluster center point is determined as the target cluster subspace.
  10. 根据权利要求9所述的数据查询方法,所述根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间,包括:The data query method according to claim 9, wherein determining the search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace, comprising:
    计算所述目标聚类子空间和所述待查询数据的第五距离;Calculate the fifth distance between the target cluster subspace and the data to be queried;
    将所述第五距离按照由近至远进行排序,选择排序靠前的第五预设数值个第五距离,将所述第五预设数值个第五距离对应的目标聚类子空间确定为所述待查询数据对应的搜索空间。Sort the fifth distances from near to far, select the fifth preset value and the fifth distance ahead of the sorting, and determine the target clustering subspace corresponding to the fifth preset value and the fifth distance as The search space corresponding to the data to be queried.
  11. 根据权利要求10所述的数据查询方法,所述计算所述目标聚类子空间和所述待查询数据的第五距离,包括:The data query method according to claim 10, wherein the calculating the fifth distance between the target cluster subspace and the data to be queried comprises:
    将所述第二目标聚类中心点和所述第二目标近邻聚类中心点的中点确定为目标点;Determining the midpoint of the second target cluster center point and the second target neighbor cluster center point as the target point;
    计算所述目标点和所述待查询数据的第六距离,将所述第六距离确定为所述目标聚类子空间和所述待查询数据的第五距离。A sixth distance between the target point and the data to be queried is calculated, and the sixth distance is determined as a fifth distance between the target cluster subspace and the data to be queried.
  12. 一种数据存储装置,所述装置包括:A data storage device comprising:
    第一确定模块,被配置为对待存储数据集进行聚类,确定多个聚类中心点;a first determining module, configured to cluster the data set to be stored, and determine a plurality of cluster center points;
    第二确定模块,被配置为针对所述多个聚类中心点中的每一个聚类中心点,根据所述聚类中心点,确定对应的近邻聚类中心点,将所述聚类中心点和对应的所述近邻聚类中心点构成的空间确定为聚类子空间;The second determination module is configured to, for each cluster center point in the plurality of cluster center points, determine the corresponding nearest neighbor cluster center point according to the cluster center point, and assign the cluster center point to the cluster center point. The space formed by the corresponding neighbor cluster center points is determined as a cluster subspace;
    存储模块,被配置为根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储。The storage module is configured to store the to-be-stored data in the to-be-stored data set according to the clustering subspace.
  13. 一种数据查询装置,所述装置包括:A data query device, the device includes:
    第三确定模块,被配置为从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;The third determining module is configured to determine the second target cluster center point corresponding to the data to be queried from among the plurality of cluster center points;
    第四确定模块,被配置为根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对应的目标聚类子空间;a fourth determination module, configured to determine a target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point;
    第五确定模块,被配置为根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间。A fifth determining module is configured to determine a search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace.
  14. 一种计算设备,包括:A computing device comprising:
    存储器和处理器;memory and processor;
    所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,以实现下述方法:The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions to implement the following methods:
    对待存储数据集进行聚类,确定多个聚类中心点;Cluster the data set to be stored, and determine multiple cluster center points;
    针对所述多个聚类中心点中的每一个聚类中心点,根据所述聚类中心点,确定对应的近邻聚类中心点,将所述聚类中心点和对应的所述近邻聚类中心点构成的空间确定为聚类子空间;For each cluster center point in the plurality of cluster center points, according to the cluster center point, determine the corresponding neighbor cluster center point, and cluster the cluster center point and the corresponding neighbor cluster point. The space formed by the center point is determined as the clustering subspace;
    根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储。According to the clustering subspace, the to-be-stored data in the to-be-stored data set is stored.
  15. 一种计算设备,包括:A computing device comprising:
    存储器和处理器;memory and processor;
    所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,以实现下述方法:The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions to implement the following methods:
    从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;From the plurality of cluster center points, determine the second target cluster center point corresponding to the data to be queried;
    根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对应的目标聚类子空间;According to the data to be queried and the second target cluster center point, determine the target cluster subspace corresponding to the second target cluster center point;
    根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间。According to the data to be queried and the target clustering subspace, a search space corresponding to the data to be queried is determined.
  16. 一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时实现 权利要求1至6任意一项所述数据存储方法的步骤。A computer-readable storage medium storing computer instructions that, when executed by a processor, implement the steps of the data storage method according to any one of claims 1 to 6.
  17. 一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时实现权利要求7至11任意一项所述数据查询方法的步骤。A computer-readable storage medium storing computer instructions, when the instructions are executed by a processor, implement the steps of the data query method according to any one of claims 7 to 11.
PCT/CN2021/119760 2020-09-27 2021-09-23 Data storage method and device, and data query method and device WO2022063150A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011035973.0 2020-09-27
CN202011035973.0A CN113297331B (en) 2020-09-27 2020-09-27 Data storage method and device and data query method and device

Publications (1)

Publication Number Publication Date
WO2022063150A1 true WO2022063150A1 (en) 2022-03-31

Family

ID=77318246

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/119760 WO2022063150A1 (en) 2020-09-27 2021-09-23 Data storage method and device, and data query method and device

Country Status (2)

Country Link
CN (1) CN113297331B (en)
WO (1) WO2022063150A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113297331B (en) * 2020-09-27 2022-09-09 阿里云计算有限公司 Data storage method and device and data query method and device
CN115357609B (en) * 2022-10-24 2023-01-13 深圳比特微电子科技有限公司 Method, device, equipment and medium for processing data of Internet of things

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049514A (en) * 2012-12-14 2013-04-17 杭州淘淘搜科技有限公司 Balanced image clustering method based on hierarchical clustering
CN103324705A (en) * 2013-06-17 2013-09-25 中国科学院深圳先进技术研究院 Large-scale vector field data processing method
CN105912611A (en) * 2016-04-05 2016-08-31 中国科学技术大学 CNN based quick image search method
US20180032579A1 (en) * 2016-07-28 2018-02-01 Fujitsu Limited Non-transitory computer-readable recording medium, data search method, and data search device
CN107818147A (en) * 2017-10-19 2018-03-20 大连大学 Distributed temporal index system based on Voronoi diagram
CN109271427A (en) * 2018-10-17 2019-01-25 辽宁大学 A kind of clustering method based on neighbour's density and manifold distance
CN110909197A (en) * 2019-11-04 2020-03-24 深圳力维智联技术有限公司 High-dimensional feature processing method and device
CN111310809A (en) * 2020-02-04 2020-06-19 重庆亿创西北工业技术研究院有限公司 Data clustering method and device, computer equipment and storage medium
CN113297331A (en) * 2020-09-27 2021-08-24 阿里云计算有限公司 Data storage method and device and data query method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868414B (en) * 2016-05-03 2019-03-26 湖南工业大学 A kind of distributed index method that cluster is isolated
CN108629345B (en) * 2017-03-17 2021-07-30 北京京东尚科信息技术有限公司 High-dimensional image feature matching method and device
CN110889424B (en) * 2018-09-11 2023-06-30 阿里巴巴集团控股有限公司 Vector index establishing method and device and vector retrieving method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049514A (en) * 2012-12-14 2013-04-17 杭州淘淘搜科技有限公司 Balanced image clustering method based on hierarchical clustering
CN103324705A (en) * 2013-06-17 2013-09-25 中国科学院深圳先进技术研究院 Large-scale vector field data processing method
CN105912611A (en) * 2016-04-05 2016-08-31 中国科学技术大学 CNN based quick image search method
US20180032579A1 (en) * 2016-07-28 2018-02-01 Fujitsu Limited Non-transitory computer-readable recording medium, data search method, and data search device
CN107818147A (en) * 2017-10-19 2018-03-20 大连大学 Distributed temporal index system based on Voronoi diagram
CN109271427A (en) * 2018-10-17 2019-01-25 辽宁大学 A kind of clustering method based on neighbour's density and manifold distance
CN110909197A (en) * 2019-11-04 2020-03-24 深圳力维智联技术有限公司 High-dimensional feature processing method and device
CN111310809A (en) * 2020-02-04 2020-06-19 重庆亿创西北工业技术研究院有限公司 Data clustering method and device, computer equipment and storage medium
CN113297331A (en) * 2020-09-27 2021-08-24 阿里云计算有限公司 Data storage method and device and data query method and device

Also Published As

Publication number Publication date
CN113297331B (en) 2022-09-09
CN113297331A (en) 2021-08-24

Similar Documents

Publication Publication Date Title
EP3709184B1 (en) Sample set processing method and apparatus, and sample querying method and apparatus
Zhao et al. k-means: A revisit
WO2022063150A1 (en) Data storage method and device, and data query method and device
JP2019053772A (en) System and method of modeling object network
CN103744934A (en) Distributed index method based on LSH (Locality Sensitive Hashing)
JP2018527656A (en) Method and device for comparing similarity of high-dimensional features of images
CN114282073B (en) Data storage method and device and data reading method and device
CN102799614B (en) Image search method based on space symbiosis of visual words
CN107391502B (en) Time interval data query method and device and index construction method and device
CN106095920B (en) Distributed index method towards extensive High dimensional space data
CN104484392B (en) Query sentence of database generation method and device
CN111752955A (en) Data processing method, device, equipment and computer readable storage medium
CN113297269A (en) Data query method and device
Zhao et al. Approximate k-NN graph construction: a generic online approach
Valero-Mas et al. On the suitability of Prototype Selection methods for kNN classification with distributed data
CN106844541B (en) Online analysis processing method and device
Chen et al. Spatial and temporal constrained ranked retrieval over videos
KR102158049B1 (en) Data clustering apparatus and method based on range query using cf tree
CN112464007A (en) Data analysis method, system and platform based on artificial intelligence and Internet
CN113806376B (en) Index construction method and device
CN115146103A (en) Image retrieval method, image retrieval apparatus, computer device, storage medium, and program product
CN113901278A (en) Data search method and device based on global multi-detection and adaptive termination
CN115495504A (en) Data query method and device, electronic equipment and computer-readable storage medium
Antaris et al. Similarity search over the cloud based on image descriptors' dimensions value cardinalities
CN115129949A (en) Vector range retrieval method, device, equipment, medium and program product

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21871519

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21871519

Country of ref document: EP

Kind code of ref document: A1