WO2022063150A1

WO2022063150A1 - Data storage method and device, and data query method and device

Info

Publication number: WO2022063150A1
Application number: PCT/CN2021/119760
Authority: WO
Inventors: 楼仁杰; 李飞飞; 占超群; 魏闯先
Original assignee: 阿里云计算有限公司
Priority date: 2020-09-27
Filing date: 2021-09-23
Publication date: 2022-03-31
Also published as: CN113297331B; CN113297331A

Abstract

A data storage method and device, and a data query method and device. The data storage method comprises: clustering a data set to be stored, and determining a plurality of clustering center points (202); for each clustering center point in the plurality of clustering center points, determining a corresponding neighbor clustering center point according to the clustering center point, and determining a space formed by the clustering center point and the corresponding neighbor clustering center point as a clustering subspace (204); and storing, according to the clustering subspace, data to be stored in said data set (206). In the data storage method, a clustering algorithm is combined with a proximity graph concept, and a proximity graph relationship of a two-layer subspace is introduced on the basis of one-layer clustering to further divide a one-layer space into more detailed two-layer clustering subspaces, to improve the retrieval accuracy of the whole index.

Description

Data storage method and device, data query method and device

This application claims the priority of the Chinese Patent Application No. 202011035973.0 filed on September 27, 2020, and the title of the invention is “Data Storage Method and Device, Data Query Method and Device”, the entire contents of which are incorporated into this application by reference .

technical field

The present specification relates to the technical field of data processing, and in particular, to a data storage method and device, and a data query method and device.

Background technique

With the rapid development of computer technology and network technology, a large amount of data has been generated, followed by huge pressure on data storage and data query. The k-means clustering algorithm (k-means clustering algorithm, K-means clustering algorithm) will cluster the data according to the similar distance, so as to divide all the data into multiple spaces. In the process of data query, the query vector You only need to search for data points in the same space as yourself to ensure high accuracy.

However, when the query vector falls on the boundary of multiple spaces, the system must search multiple spaces adjacent to the query vector to ensure the accuracy. At the same time, due to the high dimension of the vector, each space will be related to a lot of spaces. Neighbors further aggravate this problem, and thus require simpler and more convenient methods for data storage, data query operations or processing.

SUMMARY OF THE INVENTION

In view of this, the embodiments of this specification provide a data storage method. This specification also relates to a data storage device, a data query method and device, two kinds of computing devices, and two kinds of computer-readable storage media, so as to solve the technical defects existing in the prior art.

According to a first aspect of the embodiments of this specification, there is provided a data storage method, the method comprising:

Cluster the data set to be stored, and determine multiple cluster center points;

For each cluster center point in the plurality of cluster center points, according to the cluster center point, determine the corresponding neighbor cluster center point, and cluster the cluster center point and the corresponding neighbor cluster point. The space formed by the center point is determined as the clustering subspace;

According to the clustering subspace, the to-be-stored data in the to-be-stored data set is stored.

Optionally, according to the cluster center point, determining the corresponding nearest neighbor cluster center point, including:

Determining the cluster center point as the first cluster center point, and determining the cluster center point other than the first cluster center point among the plurality of cluster center points as the second cluster center point;

Calculate the first distance between the first cluster center point and each of the second cluster center points;

Sort the first distances from nearest to farthest, select the first preset value and the first distance in the ranking, and select the second cluster center corresponding to the first preset value and the first distance. The point is determined as the adjacent cluster center point corresponding to the first cluster center point.

Optionally, storing the data to be stored in the data set to be stored according to the clustering subspace includes:

determining the first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set;

According to the first data to be stored and the first target cluster center point, determine the first target neighbor cluster center point corresponding to the first target cluster center point;

The first data to be stored is stored in a clustering subspace formed by the first target cluster center point and the first target neighbor cluster center point.

Optionally, the determining the first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set includes:

Calculate the second distance between the first data to be stored and each cluster center point in the plurality of cluster center points;

Sort the second distances from nearest to farthest, select the second preset value and the second distance ahead of the sorting, and determine the cluster center point corresponding to the selected second preset value and the second distance is the first target cluster center point corresponding to the first data to be stored.

Optionally, the determining the first target neighbor cluster center point corresponding to the first target cluster center point according to the first data to be stored and the first target cluster center point includes:

Obtain each neighbor cluster center point corresponding to the first target cluster center point;

Calculate the third distance between the first data to be stored and the center points of each neighboring cluster;

Sort the third distances from nearest to farthest, select the third preset value and the third distance ahead of the sorting, and select the nearest neighbor clustering center points corresponding to the selected third preset value and the third distance. It is determined as the first target neighbor cluster center point corresponding to the first target cluster center point.

Optionally, after storing the to-be-stored data in the to-be-stored data set according to the clustering subspace, the method further includes:

From the plurality of cluster center points, determine the second target cluster center point corresponding to the data to be queried;

According to the data to be queried and the second target cluster center point, determine the target cluster subspace corresponding to the second target cluster center point;

According to the data to be queried and the target clustering subspace, a search space corresponding to the data to be queried is determined.

According to a second aspect of the embodiments of this specification, a data query method is provided, the method comprising:

Optionally, determining the second target cluster center point corresponding to the data to be queried from the plurality of cluster center points includes:

Calculate the fourth distance between the data to be queried and each cluster center point in the plurality of cluster center points;

Sort the fourth distances from nearest to farthest, select the fourth preset value and the fourth distance ahead of the sorting, and determine the cluster center point corresponding to the fourth preset value and the fourth distance as the The second target cluster center point corresponding to the data to be queried is described.

Optionally, according to the data to be queried and the second target cluster center point, determining the target cluster subspace corresponding to the second target cluster center point includes:

Obtain each neighbor cluster center point corresponding to the second target cluster center point, and determine the each neighbor cluster center point as the second target neighbor cluster center point corresponding to the second target cluster center point;

The cluster subspace formed by the second target cluster center point and the second target neighbor cluster center point is determined as the target cluster subspace.

Optionally, determining the search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace includes:

Calculate the fifth distance between the target cluster subspace and the data to be queried;

Sort the fifth distances from near to far, select the fifth preset value and the fifth distance ahead of the sorting, and determine the target clustering subspace corresponding to the fifth preset value and the fifth distance as The search space corresponding to the data to be queried.

Optionally, the calculating the fifth distance between the target cluster subspace and the data to be queried includes:

Determining the midpoint of the second target cluster center point and the second target neighbor cluster center point as the target point;

A sixth distance between the target point and the data to be queried is calculated, and the sixth distance is determined as a fifth distance between the target cluster subspace and the data to be queried.

According to a third aspect of the embodiments of the present specification, there is provided a data storage device, the device comprising:

a first determining module, configured to cluster the data set to be stored, and determine a plurality of cluster center points;

The second determination module is configured to, for each cluster center point in the plurality of cluster center points, determine the corresponding nearest neighbor cluster center point according to the cluster center point, and assign the cluster center point to the cluster center point. The space formed by the corresponding neighbor cluster center points is determined as a cluster subspace;

The storage module is configured to store the to-be-stored data in the to-be-stored data set according to the clustering subspace.

According to a fourth aspect of the embodiments of the present specification, there is provided a data query device, the device comprising:

The third determination module is configured to determine the second target cluster center point corresponding to the data to be queried from the plurality of cluster center points;

a fourth determination module, configured to determine a target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point;

A fifth determining module is configured to determine a search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace.

According to a fifth aspect of the embodiments of the present specification, a computing device is provided, including:

memory and processor;

The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions to implement the following methods:

According to a sixth aspect of the embodiments of the present specification, a computing device is provided, including:

memory and processor;

According to a seventh aspect of the embodiments of the present specification, a computer-readable storage medium is provided, which stores computer-executable instructions, and when the instructions are executed by a processor, implements the steps of the data storage method.

According to an eighth aspect of the embodiments of the present specification, a computer-readable storage medium is provided, which stores computer-executable instructions, and when the instructions are executed by a processor, implements the steps of the data query method.

The data storage method provided in this specification can cluster the data set to be stored and determine multiple cluster center points; then, for each cluster center point in the multiple cluster center points, according to the cluster center point , determine the corresponding neighbor cluster center point, and determine the space formed by the cluster center point and the corresponding neighbor cluster center point as the cluster subspace; then, according to the cluster subspace, the to-be-stored data set Store data for storage. In this case, after the cluster center point is determined, the data to be stored is not directly stored in a layer of space formed with the cluster center point as the center, but is further determined according to the cluster center point. Then the data to be stored is stored in the clustering subspace composed of the cluster center point and the corresponding neighbor cluster center point, and the clustering algorithm and the idea of the nearest neighbor graph are combined. On the basis of clustering, the neighbor graph relationship of the second-level subspace is introduced, and the first-level space is further divided into two-level clustering subspaces, thereby improving the retrieval accuracy of the entire index; and this second-level subspace division Compared with the two-layer hierarchical clustering structure, the method saves the cost of multi-layer clustering, and the index construction speed is faster. Compared with the single-layer clustering structure, this specification improves the construction and storage cost of the index without introducing additional cost of index construction and storage. accuracy of retrieval.

Description of drawings

Fig. 1 is a kind of ANN index structure diagram based on K-means clustering provided by an embodiment of this specification;

2 is a flowchart of a data storage method provided by an embodiment of the present specification;

Fig. 3 is a kind of index structure diagram combining K-means clustering and nearest neighbor graph provided by an embodiment of this specification;

4 is a flowchart of a data query method provided by an embodiment of this specification;

5 is a schematic structural diagram of a data storage device provided by an embodiment of the present specification;

6 is a schematic structural diagram of a data query apparatus provided by an embodiment of the present specification;

FIG. 7 is a structural block diagram of a computing device provided by an embodiment of the present specification;

FIG. 8 is a structural block diagram of another computing device provided by an embodiment of the present specification.

detailed description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of this specification. However, this specification can be implemented in many other ways different from those described herein, and those skilled in the art can make similar promotions without departing from the connotation of this specification. Therefore, this specification is not limited by the specific implementation disclosed below.

The terminology used in one or more embodiments of this specification is for the purpose of describing a particular embodiment only and is not intended to limit the one or more embodiments of this specification. As used in the specification or embodiments and the appended claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used in this specification in one or more embodiments refers to and includes any and all possible combinations of one or more of the associated listed items.

It will be understood that although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other. For example, a first could be termed a second, and similarly, a second could be termed a first, without departing from the scope of one or more embodiments of this specification. Depending on the context, the word "if" as used herein can be interpreted as "at the time of" or "when" or "in response to determining."

First, the terminology involved in one or more embodiments of the present specification is explained.

ANN retrieval: Approximate Nearest Neighbor Search takes advantage of the characteristic that the data will form a clustered distribution when the amount of data increases, and classifies or encodes the data in the database by analyzing and clustering the data. For the target data, predict the data category to which it belongs according to its data characteristics, and return some or all of the categories as the retrieval result. That is, the nearest N adjacent data (ie, vectors) in the high-dimensional space are quickly retrieved through pre-built indexes, but only approximately accurate results can be returned, and absolute accuracy cannot be guaranteed. The core idea of approximate nearest neighbor retrieval is to search for data items that may be nearest neighbors instead of returning only the most probable items, improving retrieval efficiency at the expense of accuracy within an acceptable range.

Vector index: a type of index structure that provides ANN retrieval capabilities for high-dimensional vector data.

K-means clustering algorithm: also known as k-means clustering algorithm (k-means clustering algorithm), is an iterative solution clustering analysis algorithm, the steps are to randomly select K objects as the initial cluster center Calculate the distance between each object and each cluster center point, assign each object to the cluster center point closest to it, and the cluster center point and the objects assigned to them represent a cluster. K-means clustering algorithm originated from a vector quantization method in signal processing, and is also a classic clustering analysis method in the field of data mining.

Voronoi space: a subdivision of hyperspace, which is characterized in that any position in the subspace is closest to the center point of the subspace, and is relatively far away from the center point in adjacent subspaces, and each subspace is far away from the center point of the subspace. Contains one and only one center point.

Next, the basic concepts of the data storage method and data query method provided in this specification are briefly described:

The k-means clustering algorithm (k-means clustering algorithm, K-means clustering algorithm) will cluster the data according to the similar distance, so as to divide all the data into multiple spaces. In the process of data query, the query vector Just search for data points in the same space as yourself. However, using the K-means clustering algorithm alone for the vector index of space division, when the data clustering effect is not good, the retrieval accuracy will be seriously reduced, and when the query vector falls on the boundary of multiple spaces, The system must search for multiple spaces adjacent to the query vector to ensure accuracy. At the same time, due to the high dimension of the vector, each space will be adjacent to a lot of spaces, which further exacerbates this problem.

For example, Figure 1 is an ANN index structure diagram based on K-means clustering. As shown in Figure 1, the K-means clustering algorithm is used to cluster the data to be stored, and 8 different K-means such as C0-C7 are determined. means cluster center point, each cluster center point is divided into a Voronoi space. When the data q to be queried is at the space boundary, it has to search multiple spaces of C0, C2 and C3 at the same time, which greatly reduces the retrieval efficiency.

Therefore, in order to improve retrieval efficiency and accuracy, this specification proposes a data storage method and device, and a data query method and device. After clustering the to-be-stored data set and determining multiple cluster center points, For each cluster center point in the cluster center points, determine the corresponding neighbor cluster center point, and determine the space formed by the cluster center point and the corresponding neighbor cluster center point as the cluster subspace; The storage data is stored in the cluster subspace. The clustering algorithm is combined with the idea of the nearest neighbor graph. On the basis of K-means clustering, the neighbor graph relationship of the second-level subspace is introduced, and the space Ci of one layer is further divided into two-layer clustering subspaces, thus Improve the retrieval efficiency and accuracy of the entire index.

In this specification, a data storage method is provided. This specification also relates to a data storage device, a data query method and device, two computing devices, and two computer-readable storage media. In the following embodiments are explained in detail one by one.

2 shows a flowchart of a data storage method provided according to an embodiment of the present specification, which specifically includes the following steps:

Step 202: Cluster the data set to be stored, and determine a plurality of cluster center points.

Specifically, the to-be-stored data set is a set composed of all to-be-stored data, and the to-be-stored data set includes a plurality of to-be-stored data.

In practical applications, K-means clustering algorithm can be used to cluster the data set to be stored, so as to determine the multiple cluster center points. In the specific implementation, K data can be randomly selected from the data set to be stored, and the selected The K data are determined as K cluster center points.

Step 204: For each cluster center point in the plurality of cluster center points, according to the cluster center point, determine the corresponding neighbor cluster center point, and the cluster center point and the corresponding neighbor cluster center point. The formed space is determined as a clustering subspace.

Specifically, on the basis of clustering the data set to be stored and determining multiple cluster center points, further, for each cluster center point in the multiple cluster center points, according to the cluster center point , determine the corresponding neighbor cluster center point, and determine the space formed by the cluster center point and the corresponding neighbor cluster center point as the cluster subspace.

In practical applications, for each cluster center point Ci, find n nearest neighbor cluster center points close to it, each cluster center point Ci and its corresponding n neighbor cluster center points can form multiple clusters. subspace.

In one or more implementations of this embodiment, according to the cluster center point, the corresponding nearest neighbor cluster center point is determined, and the specific implementation process may be as follows:

Determining the cluster center point as the first cluster center point, and determining the cluster center point of the plurality of cluster center points except the first cluster center point as the second cluster center point;

Calculate the first distance between the first cluster center point and each second cluster center point;

Sort the first distances from nearest to farthest, select the first preset value and the first distance in the first order, and determine the second cluster center point corresponding to the selected first preset value and the first distance as the first distance. A cluster center point corresponds to the nearest neighbor cluster center point.

It should be noted that, for each cluster center point Ci, it is necessary to find multiple neighboring cluster center points close to it, so it is necessary to calculate the distance between the cluster center point Ci and other cluster center points, so as to filter A plurality of neighboring cluster center points with a relatively short distance are obtained, that is, the first preset number of neighboring cluster center points. Among them, the first preset value can be set in advance, such as 2, 3, 4, etc. After sorting the calculated first distances from near to far, the first preset value can be filtered and aggregated. Multiple nearest neighbor cluster center points that are close to the class center point.

For example, the determined cluster center points are C0, C1, C2, ..., Ck, for C0, C0 is determined as the first cluster center point, and C1, C2, ..., Ck is determined as the second cluster The center point, calculate the distance between C0 and C1, C2, ..., Ck respectively, assuming that the distance between C0 and C1, C2, ..., C7 is a, and the distance between C8, C9, ..., C20 The distance is b, and the distance between C21, C22, ..., Ck is c, and the distance a is less than the distance b is less than the distance c, and the first preset value is 7, then the determined nearest neighbor clustering corresponding to C0 at this time The center points are C1, C2, ..., C7. The above steps are also performed for C1, and the adjacent cluster center points corresponding to C1 are determined to be C0, C2, C7, C8, C9, C10, and C11; the above steps are also performed for C2, and the adjacent cluster center points corresponding to C2 are determined to be C0, C1, C3, C12, C13, C14, C15; and so on, until the nearest neighbor cluster center point corresponding to Ck is determined.

In one or more implementations of this embodiment, instead of filtering a plurality of neighboring cluster center points that are close to the cluster center point by the first preset value, the corresponding nearest neighbors can be filtered out by setting a distance threshold. Cluster center point. In this way, for each cluster center point in the plurality of cluster center points, according to the cluster center point, the corresponding neighboring cluster center point is determined, and the specific implementation process may be as follows:

A first distance smaller than the first distance threshold in the first distances is determined, and a second cluster center point corresponding to the first distance smaller than the first distance threshold is determined as a neighbor cluster center point corresponding to the first cluster center point.

The first distance threshold may be set in advance.

For example, the determined cluster center points are C0, C1, C2, ..., Ck, for C0, C0 is determined as the first cluster center point, and C1, C2, ..., Ck is determined as the second cluster The center point, calculate the distance between C0 and C1, C2, ..., Ck respectively, assuming that the distance between C0 and C1, C2, ..., C7 is 1, and the distance between C8, C9, ..., C20 The distance is 2, the distance between C21, C22, ..., Ck is 3, and the first distance threshold is 1.5, then the center point of the nearest neighbor cluster corresponding to C0 determined at this time is C1, C2, ..., C7. The above steps are also performed for C1, C2, .

In practical applications, multiple neighboring cluster center points corresponding to each cluster center point are obtained, which is called the neighbor graph relationship between the cluster center points. Each cluster center point Ci and one of its neighbors are clustered together. The class center point Cj constitutes a cluster subspace B(i, j), that is, each cluster center point Ci and its multiple neighboring cluster center points can constitute multiple cluster subspaces.

For example, the determined neighbor cluster center points corresponding to C0 are C1, C2, ..., C7, and the neighbor cluster center points corresponding to C1 are C0, C2, C7, C8, C9, C10, C11. Then C0 and C1 can form a clustering subspace B(0,1), C0 and C2 can form a clustering subspace B(0,2), and C0 and C3 can form a clustering subspace B(0,3 ), C0 and C4 can form a clustering subspace B(0, 4), C0 and C5 can form a clustering subspace B(0,5), C0 and C6 can form a clustering subspace B(0, 6), C0 and C7 can form a clustering subspace B(0, 7). C1 and C0 can form a clustering subspace B(1,0), C1 and C2 can form a clustering subspace B(1,2), C1 and C7 can form a clustering subspace B(1,7) , C1 and C8 can form a clustering subspace B(1, 8), C1 and C9 can form a clustering subspace B(1,9), C1 and C10 can form a clustering subspace B(1,10 ), C1 and C11 can form a clustering subspace B(1, 11).

Step 206: According to the clustering subspace, store the data to be stored in the data set to be stored.

Specifically, for each cluster center point in the plurality of cluster center points, according to the cluster center point, determine the corresponding neighbor cluster center point, and the cluster center point and the corresponding neighbor cluster center point On the basis that the space formed by the points is determined as the clustering subspace, further, according to the clustering subspace, the to-be-stored data in the to-be-stored data set is stored.

In practical applications, each data to be stored in the data set to be stored needs to be stored in the corresponding clustering subspace to facilitate subsequent retrieval and query. Therefore, for each data to be stored in the data set to be stored, it needs to be calculated once and determined. In which clustering subspace to store it.

In one or more implementations of this embodiment, according to the clustering subspace, the data to be stored in the data set to be stored is stored, and the specific implementation process may be as follows:

The first data to be stored is stored in the clustering subspace formed by the first target cluster center point and the first target neighbor cluster center point.

Specifically, the first data to be stored may be any data to be stored in the data set to be stored, and each of the data to be stored in the data set to be stored needs to perform the above operation steps once, so as to determine its corresponding clustering subspace, to store. That is, each to-be-stored data in the to-be-stored data set is to be used as the above-mentioned first to-be-stored data once.

In practical applications, for each data to be stored, calculate its close cluster center point Ci, and then select the nearest neighbor cluster center point Cj from the neighbor cluster center points corresponding to Ci, which is closest to the data to be stored, and finally The data to be stored is encoded and written into the storage space of B(i, j). It should be noted that PQ (product quantization) encoding can be used, or other encoding methods can be used, which is not limited in this specification.

The first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set is determined, and in specific implementation, the first to-be-stored data and each cluster center point of the plurality of cluster center points may be calculated Then, sort the second distances from near to far, select the second preset value and the second distance in the front of the sorting, and put the cluster corresponding to the second preset value and the second distance. The class center point is determined as the first target cluster center point corresponding to the first data to be stored.

Specifically, when inserting the first data to be stored, it is necessary to determine into which cluster subspace the first data to be stored is inserted, and the closest cluster subspace can be selected, or the two closest cluster subspaces can be selected. Therefore, when determining the center point of the first target cluster corresponding to the first data to be stored, it can be filtered by the second preset value, wherein the second preset value can be set in advance, such as 1 or 2 , 3, 4, etc.

For example, the data set to be stored is {X1, X2, X3, ..., Xm}, the first data to be stored is X1, and the cluster center points determined according to the data set to be stored are C0, C1, C2, ..., C7, calculate the distance between X1 and C0, C1, C2, ..., C7 respectively, assuming that the distance between X1 and C0 is 1, the distance between X1 and C1 is 1.5, and the distance between X1 and C2 is 2, the distance between X1 and C3 is 3, the distance between X1 and C4 is 4, the distance between X1 and C5 is 5, the distance between X1 and C6 is 6, and the distance between X1 and C7 is 7. Sort the above-mentioned 8 distances from near to far, assuming that the first second distance (1) is selected, then determine C0 as the first target cluster corresponding to the first data to be stored X1 center point. By analogy, for other to-be-stored data {X2, X3, .

In one or more implementations of this embodiment, the center point of the first target cluster that is closest to the first data to be stored may not be screened by the second preset value, but by setting a distance threshold, the corresponding The closest first target cluster center point. In this way, the first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set is determined, and the specific implementation process may be as follows:

Calculate the second distance between the first data to be stored and each cluster center point in the plurality of cluster center points, determine the second distance in the second distance that is less than the second distance threshold, and determine the second distance less than the second distance threshold. The cluster center point corresponding to the second distance is determined as the first target cluster center point corresponding to the first data to be stored.

Specifically, the second distance threshold may be set in advance.

For example, the data set to be stored is {X1, X2, X3, ..., Xm}, the first data to be stored is X1, and the cluster center points determined according to the data set to be stored are C0, C1, C2, ..., C7, calculate the distance between X1 and C0, C1, C2, ..., C7 respectively, assuming that the distance between X1 and C0 is 1, the distance between X1 and C1 is 1.5, and the distance between X1 and C2 is 2, the distance between X1 and C3 is 3, the distance between X1 and C4 is 4, the distance between X1 and C5 is 5, the distance between X1 and C6 is 6, and the distance between X1 and C7 is 7. Assuming that the second distance threshold is 1.3 and the distance between X1 and C0 is 1 less than the second distance threshold, then determine C0 as the first target cluster center point corresponding to the first to-be-stored data X1. By analogy, for other to-be-stored data {X2, X3, .

Wherein, according to the first to-be-stored data and the first target cluster center point, the first target neighbor cluster center point corresponding to the first target cluster center point is determined, and in specific implementation, the first target cluster center point may be obtained first The corresponding neighbor cluster center points, and then calculate the third distance between the first data to be stored and the neighbor cluster center points, sort the third distances from near to far, and select the third The preset value and the third distance are determined, and the selected neighbor cluster center point corresponding to the third preset value and the third distance is determined as the first target neighbor cluster center point corresponding to the first target cluster center point.

Specifically, after determining the corresponding first target cluster center point for each to-be-stored data set in the to-be-stored data set, the first target cluster center point corresponding to the first target cluster center point of each to-be-stored data is also determined respectively. The target neighbor cluster center point, that is, the first target neighbor cluster center point corresponding to the first target cluster center point and closest to the first data to be stored. During implementation, filtering may be performed by a third preset value, wherein the third preset value may be set in advance, such as 1, 2, 3, 4, and so on.

For example, the first data to be stored is X1, the first target cluster center point corresponding to X1 is C0, and the neighbor cluster center points corresponding to C0 are C1, C2, ..., C7, respectively, calculate X1 and C1, C2, ... ..., the distance between C7 (if it has been calculated before, the result can be directly obtained here), assuming that the distance between X1 and C1 is 1.5, the distance between X1 and C2 is 2, and the distance between X1 and C3 is 2. The distance is 3, the distance between X1 and C4 is 4, the distance between X1 and C5 is 5, the distance between X1 and C6 is 6, the distance between X1 and C7 is 7, and the above 7 distances are as follows Sort from near to far, assuming that the first second distance (1.5) is selected, then C1 is determined as the first target nearest neighbor cluster center point corresponding to C0. By analogy, for the target cluster center points corresponding to other to-be-stored data in the to-be-stored data set, the corresponding target neighbor cluster center points are determined according to the above method.

In one or more implementations of this embodiment, the first target neighbor cluster center point corresponding to the first target cluster center point and closest to the first to-be-stored data may not be screened by the third preset value. , but by setting a distance threshold, filter out the corresponding first target nearest neighbor cluster center point that is closest to the first data to be stored. In this way, according to the first data to be stored and the first target cluster center point, the first target neighbor cluster center point corresponding to the first target cluster center point is determined, and the specific implementation process may be as follows:

It is possible to first obtain the center points of each neighbor cluster corresponding to the center point of the first target cluster, and then calculate the third distance between the first data to be stored and the center points of each neighbor cluster, and determine that the third distance is smaller than the third distance. For the third distance of the threshold, the neighbor cluster center point corresponding to the third distance smaller than the third distance threshold value is determined as the first target neighbor cluster center point corresponding to the first target cluster center point.

Specifically, the third distance threshold may be set in advance.

For example, the first data to be stored is X1, the first target cluster center point corresponding to X1 is C0, and the neighbor cluster center points corresponding to C0 are C1, C2, ..., C7, respectively, calculate X1 and C1, C2, ... ..., the distance between C7, assuming the distance between X1 and C1 is 1.5, the distance between X1 and C2 is 2, the distance between X1 and C3 is 3, the distance between X1 and C4 is 4, and the distance between X1 and C4 is 4. The distance between X1 and C5 is 5, the distance between X1 and C6 is 6, and the distance between X1 and C7 is 7. Assuming that the third distance threshold is 1.8, the distance between X1 and C1 is 1.5 less than the third distance. If the distance threshold is set, at this time, C1 is determined as the center point of the first target nearest neighbor clustering corresponding to C0. By analogy, for the target cluster center points corresponding to other to-be-stored data in the to-be-stored data set, the corresponding target neighbor cluster center points are determined according to the above method.

In practical applications, after the first target cluster center point and the first target neighbor cluster center point corresponding to the first data to be stored are determined through the above steps, the first data to be stored can be stored in the first target cluster. The center point and the clustering subspace formed by the first target neighbor cluster center point.

For example, the first data to be stored is X1, the first target cluster center point corresponding to X1 is C0, and the first target neighbor cluster center point is C1, then X1 is stored in the cluster subspace formed by C0 and C1. in B(0,1). By analogy, other to-be-stored data in the to-be-stored data set are stored in sequence according to the above method.

In practical applications, after storing all the to-be-stored data in the to-be-stored data set, all the encoded to-be-stored data will be stored in an inverted structure according to the belonging clustering subspace B(i, j) and then written to the index file. . The storage in the inverted structure refers to arranging and storing each data to be stored according to the number (id) of the clustering subspace, and writing it into the index file.

It should be noted that, after all the data to be stored in the to-be-stored data set is stored according to the above steps 202-206, if a certain data to be queried needs to be queried, the clustering subspace can be searched according to the following steps 208-212.

Step 208: From the plurality of cluster center points, determine a second target cluster center point corresponding to the data to be queried.

In practical applications, for each data to be queried, a second target cluster center point close to the data to be queried needs to be determined from a plurality of cluster center points. The specific implementation process can be as follows:

Sort the fourth distances from nearest to farthest, select the fourth preset value and the fourth distance at the top of the ranking, and determine the cluster center point corresponding to the fourth preset value and the fourth distance as the data to be queried. The second target cluster center point.

It should be noted that, for each data to be queried, the distance between it and each cluster center point Ci is calculated, so as to determine the cluster center point close to the data to be queried, that is, the fourth preset numerical value Two-target cluster center point. Among them, the fourth preset value can be set in advance, such as 2, 3, 4, etc. After sorting the calculated fourth distances from near to far, the fourth preset value can be used to filter and wait for The query data is close to multiple second target cluster center points.

For example, as shown in Figure 3, the cluster center points are C0, C1, C2, ..., C7, and for the data q to be queried, calculate the distance between q and C0, C1, C2, ..., C7, respectively, assuming The distance between q and C0 is 0.5, the distance between q and C1 is 3, the distance between q and C2 is 1, the distance between q and C3 is 0.8, the distance between q and C4 is 3.2, The distance between q and C5 is 5, the distance between q and C6 is 7, and the distance between q and C7 is 5.5. Assuming that the fourth preset value is 3, the second target corresponding to q is determined at this time. The cluster center points are C0, C2 and C3.

In one or more implementations of this embodiment, the center points of multiple target clusters that are close to the data to be queried may not be screened by the fourth preset value, but the corresponding target clusters may be screened by setting a distance threshold. class center point. In this way, from the plurality of cluster center points, the second target cluster center point corresponding to the data to be queried is determined, and the specific implementation process may be as follows:

Among the fourth distances, a fourth distance smaller than the fourth distance threshold is determined, and the cluster center point corresponding to the fourth distance smaller than the fourth distance threshold is determined as the second target cluster center point corresponding to the data to be queried.

The fourth distance threshold may be set in advance.

For example, as shown in Figure 3, the cluster center points are C0, C1, C2, ..., C7, and for the data q to be queried, calculate the distance between q and C0, C1, C2, ..., C7, respectively, assuming The distance between q and C0 is 0.5, the distance between q and C1 is 3, the distance between q and C2 is 1, the distance between q and C3 is 0.8, the distance between q and C4 is 3.2, The distance between q and C5 is 5, the distance between q and C6 is 7, and the distance between q and C7 is 5.5. Assuming that the fourth distance threshold is 1.5, the second target cluster corresponding to q is determined at this time. The class center points are C0, C2, and C3.

Step 210: Determine the target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point.

Specifically, on the basis of determining the second target cluster center point corresponding to the data to be queried from the plurality of cluster center points, further, according to the to-be-queried data and the second target cluster center point, determine The target cluster subspace corresponding to the second target cluster center point.

In practical applications, multiple clustering subspaces where the data to be queried may be stored can be selected according to the data to be queried in the previous step and the center point of the second target cluster. The specific implementation process is as follows:

Obtain each neighbor cluster center point corresponding to the second target cluster center point, and determine each neighbor cluster center point as the second target neighbor cluster center point corresponding to the second target cluster center point;

It should be noted that there may be multiple second target cluster center points determined, and each second target cluster center point will have a corresponding neighboring cluster center point. Therefore, for each second target cluster center point , the corresponding second target neighbor cluster center point should be determined, and then the corresponding target cluster subspace should be determined.

For example, it is determined that the second target cluster center points corresponding to the data to be queried are C0, C2, and C3, and the neighboring cluster center points corresponding to C0 are C1, C2, ..., C7, so the second target cluster center point The second target nearest neighbor cluster center points corresponding to C0 are C1, C2, ..., C7. At this time, the target cluster subspace corresponding to the second target cluster center point C0 is B(0,1), B(0, 2), B(0,3), B(0,4), B(0,5), B(0,6), B(0,7); the center points of each nearest neighbor cluster corresponding to C2 are C0, C1, C3, C12, C13, C14, C15, so the second target neighbor cluster center points corresponding to the second target cluster center point C2 are C0, C1, C3, C12, C13, C14, C15, and the second target cluster center point C2 The target cluster subspace corresponding to the target cluster center point C2 is B(2,0), B(2,1), B(2,3), B(2,12), B(2,13), B (2,14), B(2,15); the center points of each neighboring cluster corresponding to C3 are C0, C2, C4, C16, C17, C18, C19, so the second target cluster center point C3 corresponds to the second The target nearest neighbor cluster center points are C0, C2, C4, C16, C17, C18, C19, and the target cluster subspace corresponding to the second target cluster center point C3 is B(3,0), B(3, 2), B(3,4), B(3,16), B(3,17), B(3,18), B(3,19).

Step 212: According to the data to be queried and the target clustering subspace, determine a search space corresponding to the data to be queried.

Specifically, the search space refers to the space in which the data to be queried is finally queried.

In one or more implementations of this embodiment, the search space corresponding to the data to be queried is determined according to the data to be queried and the target clustering subspace, and the specific implementation process may be as follows:

Sort the fifth distances from nearest to farthest, select the fifth preset value and the fifth distance at the top of the ranking, and determine the target clustering subspace corresponding to the fifth preset value and the fifth distance as the corresponding data to be queried. search space.

Wherein, the midpoint between the second target cluster center point and the second target neighbor cluster center point can be determined as the target point; then the sixth distance between the target point and the data to be queried is calculated, and the sixth distance is determined as the target The fifth distance between the clustering subspace and the data to be queried. That is, the distance from the target cluster subspace to the data to be queried is represented by the distance from the mean center point of the cluster center point Ci and the adjacent cluster center point Cj to the data to be queried.

It should be noted that there may be multiple target clustering subspaces obtained by the above steps. For each target clustering subspace, it is necessary to determine the distance between it and the data to be stored, so as to decide whether to cluster the target. The subspace is determined as the search space for querying the data to be queried.

For example, the target clustering subspace corresponding to C0 is B(0,1), B(0,2), B(0,3), B(0,4), B(0,5), B(0, 6), B(0,7), the target clustering subspace corresponding to C2 is B(2,0), B(2,1), B(2,3), B(2,12), B(2 ,13), B(2,14), B(2,15), the target clustering subspace corresponding to C3 is B(3,0), B(3,2), B(3,4), B( 3,16), B(3,17), B(3,18), B(3,19), calculate the distance between the data q to be queried and all the above target clustering subspaces, and then select multiple distance comparisons. The nearest target clustering subspace is used as the final search space. As shown in Figure 3, B(0,2), B(0,3), B(2,0), and B(3,0) are selected as the final search space, and the data to be queried is queried in this search space. q.

The data storage method provided in this specification can cluster the data set to be stored and determine multiple cluster center points; then, for each cluster center point in the multiple cluster center points, according to the cluster center point , determine the corresponding neighbor cluster center point, and determine the space formed by the cluster center point and the corresponding neighbor cluster center point as the cluster subspace; then, according to the cluster subspace, the to-be-stored data set Store data for storage. Then, when the data to be queried needs to be queried, the second target cluster center point corresponding to the data to be queried can be determined from the plurality of cluster center points; and then according to the data to be queried and the second target cluster center point, determine the target clustering subspace corresponding to the center point of the second target clustering, and then further determine the search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace, and perform a search in the search space .

In this case, after the cluster center point is determined, the data to be stored is not directly stored in a layer of space formed with the cluster center point as the center, but is further determined according to the cluster center point. Then the data to be stored is stored in the clustering subspace formed by the cluster center point and the corresponding neighbor cluster center point, and the clustering algorithm and the idea of the nearest neighbor graph are combined. On the basis of means clustering, the neighbor graph relationship of the second-level subspace is introduced, and the first-level space is further divided into more detailed two-level clustering subspaces, thereby improving the retrieval accuracy of the entire index. In addition, this two-layer subspace division method saves the cost of multi-layer clustering compared with the two-layer hierarchical clustering structure, and the index construction speed is faster. Compared with the single-layer K-means clustering structure, this specification does not On the basis of introducing additional index construction and storage overhead, the retrieval accuracy is improved.

Next, in conjunction with accompanying drawing 1 and accompanying drawing 3, the beneficial effect that the data storage method provided in this specification can bring is illustrated by example:

Assuming that the data to be queried is q, if the prior art as shown in Figure 1 is adopted, the stored data is clustered simply by the K-means clustering algorithm, and 8 different K-means clustering centers such as C0-C7 are determined. Each cluster center point is divided into a Voronoi space. When the data q to be queried is at the space boundary, multiple spaces of C0, C2 and C3 have to be searched at the same time. As shown in Figure 3, the data storage method provided in this specification, after determining 8 different K-means clustering center points such as C0-C7, will further determine the neighboring clustering center points corresponding to C0-C7, Therefore, the second-level spatial clustering subspace is divided twice, and the data q to be queried is also queried. The method provided in this specification only needs to be in the clustering subspaces B(0,2), B(0,3), B(2, 0) and B(3,0), which greatly reduces the search range and improves the search efficiency and accuracy.

FIG. 4 shows a flowchart of a data storage method provided according to an embodiment of the present specification, which specifically includes the following steps:

Step 402: From the plurality of cluster center points, determine the second target cluster center point corresponding to the data to be queried. It should be noted that, the specific implementation process of step 402 is the same as the specific implementation process of the above-mentioned step 208, and details are not described herein again in this specification.

Step 404: Determine the target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point.

It should be noted that the specific implementation process of step 404 is the same as the specific implementation process of the above-mentioned step 210, and details are not described herein again in this specification.

Step 406: Determine a search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace.

It should be noted that the specific implementation process of step 406 is the same as the specific implementation process of the above-mentioned step 212, and details are not described herein again in this specification.

The data query method provided in this specification can determine the second target cluster center point corresponding to the data to be queried from a plurality of cluster center points, and then determine the second target cluster center point according to the data to be queried and the second target cluster center point The target clustering subspace corresponding to the cluster center point is further determined according to the data to be queried and the target clustering subspace to determine the search space corresponding to the data to be queried, and the search is performed in the search space. In this case, the clustering algorithm is combined with the idea of the nearest neighbor graph, that is, on the basis of K-means clustering, the neighbor graph relationship of the two-layer subspace is introduced, and the space of the first layer is further divided into more The detailed two-level clustering subspace can narrow the search scope when querying the data to be queried subsequently, thereby improving the retrieval efficiency and accuracy of the entire index.

Corresponding to the foregoing method embodiments, this specification also provides an embodiment of a data storage apparatus, and FIG. 5 shows a schematic structural diagram of a data storage apparatus provided by an embodiment of this specification. As shown in Figure 5, the device includes:

The first determining module 502 is configured to cluster the data set to be stored, and determine a plurality of cluster center points;

The second determination module 504 is configured to, for each cluster center point in the plurality of cluster center points, determine the corresponding nearest neighbor cluster center point according to the type of center point, and the cluster center point and the corresponding cluster center point The space formed by the neighbor cluster center points is determined as the cluster subspace;

The storage module 506 is configured to store the to-be-stored data in the to-be-stored data set according to the cluster subspace.

In one or more implementations of this embodiment, the second determining module 504 is further configured to:

In one or more implementations of this embodiment, the storage module 506 is further configured to:

Sort the second distances from nearest to farthest, select the second preset value and the second distance at the top of the ranking, and determine the cluster center point corresponding to the selected second preset value and the second distance as the first The first target cluster center point corresponding to the data is stored.

Obtain the center points of each neighboring cluster corresponding to the center point of the first target cluster;

Sort the third distances from nearest to farthest, select the third preset value and the third distance at the top of the ranking, and determine the nearest neighbor cluster center point corresponding to the selected third preset value and the third distance as the first The first target neighbor cluster center point corresponding to the target cluster center point.

In one or more implementations of this embodiment, the apparatus further includes:

a third determination module, configured to determine a second target cluster center point corresponding to the data to be queried from among the plurality of cluster center points;

a fourth determining module, configured to determine a target clustering subspace corresponding to the second target clustering center point according to the data to be queried and the second target clustering center point;

The fifth determining module is configured to determine a search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace.

In this specification, after the cluster center point is determined, the data to be stored is not directly stored in a layer of space formed with the cluster center point as the center, but is further determined according to the cluster center point. The nearest neighbor clustering center point, and then the data to be stored is stored in the clustering subspace formed by the cluster center point and the corresponding neighbor cluster center point, and the clustering algorithm and the nearest neighbor graph idea are combined, that is, in K On the basis of -means clustering, the neighbor graph relationship of the second-level subspace is introduced, and the first-level space is further divided into more detailed two-level clustering subspaces, thereby improving the retrieval accuracy of the entire index. In addition, this two-layer subspace division method saves the cost of multi-layer clustering compared with the two-layer hierarchical clustering structure, and the index construction speed is faster. Compared with the single-layer K-means clustering structure, this specification does not On the basis of introducing additional index construction and storage overhead, the retrieval accuracy is improved.

The above is a schematic solution of a data storage device according to this embodiment. It should be noted that the technical solution of the data storage device and the technical solution of the above-mentioned data storage method belong to the same concept, and the details that are not described in detail in the technical solution of the data storage device can be referred to the description of the technical solution of the above-mentioned data storage method. .

Corresponding to the foregoing method embodiments, the present specification also provides an embodiment of a data query apparatus, and FIG. 6 shows a schematic structural diagram of a data query apparatus provided by an embodiment of the present specification. As shown in Figure 6, the device includes:

The third determination module 602 is configured to determine the second target cluster center point corresponding to the data to be queried from the plurality of cluster center points;

The fourth determination module 604 is configured to determine the target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point;

The fifth determination module 606 is configured to determine a search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace.

In one or more implementations of this embodiment, the third determining module 602 is further configured to:

In one or more implementations of this embodiment, the fourth determining module 604 is further configured to:

In one or more implementations of this embodiment, the fifth determining module 606 is further configured to:

Sort the fifth distance from nearest to farthest, select the fifth preset value and the fifth distance ahead of the sorting, and determine the target clustering subspace corresponding to the fifth preset value and the fifth distance as the to-be-to-be The search space corresponding to the query data.

Determining the midpoint between the second target cluster center point and the second target neighbor cluster center point as the target point;

Calculate the sixth distance between the target point and the data to be queried, and determine the sixth distance as the fifth distance between the target cluster subspace and the data to be queried.

The data query device provided in this specification can determine the second target cluster center point corresponding to the data to be queried from a plurality of cluster center points, and then determine the second target cluster center point according to the data to be queried and the second target cluster center point The target clustering subspace corresponding to the cluster center point is further determined according to the data to be queried and the target clustering subspace to determine the search space corresponding to the data to be queried, and the search is performed in the search space. In this case, the clustering algorithm is combined with the idea of the nearest neighbor graph, that is, on the basis of K-means clustering, the neighbor graph relationship of the two-layer subspace is introduced, and the space of the first layer is further divided into more The detailed second-level subspace can narrow the search range when querying the data to be queried, thereby improving the retrieval efficiency and accuracy of the entire index.

The above is a schematic solution of a data query apparatus according to this embodiment. It should be noted that the technical solution of the data query device and the technical solution of the above-mentioned data query method belong to the same concept, and the details that are not described in detail in the technical solution of the data query device can be referred to the description of the technical solution of the above-mentioned data query method. .

FIG. 7 shows a structural block diagram of a computing device 700 according to an embodiment of the present specification.

Components of the computing device 700 include, but are not limited to, memory 710 and processor 720 . The processor 720 is connected with the memory 710 through the bus 730, and the database 750 is used for storing data.

Computing device 700 also includes access device 740 that enables computing device 700 to communicate via one or more networks 760 . Examples of such networks include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet. Access device 740 may include one or more of any type of network interface (eg, a network interface card (NIC)), wired or wireless, such as an IEEE 802.11 wireless local area network (WLAN) wireless interface, World Interoperability for Microwave Access ( Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth interface, Near Field Communication (NFC) interface, and the like.

In one embodiment of the present specification, the above-described components of computing device 700 and other components not shown in FIG. 7 may also be connected to each other, such as through a bus. It should be understood that the structural block diagram of the computing device shown in FIG. 7 is only for the purpose of example, rather than limiting the scope of the present specification. Those skilled in the art can add or replace other components as required.

Computing device 700 may be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (eg, tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (eg, smart phones) ), wearable computing devices (eg, smart watches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or PCs. Computing device 700 may also be a mobile or stationary server.

The processor 720 is configured to execute the following computer-executable instructions to implement the following method:

For each cluster center point in the plurality of cluster center points, according to the cluster center point, determine the corresponding neighbor cluster center point, and the space formed by the cluster center point and the corresponding neighbor cluster center point Determined as a clustering subspace;

The above is a schematic solution of a computing device according to this embodiment. It should be noted that the technical solution of the computing device and the technical solution of the above-mentioned data storage method belong to the same concept. For details not described in detail in the technical solution of the computing device, refer to the description of the technical solution of the above-mentioned data storage method.

FIG. 8 shows a structural block diagram of a computing device 800 according to an embodiment of the present specification.

Components of the computing device 800 include, but are not limited to, a memory 810 and a processor 820 . The processor 820 is connected with the memory 810 through the bus 830, and the database 850 is used for saving data.

Computing device 800 also includes access device 840 that enables computing device 800 to communicate via one or more networks 860 . Examples of such networks include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet. Access device 840 may include one or more of any type of network interface (eg, network interface card (NIC)), wired or wireless, such as IEEE 802.11 wireless local area network (WLAN) wireless interface, World Interoperability for Microwave Access ( Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth interface, Near Field Communication (NFC) interface, and the like.

In one embodiment of the present specification, the above-described components of computing device 800 and other components not shown in FIG. 8 may also be connected to each other, such as through a bus. It should be understood that the structural block diagram of the computing device shown in FIG. 8 is only for the purpose of example, rather than limiting the scope of this specification. Those skilled in the art can add or replace other components as required.

Computing device 800 may be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (eg, tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (eg, smart phones) ), wearable computing devices (eg, smart watches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or PCs. Computing device 800 may also be a mobile or stationary server.

The processor 820 is configured to execute the following computer-executable instructions to implement the following method:

The above is a schematic solution of a computing device according to this embodiment. It should be noted that the technical solution of the computing device and the technical solution of the data query method above belong to the same concept, and the details not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the data query method.

An embodiment of the present specification further provides a computer-readable storage medium, which stores computer instructions, and when the instructions are executed by a processor, is used to implement the operation steps of the above data storage method.

An embodiment of the present specification further provides a computer-readable storage medium, which stores computer instructions, which, when executed by a processor, are used to implement the operation steps of the above data query method.

The above is a schematic solution of a computer-readable storage medium of this embodiment. It should be noted that the technical solution of the storage medium and the technical solutions of the above-mentioned data storage method and data query method belong to the same concept. Description of the technical solution of the method.

The foregoing describes specific embodiments of the present specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. Additionally, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code, which may be in source code form, object code form, an executable file, some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable media may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, the computer-readable media Electric carrier signals and telecommunication signals are not included.

It should be noted that, for the convenience of description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that this specification is not limited by the described action sequence. Because in accordance with this specification, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily all necessary in the specification.

In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are provided only to aid in the elaboration of the present specification. Alternative embodiments are not intended to exhaust all details, nor do they limit the invention to only the described embodiments. Obviously, many modifications and variations are possible in light of the contents of this specification. These embodiments are selected and described in this specification to better explain the principles and practical applications of this specification, so that those skilled in the art can well understand and utilize this specification. This specification is limited only by the claims and their full scope and equivalents.

Claims

A data storage method, the method comprising:

Cluster the data set to be stored, and determine multiple cluster center points;

For each cluster center point in the plurality of cluster center points, according to the cluster center point,

Determine the corresponding neighbor cluster center point, and combine the cluster center point and the corresponding neighbor cluster center point

The formed space is determined as a clustering subspace;

According to the clustering subspace, the to-be-stored data in the to-be-stored data set is stored.
The data storage method according to claim 1, wherein according to the cluster center point, determining

The corresponding nearest neighbor cluster center points, including:

Determining the cluster center point as the first cluster center point, and determining the cluster center point other than the first cluster center point among the plurality of cluster center points as the second cluster center point;

Calculate the first distance between the first cluster center point and each of the second cluster center points;

Sort the first distances from nearest to farthest, select the first preset value and the first distance in the ranking, and select the second cluster center corresponding to the first preset value and the first distance. The point is determined as the adjacent cluster center point corresponding to the first cluster center point.
The data storage method according to claim 1, wherein storing the data to be stored in the to-be-stored data set according to the clustering subspace comprises:

determining the first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set;

According to the first data to be stored and the first target cluster center point, determine the first target neighbor cluster center point corresponding to the first target cluster center point;

The first data to be stored is stored in a clustering subspace formed by the first target cluster center point and the first target neighbor cluster center point.
The data storage method according to claim 3, wherein said determining the first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set comprises:

calculating the second distance between the first data to be stored and each cluster center point in the plurality of cluster center points;

Sort the second distances from nearest to farthest, select the second preset value and the second distance ahead of the sorting, and determine the cluster center point corresponding to the selected second preset value and the second distance is the first target cluster center point corresponding to the first data to be stored.
The data storage method according to claim 3, wherein the first target neighbor cluster corresponding to the first target cluster center point is determined according to the first data to be stored and the first target cluster center point Center points, including:

Obtain each neighbor cluster center point corresponding to the first target cluster center point;

Calculate the third distance between the first data to be stored and the center points of each neighboring cluster;

Sort the third distances from nearest to farthest, select the third preset value and the third distance ahead of the sorting, and select the nearest neighbor clustering center points corresponding to the selected third preset value and the third distance. It is determined as the first target neighbor cluster center point corresponding to the first target cluster center point.
The data storage method according to claim 1, after storing the to-be-stored data in the to-be-stored data set according to the clustering subspace, further comprising:

From the plurality of cluster center points, determine the second target cluster center point corresponding to the data to be queried;

According to the data to be queried and the second target cluster center point, determine the target cluster subspace corresponding to the second target cluster center point;

According to the data to be queried and the target clustering subspace, a search space corresponding to the data to be queried is determined.
A data query method, the method includes:

From the plurality of cluster center points, determine the second target cluster center point corresponding to the data to be queried;

According to the data to be queried and the second target cluster center point, determine the target cluster subspace corresponding to the second target cluster center point;

According to the data to be queried and the target clustering subspace, a search space corresponding to the data to be queried is determined.
The data query method according to claim 7, wherein determining the second target cluster center point corresponding to the data to be queried from the plurality of cluster center points, comprising:

Calculate the fourth distance between the data to be queried and each cluster center point in the plurality of cluster center points;

Sort the fourth distances from nearest to farthest, select the fourth preset value and the fourth distance ahead of the sorting, and determine the cluster center point corresponding to the fourth preset value and the fourth distance as the The second target cluster center point corresponding to the data to be queried is described.
The data query method according to claim 7, wherein determining the target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point, comprising:

Obtain each neighbor cluster center point corresponding to the second target cluster center point, and determine each neighbor cluster center point as the second target neighbor cluster center point corresponding to the second target cluster center point;

The cluster subspace formed by the second target cluster center point and the second target neighbor cluster center point is determined as the target cluster subspace.
The data query method according to claim 9, wherein determining the search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace, comprising:

Calculate the fifth distance between the target cluster subspace and the data to be queried;

Sort the fifth distances from near to far, select the fifth preset value and the fifth distance ahead of the sorting, and determine the target clustering subspace corresponding to the fifth preset value and the fifth distance as The search space corresponding to the data to be queried.
The data query method according to claim 10, wherein the calculating the fifth distance between the target cluster subspace and the data to be queried comprises:

Determining the midpoint of the second target cluster center point and the second target neighbor cluster center point as the target point;

A sixth distance between the target point and the data to be queried is calculated, and the sixth distance is determined as a fifth distance between the target cluster subspace and the data to be queried.
A data storage device comprising:

a first determining module, configured to cluster the data set to be stored, and determine a plurality of cluster center points;

The second determination module is configured to, for each cluster center point in the plurality of cluster center points, determine the corresponding nearest neighbor cluster center point according to the cluster center point, and assign the cluster center point to the cluster center point. The space formed by the corresponding neighbor cluster center points is determined as a cluster subspace;

The storage module is configured to store the to-be-stored data in the to-be-stored data set according to the clustering subspace.
A data query device, the device includes:

The third determining module is configured to determine the second target cluster center point corresponding to the data to be queried from among the plurality of cluster center points;

a fourth determination module, configured to determine a target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point;

A fifth determining module is configured to determine a search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace.
A computing device comprising:

memory and processor;

The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions to implement the following methods:

Cluster the data set to be stored, and determine multiple cluster center points;

For each cluster center point in the plurality of cluster center points, according to the cluster center point, determine the corresponding neighbor cluster center point, and cluster the cluster center point and the corresponding neighbor cluster point. The space formed by the center point is determined as the clustering subspace;

According to the clustering subspace, the to-be-stored data in the to-be-stored data set is stored.
A computing device comprising:

memory and processor;

The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions to implement the following methods:

From the plurality of cluster center points, determine the second target cluster center point corresponding to the data to be queried;

According to the data to be queried and the second target cluster center point, determine the target cluster subspace corresponding to the second target cluster center point;

According to the data to be queried and the target clustering subspace, a search space corresponding to the data to be queried is determined.
A computer-readable storage medium storing computer instructions that, when executed by a processor, implement the steps of the data storage method according to any one of claims 1 to 6.
A computer-readable storage medium storing computer instructions, when the instructions are executed by a processor, implement the steps of the data query method according to any one of claims 7 to 11.