CN110825894A

CN110825894A - Data index establishing method, data index retrieving method, data index establishing device, data index retrieving device, data index establishing equipment and storage medium

Info

Publication number: CN110825894A
Application number: CN201910883196.6A
Authority: CN
Inventors: 张艳; 孙太武; 周超勇; 刘玉宇
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-09-18
Filing date: 2019-09-18
Publication date: 2020-02-21

Abstract

The invention discloses a data index establishing method, a data index retrieving method, a data index establishing device, a data index retrieving device, data index establishing equipment and a storage medium. In the process of establishing the data index, firstly, selecting different segmented clustering models to perform primary clustering on data samples in the data set according to data sets with different data volume levels to obtain different first-class clustering centers, secondly, performing secondary clustering by using a quantizer associated with the first-class clustering centers to obtain different second-class centers, and obtaining an index table based on the different second-class centers; and in the data retrieval process, the index table obtained in the data index establishing process is utilized to carry out image data retrieval. The invention carries out multiple segmented clustering on massive sample data in advance and establishes indexes, thereby improving the clustering effect and the precision of a clustering center; meanwhile, in the data retrieval process, based on the pre-established index, the image data retrieval with high precision and high efficiency is realized.

Description

Data index establishing method, data index retrieving method, data index establishing device, data index retrieving device, data index establishing equipment and storage medium

Technical Field

The invention relates to the field of data processing, in particular to a data index establishing method, a data index retrieving method, a data index establishing device, a data index retrieving device, data index establishing equipment and a storage medium.

Background

The existing K-means clustering algorithm (namely, K-means algorithm) is easily influenced by initial points, when the initial points are not properly selected, a correct clustering result cannot be obtained, the algorithm stability is poor, the existing K-means clustering algorithm obtains a quantization result by carrying out single clustering on the whole vector, and the data quantity of a clustering center to be expressed is large.

The feature vector retrieval method based on the K-means clustering algorithm is a static learning process, a rough clustering center is obtained directly through training once no matter how much data amount is, the feature vector retrieval method cannot be suitable for massive feature vector data retrieval, and when the clustering center is not selected properly, a result obtained by 1vN retrieval (finding a feature vector which is most similar to a query vector from N feature vectors) has great deviation, so that retrieval precision is influenced, and retrieval efficiency is low; meanwhile, as retrieval objects (for example, unstructured data such as images, videos, music and the like) become more complex, retrieval difficulty index increases, the complex retrieval objects are generally represented by multidimensional vectors, and when massive data exists in a retrieval library, if violent search is adopted, multidimensional vector operation is performed on retrieval samples (including a plurality of comparison objects), the calculation amount is very large, so that retrieval time is too long, and user requirements are difficult to meet.

Disclosure of Invention

The embodiment of the invention provides a data index establishing method, a data retrieval method, a data index establishing device, data retrieval equipment and a storage medium, wherein multiple segmented clustering is performed on mass sample data in advance, and indexes are established, so that the clustering effect and the precision of a clustering center are improved; meanwhile, in the data retrieval process, based on the pre-established index, the image data retrieval with high precision and high efficiency is realized.

In a first aspect, a data index establishing method is provided, including:

acquiring a segmented clustering model associated with the data volume level according to the data volume level of the data set;

inputting all data samples in the data set into the segmented clustering model, and receiving N first-class clustering centers output by the segmented clustering model; n is a positive integer;

reading the data samples from the data set, classifying the read data samples into the cluster centers of the same type with the closest distance, and correspondingly associating the N quantizers with the N cluster centers of the same type one by one;

performing secondary clustering on the data samples associated with each of the quantizers, determining class II centers and the data samples associated with each of the class II centers;

establishing N index tables corresponding to the N quantizers; each index table comprises at least one index, and each index comprises a class II center and all the data samples related to the class II center.

In a second aspect, a data retrieval method is provided, where the data retrieval method performs image data retrieval using an index table obtained by the data index establishing method, and includes:

receiving a query request containing an image query sample, and acquiring a query vector of the image query sample;

acquiring the second-class centers of which the distances between the two classes of centers and the query vector meet a first preset condition from all the second-class centers contained in the index table, and determining the data samples corresponding to the second-class centers meeting the first preset condition as image comparison samples;

determining a sample distance between the image comparison sample and the image query sample;

and displaying the image comparison sample with the sample distance meeting a second preset condition at the client as a query result of the query request.

In a third aspect, an apparatus for establishing a data index is provided, including:

the model matching module is used for acquiring a segmented clustering model associated with the data volume level according to the data volume level of the data set;

the primary clustering module is used for inputting all data samples in the data set into the segmented clustering model and receiving N first-class clustering centers output by the segmented clustering model; n is a positive integer;

the data adding module is used for reading the data samples from the data set, classifying the read data samples into the cluster centers with the closest distance, and correspondingly associating the N quantizers with the N cluster centers one by one;

a secondary clustering module for performing secondary clustering on the data samples associated with each of the quantizers to determine class two centers and the data samples associated with each of the class two centers;

the index establishing module is used for establishing N index tables corresponding to the N quantizers; each index table comprises at least one index, and each index comprises a class II center and all the data samples related to the class II center.

In a third aspect, a data retrieval apparatus is provided, including:

the receiving module is used for receiving a query request containing an image query sample and acquiring a query vector of the image query sample;

the data retrieval module is used for acquiring the second-class centers of which the distances between the second-class centers and the query vectors meet a first preset condition from all the second-class centers contained in the index table, and determining data samples corresponding to the second-class centers meeting the first preset condition as image comparison samples;

a calculation module to determine a sample distance between the image comparison sample and the image query sample;

and the display module is used for displaying the image comparison sample of which the sample distance meets a second preset condition as a query result of the query request on the client.

In a fifth aspect, a computer device is provided, which includes a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, wherein the processor implements the above data index establishing method and the above data retrieving method when executing the computer readable instructions.

In a sixth aspect, a computer-readable storage medium is provided, which stores computer-readable instructions, and the computer-readable instructions, when executed by a processor, implement the above data index establishing method and the above data retrieval method.

According to the data index establishing and data retrieving method, device, equipment and storage medium, in the data index establishing process, firstly, different segmented clustering models are selected to perform primary clustering on data samples in the data set according to data sets with different data volume levels to obtain different class-one clustering centers, secondly, a quantizer associated with the class-one clustering centers is used for performing secondary clustering to obtain different class-two centers, and an index table is obtained based on the different class-two centers, so that the clustering effect is improved, and the precision of the clustering centers is improved; in the data retrieval process, the index table obtained in the data index establishing process is utilized to perform image data retrieval, and the fragments of some data samples can be quickly positioned through coarse query with low cost, so that the times of sample calculation are greatly reduced, the operation speed of the server is improved, and the image data retrieval with high precision and high efficiency is realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without inventive labor.

FIG. 1 is a schematic diagram of an application environment of a data index building and data retrieval method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a data index building method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating step S20 of the data index creating method according to an embodiment of the present invention;

FIG. 4 is a flow chart of a data retrieval method in another embodiment of the present invention;

FIG. 5 is a schematic block diagram of a data index creating apparatus according to an embodiment of the present invention;

FIG. 6 is a functional block diagram of a data retrieval device in accordance with an embodiment of the present invention;

FIG. 7 is a schematic diagram of a computer device in an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The cluster retrieval method of the data samples provided by the invention can be applied to the application environment shown in figure 1, wherein a client communicates with a server through a network. The client includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, cameras, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.

In an embodiment, as shown in fig. 2, a data index establishing method is provided, which is described by taking the server in fig. 1 as an example, and includes the following steps:

s10, according to the data volume level of the data set, obtaining a segmented clustering model associated with the data volume level.

Wherein the data volume level can be set according to requirements; optionally, the data volume levels of the data sets may be set to an initial level, a first level, a second level, a third level, a fourth level, and a fifth level, where the data volumes of the data sets corresponding to the data volume levels respectively reach: 10 ten thousand, 20 ten thousand, 40 ten thousand, 80 ten thousand, 160 ten thousand and 320 ten thousand. It can be understood that, in the process of training data to obtain a clustering center, if the data level of a data set is higher, the system occupies a larger memory and consumes more time in the training process, so that the data set with a proper data volume level needs to be selected by comprehensively considering the memory and the consumed time of the system.

Specifically, when a data set to be trained is acquired, each data sample in the data set is counted to determine the data volume of the data set, the data volume level corresponding to the data volume is matched in the database, and then the segmented clustering model associated with the data volume level is acquired from the database. Preferably, a segmented clustering model is stored in the database in advance in association with each data volume level.

S20, inputting all data samples in the data set into the segmented clustering model, and receiving N clustering centers output by the segmented clustering model; n is a positive integer.

That is, in the process of obtaining the clustering centers by training data, according to the data sets with different data volume levels, different piecewise clustering models are adopted to perform primary clustering on the data samples in the data sets to obtain different clustering centers.

Preferably, the method for clustering the data samples in the data set by using the segmented clustering model comprises the following steps:

firstly, according to a preselected quantity contained in a segmented clustering model, for example, the preselected quantity is S, S data samples for primary clustering are randomly selected from all input data samples; the pre-selected number refers to the number of data samples selected by the segmented clustering model from all input data samples for one-time clustering. For example, for a data set at an initial level, the data amount of data samples included in the data set at the initial level is greater than or equal to 10 ten thousand, 10 ten thousand data samples may be selected for primary clustering; for the data set of the first level, the data amount of the data samples included in the data set of the first level is greater than or equal to 20 ten thousand, and 20 ten thousand data samples may be selected for primary clustering, etc.

Secondly, acquiring each feature vector of each data sample in the S data samples; and the dimensionality of the corresponding feature vector in each data sample is the same.

Thirdly, segmenting the eigenvector of each data sample in the S data samples according to a preset dimension according to the number of segments contained in the segmented clustering model, for example, the number of segments is M, so as to form M segmented vectors corresponding to each segmented eigenvector after segmentation, wherein each segmented vector contains the data volume of the preset dimension; the segmentation quantity refers to the number of segments of the segmentation clustering model for segmenting the feature vector of each data sample in the S data samples, and the segmentation quantity is preset in the segmentation clustering model. It will be appreciated that the amount of data contained in the segmentation vectors may be determined according to the preset dimensions and the dimensions of each feature vector.

And finally, clustering the segmentation vectors corresponding to the same segment in each feature vector corresponding to the S data samples, determining K first-class clustering centers corresponding to each same segment, and further determining and outputting M x K first-class clustering centers, namely determining and outputting N first-class clustering centers. It can be understood that, each feature vector corresponding to S data samples is divided into M segments, and each segment determines K first-class clustering centers, so that M × K first-class clustering centers can be obtained in total, at this time, M × K total clustering centers can express K M-th-power clustering centers, and the number of first-class clustering centers corresponding to the segment vector of each same segment and the number of first-class clustering centers output by the segment clustering model have the following relationship: n ═ M × K.

S30, reading the data samples from the data set, classifying the read data samples into the cluster centers of the same type with the nearest distance, and associating the N quantizers with the N cluster centers of the same type one by one.

Preferably, reading data samples from the data set according to a preset adding amount, finding a class of clustering centers closest to the read data samples, and adding the class of clustering centers into a quantizer, wherein all the data samples classified into the class of clustering centers are associated with the quantizer; the addition amount refers to the number of data sample batch reading, namely the number of data sample batch classified into a type of clustering center. For example, when the addition amount is 1 ten thousand, the distance between 1 ten thousand data samples and each class-one clustering center is calculated, that is, the sum of squares of differences between 1 ten thousand data samples and each class-one clustering center is calculated, one class-one clustering center with the smallest sum of squares of differences is selected and recorded as the closest class-one clustering center, and then the 1 ten thousand data samples are classified into the closest class-one clustering center and added into the quantizer.

S40, performing secondary clustering on the data samples associated with the quantizers, and determining class II centers and the data samples associated with the class II centers.

That is, all data samples associated with the N quantizers are clustered again, respectively. For example, when all data samples associated with the quantizer are X, a feature vector of each data sample in the X data samples is obtained; then, the feature vector of each data sample in the X data samples is divided into M ' segments according to the dimension to form M ' segment vectors corresponding to the segmented feature vector, the segment vectors corresponding to the same segments of the X data samples are clustered respectively, and K ' component clustering centers corresponding to the segments are determined, so that M ' K ' binary clustering centers are determined. It is understood that the quadratic clustering method in step S40 can refer to the quadratic clustering method in step S30.

S50, establishing N index tables corresponding to the N quantizers; each index table comprises at least one index, and each index comprises a class II center and all the data samples related to the class II center.

Understandably, N index tables can be built by performing secondary clustering on all data samples associated with N quantizers, respectively. After M '× K' class two centers are determined in step S40, the distance between each data sample associated with each quantizer and each class two center is calculated, that is, the distance between the original feature vector and each segment center vector (one segment center vector for each class two center) is calculated, and each data sample associated with each quantizer is classified into the class two center closest to the original feature vector. Further, an index table is built according to each class II center and the original feature vectors classified into the class II centers.

Preferably, in order to further improve the retrieval efficiency and accuracy, clustering may be performed three times or more, that is, secondary clustering, and then the next clustering is performed after the index is constructed.

In summary, in the data index establishment process of the embodiment, firstly, according to data sets with different data volume levels, different segmented clustering models are selected to perform primary clustering on data samples in the data sets to obtain different first-class clustering centers, secondly, a quantizer associated with the first-class clustering centers is used to perform secondary clustering to obtain different second-class centers, and an index table is obtained based on the different second-class centers, and through experimental verification, for the data sets with the data volume reaching more than several million levels, the clustering effect is better, and the precision of the clustering centers can be improved by 5% -10%; in addition, the embodiment can further improve the clustering effect through multiple times of segmented clustering, and can express a very large clustering center set by using a small clustering center set, so that the retrieval time is at least shortened by a half, and the retrieval speed is favorably improved.

In an embodiment, before the step S10, that is, before the obtaining of the segmented clustering model associated with the data volume level according to the data volume level of the data set, the method includes the following steps:

firstly, acquiring the data volume of a data sample contained in a data set, and inputting the data volume into a preset output model; and then receiving the data volume level output by the output model, and determining the data volume level of the data set.

Preferably, the output model is:

n＝log_λ[X_n/X₀]

wherein n is the data volume level; λ is a level coefficient; x_nIs the data volume; x₀Is the initial number corresponding to the initial level.

Alternatively, λ is 1, X₀Is 100000.

In summary, the present embodiment automatically calculates the data volume level of the data set through the output model, and provides output with higher precision and higher efficiency.

In an embodiment, as shown in fig. 3, the step S20, namely, inputting all data samples in the data set into the segmented clustering model, and receiving N clustering centers output by the segmented clustering model, specifically includes the following steps:

s201, determining a preselected number of data samples and corresponding feature vectors from all the data samples by using the segmented clustering model.

In this embodiment, S data samples for one-time clustering are determined from all the data samples according to a preselected number included in the segmented clustering model, for example, the preselected number is S. The pre-selected number refers to the number of data samples selected by the segmented clustering model from all input data samples for one-time clustering.

Further, identifying each data sample in the S data samples by using a feature extraction model contained in the segmented clustering model, obtaining a plurality of feature elements of each data sample, and taking the feature elements as vector elements to form a feature vector of each data sample; for example, when the data sample is an image data sample, the image identification model included in the segmented clustering model is used to identify the image sample, so as to obtain the feature elements, such as the pixel number, the gray average value, the gray median value, the sub-region number, the sub-region gray average value, and the like, included in each image sample, and form a feature vector according to the feature elements.

S202, segmenting each feature vector according to the dimension by using a segmented clustering model to form segmented vectors corresponding to all segments in each feature vector.

In this embodiment, according to the number of segments included in the segment clustering model, for example, the number of segments is M, the feature vector of each of the S data samples is segmented according to the dimension to form M segment vectors corresponding to the segmented feature vector.

S203, clustering the segmented vectors corresponding to the same segments in each feature vector by using a segmented clustering model to determine K first-class clustering centers corresponding to the segments; k is a positive integer.

Preferably, first, whether the number of segment vectors (the number of segment vectors is M) corresponding to each identical segment is greater than the number of cluster clusters (the number of cluster clusters is K) is detected, and when the number of segment vectors corresponding to each identical segment is greater than the number of cluster clusters, whether the number of segment vectors corresponding to each identical segment is greater than a first threshold value is detected, wherein the first threshold value is a product of the number of cluster clusters and a minimum storage number (for example, the minimum storage number is 39) contained in each cluster; otherwise, alarm information is sent.

Secondly, when the number of the segment vectors corresponding to each identical segment is greater than a first threshold, detecting whether the number of the segment vectors corresponding to each identical segment is greater than a second threshold (for example, the minimum storage number is 256), wherein the second threshold is the product of the number of the cluster clusters and the maximum storage number contained in each cluster; otherwise, alarm information is sent.

Thirdly, when the number of the segmented vectors corresponding to each same segment is larger than a second threshold value, randomly selecting a preset number of the segmented vectors for clustering by the segmented clustering model, further randomly selecting K segmented vectors from the preset number of the segmented vectors as a class-one clustering center, and adding the K class-one clustering centers into the quantizer; and otherwise, directly selecting K segmented vectors as a class-one clustering center from the segmented vectors corresponding to the same segments, and adding the K class-one clustering centers into the quantizer.

And finally, performing iteration by using K clustering centers in the quantizer. That is, the closest one-class clustering center of each segment vector for clustering by the segment clustering model is found, the distance is calculated, the K one-class clustering centers are updated according to the calculated distance, the quantizer is further updated until the iteration number is greater than the preset iteration threshold, and at this time, the K one-class clustering centers corresponding to each segment are determined. It can be understood that when the iteration number is greater than the preset iteration threshold, the K first-class clustering centers do not change any more, or the sum of the square errors between each segment vector classified into each first-class clustering center and the corresponding first-class clustering center is the minimum, or no segment vector is classified into the K first-class clustering centers again.

S204, determining N first-class clustering centers output by the segmented clustering model according to the K first-class clustering centers corresponding to each segment.

That is, the N first-class clustering centers output by the segmented clustering model can be obtained by accumulating the K first-class clustering centers corresponding to each segment.

In conclusion, the segmented clustering model is used for clustering the input data samples once, so that the data volume of a class-I clustering center set is reduced, the occupied memory is reduced, and the running speed of the server is improved; meanwhile, the method has the clustering effects of no isolated point, high compactness and high separation degree.

In an embodiment, as shown in fig. 4, a data retrieval method is provided, where the data retrieval method utilizes an index table obtained by the data index establishment method in the foregoing embodiment to perform image data retrieval, and when the data retrieval method is applied to the server in fig. 1, the data retrieval method and the data index establishment method may be applied to the same server; when the data retrieval method is applied to other servers, the data retrieval method and the data index establishing method can be applied to different servers. The data retrieval method comprises the following steps:

s60, receiving a query request containing an image query sample, and acquiring a query vector of the image query sample.

That is, after the server receives a query request containing an image query sample and identifies the image query sample by using a preset image identification model, the server obtains the number of pixels, the mean value of gray scale, the median value of gray scale, the number of sub-regions, the mean value of gray scale of sub-regions and other characteristic elements contained in the image query sample, and forms a query vector according to the characteristic elements. The query instruction refers to that a user inputs an image sample in a preset search bar of the client and then sends the image sample to the server.

S70, obtaining the second-class centers of which the distances to the query vectors meet a first preset condition from all the second-class centers contained in the index table, and determining the data samples corresponding to the second-class centers meeting the first preset condition as image comparison samples.

In this embodiment, if the index tables obtained by using the data index establishing method described in the above embodiment are directly stored in the database, the index tables are directly obtained from the database; if the index tables obtained by the data index establishing method in the above embodiment are stored in the external server, the configuration parameters of the index tables are first obtained, and then the index tables are obtained from the external server according to the access paths included in the configuration parameters.

After the index tables are obtained, by inquiring the index tables, one second-class center, the distance between which and the inquiry vector meets the first preset condition, can be determined from all the second-class centers recorded by the index tables, and the second-class center is recorded as the selected cluster center.

In an embodiment, the first predetermined condition is that the distance is less than a first distance threshold. At this time, if the total number of the class two centers is N ', the distance between each class two center (i.e., the segment center vector) in the index table and the query vector is calculated, and for the calculated N' distances, one or more class two centers smaller than the first distance threshold are determined and recorded as the selected cluster center.

Further, before recording the selected clustering center, determining the nearest one of the two clustering centers as the selected clustering center; at this time, when the number of the determined second-class centers is one, the determined second-class centers are the closest second-class centers and are recorded as the final selected cluster centers; and when the number of the determined second-class centers is multiple, determining the second-class center closest to the second-class center according to the multiple distances obtained by calculation, and recording the second-class center as the final selected cluster center.

In another embodiment, the first predetermined condition is a predetermined number or a predetermined proportion of the smallest distance. At this time, if the preset number is 3, determining 3 secondary cluster centers with the smallest distance in the N' distances obtained by calculation according to the sequence of the distances from small to large, and recording the secondary cluster centers as the final selected cluster centers; if the preset proportion is 3%, the calculated N' distances are 100, and 3 second-class centers with the smallest distances are determined according to the sequence from small to large.

Understandably, after the selected clustering center (the two clustering centers whose distances from the query vector satisfy the first predetermined condition) is determined from the index table, the data sample corresponding to the selected clustering center is determined as the image comparison sample from the index table.

It should be noted that, if the image data is retrieved by using the index table obtained by indexing by the data index establishing method described in the above embodiment, the data set used in the indexing process is a data set including image data samples, and each index in the index table includes one class ii center and all image data samples associated with the class ii center.

S80, determining a sample distance between the image comparison sample and the image query sample.

That is, the euclidean distance between the feature vector of the image comparison sample and the query vector of the image query sample may be determined as the sample distance, and the cosine similarity between the feature vector of the image comparison sample and the query vector of the image query sample may also be determined as the sample distance.

In another embodiment, to reduce the amount of computation and increase the computation speed, the corresponding segmentation distance between the image comparison sample and the image query sample under each segmentation may be calculated, that is, the distance between each segmentation vector of the image comparison sample and each segmentation query vector of the image query sample is calculated (the segmentation vector of one image comparison sample corresponds to the segmentation query vector of one image query sample), and the sample distance is determined according to each segmentation distance.

And S90, displaying the image comparison sample with the sample distance meeting the second preset condition as the query result of the query request on the client.

In an embodiment, the second predetermined condition is that the sample distance is less than a second distance threshold. At this time, after the sample distance between the image query sample and each image comparison sample is calculated in step S90, at least one image comparison sample with the sample distance smaller than the second distance threshold is determined from all the image comparison samples, and is displayed on the client as the query result of the query request.

In another embodiment, the second predetermined condition is a predetermined number or a predetermined proportion of sample distances being minimum. At this time, after the sample distances between the image query samples and the respective image comparison samples are calculated in step S90, a predetermined number of image comparison samples or a predetermined ratio of image comparison samples having the smallest distance are determined in descending order of the sample distances from among all the image comparison samples, and are displayed on the client as the query result of the query request.

In summary, in the data retrieval process, the index table obtained in the data index establishing process is used for image data retrieval, and the fragments of some data samples can be quickly located through coarse query with low cost (only a small amount of class II centers are needed), so that the times of sample calculation are greatly reduced, the operation speed of the server is improved, and high-precision and high-efficiency image data retrieval is realized.

In an embodiment, the data retrieval method may also perform text data retrieval by using an index table obtained by the data index establishing method in the above embodiment, and specifically includes the following steps:

receiving a query request containing a text query sample, and acquiring a query vector of the text query sample (the query vector can be generated according to the combination of characteristic elements such as word number, word frequency, unit participles, multi-element participles and the like in the text query sample); acquiring a second-class center, the distance between which and the query vector meets a third preset condition, from all the second-class centers contained in the index table, and determining a data sample corresponding to the second-class center meeting the third preset condition as a text comparison sample; determining a sample distance between the text comparison sample and the text query sample; and displaying the audio comparison sample with the sample distance meeting a fourth preset condition on a client as a query result of the query request.

It should be noted that, if the text data is retrieved by using the index table obtained by indexing by the data index establishing method in the foregoing embodiment, the data set used in the indexing process is a data set including text data samples, and each index in the index table includes a class ii center and all text data samples associated with the class ii center.

In an embodiment, the data retrieval method may further use the index table obtained by the data index establishing method in the above embodiment to perform retrieval of other data such as music data or video data, and the detailed description of the retrieval of other data by using the index table obtained by the data index establishing method in the above embodiment may refer to the detailed description of the retrieval of image data or text data by using the index table obtained by the data index establishing method in the above embodiment.

In an embodiment, as shown in fig. 5, a data index creating apparatus is provided, where the data index creating apparatus corresponds to the data index creating method in the foregoing embodiment one to one. The data index establishing device comprises a model matching module 110, a primary clustering module 120, a data adding module 130, a secondary clustering module 140 and an index establishing module 150. The functional modules are explained in detail as follows:

and the model matching module 110 is configured to obtain a segmented clustering model associated with the data volume level according to the data volume level of the data set.

A primary clustering module 120, configured to input all data samples in the data set into the segmented clustering model, and receive N first-class clustering centers output by the segmented clustering model; n is a positive integer.

A data adding module 130, configured to read the data sample from the data set, classify the read data sample into the one-class clustering center closest to the data sample, and associate the N quantizers with the N one-class clustering centers one to one.

A secondary clustering module 140 configured to perform secondary clustering on the data samples associated with each of the quantizers, and determine class two centers and the data samples associated with each of the class two centers.

An index establishing module 150, configured to establish N index tables corresponding to the N quantizers; each index table comprises at least one index, and each index comprises a class II center and all the data samples related to the class II center.

In one embodiment, the model matching module 110 includes the following elements, each of which is described in detail as follows:

the input unit is used for acquiring the data volume of the data samples contained in the data set and inputting the data volume into a preset output model.

And the output unit is used for receiving the data volume level output by the output model and determining the data volume level of the data set.

In an embodiment, in the data index establishing apparatus, the output model is:

n＝log_λ[X_n/X₀]

In one embodiment, the primary clustering module 120 includes the following units, and each functional unit is described in detail as follows:

and the vector acquisition unit is used for determining a preselected number of the data samples and corresponding feature vectors from all the data samples by using the segmented clustering model.

And the vector segmentation unit is used for segmenting each feature vector according to the dimension by utilizing a segmentation clustering model so as to form a segmentation vector corresponding to each segment in each feature vector.

The segmentation clustering unit is used for clustering the segmentation vectors corresponding to the same segments in each feature vector by using a segmentation clustering model to determine K first-class clustering centers corresponding to the segments; k is a positive integer.

And the result determining unit is used for determining N first-class clustering centers output by the segmented clustering model according to the K first-class clustering centers corresponding to each segment.

In one embodiment, as shown in fig. 6, a data retrieval apparatus is provided, and the data retrieval apparatus corresponds to the data retrieval method in the above embodiment one to one. The data retrieval device includes a receiving module 160, a data retrieval module 170, a calculation module 180, and a display module 190. The functional modules are explained in detail as follows:

the receiving module 160 is configured to receive a query request including an image query sample, and obtain a query vector of the image query sample.

And the data retrieval module 170 is configured to obtain, from all the class two centers included in the index table, a class two center whose distance from the query vector meets a first predetermined condition, and determine a data sample corresponding to the class two center meeting the first predetermined condition as an image comparison sample.

A calculation module 180 for determining a sample distance between the image comparison sample and the image query sample.

And the display module 190 is configured to display the image comparison sample, which is the sample distance satisfying the second predetermined condition, as a query result of the query request at the client.

In an embodiment, in the data retrieval apparatus, the first predetermined condition is that the distance is smaller than a first distance threshold; the second predetermined condition is that the sample distance is less than a second distance threshold.

For specific limitations of the cluster retrieval apparatus for data samples, reference may be made to the above limitations of the cluster retrieval method for data samples, which are not described herein again. The modules in the clustering retrieval device for data samples can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operating system and execution of computer-readable instructions in the non-volatile storage medium. The computer readable instructions, when executed by a processor, implement a method for cluster retrieval of data samples.

In one embodiment, a computer device is provided, which includes a memory, a processor, and computer readable instructions stored on the memory and executable on the processor, and when the processor executes the computer readable instructions, the data index establishing method in the above embodiments and the data retrieving method in the above embodiments are implemented.

In one embodiment, a computer readable storage medium is provided, on which computer readable instructions are stored, and the computer readable instructions, when executed by a processor, implement the data index establishing method in the above embodiments and the data retrieving method in the above embodiments.

It will be understood by those of ordinary skill in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a non-volatile computer readable storage medium, and when executed, can include processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), Direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of each functional unit or module is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units or modules according to requirements, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A data index establishing method is characterized by comprising the following steps:

2. The method for building a data index according to claim 1, wherein before obtaining the segmented clustering model associated with the data volume level according to the data volume level of the data set, the method comprises:

acquiring the data volume of a data sample contained in a data set, and inputting the data volume into a preset output model;

and receiving the data volume level output by the output model, and determining the data volume level of the data set.

3. The data index building method of claim 1, wherein the output model is:

n＝log_λ[X_n/X₀]

4. The method of claim 1, wherein the inputting all data samples in the data set into the piecewise clustering model, receiving N class-one clustering centers output by the piecewise clustering model, comprises:

determining a preselected number of the data samples and corresponding feature vectors from all the data samples by using the segmented clustering model;

segmenting each feature vector according to the dimension by utilizing a segmented clustering model to form a segmented vector corresponding to each segment in each feature vector;

clustering the segmented vectors corresponding to the same segments in each feature vector by using a segmented clustering model to determine K first-class clustering centers corresponding to the segments; k is a positive integer;

and determining N first-class clustering centers output by the segmented clustering model according to the K first-class clustering centers corresponding to each segment.

5. A data retrieval method for performing image data retrieval using the index table obtained by the data index creating method according to any one of claims 1 to 4, comprising:

6. The data retrieval method of claim 5 wherein the first predetermined condition is that the distance is less than a first distance threshold; the second predetermined condition is that the sample distance is less than a second distance threshold.

7. A data index creation apparatus, comprising:

8. A data retrieval device, comprising:

9. A computer device comprising a memory, a processor and computer readable instructions stored in the memory and executable on the processor, wherein the processor when executing the computer readable instructions implements a cluster retrieval method for data samples according to any one of claims 1 to 4 and a data retrieval method according to any one of claims 5 to 6.

10. A computer-readable storage medium storing computer-readable instructions which, when executed by a processor, implement a method for cluster retrieval of data samples according to any one of claims 1 to 4 and a method for data retrieval according to any one of claims 5 to 6.