CN113609313A

CN113609313A - Data processing method and device, electronic equipment and storage medium

Info

Publication number: CN113609313A
Application number: CN202110832835.3A
Authority: CN
Inventors: 谢超; 程倩雅; 易小萌; 郭人通; 李盛俊
Original assignee: Shanghai Xuyu Intelligent Technology Co ltd
Current assignee: Shanghai Xuyu Intelligent Technology Co ltd
Priority date: 2021-07-22
Filing date: 2021-07-22
Publication date: 2021-11-05

Abstract

Embodiments of the present application provide a data processing method, apparatus, electronic device, and storage medium, and relate to the field of computer technology. Through the embodiments of the present application, it is possible to determine a data processing method suitable for the original data based on the data type and data characteristics of the original data when storing data. The index creation method, and then create an index of the original data and store the original data according to the index creation method suitable for the original data. In this way, when a data query request is received, the embodiment of the present application can query the target data according to an adaptive indexing manner, thereby improving the efficiency of data processing.

Description

Data processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method and apparatus, an electronic device, and a storage medium.

Background

With the development of computer technology, the volume of various data is larger and larger, and the difficulty of data processing is higher and higher. Especially unstructured data, which is data that is not directly understandable by the computer, such as pictures, videos, text, etc.

At present, the traditional data processing mode cannot process unstructured data efficiently.

Disclosure of Invention

In view of this, embodiments of the present application provide a data processing method, an apparatus, an electronic device, and a storage medium, so as to implement efficient processing of unstructured data.

In a first aspect, a data query method is provided, where the method is applied to an electronic device, and the method includes:

a data query request is received, the data query request including at least reference data.

And determining a target index mode corresponding to the data query request based on the reference data.

And determining at least one target data corresponding to the data query request based on the target index mode.

In a second aspect, a data storage method is provided, and the method is applied to an electronic device, and the method includes:

an index build request is received that includes a data type and data characteristics of original data.

And determining a corresponding index creation mode according to the data type and the data characteristics.

And constructing an index of the original data according to the index creation mode, and correspondingly storing the original data.

In a third aspect, a data query apparatus is provided, where the apparatus is applied to an electronic device, and the apparatus includes:

the device comprises a first receiving module, a second receiving module and a sending module, wherein the first receiving module is used for receiving a data query request, and the data query request at least comprises reference data.

And the first determining module is used for determining a target index mode corresponding to the data query request based on the reference data.

And the query module is used for determining at least one target data corresponding to the data query request based on the target index mode.

In a fourth aspect, there is provided a data storage device, the device being applied to an electronic apparatus, the device comprising:

a second receiving module, configured to receive an index construction request, where the index construction request includes a data type and a data characteristic of original data.

And the second determining module is used for determining a corresponding index creation mode according to the data type and the data characteristics.

And the storage module is used for constructing an index of the original data according to the index creation mode and correspondingly storing the original data.

In a fifth aspect, embodiments of the present application provide an electronic device, comprising a memory and a processor, the memory being configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method according to the first and second aspects.

In a sixth aspect, embodiments of the present application provide a computer storage medium on which computer program instructions are stored, which when executed by a processor implement the method according to the first and second aspects.

By the aid of the method and the device, when the data are stored, the index creation mode suitable for the original data can be determined based on the data type and the data characteristics of the original data, and then the index of the original data is created according to the index creation mode suitable for the original data and the original data are stored. Therefore, when a data query request is received, the embodiment of the application can query the target data according to the adaptive index mode, and the data processing efficiency is improved.

Drawings

The foregoing and other objects, features and advantages of the embodiments of the present application will be apparent from the following description of the embodiments of the present application with reference to the accompanying drawings in which:

FIG. 1 is a flowchart of a data query method according to an embodiment of the present application;

FIG. 2 is a flow chart illustrating the determination of target data based on a routing index according to an embodiment of the present disclosure;

FIG. 3 is a flow chart illustrating the determination of target data based on a routing index and a compression index according to an embodiment of the present disclosure;

FIG. 4 is a flowchart illustrating the determination of target data based on compressed index and full index according to an embodiment of the present disclosure;

FIG. 5 is a flowchart illustrating the determination of target data based on a route index, a compression index, and a full index according to an embodiment of the present disclosure;

FIG. 6 is a flowchart of a data storage method according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a data query device according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a data storage device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of another electronic device according to an embodiment of the present application.

Detailed Description

The present application is described below based on examples, but the present application is not limited to only these examples. In the following detailed description of the present application, certain specific details are set forth in detail. It will be apparent to one skilled in the art that the present application may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present application.

Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.

Unless the context clearly requires otherwise, throughout the description, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".

In the description of the present application, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present application, "a plurality" means two or more unless otherwise specified.

In the related art, for unstructured data, the unstructured data is generally queried through an Approximate Nearest Neighbor Search (ANNS), wherein the ANNS determines one or more target data most similar to a reference data according to a distance (i.e., similarity) between the reference data and data in a database, so as to query the unstructured data. Specifically, the unstructured data can be represented as vectors in a high-dimensional space through the ANNS, and then the distance between the vectors is calculated in the high-dimensional space, so that the query of the unstructured data is realized.

In the related art, the query modes mainly used by the ANNS mainly include an index mode based on clustering, an index mode based on a graph or tree structure, an index mode based on quantization, and an index mode based on hashing. Among them, since the index of the graph or tree structure has an index of a dimension as a spatial complexity of the index and a temporal complexity of the search, it is difficult to apply to a vector space of a high dimension. The hash-based indexing method is not suitable for massive data sets because the hash table structure requires a large amount of storage space.

The clustering index mode and the quantization-based index mode cannot simultaneously meet the requirements of low query time and high query precision, so that the traditional data processing mode cannot efficiently process unstructured data.

In order to solve the above problem, embodiments of the present application provide a data query method, which may be applied to an electronic device. The electronic device may be a terminal or a server, the terminal may be a smartphone, a tablet Computer, a Personal Computer (PC), or the like, and the server may be a single server, a server cluster configured in a distributed manner, or a cloud server.

The following describes in detail a data query method according to an embodiment of the present application with reference to a specific implementation manner, as shown in fig. 1, the specific steps are as follows:

at step 11, a data query request is received.

Wherein the data query request comprises at least reference data. The reference data is data referred to when the data query is performed in the embodiment of the present application, for example, if the image query is performed in the embodiment of the present application, the reference data may be an image a, and then, the electronic device may query the database according to the image a and then determine a target image (target data) in the database, which matches with the image a.

In step 12, a target indexing method corresponding to the data query request is determined based on the reference data.

In the embodiment of the present application, different query purposes may be achieved through indexes of multiple levels, where the target indexing manner may include one or more of a routing index, a compression index, and a full index.

Specifically, the routing index may be used to determine a target index tag corresponding to the reference data, the compressed index may be used to determine target compressed data corresponding to the reference data in each compressed data set, and the full index may be used to determine target full data corresponding to the reference data in each full data set.

In the embodiment of the application, quick query can be performed through the routing index and the compression index, and accurate query can be performed through the full-scale index. Therefore, the quick and accurate query can be realized through a query mode combining various indexes. That is, the multi-level index structure constructed by one or more of the above indexing methods can be used to cope with different application scenarios.

In step 13, at least one target data corresponding to the data query request is determined based on the target index manner.

According to the embodiment of the application, the data in the database can be queried in a specific mode through a target index mode so as to determine at least one target data. In this process, the embodiment of the present application can perform data query based on a suitable index manner, so that fast and accurate data query can be realized by the embodiment of the present application.

In a preferred implementation manner, if the target indexing manner determined based on the reference data in the embodiment of the present application is a route index, step 13 may be performed as: the method comprises the steps of determining at least one target index tag corresponding to reference data based on a preset first index algorithm and the reference data, determining a target data set corresponding to each target index tag, and determining at least one target data according to the similarity of the reference data and the data in each target data set.

For example, as shown in fig. 2, fig. 2 is a flowchart of determining target data based on a route index according to an embodiment of the present application. The data query request 21 includes reference data 211, and the index tags include index tag 221, index tag 222, index tag 223, and index tag 224, where each index tag may correspond to one target data set (index tag 221 corresponds to target data set a, index tag 222 corresponds to target data set B, index tag 223 corresponds to target data set C, and index tag 224 corresponds to target data set D).

In the process of determining the target data 23, the embodiment of the present application may determine the similarity between the reference data 211 and each index tag based on the first indexing algorithm, and then determine one or more index tags with higher similarity to the reference data 211 according to the similarity between the reference data 211 and each index tag. In fig. 2, the index tabs having a higher similarity to the reference data 211 are the index tab 221 and the index tab 223.

In a preferred embodiment, the first indexing algorithm may be a Hierarchical Navigation Small World (HNSW) algorithm, an Inverted File (IVF) algorithm, or an Inverted Multi-Index (IMI) algorithm.

HNSW is an approximate nearest neighbor search algorithm, specifically, HNSW is a nearest neighbor search algorithm based on a graph, HNSW constructs all vectors in a D-dimensional space into a communicated graph, and searches for a plurality of points nearest to a vertex based on the graph.

The IVF is an algorithm for querying by clustering, and specifically, the IVF can cluster different vector spaces and query each cluster, so that the computation amount of the query can be reduced, and the query speed can be increased.

The IMI is a variant of the dot product quantization algorithm, and can divide the data dimension into two parts, and respectively divide half of the data dimension into different central points by using a clustering algorithm. Compared with IVF, IMI adopts a large number of central points, and is more suitable for data-intensive scenes.

After determining the index tab 221 and the index tab 223, embodiments of the present application may determine the target data sets (i.e., target data set a and target data set C) corresponding to the index tab 221 and the index tab 223 respectively.

After determining the target data set a and the target data set C, the embodiment of the present application may determine the target data 23 according to the similarity between the reference data 211 and the data in each target data set, where the target data 23 in fig. 2 may represent a single target data or a set of target data.

By the embodiment of the application, the index tag can be determined first, and then the target data can be further determined according to the index tag. In the process, the similarity between the reference data and the data in each data set does not need to be calculated, and the similarity between the reference data and the data in the target data set only needs to be calculated, so that the calculation amount is saved, and the efficiency of data query is improved.

In addition, in a preferred embodiment, the index tags may be used to characterize the clustering characteristics of the corresponding data sets.

Furthermore, the process of determining the target index tag based on the first indexing algorithm and the reference data according to the embodiment of the present application may be performed as follows: the method comprises the steps of determining similarity between reference data and index tags of each data set based on a preset first index algorithm, and then determining at least one target index tag with the similarity meeting a first similarity condition based on a preset first similarity condition.

The first similarity condition may be that the similarity between the reference data and the index label of each data set is greater than a predetermined threshold, and the threshold may be a value set according to actual conditions. When the similarity between the reference data and the index tag of one or more data sets is greater than a predetermined threshold, the embodiment of the present application may determine the index tag with the similarity greater than the predetermined threshold as the target index tag.

In another preferred implementation manner, if the target indexing manner determined based on the reference data in the embodiment of the present application is a route index and a compression index, step 13 may be performed as: the method comprises the steps of determining at least one target index tag corresponding to reference data based on a first preset index algorithm and the reference data, then determining at least one target compressed data in a compressed data set corresponding to each target index tag based on a second preset index algorithm and the reference data, and then determining the target data according to the target compressed data.

Wherein, each index tag corresponds to a compressed data set respectively.

For example, as shown in fig. 3, fig. 3 is a flowchart of determining target data based on a routing index and a compression index according to an embodiment of the present application. The data query request 31 includes reference data 311, the index tags include index tags 321, index tags 322, index tags 323, and index tags 324, and each index tag may correspond to one compressed data set (the index tag 321 corresponds to the compressed data set E, the index tag 322 corresponds to the compressed data set F, the index tag 323 corresponds to the compressed data set G, and the index tag 324 corresponds to the compressed data set H).

In the process of determining the target data 34, the embodiment of the present application may determine the similarity between the reference data 311 and each index tag based on the first indexing algorithm, and then determine one or more index tags with higher similarity to the reference data 311 according to the similarity between the reference data 311 and each index tag. In fig. 3, the index tabs having a higher similarity to the reference data 311 are an index tab 321 and an index tab 323.

After determining the

index tabs

321 and 323, embodiments of the present application may determine the compressed data sets (i.e., the compressed data set E and the compressed data set G) corresponding to the

index tabs

321 and 323.

Then, the embodiment of the present application may determine, based on the second indexing algorithm, a similarity between the reference data 311 and each of the compressed data in the compressed data set E and the compressed data set G, and then determine, according to the similarity between the reference data 311 and each of the compressed data in the compressed data set E and the compressed data set G, the target compressed data 33 with a higher similarity to the reference data 311, where the target compressed data 33 may be used to characterize a single target compressed data or may be used to characterize a set of target compressed data.

In a preferred embodiment, the second indexing algorithm may be a Scalar Quantization (SQ) algorithm or a Product Quantization (PQ) algorithm.

The SQ and the PQ can use quantized data in the compressed index, so that the memory requirement can be effectively reduced, and the retrieval speed can be increased. Using the compressed index of the quantization index structure enables all quantization data and index structures to be stored in memory.

After determining the target compressed data 33, the embodiment of the present application may determine the target data 34 according to the target compressed data 33.

In this embodiment, according to the correspondence between the compressed data and the full data, the full data corresponding to the target compressed data 33 is determined first, and then according to the similarity between the reference data 311 and the full data corresponding to the target compressed data 33, the full data with a higher similarity to the reference data 311 is determined as the target data 34.

By the embodiment of the application, the index tag can be determined first, then the target compressed data can be further determined according to the index tag, and then the target data can be further determined according to the compressed data (for example, the full amount of data corresponding to the compressed data is determined by the compressed data). In the process, the similarity between the reference data and the data in each data set does not need to be calculated, and the compression index is indexed according to the quantized compression data, so that the rapid retrieval in the memory of the electronic equipment can be realized in a way of routing index and compression index, the operation amount is saved, and the efficiency of data query is improved.

In addition, the process of determining the target compressed data based on the second indexing algorithm and the reference data according to the embodiment of the present application may be performed as follows: determining a compressed data set corresponding to each target index tag, then determining the similarity between the reference data and the compressed data in each compressed data set based on a preset second index algorithm, and then determining at least one target compressed data with the similarity meeting a second similarity condition based on a preset second similarity condition.

The second similarity condition may be that the similarity between the reference data and the compressed data in each compressed data set is greater than a predetermined threshold, and the threshold may be a value set according to actual conditions. When the similarity between the reference data and one or more compressed data is greater than a predetermined threshold, the embodiment of the present application may determine the compressed data with the similarity greater than the predetermined threshold as the target compressed data.

In another preferred implementation manner, if the target indexing manner determined based on the reference data in the embodiment of the present application is a compressed index and a full index, step 13 may be performed as: and determining at least one target compressed data in the compressed data set based on a preset second index algorithm and reference data, and then determining at least one target data in the full data corresponding to each target compressed data based on a preset third index algorithm and reference data.

Wherein, each compressed data corresponds to a full amount of data respectively.

For example, as shown in fig. 4, fig. 4 is a flowchart of determining target data based on a compressed index and a full index according to an embodiment of the present application. Wherein the data query request 41 includes the reference data 411.

In determining the target data 44, the embodiment of the present application may first determine the similarity between the reference data 411 and the compressed data in each compressed data set (the compressed data set E, the compressed data set F, the compressed data set G, and the compressed data set H) based on the second indexing algorithm, and then determine the target compressed data 42 with higher similarity to the reference data 411 according to the similarity between the reference data 411 and the compressed data in each compressed data set. The target compressed data 42 may be used to characterize a single target compressed data, or may be used to characterize a collection of target compressed data.

After determining the target compressed data 42, the embodiment of the present application may determine the target full volume data 43 corresponding to the target compressed data 42 according to the corresponding relationship between the compressed data and the full volume data.

After determining the target full-scale data 43, the embodiment of the present application may determine the similarity between the reference data 411 and the target full-scale data 43 based on the third indexing algorithm, and then determine the target full-scale data 43 with higher similarity to the reference data 411 as the target data 44 according to the similarity between the reference data 411 and the target full-scale data 43.

In a preferred embodiment, the third indexing algorithm may be a full-scale precision search algorithm, for example, the third indexing algorithm may be an algorithm that calculates a distance between full-scale vectors, and the calculated distance is data similarity.

By the embodiment of the application, the target compressed data similar to the reference data can be determined, and then the target data can be determined according to the full data corresponding to the target compressed data. In the process, the similarity between the reference data and the data in each data set does not need to be calculated, and the compression index is indexed according to the quantized compression data, so that the precision can be ensured, the calculation amount can be saved and the efficiency of data query can be improved by means of the compression index and the full index.

In addition, in a preferred embodiment, the compressed data set may further include a corresponding codebook, where the codebook may be used to record a correspondence between each compressed data in the compressed data set and each full amount of data in the storage medium.

Furthermore, the process of determining the target data based on the third indexing algorithm and the reference data according to the embodiment of the present application may be performed as follows: and determining full data corresponding to each target compressed data based on a codebook corresponding to each compressed data set, then determining the similarity between the reference data and the full data corresponding to each target compressed data based on a preset third indexing algorithm, and then determining at least one target data of which the similarity meets a third similarity condition based on a preset third similarity condition.

The third similarity condition may be that the similarity between the reference data and the full data corresponding to each target compressed data is greater than a predetermined threshold, and the threshold may be a value set according to an actual situation. When the similarity between the reference data and one or more than one full amount data is larger than a predetermined threshold, the embodiment of the present application may determine the full amount data with the similarity larger than the predetermined threshold as the target data.

In another preferred implementation manner, if the target indexing manner determined based on the reference data in the embodiment of the present application is a route index, a compression index, and a full index, step 13 may be performed as: the method comprises the steps of determining at least one target index tag corresponding to reference data based on a first preset index algorithm and the reference data, then determining at least one target compressed data in a compressed data set corresponding to each target index tag based on a second preset index algorithm and the reference data, and then determining at least one target data in full data corresponding to each target compressed data based on a third preset index algorithm and the reference data.

Each index tag corresponds to one compressed data set, and each compressed data corresponds to one full-volume data.

For example, as shown in fig. 5, fig. 5 is a flowchart of determining target data based on a route index, a compression index and a full index according to an embodiment of the present application. The data query request 51 includes reference data 511, the index tags include index tags 521, index tags 522, index tags 523, and index tags 524, and each index tag may correspond to one compressed data set (the index tag 521 corresponds to the compressed data set E, the index tag 522 corresponds to the compressed data set F, the index tag 523 corresponds to the compressed data set G, and the index tag 524 corresponds to the compressed data set H).

In determining the target data 54, the embodiment of the present application may determine the similarity between the reference data 511 and each index tag based on the first indexing algorithm, and then determine one or more index tags with higher similarity to the reference data 511 according to the similarity between the reference data 511 and each index tag. In fig. 5, the index tabs having a higher similarity to the reference data 511 are an index tab 521 and an index tab 523.

After determining the

index tabs

521 and 523, the embodiments of the present application may determine the compressed data sets (i.e., the compressed data set E and the compressed data set G) corresponding to the

index tabs

521 and 523 respectively.

Then, the embodiment of the present application may determine the similarity between the reference data 511 and each of the compressed data in the compressed data set E and the compressed data set G based on the second indexing algorithm, and then determine the target compressed data 53 with higher similarity to the reference data 511 according to the similarity between the reference data 511 and each of the compressed data in the compressed data set E and the compressed data set G, where the target compressed data 53 may be used to characterize a single target compressed data or may be used to characterize a set of target compressed data.

After determining the target compressed data 53, the embodiment of the present application may determine the target full volume data 54 corresponding to the target compressed data 53 according to the corresponding relationship between the compressed data and the full volume data.

After determining the target full-scale data 54, the embodiment of the present application may determine the similarity between the reference data 511 and the target full-scale data 54 based on the third indexing algorithm, and then determine the target full-scale data 54 having a higher similarity with the reference data 511 as the target data 55 according to the similarity between the reference data 511 and the target full-scale data 54. The target data 55 may be used to characterize a single target data or a collection of target data.

According to the embodiment of the application, the index tag can be determined firstly, then the target compressed data can be determined according to the index tag, and then the target data can be determined according to the full data corresponding to the target compressed data. In the process, the similarity between the reference data and the data in each data set does not need to be calculated, so that the accuracy is ensured, the calculation amount is saved, and the efficiency of data query is improved.

The foregoing embodiment explains the embodiment of the present application from the perspective of data query, and the following embodiment explains the embodiment of the present application from the perspective of data storage with reference to a specific implementation, specifically, the embodiment of the present application provides a data storage method, as shown in fig. 6, and the specific steps are as follows:

at step 61, an index build request is received.

Wherein the index build request includes a data type and a data characteristic of the raw data.

In the embodiment of the present application, the data type may be used to characterize a file format of the raw data, for example, the raw data may be data in a picture format, data in a video format, or data in a text format. The data characteristics can be used for characterizing data distribution, data sparsity, dimension characteristics and the like of the original data.

At step 62, a corresponding index creation mode is determined based on the data type and the data characteristics.

In the embodiment of the application, the original data with different data types and data characteristics may correspond to different optimal index creation modes, so that the efficiency of subsequent data query can be effectively improved by determining the optimal index creation mode.

For example, a query structure as shown in fig. 3 or fig. 5 may be created for raw data with a large data volume and dense data, and for example, a query structure as shown in fig. 4 may be created for raw data with a large data volume and sparse data.

In step 63, an index of the original data is constructed according to the index creation mode, and the original data is correspondingly stored.

After the index creation mode is determined, the index of the original data can be constructed based on the determined index creation mode and the original data can be correspondingly stored. After the original data are stored, the embodiment of the application can realize efficient data query according to the data query request and the established index mode.

In a preferred embodiment, if the index creation means includes route index creation and compression index creation, step 63 may be performed as: and in response to the index creation mode comprising route index creation and compression index creation, clustering the original data to obtain at least one data subset, creating a route index of each data subset, then performing quantization compression on each data subset, and creating a compressed index of the data in each compressed data subset.

In this embodiment of the application, if the routing index and the compressed index are created for the original data, after the electronic device receives a subsequent data query request, the electronic device may perform the routing index and the compressed index according to reference data in the data query request, and then obtain target data. Specifically, the process of performing data query based on the routing index and the compressed index may refer to the content described in fig. 3, which is not described herein again in this embodiment of the present application.

In another preferred embodiment, if the index creation mode includes compressed index creation and full index creation, step 63 may be performed as: and in response to the index creation mode comprising compressed index creation and full index creation, performing quantization compression on the original data, creating a compressed index of the compressed data, and then creating the full index according to the corresponding relation between the compressed data and the original data.

In the embodiment of the application, if the compressed index and the full index are created for the original data, after the electronic device receives a subsequent data query request, the electronic device may perform the compressed index and the full index according to the reference data in the data query request, and then obtain the target data. Specifically, the process of performing data query based on the compressed index and the full index may refer to the content described in fig. 4, which is not described herein again in this embodiment of the present application.

In another preferred embodiment, if the index creation means includes route index creation, compressed index creation and full index creation, step 63 may be performed as: and in response to the index creation mode comprising route index creation, compressed index creation and full index creation, clustering the original data to obtain at least one data subset, creating a route index of each data subset, then performing quantization compression on each data subset, creating a compressed index of the data in each compressed data subset, and then creating a full index according to the corresponding relation between the compressed data and the original data.

In this embodiment of the application, if the route index, the compressed index, and the full index are created for the original data, after the electronic device receives a subsequent data query request, the electronic device may perform the route index, the compressed index, and the full index according to reference data in the data query request, and then obtain target data. Specifically, the process of performing data query based on the route index, the compressed index and the full index may refer to the content described in fig. 5, and details of the embodiment of the present application are not repeated herein.

Further, the process of creating the route index may specifically be performed as: based on the original data, training a first indexing algorithm corresponding to the routing index, and determining n data subsets corresponding to the original data and an index label of each data subset.

Wherein n is a natural number greater than or equal to 1, and the index tag is used for representing the clustering characteristics of the corresponding data set.

In the embodiment of the application, each data subset can be clustered through a trained first indexing algorithm, and an index tag corresponding to each data subset is determined, where the index tag can be used for a routing index during data query.

After determining the index tag of each data subset, the embodiment of the present application may further create a compressed index according to the index tag, and specifically, the process may be performed as: and training a second index algorithm corresponding to the compressed index based on the index label of each data subset, determining a mapping lookup table corresponding to n data subsets, and then performing quantitative compression on the data in each data subset to determine n compressed data sets.

And the quantized and compressed data subset is a compressed data set. After creating the compressed index, the electronic device may determine, according to the trained second index algorithm and the mapping lookup table, compressed data corresponding to the reference data in the data query request in each compressed data set.

Further, the process of training the second indexing algorithm and determining the mapping look-up table may be performed as: determining residual vectors or original vectors of the index labels and the data in the corresponding data subsets, then training a second index algorithm corresponding to the compression index based on the residual vectors or the original vectors, then disassembling the residual vectors or the original vectors, and determining mapping lookup tables corresponding to the n compression data sets.

In the process of decomposing the residual vectors, the embodiment of the application may decompose a single residual vector into m groups of sub-vectors with d/m dimensions, and then generate a lookup table for mapping m groups of sub-vectors with d/m dimensions to n-bit binary strings based on m groups of sub-vectors with d/m dimensions.

After determining the mapping lookup tables corresponding to the n data subsets, the embodiment of the present application may further create a full index based on the mapping lookup tables corresponding to the n data subsets, and specifically, the process may be performed as: and determining codebooks corresponding to the n compressed data sets based on the mapping lookup table, and then storing each original data to a storage medium based on each compressed data set.

The codebook is used for recording the total data corresponding to each compressed data in the compressed data set. In the subsequent query process, after the electronic device determines the target compressed data corresponding to the data query request, the full data corresponding to the target compressed data can be determined according to the codebook generated in the creation of the full index, and then the target data can be determined in the full data corresponding to the target compressed data, so that the calculation power is saved, and the efficiency of data query is improved.

In addition, the process of determining the codebook may be specifically performed as: and coding each residual vector according to the mapping lookup table, and determining the codebooks corresponding to the n compressed data sets.

For example, the electronic device may divide the d-dimensional residual vector into m groups of sub-vectors, and then encode each group of sub-vectors into n-bit data according to a mapping lookup table, and then the original d-dimensional residual vector may be represented as an m × n-bit codebook.

In a preferred embodiment, the electronic device may further perform a batch insertion operation on the data, and in particular, the process may be performed as: receiving a batch insertion request, wherein the batch insertion request comprises at least one piece of data to be processed, deleting the constructed index, and constructing the index based on each piece of data to be processed and the original data.

In practical applications, if a large amount of data insertion is required, creating an index for each piece of data to be processed may reduce the efficiency of data processing. Therefore, the original index can be deleted firstly, and then the index is created uniformly for all data (to-be-processed data and original data), so that the data processing efficiency is improved.

In another preferred embodiment, the electronic device may further perform an additional insertion operation on the data, and specifically, the process may be performed as: receiving an additional insertion request, then performing query operation based on at least one to-be-processed data, determining at least one to-be-updated compressed data set, then compressing and inserting the at least one to-be-processed data into the corresponding to-be-updated compressed data set, then determining a codebook corresponding to the updated compressed data set, and then storing the to-be-processed data to a storage medium.

In the embodiment of the application, the additional insertion is suitable for the insertion of the data to be processed with small data volume, and the additional insertion request comprises at least one data to be processed.

In addition, if the data amount of the compressed data set after the data insertion is too large, the following steps may be performed: and in response to the fact that the data volume of the updated compressed data set is larger than a preset data volume upper limit threshold value, splitting the updated compressed data set into at least 2 compressed data sets, and then determining the index tags corresponding to the split compressed data sets.

The predetermined data volume upper limit threshold value can be a reasonable numerical value set according to actual conditions, and each compressed data set can be guaranteed to have a proper data volume size through splitting of the compressed data sets, so that excessive calculation power does not need to be occupied during data processing, and the data processing efficiency is improved.

In another preferred embodiment, the electronic device may further perform a deletion operation on the data, and specifically, the process may be performed as: receiving a data deletion request, then performing query operation based on data identification of data to be deleted, determining each data to be deleted, then deleting each data to be deleted and compressed data corresponding to each data to be deleted, and then updating a codebook and an index tag corresponding to each data to be deleted.

The data deleting request at least comprises a data identifier of the data to be deleted.

In addition, if the data amount of the compressed data set after the data deletion is too small, the following steps may be performed: and in response to the fact that the data volume of the compressed data set after the compressed data is deleted is smaller than a preset data volume lower limit threshold value, merging the compressed data set after the compressed data is deleted with other compressed data sets to determine a merged data set, deleting the index information corresponding to the compressed data set after the compressed data is deleted, and updating the index information corresponding to the merged data set.

The data volume of the merged data set is smaller than the predetermined data volume upper threshold, and the predetermined data volume lower threshold may be a reasonable value set according to actual conditions. By combining the compressed data sets, each compressed data set can be ensured to have a proper data size, redundant storage paths are not required to be occupied, and the storage space is saved.

Based on the same technical concept, the embodiment of the present application further provides a data query apparatus, as shown in fig. 7, the apparatus includes: a first receiving module 71, a first determining module 72 and a querying module 73.

The first receiving module 71 is configured to receive a data query request, where the data query request includes at least reference data.

A first determining module 72, configured to determine, based on the reference data, a target index manner corresponding to the data query request; and

and the query module 73 is configured to determine at least one target data corresponding to the data query request based on the target index manner.

According to the embodiment of the application, the data in the database can be queried in a specific mode through a target index mode so as to determine at least one target data. In this process, the embodiment of the present application can query the unstructured data based on a suitable index manner, so that the embodiment of the present application can efficiently process the unstructured data.

Based on the same technical concept, the embodiment of the present application further provides a data storage device, as shown in fig. 8, the device including: a second receiving module 81, a second determining module 82 and a storing module 83.

The second receiving module 81 is configured to receive an index building request, where the index building request includes a data type and a data characteristic of the original data.

And a second determining module 82, configured to determine a corresponding index creation mode according to the data type and the data characteristic.

And the storage module 83 is configured to construct an index of the original data according to the index creation manner, and store the original data correspondingly.

According to the method and the device, the index of the original data can be constructed based on the determined index creation mode, and the original data can be correspondingly stored. After the original data are stored, the embodiment of the application can realize efficient data query according to the data query request and the established index mode.

Fig. 9 is a schematic diagram of an electronic device according to an embodiment of the present application. As shown in fig. 9, the electronic device shown in fig. 9 is a general address query device, which includes a general computer hardware structure, which includes at least a processor 91 and a memory 92. The processor 91 and the memory 92 are connected by a bus 93. The memory 92 is adapted to store instructions or programs executable by the processor 91. The processor 91 may be a stand-alone microprocessor or may be a collection of one or more microprocessors. Thus, the processor 91 implements processing of data and control of other devices by executing instructions stored by the memory 92 to perform the method flows of the embodiments of the present application as described above. The bus 93 connects the above components together, and also connects the above components to a display controller 94 and a display device and an input/output (I/O) device 95. Input/output (I/O) devices 95 may be a mouse, keyboard, modem, network interface, touch input device, motion sensing input device, printer, and other devices known in the art. Typically, the input/output devices 95 are coupled to the system through an input/output (I/O) controller 96.

In a preferred implementation manner, the electronic device may further include a storage medium and a coprocessor, as shown in fig. 10, fig. 10 is a schematic structural diagram of another electronic device according to an embodiment of the present application, where the schematic structural diagram includes: processor 101, memory 102, bus 103, display controller 104, input/output (I/O) device 105, input/output (I/O) controller 106, storage medium 107, and coprocessor 108.

In an embodiment of the present application, the coprocessor 108 may be configured to determine a target index tag corresponding to the reference data (i.e., configured to perform the route index correlation step), and the storage medium 107 may be configured to store the full amount of data. The coprocessor 108 may be a Graphics Processing Unit (GPU), a programmable gate array, or a custom chip. The Storage medium 107 may be a Hard Disk Drive (HDD), a Solid State Drive (SSD), a Flash Memory (Flash), or a new type of nonvolatile Memory device, such as a Storage Class Memory (SCM).

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus (device) or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may employ a computer program product embodied on one or more computer storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations of methods, apparatus (devices) and computer program products according to embodiments of the application. It will be understood that each flow in the flow diagrams can be implemented by computer program instructions.

These computer program instructions may be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.

These computer program instructions may also be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.

Another embodiment of the present application is directed to a non-transitory storage medium storing a computer-readable program for causing a computer to perform some or all of the above-described method embodiments.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be accomplished by specifying the relevant hardware through a program, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. a data query method, characterized in that the method comprises:

receiving a data query request, the data query request including at least reference data;

determining, based on the reference data, a target index mode corresponding to the data query request; and

Based on the target indexing manner, at least one target data corresponding to the data query request is determined.

2 . The method according to claim 1 , wherein the determining at least one target data corresponding to the data query request based on the target indexing method comprises: 2 .

determining at least one target index label corresponding to the reference data based on the preset first index algorithm and the reference data;

determining the target dataset corresponding to each target index label; and

At least one target data is determined according to the similarity between the reference data and the data in each target data set.

3. The method according to claim 1, wherein the determining at least one target data corresponding to the data query request based on the target indexing method comprises:

Based on the preset first index algorithm and the reference data, determine at least one target index label corresponding to the reference data, and each index label corresponds to a compressed data set;

determining at least one target compressed data in the compressed data set corresponding to each target index tag based on the preset second index algorithm and the reference data; and

The target data is determined according to the target compressed data.

4. The method according to claim 1, wherein the determining at least one target data corresponding to the data query request based on the target indexing method comprises:

Based on the preset second index algorithm and the reference data, at least one target compressed data is determined in the compressed data set, and each compressed data corresponds to a full amount of data; and

Based on the preset third index algorithm and the reference data, at least one target data is determined from the full amount of data corresponding to each target compressed data.

5 . The method according to claim 1 , wherein the determining at least one target data corresponding to the data query request based on the target indexing method comprises: 6 .

Based on the preset second index algorithm and the reference data, at least one target compressed data set is determined in the compressed data set corresponding to each target index tag, and each compressed data corresponds to a full amount of data; and

6. The method according to any one of claims 2, 3 and 5, wherein the index label is used to characterize the clustering feature of the corresponding data set;

The determining at least one target index label corresponding to the reference data based on the preset first index algorithm and the reference data includes:

determining the similarity between the reference data and the index labels of each data set based on a preset first indexing algorithm; and

Based on a preset first similarity condition, at least one target index label whose similarity satisfies the first similarity condition is determined.

7. The method according to any one of claims 3-5, wherein, based on the preset second index algorithm and the reference data, at least one of the compressed data sets corresponding to each target index label is determined. Target compressed data, including:

Determine the compressed data set corresponding to each target index label;

determining the similarity between the reference data and the compressed data in each compressed data set based on a preset second indexing algorithm; and

Based on a preset second similarity condition, at least one target compressed data whose similarity meets the second similarity condition is determined.

8. The method according to claim 4 or 5, wherein the compressed data set further comprises a corresponding codebook, and the codebook is used to record the difference between each compressed data in the compressed data set and each full amount of data in the storage medium. Correspondence between;

Determining at least one target data in the full amount of data corresponding to each target compressed data based on the preset third index algorithm and the reference data, including:

Determine the full amount of data corresponding to each target compressed data based on the codebook corresponding to each compressed data set;

determining the similarity between the reference data and the full amount of data corresponding to each target compressed data based on a preset third index algorithm; and

Based on the preset third similarity condition, at least one target data whose similarity meets the third similarity condition is determined.

9. The method according to claim 5, wherein the first indexing algorithm is a hierarchical navigation small-world algorithm, an inverted file algorithm or an inverted multi-index algorithm, and the second indexing algorithm is a scalar quantization algorithm Or a product quantization algorithm, the third indexing algorithm is a full-scale accurate search algorithm.

10. A data storage method, wherein the method comprises:

receiving an index construction request, the index construction request including the data type and data characteristics of the original data;

Determine a corresponding index creation method according to the data type and data characteristic; and

According to the index creation method, an index of the original data is constructed, and the original data is correspondingly stored.

11. The method according to claim 10, wherein the constructing the index of the original data according to the index creation method comprises:

In response to the index creation methods including routing index creation and compression index creation, clustering the raw data, obtaining at least one data subset, and creating a routing index for each of the data subsets; and

Each of the data subsets is quantized and compressed, and a compression index of the data in each compressed data subset is created.

12. The method according to claim 10, wherein the constructing the index of the original data according to the index creation method comprises:

In response to the index creation method including compressed index creation and full index creation, performing quantitative compression on the original data, and creating a compressed index of the compressed data; and

A full index is created according to the correspondence between the compressed data and the original data.

13. The method according to claim 10, wherein the constructing the index of the original data according to the index creation method comprises:

In response to the index creation methods including routing index creation, compressed index creation and full index creation, clustering the original data, acquiring at least one data subset, and creating a routing index for each of the data subsets;

Quantizing and compressing each of the data subsets to create a compressed index of the data in each of the compressed data subsets; and

14. The method according to claim 13, wherein the performing clustering on the original data, obtaining at least one data subset, and creating a routing index for each of the data subsets comprises:

Based on the original data, the first indexing algorithm corresponding to the routing index is trained, and n data subsets corresponding to the original data and the index label of each data subset are determined, where n is a natural number greater than or equal to 1, The index labels are used to characterize the clustering features of the corresponding dataset.

15. The method according to claim 14, wherein the performing quantitative compression on each of the data subsets, and creating a compression index of the data in the compressed data subsets, comprises:

Based on the index labels of each data subset, the second indexing algorithm corresponding to the compressed index is trained to determine the mapping lookup table corresponding to the n data subsets; and

Quantize and compress the data in each data subset to determine n compressed data sets.

16 . The method according to claim 15 , wherein the second indexing algorithm corresponding to the compressed index is trained based on the index labels of each data subset, and the mapping lookup table corresponding to the n data subsets is determined, comprising: 16 . :

Determine the residual vector or original vector of each index label and the data in the corresponding data subset;

training a second indexing algorithm corresponding to the compressed index based on each residual vector or each original vector; and

Each residual vector or each original vector is disassembled, and the mapping lookup table corresponding to the n compressed data sets is determined.

17. The method according to claim 15, wherein the creating a full index according to the corresponding relationship between the compressed data and the original data comprises:

Determine a codebook corresponding to the n compressed data sets based on the mapping lookup table, where the codebook is used to record the full amount of data corresponding to each compressed data in the compressed data set; and

Based on each compressed data set, each original data is stored to a storage medium.

18. The method according to claim 17, wherein the determining the codebook corresponding to the n compressed data sets based on the mapping lookup table comprises:

Each residual vector is encoded according to the mapping lookup table, and the codebook corresponding to the n compressed data sets is determined.

19. The method of claim 10, wherein the method further comprises:

receiving a batch insert request, the batch insert request including at least one data to be processed;

delete built indexes; and

Build an index based on each pending data and raw data.

20. The method of claim 17, wherein the method further comprises:

receiving an additional insert request, the additional insert request including at least one data to be processed;

Perform a query operation based on the at least one data to be processed, and determine at least one compressed data set to be updated;

compressing the at least one data to be processed and inserting the corresponding compressed data set to be updated;

determining the codebook corresponding to the updated compressed dataset; and

The data to be processed is stored to a storage medium.

21. The method of claim 20, wherein the method further comprises:

In response to the data volume of the updated compressed data set being greater than a predetermined data volume upper threshold, splitting the updated compressed data set into at least 2 compressed data sets; and

Determine the index labels corresponding to the split compressed datasets.

22. The method of claim 17, wherein the method further comprises:

receiving a data deletion request, where the data deletion request at least includes a data identifier of the data to be deleted;

Perform a query operation based on the data identifier of the data to be deleted, and determine each data to be deleted;

delete each data to be deleted and the compressed data corresponding to each data to be deleted; and

Update the codebook and index label corresponding to each data to be deleted.

23. The method of claim 22, wherein the method further comprises:

In response to the data volume of the compressed data set after deleting the compressed data being less than the predetermined data volume lower limit threshold, combining the compressed data set after deleting the compressed data with other compressed data sets to determine the combined data set, the data volume of the combined data set is less than the upper threshold of the predetermined data volume;

Delete the index information corresponding to the compressed data set after deleting the compressed data; and

The index information corresponding to the merged data set is updated.

24. A data query device, characterized in that the device comprises:

a first receiving module, configured to receive a data query request, where the data query request at least includes reference data;

a first determining module, configured to determine a target index mode corresponding to the data query request based on the reference data; and

A query module, configured to determine at least one target data corresponding to the data query request based on the target indexing manner.

25. A data storage device, characterized in that the device comprises:

a second receiving module, configured to receive an index construction request, where the index construction request includes the data type and data characteristics of the original data;

a second determining module, configured to determine a corresponding index creation mode according to the data type and data characteristic; and

A storage module, configured to construct an index of the original data according to the index creation method, and store the original data correspondingly.

26. An electronic device comprising a memory and a processor, wherein the memory is used to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement The method of any of claims 1-23.

27. The electronic device according to claim 26, wherein the electronic device further comprises a storage medium and a coprocessor;

the storage medium is configured to store the full amount of data;

The coprocessor is configured to determine a target index tag corresponding to reference data, where the reference data is at least data included in the data query request.

28. A computer storage medium, wherein a computer program is stored in the computer storage medium, and when the computer program is executed by a processor, the method according to any one of claims 1-23 is implemented.