CN113297264A

CN113297264A - Method and device for massively parallel processing of database

Info

Publication number: CN113297264A
Application number: CN202010281397.1A
Authority: CN
Inventors: 楼仁杰; 魏闯先
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-04-10
Filing date: 2020-04-10
Publication date: 2021-08-24

Abstract

The embodiment of the specification provides a method and a device for massively parallel processing of a database, wherein the method for massively parallel processing of the database comprises the following steps: receiving a retrieval instruction, wherein the retrieval instruction comprises a vector to be retrieved corresponding to the item data to be retrieved; determining at least one retrieval feature vector corresponding to the vector to be retrieved in a pre-generated feature vector index table by using an approximate nearest neighbor retrieval method, and acquiring a retrieval computing node corresponding to each retrieval feature vector, wherein the feature vector index table records a plurality of feature vectors and a computing node corresponding to each feature vector; retrieving an approximate vector corresponding to the vector to be retrieved in each retrieval computing node; and determining a target vector corresponding to the vector to be retrieved from at least one approximate vector. By the method, the retrieval accuracy is ensured, and meanwhile, the calculation consumption is greatly reduced.

Description

Method and device for massively parallel processing of database

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to a method for massively parallel processing of a database. One or more embodiments of the present specification also relate to a method for massively parallel processing of a database, two apparatuses for massively parallel processing of a database, two computing devices, and a computer-readable storage medium.

Background

With the development of computer technology, distributed systems have also been developed rapidly, and the MPP architecture distributes tasks to a plurality of servers and computing nodes in parallel, and after the computation on each computing node is completed, the results of respective parts are collected together to obtain a final result, and a database adopting the MPP architecture is called an MPP database.

Due to the development of current science and technology and the rapid increase of magnitude order, the traditional MPP database can not meet the corresponding requirements of low delay and high efficiency of current project requirements, for example, in a face recognition project, a project system can extract feature vectors of face pictures captured by an arrangement control camera, then approximate nearest neighbor retrieval is carried out in a vector database arranged in the MPP database, but the number of data stored in the database is billion, the retrieval process is very long, and the resources of a computer are greatly consumed, so that the method is very inconvenient.

Therefore, how to solve the above problems becomes an urgent problem to be solved by the skilled person.

Disclosure of Invention

In view of the above, the present specification provides a method for massively parallel processing of a database. One or more embodiments of the present specification also relate to a method for massively parallel processing a database, two apparatuses for massively parallel processing a database, two computing devices, and a computer-readable storage medium, which solve the technical disadvantages of the prior art.

According to a first aspect of embodiments herein, there is provided a method for massively parallel processing of a database, comprising:

receiving a retrieval instruction, wherein the retrieval instruction comprises a vector to be retrieved corresponding to the item data to be retrieved;

determining at least one retrieval feature vector corresponding to the vector to be retrieved in a pre-generated feature vector index table by using an approximate nearest neighbor retrieval method, and acquiring a retrieval computing node corresponding to each retrieval feature vector, wherein the feature vector index table records a plurality of feature vectors and a computing node corresponding to each feature vector;

retrieving an approximate vector corresponding to the vector to be retrieved in each retrieval computing node;

and determining a target vector corresponding to the vector to be retrieved from at least one approximate vector.

Optionally, determining at least one retrieval feature vector corresponding to the vector to be retrieved in a pre-generated feature vector index table by using an approximate nearest neighbor retrieval method, including:

sequentially calculating the difference value between the vector to be retrieved and each feature vector in the feature vector index table;

and determining the feature vector of which the difference value is smaller than a preset threshold value as a retrieval feature vector corresponding to the vector to be retrieved.

Optionally, determining a target vector corresponding to the vector to be retrieved from at least one of the approximate vectors includes:

calculating the difference value between each approximate vector and the vector to be retrieved;

and selecting the approximate vector with the minimum difference value as a target vector corresponding to the vector to be retrieved.

Optionally, after determining a target vector corresponding to the vector to be retrieved from at least one of the approximate vectors, the method further includes:

acquiring attribute information corresponding to the target vector according to the target vector;

and taking the attribute information corresponding to the target vector as a retrieval result corresponding to the item data to be retrieved.

Optionally, the method further includes:

receiving a deleting instruction;

executing the delete instruction in each compute node of the massively parallel processing database.

Optionally, the feature vector index table is obtained through training in the following steps, including:

acquiring a training sample, wherein the training sample comprises a plurality of training vectors corresponding to the project data;

clustering a plurality of training vectors in the training samples;

determining at least two eigenvectors according to the clustering result, and determining a computing node corresponding to each eigenvector;

and generating a feature vector index table according to each feature vector and the computing node corresponding to each feature vector.

Optionally, determining a computing node corresponding to each feature vector includes:

and determining the computing node corresponding to each feature vector according to the number of the computing nodes in the massive parallel processing database.

According to a second aspect of embodiments herein, there is provided a method for massively parallel processing of a database, comprising:

receiving a write-in instruction, wherein the write-in instruction comprises a vector to be written corresponding to project data to be written;

determining a writing characteristic vector corresponding to the vector to be written in a pre-generated characteristic vector index table by using an approximate nearest neighbor retrieval method, and acquiring a writing calculation node corresponding to the writing characteristic vector, wherein the characteristic vector index table records a plurality of characteristic vectors and a calculation node corresponding to each characteristic vector;

and writing the vector to be written into the writing calculation node.

Optionally, determining, by using an approximate nearest neighbor search method, a write feature vector corresponding to the to-be-written vector in a pre-generated feature vector index table, includes:

sequentially calculating the difference value between the vector to be written and each eigenvector in the eigenvector index table;

and determining the characteristic vector with the minimum difference value as a writing characteristic vector corresponding to the vector to be written.

Optionally, the method further includes:

receiving a deleting instruction;

Optionally, the method further includes:

determining at least one retrieval characteristic vector corresponding to the vector to be retrieved in the characteristic vector index table by using an approximate nearest neighbor retrieval method, and acquiring a retrieval computing node corresponding to each retrieval characteristic vector;

Optionally, determining at least one retrieval feature vector corresponding to the vector to be retrieved in the feature vector index table by using an approximate nearest neighbor retrieval method includes:

clustering a plurality of training vectors in the training samples;

According to a third aspect of embodiments herein, there is provided an apparatus for massively parallel processing of a database, comprising:

the retrieval method comprises a first receiving module, a second receiving module and a third receiving module, wherein the first receiving module is configured to receive a retrieval instruction, and the retrieval instruction comprises a vector to be retrieved corresponding to item data to be retrieved;

the first node determining module is configured to determine at least one retrieval feature vector corresponding to the vector to be retrieved in a pre-generated feature vector index table by using an approximate nearest neighbor retrieval method, and acquire a retrieval computing node corresponding to each retrieval feature vector, wherein the feature vector index table records a plurality of feature vectors and computing nodes corresponding to each feature vector;

a first retrieval vector module configured to retrieve an approximate vector corresponding to the vector to be retrieved in each of the retrieval computing nodes;

and the first vector determining module is configured to determine a target vector corresponding to the vector to be retrieved from at least one approximate vector.

According to a fourth aspect of embodiments herein, there is provided an apparatus for massively parallel processing of a database, comprising:

the third receiving module is configured to receive a write instruction, wherein the write instruction comprises a vector to be written corresponding to the item data to be written;

a second determining node module configured to determine, by using an approximate nearest neighbor search method, a writing feature vector corresponding to the to-be-written vector in a pre-generated feature vector index table, and acquire a writing computation node corresponding to the writing feature vector, where the feature vector index table records a plurality of feature vectors and a computation node corresponding to each feature vector;

a write module configured to write the vector to be written to the write compute node.

According to a fifth aspect of embodiments herein, there is provided a computing device comprising:

a memory and a processor;

the memory is to store computer-executable instructions, and the processor is to execute the computer-executable instructions to:

According to a sixth aspect of embodiments herein, there is provided a computing device comprising:

a memory and a processor;

and writing the vector to be written into the writing calculation node.

According to a seventh aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the method for massively parallel processing of a database.

According to the method for massively parallel processing of the database, when retrieval is performed in the massively parallel processing database, the retrieval calculation node which needs to be retrieved is determined through the preset feature vector index table, the retrieval is performed in the retrieval calculation node, the target vector corresponding to the vector to be retrieved is determined in the obtained at least one approximate vector, the accuracy of the target vector can be guaranteed, the method provided by the specification only needs to perform retrieval in the retrieval calculation node without retrieving the vector to be retrieved in all the calculation nodes, the retrieval accuracy is guaranteed, meanwhile, the time of each retrieval is greatly reduced, useless calculation consumption is greatly avoided, the retrieval capability is improved, and user experience is improved.

Drawings

FIG. 1 is a process flow diagram of a method for massively parallel processing of a database provided by a first embodiment of the present specification;

FIG. 2 is a flowchart illustrating a training process of a feature vector index table according to a second embodiment of the present disclosure;

FIG. 3 is a process flow diagram of a method for massively parallel processing of a database provided by a third embodiment of the present specification;

FIG. 4a is a diagram illustrating a structure of a massively parallel processing database provided by a third embodiment of the present specification;

FIG. 4b shows a schematic diagram of vector writing provided by a third embodiment of the present description;

FIG. 4c is a schematic diagram illustrating vector retrieval provided by a third embodiment of the present specification;

FIG. 4d is a schematic diagram illustrating vector deletion provided by a third embodiment of the present specification;

FIG. 5 is a schematic diagram illustrating an apparatus for massively parallel processing of a database according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating an apparatus for massively parallel processing of a database according to another embodiment of the present disclosure;

FIG. 7 is a block diagram of a computing device, according to an embodiment of the present disclosure;

fig. 8 is a block diagram of a computing device according to another embodiment of the present disclosure.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, as those skilled in the art will be able to make and use the present disclosure without departing from the spirit and scope of the present disclosure.

The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

First, the noun terms to which one or more embodiments of the present specification relate are explained.

Massively parallel processing the database: the MPP architecture is characterized in that a Massivey Parallel Processing (MPP) database is adopted, the MPP database is called as an MPP database, tasks are distributed on a plurality of servers and computing nodes in Parallel, and after computing on each computing node is completed, results of respective parts are collected together to obtain a final result.

Partition cutting: for partition table or partition index, the optimizer can automatically extract the partition to be accessed from the from and where according to the partition key, thereby avoiding accessing all the partitions, reducing IO (input/output) requests and reducing the pressure of the CPU (central processing unit) of the server.

Approximate nearest neighbor retrieval: approximate Nearest Neighbor Search (ANN) can return approximately accurate results by quickly retrieving the Nearest N neighboring vectors in a high-dimensional space through a pre-constructed index.

Clustering: the process of dividing a collection of physical or abstract objects into classes composed of similar objects.

In the present specification, a method for massively parallel processing of a database is provided, and the present specification also relates to a method for massively parallel processing of a database, two apparatuses for massively parallel processing of a database, two computing devices, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.

FIG. 1 shows a process flow diagram of a method for massively parallel processing of a database, including steps 102 through 108, provided in accordance with one embodiment of the present description.

Step 102: receiving a retrieval instruction, wherein the retrieval instruction comprises a vector to be retrieved corresponding to the item data to be retrieved.

The retrieval instruction is an instruction for retrieving in a database, and generally includes data content to be retrieved, a retrieval database, retrieval conditions, and the like.

The data of the item to be retrieved is item data used by the retrieval instruction, and the data of the item to be retrieved is different in different items, for example, in a face recognition item, the data of the item to be retrieved is a face image to be retrieved, and in a voice recognition item, the data of the item to be retrieved is voice information to be retrieved.

The vector to be retrieved is a vector obtained after the item data to be retrieved is subjected to feature vectorization, when the item data is a face image, the vector to be retrieved is a face feature vector obtained after the face image is subjected to feature vectorization, and when the item data is voice data, the vector to be retrieved is a voice feature vector obtained after the voice data is subjected to feature vectorization.

In the first embodiment provided in this specification, taking speech recognition as an example, item data to be retrieved is speech information to be retrieved, person attribute information corresponding to the speech information is retrieved, the person attribute information may be name, age, native place, academic calendar and the like, and the speech information is subjected to feature vectorization processing to obtain vector V to be retrieved₀And the retrieval instruction is obtained by retrieving the voice vector corresponding to the vector to be retrieved in a large-scale parallel processing database (MPP database).

Step 104: determining at least one retrieval feature vector corresponding to the vector to be retrieved in a pre-generated feature vector index table by using an approximate nearest neighbor retrieval method, and acquiring a retrieval computing node corresponding to each retrieval feature vector, wherein the feature vector index table records a plurality of feature vectors and a computing node corresponding to each feature vector.

Approximate Nearest Neighbor Search (ANN Search) is a fast Search of the Nearest N neighboring vectors in a high dimensional space through a pre-constructed index. The pre-generated feature vector index table is an expression form of an index pre-constructed in ANN retrieval, a plurality of feature vectors and a computing node corresponding to each feature vector are recorded in the feature vector index table, the feature vectors are vectors used for retrieval in the feature vector index table, N retrieval feature vectors corresponding to the vectors to be retrieved are retrieved in the feature vector index table by using an ANN retrieval method, then retrieval computing nodes corresponding to each retrieval feature vector are determined according to the corresponding relation between the feature vectors and the computing nodes, the retrieval feature vectors are feature vectors corresponding to the vectors to be retrieved in the feature vector index table in the vector retrieval process, and the retrieval computing nodes are computing nodes corresponding to the retrieval feature vectors.

Optionally, determining at least one retrieval feature vector corresponding to the vector to be retrieved in a pre-generated feature vector index table by using an approximate nearest neighbor retrieval method, including: sequentially calculating the difference value between the vector to be retrieved and each feature vector in the feature vector index table; and determining the feature vector of which the difference value is smaller than a preset threshold value as a retrieval feature vector corresponding to the vector to be retrieved.

And respectively calculating a difference value between the vector to be retrieved and each feature vector in the feature vector index table, wherein the difference value is used for identifying the difference between the two vectors, the smaller the difference value is, the more similar the two vectors are, and the feature vector of which the difference value is smaller than a preset threshold value is used as the retrieval feature vector corresponding to the vector to be retrieved.

When calculating the difference between two vectors, the difference may be calculated by using a distance algorithm, such as euclidean distance, manhattan distance, minkowski distance, or may be calculated by using a similarity algorithm, such as cosine similarity, pearson correlation coefficient, log likelihood similarity, or the like.

In practical application, the difference values may also be sorted, a preset number N of retrieval feature vectors is obtained, and the first N feature vectors with the smallest difference values are obtained as the retrieval feature vectors corresponding to the vector to be retrieved.

In the first embodiment provided in the present specification, following the above example, the feature vector index table generated in advance is shown in table 1 below.

Serial number	Feature vector	Computing node
			1	V₁	No. 1 computing node
2	V₂	No. 5 computing node
			3	V₃	No. 11 computing node
……	……	……
			N	V_N	No. 7 computing node

TABLE 1

Respectively calculating vectors V to be retrieved₀And the feature vector V in the feature vector index table₁、V₂、V₃……V_NThe cosine similarity of the vector is larger, the larger the cosine value is, the smaller the difference value of the vectors is, and the cosine value is selected to be larger than a preset thresholdFeature vector V of values₁、V₂、V₃As the vector V to be retrieved₀The corresponding retrieval feature vector can be determined through the feature vector index table. Vector V to be retrieved₀The corresponding retrieval computing nodes are respectively No. 1, No. 5 and No. 11 computing nodes.

Step 106: and retrieving the approximate vector corresponding to the vector to be retrieved in each retrieval computing node.

The approximate vector is the vector with the minimum difference value with the vector to be retrieved, which is stored in the computing node. And issuing a retrieval instruction to each retrieval computing node, executing the retrieval instruction in each retrieval computing node respectively, retrieving an approximate vector corresponding to the vector to be retrieved from the vectors stored in the current retrieval computing node, and acquiring the vector with the minimum difference value with the vector to be retrieved from each retrieval computing node as the approximate vector corresponding to the vector to be retrieved in the current retrieval computing node.

In the first embodiment provided in this specification, the above example is followed, and the vector V to be retrieved is₀The approximate vector in compute node number 1 is V₁₁Vector V to be retrieved₀The approximate vector in compute node number 5 is V₂₁Vector V to be retrieved₀The approximate vector in compute node number 11 is V₃₁。

Step 108: and determining a target vector corresponding to the vector to be retrieved from at least one approximate vector.

Optionally, determining a target vector corresponding to the vector to be retrieved from at least one of the approximate vectors includes: calculating the difference value between each approximate vector and the vector to be retrieved; and selecting the approximate vector with the minimum difference value as a target vector corresponding to the vector to be retrieved.

And the target vector is the vector with the minimum difference value with the vector to be retrieved in the massive parallel processing database. After at least one approximate vector is obtained, calculating difference values of each approximate vector and the vector to be retrieved respectively, and determining an approximate vector with the minimum difference value with the vector to be retrieved from the at least one approximate vector as a target vector corresponding to the vector to be retrieved.

In the first embodiment provided in the present specification, following the above example, the approximation vectors V are calculated separately₁₁、V₂₁、V₃₁And vector V to be retrieved₀Cosine similarity of from V₁₁、V₂₁、V₃₁To determine the approximate vector V with the maximum cosine similarity₂₁For the vector V to be retrieved₀The corresponding target vector.

Optionally, after determining a target vector corresponding to the vector to be retrieved from at least one of the approximate vectors, the method further includes: acquiring attribute information corresponding to the target vector according to the target vector; and taking the attribute information corresponding to the target vector as a retrieval result corresponding to the item data to be retrieved.

In practical application, after the target vector is determined, attribute information corresponding to the target vector needs to be acquired, and the attribute information corresponding to the target vector is used as a retrieval result corresponding to item data to be retrieved in a retrieval instruction.

In the first embodiment provided in this specification, the above example is followed, and the target vector V is obtained₂₁Corresponding attribute information such as name, age, native place, academic calendar and the like is used as a vector V to be retrieved₀Corresponding person attribute information.

Optionally, the method further includes: receiving a deleting instruction; executing the delete instruction in each compute node of the massively parallel processing database.

In practical application, when information needs to be deleted in the MPP database, the received deletion instruction is sent to each computing node of the MPP database, and the deletion instruction is executed in each computing node.

When the method for massively parallel processing of the database provided in the first embodiment of the present specification is used for performing retrieval in the massively parallel processing database, a retrieval computing node that needs to be retrieved is determined through a preset feature vector index table, and retrieval is not required in all computing nodes, so that useless computing consumption is avoided, and the retrieval efficiency is improved.

Next, with reference to fig. 2, a training obtaining step of the feature vector index table in the embodiment of the present specification is explained, and fig. 2 shows a flowchart of a training process of the feature vector index table provided in an embodiment of the present specification, where the specific steps include 202 to 208.

Step 202: obtaining a training sample, wherein the training sample comprises a plurality of training vectors corresponding to the project data.

The training sample is sample data for training to generate a feature vector index table, wherein the training sample comprises a plurality of training vectors corresponding to project data, the project data is data related to a project, for example, in a face recognition project, the project data is a face image, a plurality of training vectors corresponding to the project data are feature vectors corresponding to a plurality of face images, in a voice recognition project, the project data is voice data, and a plurality of training vectors corresponding to the project data are feature vectors corresponding to the voice data.

In a second embodiment provided in this specification, taking speech recognition as an example, training data is obtained, where the training data includes training vectors corresponding to 100 pieces of speech data.

Step 204: clustering a plurality of training vectors in the training samples.

Clustering is a process of dividing a set of physical or abstract objects into a plurality of classes composed of similar objects, and clustering operations are performed on a plurality of training vectors through a clustering algorithm, which is many, such as a k-means clustering algorithm (k-means), a mean shift method, and the like, and the clustering algorithm is not limited in this specification.

In the second embodiment provided in the present specification, the above example is used, and 100 training vectors in the training samples are clustered by a k-means clustering algorithm.

Step 206: and determining at least two characteristic vectors according to the clustering result, and determining a computing node corresponding to each characteristic vector.

Determining K sets according to a clustering result, taking a clustering center vector in each set as a feature vector of a current set, determining a computing node corresponding to each feature vector according to the number of computing nodes in the massively parallel processing database, and determining the computing node corresponding to the feature vector by a plurality of methods, for example, encoding each feature vector and modulo generation with encoding of the computing node, or obtaining an optimization result by an Expectation-maximization (EM) algorithm, where the method for determining the computing node corresponding to each feature vector is not limited in this specification.

In the second embodiment provided in this specification, following the above example, the clustering result defines 9 sets, and the cluster center vector of each set is B₁，B₂，B₃，B₄，B₅，B₆，B₇，B₈，B₉The 9 vectors are feature vectors, the number of the computation nodes of the massively parallel processing database in the embodiment is 3, and the feature vector B is determined by a modulo mode₁、B₄、B₇The corresponding computing node is the computing node No. 1, and the characteristic vector B₂、B₅、B₈The corresponding computing node is the computing node No. 2, and the characteristic vector B₃、B₆、B₉The corresponding computing node is the computing node No. 3.

Step 208: and generating a feature vector index table according to each feature vector and the computing node corresponding to each feature vector.

And correspondingly storing each feature vector and the computing node corresponding to each feature vector one by one to generate a feature vector index table.

In the second embodiment provided in the present specification, the above example is followed, and referring to table 2 below, table 2 shows the feature vector index table generated by the present embodiment.

TABLE 2

In the training step of the feature vector index table provided in the second embodiment of the present specification, the feature vector index table may be generated by a small amount of training data, so that partition cutting of the massively parallel processing database by a vector is achieved, and it is convenient to determine a feature vector with a small difference value according to a vector to be processed in a subsequent database operation process, and process the feature vector in a computing node corresponding to the feature vector, without accessing each computing node in the massively parallel processing database, thereby reducing IO requests, reducing the pressure of a processor, and improving processing efficiency.

The following describes a method for massively parallel processing a database, taking an application of the method for massively parallel processing a database provided in this specification in the field of face recognition as an example, with reference to fig. 3. Fig. 3 shows a flowchart of a processing procedure of a method for massively parallel processing a database according to a third embodiment of the present specification, where specific steps include step 302 to step 306.

Step 302: receiving a write instruction, wherein the write instruction comprises a vector to be written corresponding to the item data to be written.

The writing instruction is used for writing data into a database, the item data to be written is item data corresponding to the writing instruction, in the field of face recognition, the item data to be written is a face image to be written into the database, and the vector to be written is a vector obtained after the face image is subjected to feature vectorization.

In a third embodiment provided in this specification, the item data to be written is a face image, and the vector C to be written is a vector C₀The vector is obtained after the facial image is subjected to the characterization processing.

Step 304: determining a writing characteristic vector corresponding to the vector to be written in a pre-generated characteristic vector index table by using an approximate nearest neighbor retrieval method, and acquiring a writing calculation node corresponding to the writing characteristic vector, wherein the characteristic vector index table records a plurality of characteristic vectors and a calculation node corresponding to each characteristic vector.

The writing characteristic vector is a characteristic vector corresponding to the vector to be written in the characteristic vector index table in the vector writing process. And the writing calculation node is a calculation node corresponding to the writing characteristic vector.

Optionally, determining, by using an approximate nearest neighbor search method, a write feature vector corresponding to the to-be-written vector in a pre-generated feature vector index table, includes: sequentially calculating the difference value between the vector to be written and each eigenvector in the eigenvector index table; and determining the characteristic vector with the minimum difference value as a writing characteristic vector corresponding to the vector to be written.

And respectively calculating a difference value between the vector to be written and each eigenvector in the eigenvector index table, wherein the difference value is used for identifying the difference between the two vectors, the smaller the difference value is, the more similar the two vectors are, and the eigenvector with the minimum difference value with the vector to be written is taken as the writing eigenvector corresponding to the vector to be written.

In a third embodiment provided in the present specification, a feature vector index table generated in advance is shown in table 3 below.

TABLE 3

Separately calculating vectors C to be written₀And the feature vector V in the feature vector index table₁、V₂、V₃……V_NThe smaller the Euclidean distance is, the smaller the difference value of the vector is, and the difference value is selectedMinimum feature vector V₂As a vector C to be written₀Corresponding writing characteristic vector and determining the vector C to be written through the characteristic vector index table₀The corresponding write-in computing node is computing node No. 3.

Step 306: and writing the vector to be written into the writing calculation node.

In a third embodiment provided by this specification, a vector C is to be written₀And writing the vector into the No. 3 computing node, and completing the process of writing the new vector into the massively parallel processing database.

In the method for massively parallel processing a database provided in the embodiments of the present specification, in a process of writing a vector to be written into the massively parallel processing database, a feature vector having a minimum difference value from the vector to be written is obtained by an approximate nearest neighbor search method, and a write-in computation node corresponding to the vector to be written is determined, where a pre-generated feature vector index table performs a function of partition cutting for the massively parallel processing database, determines a corresponding write-in computation node for the vector to be written, and writes the vector having the minimum difference value into the same computation node, so as to facilitate subsequent query and use.

And after receiving the deleting instruction, directly sending the deleting instruction to each computing node of the large-scale processing database, and executing the deleting instruction in each computing node.

Optionally, the method further includes the following steps S3080 to S3088.

Step S3080: receiving a retrieval instruction, wherein the retrieval instruction comprises a vector to be retrieved corresponding to the item data to be retrieved.

Step S3080 is the same as the method of step 102, and for the specific explanation of step S3080, refer to the details of step 102 in the foregoing embodiment, which are not repeated herein.

In a third embodiment provided by the present specification, a retrieval instruction is received,the retrieval instruction comprises a vector C to be retrieved corresponding to the face image to be retrieved₁。

Step S3082: and determining at least one retrieval characteristic vector corresponding to the vector to be retrieved in the characteristic vector index table by using an approximate nearest neighbor retrieval method, and acquiring a retrieval computing node corresponding to each retrieval characteristic vector.

Step S3082 is the same as the method of step 104, and for the specific explanation of step S3082, refer to the details of step 104 in the foregoing embodiment, which are not repeated herein.

In a third embodiment provided in this specification, vectors C to be retrieved are calculated separately₁And the feature vector V in the feature vector index table shown in Table 3₁、V₂、V₃……V_NThe smaller the Euclidean distance is, the smaller the difference value of the vectors is, and the feature vector V with the Euclidean distance smaller than a preset threshold value is determined₁、V₂As the vector C to be retrieved₁Corresponding retrieval feature vector, and determining the vector C to be retrieved through the feature vector index table₁The corresponding retrieval computing nodes are No. 1 and No. 3 computing nodes.

Step S3084: and retrieving the approximate vector corresponding to the vector to be retrieved in each retrieval computing node.

Step S3084 is identical to the method of step 106, and for a detailed explanation of step S3084, refer to the details of step 106 in the foregoing embodiment, which are not repeated herein.

In a third embodiment provided in this specification, a vector C to be retrieved₁The approximate vector retrieved in compute node number 1 is C₁₁Vector C to be retrieved₁The approximate vector retrieved in compute node number 3 is C₃₁。

Step S3086: and determining a target vector corresponding to the vector to be retrieved from at least one approximate vector.

Step S3086 is the same as the method of step 108, and for the specific explanation of step S3086, refer to the details of step 108 in the foregoing embodiment, which will not be repeated herein.

In a third embodiment provided in the present specification, approximation vectors C are calculated separately₁₁、C₃₁And the vector C to be retrieved₁Euclidean distance of from C₁₁、C₃₁Determining the approximate vector C with the minimum Euclidean distance₃₁As a vector C to be retrieved₁The corresponding target vector.

Step S3088: and acquiring attribute information corresponding to the target vector according to the target vector, and taking the attribute information corresponding to the target vector as a retrieval result corresponding to the item data to be retrieved.

In a third embodiment provided in the present specification, a target vector C is acquired₃₁Corresponding attribute information such as name, age, home address, work unit, etc. Taking the attribute information as a vector C to be retrieved₁Corresponding person attribute information.

For convenience of understanding, referring to fig. 4a to 4d, a face recognition scenario in the third embodiment is further explained with reference to fig. 4a to 4d, where fig. 4a shows a structural schematic diagram of a massively parallel processing database provided by the third embodiment of the present specification, fig. 4b shows a schematic diagram of vector writing provided by the third embodiment of the present specification, fig. 4c shows a schematic diagram of vector retrieval provided by the third embodiment of the present specification, and fig. 4d shows a schematic diagram of vector deletion provided by the third embodiment of the present specification.

Referring to fig. 4a, a schematic structural diagram of a massively parallel processing database is shown, wherein a client is connected with front-end nodes of the massively parallel processing database, and the front-end nodes are respectively connected with computing nodes 1 to 6 in the massively parallel processing database.

Acquiring 1000 face images in advance, respectively extracting a feature vector of each face image as a training vector to generate a training sample of the feature vector of the face image, and performing k-means clustering on 1000 training vectorsClustering to obtain N cluster sets, and taking the cluster center vector in each set as the feature vector of the current set to obtain N feature vectors V₁、V₂、V₃……V_NThe massively parallel processing database has 6 computing nodes, determines the computing node corresponding to each eigenvector by a modulo mode, and stores each eigenvector and the computing node corresponding to each eigenvector in a one-to-one correspondence manner to generate the eigenvector index table shown in table 3.

Referring to fig. 4b, the image capturing apparatus captures a face image, and extracts a feature vector as a vector C to be written in the face image through feature extraction₀Writing the vector C to be written by a write instruction₀Writing into a front-end node of a massively parallel processing database.

The front end node carries out the pair of the vectors C to be written in through a characteristic vector index table₀Performing partition routing, and respectively calculating vectors C to be written₀And the feature vector V in the feature vector index table₁、V₂、V₃……V_NDetermining a characteristic vector V with the minimum Euclidean distance₂As a vector C to be written₀Corresponding writing characteristic vector and determining the vector C to be written through the characteristic vector index table₀The corresponding write-in computing node is computing node No. 3.

The front-end node routes the vector C to be written through the partition₀And writing the facial image into the No. 3 computing node to finish the writing process of the feature vector corresponding to the facial image.

Referring to fig. 4c, when performing face recognition, the face image captured by the monitoring camera needs to be retrieved to obtain corresponding person attribute information such as name, age, native place, and the like.

The monitoring camera carries out feature vectorization processing on the shot face image to obtain a vector C to be retrieved₁And the vector C to be searched₁Input to the front-end node.

The front end node indexes the vector C to be retrieved through a feature vector index table₁Performing vector query and partition clipping to be checkedCable vector C₁Respectively calculating Euclidean distance with each feature vector in the feature vector index table, and selecting the feature vector V with the Euclidean distance smaller than a preset threshold value₁、V₂As the vector C to be retrieved₁And corresponding retrieval feature vectors, and determining retrieval calculation nodes corresponding to the vectors to be retrieved as No. 1 and No. 3 calculation nodes respectively through a feature vector index table.

Searching out vector C to be searched in number 1 computing node₁Approximate vector C of₁₁And searching out the vector C to be searched in the No. 3 computing node₁Approximate vector C of₃₁Separately calculating an approximation vector C₁₁、C₃₁And the vector C to be retrieved₁Determine approximate vector C in the No. 3 computing node₃₁As a vector C to be retrieved₁The target vector of (2).

The information of the name, age, native place, etc. of the person stored corresponding to the target vector is used as the vector C to be searched₁And corresponding to the retrieval result.

Referring to fig. 4d, the client delete instruction is sent to the front-end node, where it is forwarded to each compute node via partition routing, and executed in each compute node.

Secondly, when retrieval is carried out in a large-scale parallel processing database, retrieval computing nodes needing to be retrieved are determined through a preset feature vector index table, retrieval is not needed in all computing nodes, useless computing consumption is avoided, retrieval efficiency is improved, retrieval is carried out in the retrieval computing nodes, and a target vector corresponding to a vector to be retrieved is determined in at least one obtained approximate vector, so that the accuracy of the target vector can be guaranteed.

Corresponding to the above method embodiments, the present specification further provides an embodiment of an apparatus for massively parallel processing a database, and fig. 5 shows a schematic structural diagram of an apparatus for massively parallel processing a database provided by an embodiment of the present specification. As shown in fig. 5, the apparatus includes:

the first receiving module 502 is configured to receive a retrieval instruction, where the retrieval instruction includes a vector to be retrieved corresponding to item data to be retrieved.

A first node determining module 504, configured to determine at least one retrieval feature vector corresponding to the vector to be retrieved in a pre-generated feature vector index table by using an approximate nearest neighbor retrieval method, and obtain a retrieval computing node corresponding to each retrieval feature vector, where the feature vector index table records a plurality of feature vectors and a computing node corresponding to each feature vector.

A first retrieve vector module 506 configured to retrieve, in each of the retrieve computing nodes, an approximate vector corresponding to the vector to be retrieved.

A first vector determining module 508 configured to determine a target vector corresponding to the vector to be retrieved from at least one of the approximate vectors.

Optionally, the first determining node module 504 is further configured to sequentially calculate a difference value between the vector to be retrieved and each feature vector in the feature vector index table; and determining the feature vector of which the difference value is smaller than a preset threshold value as a retrieval feature vector corresponding to the vector to be retrieved.

Optionally, the first vector determining module 508 is further configured to calculate a difference value between each of the approximate vectors and the vector to be retrieved; and selecting the approximate vector with the minimum difference value as a target vector corresponding to the vector to be retrieved.

Optionally, the apparatus further comprises:

the first matching result module is configured to acquire attribute information corresponding to the target vector according to the target vector; and taking the attribute information corresponding to the target vector as a retrieval result corresponding to the item data to be retrieved.

Optionally, the apparatus further comprises:

a second receiving module configured to receive a deletion instruction;

a first execution instruction module configured to execute the delete instruction in each compute node of the massively parallel processing database.

Optionally, the apparatus further includes a training module, where the training module is configured to train and generate the feature vector index table, and the training module includes:

the acquisition sample submodule is configured to acquire a training sample, wherein the training sample comprises a plurality of training vectors corresponding to the project data;

a clustering submodule configured to cluster a plurality of training vectors in the training samples;

the determining submodule is configured to determine at least two feature vectors according to a clustering result and determine a computing node corresponding to each feature vector;

and the generation submodule is configured to generate a feature vector index table according to each feature vector and the computing node corresponding to each feature vector.

Optionally, the determining submodule is configured to determine a computing node corresponding to each feature vector according to the number of computing nodes in the massively parallel processing database.

According to the device for massively parallel processing of the database, when retrieval is performed in the massively parallel processing database, retrieval calculation nodes needing to be retrieved are determined through the preset feature vector index table, the retrieval is performed in the retrieval calculation nodes, the target vector corresponding to the vector to be retrieved is determined in the obtained at least one approximate vector, the accuracy of the target vector can be guaranteed, the retrieval of the vector to be retrieved in all the calculation nodes is not needed through the method provided by the specification, the retrieval accuracy is guaranteed, meanwhile, the time of each retrieval is greatly reduced, meanwhile, useless calculation consumption is greatly reduced, the retrieval capability is improved, and user experience is improved.

The above is an illustrative scheme of an apparatus for massively parallel processing of a database according to the present embodiment. It should be noted that the technical solution of the apparatus for massively parallel processing a database belongs to the same concept as the technical solution of the method for massively parallel processing a database described above, and details of the technical solution of the apparatus for massively parallel processing a database, which are not described in detail, can be referred to the description of the technical solution of the method for massively parallel processing a database described above.

Corresponding to the above method embodiments, the present specification further provides an embodiment of an apparatus for massively parallel processing a database, and fig. 6 shows a schematic structural diagram of an apparatus for massively parallel processing a database provided by an embodiment of the present specification. As shown in fig. 6, the apparatus includes:

a third receiving module 602 configured to receive a write instruction, where the write instruction includes a to-be-written vector corresponding to the to-be-written item data.

A second determining node module 604, configured to determine, by using an approximate nearest neighbor search method, a write feature vector corresponding to the to-be-written vector in a pre-generated feature vector index table, and obtain a write computation node corresponding to the write feature vector, where the feature vector index table records a plurality of feature vectors and a computation node corresponding to each feature vector.

A write module 606 configured to write the vector to be written to the write compute node.

Optionally, the second determining node module 604 is further configured to sequentially calculate an approximate value of the vector to be written and each feature vector in the feature vector index table by using an approximate nearest neighbor search method; and determining the eigenvector with the largest approximation value as the writing eigenvector corresponding to the vector to be written.

Optionally, the apparatus further comprises:

a fourth receiving module configured to receive a deletion instruction;

a second execution instruction module configured to execute the delete instruction in each compute node of the massively parallel processing database.

Optionally, the apparatus further comprises:

a fifth receiving module, configured to receive a retrieval instruction, where the retrieval instruction includes a vector to be retrieved corresponding to the item data to be retrieved;

a third determining node module, configured to determine at least one retrieval feature vector corresponding to the vector to be retrieved in the feature vector index table by using an approximate nearest neighbor retrieval method, and obtain a retrieval computing node corresponding to each retrieval feature vector;

a second retrieval vector module configured to retrieve an approximate vector corresponding to the vector to be retrieved in each of the retrieval computing nodes;

and the second vector determining module is configured to determine a target vector corresponding to the vector to be retrieved from at least one approximate vector.

Optionally, the third determining node module is further configured to sequentially calculate a difference value between the vector to be retrieved and each feature vector in the feature vector index table; and determining the feature vector of which the difference value is smaller than a preset threshold value as a retrieval feature vector corresponding to the vector to be retrieved.

Optionally, the second vector determining module is further configured to calculate a difference value between each of the approximate vectors and the vector to be retrieved; and selecting the approximate vector with the minimum difference value as a target vector corresponding to the vector to be retrieved.

Optionally, the apparatus further comprises:

the second matching result module is configured to obtain attribute information corresponding to the target vector according to the target vector; and taking the attribute information corresponding to the target vector as a retrieval result corresponding to the item data to be retrieved.

The apparatus for massively parallel processing a database provided in an embodiment of the present specification, in a massively parallel processing database, obtains a feature vector having a minimum difference value from a vector to be written by an approximate nearest neighbor search method, and further determines a write computation node corresponding to the vector to be written, where a pre-generated feature vector index table performs a partition clipping function for the massively parallel processing database, determines a corresponding write computation node for the vector to be written, writes the vector having the minimum difference value into the same computation node, so as to facilitate subsequent query, and when performing a search in the massively parallel processing database, determines a search computation node to be searched by using a preset feature vector index table, performs a search in the search computation node, and determines a target vector corresponding to the vector to be searched in at least one obtained approximate vector, the accuracy of the target vector can be guaranteed, the retrieval of the vectors to be retrieved in all the computing nodes is not needed by the method provided by the specification, the retrieval accuracy is guaranteed, the time of each retrieval is greatly reduced, useless computing consumption is greatly reduced, the retrieval capability is improved, and the user experience is improved.

FIG. 7 illustrates a block diagram of a computing device 700 provided in accordance with one embodiment of the present description. The components of the computing device 700 include, but are not limited to, memory 710 and a processor 720. Processor 720 is coupled to memory 710 via bus 730, and database 750 is used to store data.

Computing device 700 also includes access device 740, access device 740 enabling computing device 700 to communicate via one or more networks 760. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 740 may include one or more of any type of network interface, e.g., a Network Interface Card (NIC), wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 700, as well as other components not shown in FIG. 7, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 7 is for purposes of example only and is not limiting as to the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 700 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 700 may also be a mobile or stationary server.

Wherein the memory 710 is configured to store computer-executable instructions and the processor 720 is configured to execute the following computer-executable instructions:

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the method for massively parallel processing of the database described above belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the method for massively parallel processing of the database described above.

FIG. 8 shows a block diagram of a computing device 800 according to an embodiment of the present application. The components of the computing device 800 include, but are not limited to, memory 810 and a processor 820. The processor 820 is coupled to the memory 810 via a bus 830, and the database 850 is used to store data.

Computing device 800 also includes access device 840, access device 840 enabling computing device 800 to communicate via one or more networks 860.

The access mode of the computing device 800 is the same as that of the computing device 700, and is not described herein again.

Wherein, the processor 820 is configured to execute the following computer-executable instructions:

and writing the vector to be written into the writing calculation node.

An embodiment of the present specification also provides a computer readable storage medium storing computer instructions which, when executed by a processor, are used for implementing the steps of the method for massively parallel processing of a database.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the method for massively parallel processing of the database, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the method for massively parallel processing of the database.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts, but those skilled in the art should understand that the present embodiment is not limited by the described acts, because some steps may be performed in other sequences or simultaneously according to the present embodiment. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for an embodiment of the specification.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the embodiments. The specification is limited only by the claims and their full scope and equivalents.

Claims

1. A method for massively parallel processing of a database, comprising:

2. The method for massively parallel processing a database as claimed in claim 1, wherein determining at least one search feature vector corresponding to said vector to be searched in a pre-generated feature vector index table using a near-nearest neighbor search method comprises:

3. The method for massively parallel processing of a database as claimed in claim 1, determining a target vector corresponding to said vector to be retrieved from at least one of said approximate vectors, comprising:

4. The method for massively parallel processing of a database as claimed in claim 1, further comprising, after determining a target vector corresponding to said vector to be retrieved from at least one of said approximation vectors:

5. The method for massively parallel processing of a database as claimed in claim 1, further comprising:

receiving a deleting instruction;

6. The method for massively parallel processing of a database as claimed in claim 1, said eigenvector index table being trained by the steps comprising:

clustering a plurality of training vectors in the training samples;

7. The method for massively parallel processing of a database as claimed in claim 6, determining a compute node for each said feature vector, comprising:

8. A method for massively parallel processing of a database, comprising:

and writing the vector to be written into the writing calculation node.

9. The method for massively parallel processing a database as claimed in claim 8, determining a write eigenvector corresponding to said vector to be written in a pre-generated eigenvector index table using approximate nearest neighbor search, comprising:

10. The method for massively parallel processing of databases as claimed in claim 8, further comprising:

receiving a deleting instruction;

11. The method for massively parallel processing of a database as claimed in claim 8 or 10, further comprising:

12. The method for massively parallel processing a database as claimed in claim 11, determining at least one retrieved feature vector corresponding to said vector to be retrieved in said feature vector index table using approximate nearest neighbor searching, comprising:

13. The method for massively parallel processing of a database as claimed in claim 11, determining a target vector corresponding to said vector to be retrieved from at least one of said approximate vectors, comprising:

14. The method for massively parallel processing of a database as claimed in claim 11, after determining a target vector corresponding to said vector to be retrieved from at least one of said approximation vectors, further comprising:

15. The method for massively parallel processing a database as claimed in claim 8, said eigenvector index table being trained by the steps comprising:

clustering a plurality of training vectors in the training samples;

16. The method for massively parallel processing of a database as claimed in claim 15, determining a compute node for each of said feature vectors, comprising:

17. An apparatus for massively parallel processing of a database, comprising:

18. An apparatus for massively parallel processing of a database, comprising:

19. A computing device, comprising:

a memory and a processor;

20. A computing device, comprising:

a memory and a processor;

and writing the vector to be written into the writing calculation node.

21. A computer readable storage medium storing computer instructions which, when executed by a processor, carry out the steps of the method for massively parallel processing of a database as claimed in any of claims 1-7 or 8-16.