CN110046062B

CN110046062B - Distributed data processing method and system

Info

Publication number: CN110046062B
Application number: CN201910173569.0A
Authority: CN
Inventors: 郑轩; 贾志忠; 曲家朋; 段和枫
Original assignee: PCI Suntek Technology Co Ltd
Current assignee: PCI Technology Group Co Ltd
Priority date: 2019-03-07
Filing date: 2019-03-07
Publication date: 2021-03-19
Anticipated expiration: 2039-03-07
Also published as: CN110046062A

Abstract

The embodiment of the invention discloses a distributed data processing method and a system, comprising the following steps: the method comprises the steps that a metadata node receives a retrieval instruction sent by a client, and the metadata node belongs to a metadata cluster; the metadata nodes feed back all stored metadata to the client according to the retrieval instruction so that the client selects a first target data node according to the metadata and sends a retrieval request to the first target data node, a group of metadata corresponds to one data node, a plurality of data nodes form one data node group, all the data node groups form a characteristic index server cluster, each data node group has one first target data node, and data in the data node groups are synchronous; a first target data node receives a retrieval request and acquires a feature index therein; and the first target data node determines retrieval data according to the characteristic index and feeds the retrieval data back to the client so that the client determines a retrieval result according to the retrieval data. The above realizes fast retrieval without storing metadata at the client.

Description

Distributed data processing method and system

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a distributed data processing method and system.

Background

With the development of computer technology, various fields increasingly depend on computer technology. For example, in the field of intelligent security, various electronic devices are generally used to form a security system in a networked manner. Wherein, the camera is indispensable partly in the security protection field. In order to facilitate subsequent evidence collection, it is usually set that a picture collected by a camera is stored by electronic equipment at a processor level in a security system, and corresponding metadata is stored in a client used by a user, so that the user can search through the metadata. At this time, the stored data can only reach the fault tolerance of the disk level, and when the client is abnormal or damaged, the user cannot retrieve the stored data, which may cause irreparable loss to the user.

Disclosure of Invention

The invention provides a distributed data processing method and a distributed data processing system, which are used for realizing rapid retrieval under the condition that a client does not store metadata.

In a first aspect, an embodiment of the present invention provides a distributed data processing method, including:

receiving a retrieval instruction sent by a client by a metadata node, wherein the metadata node belongs to a metadata cluster;

the metadata nodes feed back all stored metadata to the client according to a retrieval instruction so that the client selects a first target data node according to the metadata and sends a retrieval request to the first target data node, one group of metadata corresponds to one data node, a plurality of data nodes form one data node group, all the data node groups form a feature index server cluster, each data node group has one data node as a first target data node, and data in the data node group are synchronous;

the first target data node receives a retrieval request and acquires a feature index in the retrieval request;

and the first target data node determines retrieval data according to the characteristic index and feeds the retrieval data back to the client so that the client determines a retrieval result according to the retrieval data.

Further, the determining, by the first target data node, the retrieval data according to the feature index includes:

the first target data node determines whether retrieval data corresponding to the characteristic index exists in a cache manager of the first target data node;

if yes, acquiring the retrieval data;

if not, searching retrieval data corresponding to the characteristic index in the self storage manager.

Further, the feature index includes feature data obtained by performing deep learning on the data to be retrieved.

Further, the retrieval data comprises data unique identification and/or data position information.

Further, the data node group comprises a master data node and a plurality of slave data nodes, and the master data node is used for controlling data synchronization of the data node group.

Further, the method also comprises the following steps:

the metadata node receives a write-in instruction sent by the client;

the metadata node determines a second target data node in all the data nodes according to the write-in instruction;

the metadata node feeds back the metadata of the second target data node to the client, so that the client sends a write request to the second target data node according to the metadata;

the second target data node receives the write request;

and the second target data node performs write response according to the write request.

Further, the second target data node is a master data node in the data node group,

the second target data node performing a write response according to the write request includes:

the second target data node performs writing operation in a self storage manager according to the writing request and creates a characteristic index;

the second target data node updates the data written in the storage manager and the characteristic index to the cache manager of the second target data node;

and the second target data node feeds back write response information to the client.

Further, the second target data node is a slave data node in the data node group,

the second target data node sends the write-in request to a third target data node, wherein the third target data node is a main data node in the data node group to which the third target data node belongs;

the third target data node performs writing operation in a self storage manager according to the writing request and creates a characteristic index;

the third target data node updates the data and the characteristic index written in the storage manager to the cache manager of the third target data node;

the third target data node feeds back write-in response information to the second target data node;

and the second target data node feeds back the write response information to the client.

In a second aspect, an embodiment of the present invention further provides a distributed data processing system, including: the system comprises a metadata cluster and a characteristic index server cluster, wherein the metadata cluster comprises at least one metadata node, the characteristic index server cluster comprises a plurality of data node groups, each data node group comprises a plurality of data nodes, and data in the data node groups are synchronized;

the metadata nodes are used for receiving retrieval instructions sent by a client and feeding back all stored metadata to the client according to the retrieval instructions so that the client can select a first target data node according to the metadata and send a retrieval request to the first target data node, a group of metadata corresponds to one data node, and one data node exists in each data node group and serves as the first target data node;

the first target data node is used for receiving a retrieval request and acquiring a feature index in the retrieval request; and the client is also used for determining retrieval data according to the characteristic index and feeding the retrieval data back to the client so as to enable the client to determine a retrieval result according to the retrieval data.

Further, the hash value of each data node in the data node group is within a set range.

According to the distributed data processing method and system, the metadata nodes in the metadata cluster receive retrieval instructions sent by the client, and the metadata of the data nodes of all data node groups in the feature index server cluster is fed back to the client, so that the client selects first target data nodes in all data node groups according to the metadata and sends retrieval requests to all the first target data nodes, all the first target data nodes determine feature indexes according to the retrieval requests and retrieve corresponding retrieval data according to the feature indexes, and then all the first target data nodes feed the retrieval data back to the client, so that the client can clearly retrieve results. Meanwhile, the fault tolerance of the node level can be ensured by setting the data node group with synchronous data in the group, and even if any data node in the data node group fails, other data nodes can provide retrieval service, and the safety and stability of the data can be ensured.

Drawings

Fig. 1 is a flowchart of a distributed data processing method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of data flow during client retrieval according to an embodiment of the present invention;

FIG. 3 is a flow chart of another distributed data processing method according to an embodiment of the present invention;

FIG. 4 is a data flow diagram illustrating a first target data node processing a search request according to an embodiment of the present invention;

FIG. 5 is a flowchart of another distributed data processing method according to an embodiment of the present invention;

fig. 6 is a schematic diagram illustrating data flows among clusters when a client writes data according to an embodiment of the present invention;

fig. 7 is a data flow diagram illustrating a write response performed by a master data node according to an embodiment of the present invention;

FIG. 8 is a block diagram of a distributed data processing system according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a distributed data processing apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a metadata node according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a data node according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are for purposes of illustration and not limitation. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Fig. 1 is a flowchart of a distributed data processing method according to an embodiment of the present invention. The distributed data processing method is suitable for the condition of searching the data stored in a distributed mode through the characteristic indexes. The distributed data processing method may be performed by a distributed data processing system.

The distributed data processing system comprises a metadata cluster and a feature index server cluster. The metadata cluster includes a plurality of metadata nodes, each of which may be considered a server. The metadata nodes are used for storing metadata of each data node in the feature index server cluster, and each metadata node stores metadata of all the data nodes. The feature index server cluster includes a plurality of data node groups, each data node group including a plurality of data nodes. Each data node stores data written by a user and a feature index generated based on the data. And data synchronization among the data nodes in each data node group. At this time, the user can obtain the required data by only accessing one of the data nodes, and when any data node in the data node group is abnormal, the user can still retrieve the required data in the data node group. Furthermore, the grouping mode of each data node can be set according to the actual situation. For example, data nodes in the same geographical location area are divided into the same data node group. For another example, a consistent hash algorithm is used to perform hash calculation on the set data of the data nodes, for example, the consistent hash calculation is performed on the device feature codes, and the data nodes corresponding to the obtained hash values in the set range are divided into the same data node group.

Specifically, referring to fig. 1, the distributed data processing method specifically includes:

and 110, receiving a retrieval instruction sent by a client by a metadata node, wherein the metadata node belongs to a metadata cluster.

Illustratively, the client is an intelligent electronic device used by a user that can access metadata nodes in the metadata cluster and access data nodes in the feature index server cluster. Optionally, the client may include at least one user-held device such as a mobile phone, a tablet computer, a notebook computer, and a desktop computer. Further, when the user has a retrieval requirement, a retrieval instruction can be sent out through the client. The embodiment of the process for generating the retrieval instruction by the client is not limited.

Specifically, after generating the retrieval instruction, the client sends the retrieval instruction to the metadata node. The communication method embodiments adopted by the client and the metadata node are not limited. Optionally, the metadata node may be a set node in the metadata cluster, or may be any node in the metadata cluster. For example, a client range corresponding to each metadata node is preset, and in this case, clients in the range can communicate with the set metadata node. As another example, the client may send the retrieval instruction to any metadata node, or send the retrieval instruction to the metadata node in a broadcast manner.

Further, the search instruction is an instruction for prompting data search. The content embodiment included in the search instruction is not limited. For example, the set search instruction includes the identity information of the client and the set search code. And the metadata node determines that a retrieval instruction is received through the retrieval code and confirms the client through the identity information.

And step 120, the metadata node feeds back all stored metadata to the client according to the retrieval instruction, so that the client selects a first target data node according to the metadata and sends a retrieval request to the first target data node.

The metadata of each group corresponds to one data node, a plurality of data nodes form one data node group, all the data node groups form a feature index server cluster, each data node group has one data node as a first target data node, and data in the data node groups are synchronous.

Specifically, after the set metadata node receives the retrieval instruction, all metadata stored inside the set metadata node is fed back to the client. The metadata node stores metadata of each data node in the feature index server cluster. The metadata is used to describe attribute information of the data node, and in the embodiment, the setting the metadata at least includes: and the position information and the belonged group of the data nodes enable the client to determine the positions of the data nodes through the position information, and determine the data node group where each data node is located through the belonged group, so that the data nodes are accessed. Generally, when a data node is started, its own metadata can be reported to the metadata node. Wherein the data nodes send the metadata to all metadata nodes such that each metadata node receives the metadata. Or the data nodes report the metadata to any metadata node or set metadata nodes, and after receiving the metadata, the metadata nodes synchronize in the metadata cluster to ensure that each metadata node acquires the metadata. The embodiment of the determination method for setting the metadata node is not limited.

For example, after receiving the metadata, the client determines a data node group to which each data node belongs based on the metadata. Generally, each data node group includes a plurality of data nodes, and data in the data node groups are synchronized, and data between the data node groups may be the same or different. Further, after the data node groups are determined, the client side sequentially selects one data node in each data node group as a first target data node. In the embodiment, data synchronization among the data nodes in the same data node group is defined, so that the data storage conditions of the data nodes are the same. Therefore, in the embodiment, the client may optionally select one data node in each data node group as the first target data node. It is understood that, in practical applications, the client may also select the set data node as the first target data node from the data node group according to limitations in communication conditions and the like. Typically, each data node group corresponds to a first target data node, so as to ensure that the client retrieves all data stored in the feature index server cluster.

Further, after the client selects the first target data node, the client sends a retrieval request to each first target data node according to the corresponding metadata. The embodiment of the generation method of the search request is not limited. Generally, at least the feature index is included in the search request. Optionally, the feature index includes feature data, and the feature data is obtained by performing deep learning on the data to be retrieved. In this case, a key that is also understood as a feature index may be used as feature data. Further, the data format of the data to be retrieved is not limited, such as image data, text data and/or audio data. The different data correspond to different characteristic data. The characteristic data can be obtained by deep learning of the corresponding data nodes, and can also be obtained by deep learning of other data systems. In the embodiment, the description is given by taking an example that the feature data is obtained by deep learning by other data systems, at this time, hardware devices where the other data systems are located and the data node are different devices, and the other data systems can communicate with the data node and the client. Generally, when a data node or other storage nodes store data, other data systems perform deep learning on the stored data to obtain feature data corresponding to the data. The specific mode of deep learning can be set according to actual conditions, the characteristic data is a character string with a certain length, and the data nodes have no authority to modify the character string. Further, after the data node acquires the feature data, the feature data is stored. Optionally, the client obtains and stores the feature data, at this time, if the client wants to retrieve a certain data, only the corresponding feature data needs to be input, for example, the client wants to retrieve a picture including the object a, at this time, the client needs only to generate the feature index based on the feature data of the object a, so that the retrieval can be performed, that is, the picture of the object a is the data to be retrieved. Optionally, the client does not obtain the feature data, at this time, if the client wants to retrieve a certain data, the client needs to input the reference data as the data to be retrieved, and the feature data generated by the other data systems is fed back to the client, so that the client generates the feature index. For example, when the client wants to search for a picture including a B object, the client may use a certain picture including the B object as data to be searched, generate feature data by another data system, and then generate a feature index based on the feature data of the B object.

Step 130, the first target data node receives the retrieval request and obtains the feature index in the retrieval request.

For example, the processing procedure of each first target data node for the retrieval request is the same, and in the embodiment, a first target data node is taken as an example for description. Specifically, after receiving the retrieval request, the first target data node parses the retrieval request to obtain the feature index in the retrieval request. Alternatively, when a user needs to retrieve a plurality of data, a plurality of feature indexes may be written in the retrieval request.

And step 140, the first target data node determines retrieval data according to the characteristic index, and feeds the retrieval data back to the client, so that the client determines a retrieval result according to the retrieval data.

Specifically, the retrieved data is data obtained according to the characteristic index, and the specific data content can be set according to the actual situation. In an embodiment, the set retrieval data includes data unique identifier and/or data location information, and optionally includes data writing time and/or feature index similarity. Wherein, the data writing time is the time for storing the data. The data unique identification is the identity identification of the data and has uniqueness, and the embodiment does not limit the generation rule of the data unique identification. The data location information is location information of data storage. The data writing time, the data position information and the data unique identification can be marked as characteristic information of the data. The feature index similarity is the similarity between the feature data in the feature index and the feature data found in the data node. In general, the higher the similarity, the more similar the two feature data, and the more accurate the retrieved data.

Generally, when a data node or other storage nodes store data, the data node records data position information, data unique identification, data writing time and the like synchronously in addition to characteristic data of the data, and establishes an association relationship between the data. At this time, when other storage nodes store data, the other storage nodes report the information to the data node after the storage is completed, and the data node performs intra-group synchronization. After the first target data node receives the feature index, feature data in the feature index are obtained, the similarity between each feature data stored in the first target data node and the feature data in the feature index is confirmed, then the feature data with the similarity higher than the set similarity are obtained, one feature data with the highest similarity or the feature data with the set number is selected from the obtained feature data, information such as data position information and data unique identification associated with the feature data is obtained, and corresponding retrieval data are generated. In this case, the search data corresponds to the value of the feature index. It is understood that the above-mentioned specific data for setting the similarity and the calculation manner of the similarity may be set according to actual situations.

Optionally, during the retrieval, the first target data node first calculates similarity between each feature data in its own cache manager and the feature data in the feature index, that is, retrieves in the cache manager, and if there is feature data higher than the set similarity in the cache manager, it confirms that the retrieval is successful, and generates the retrieval data. Otherwise, the search is performed in the storage manager of the cache manager, and at this time, the search process is the same as the search process in the cache manager, which is not described herein again. Further, if the search is successful, the search data is generated, otherwise, the search is confirmed to be failed. Illustratively, if the retrieval data is generated, the retrieval data is fed back to the client, otherwise, the client is notified of the retrieval failure, so that the client confirms whether to regenerate the retrieval request.

Optionally, when the retrieval data is obtained, the first target data node performs asynchronous retrieval, and merges retrieval results after all asynchronous retrieval is completed, so as to obtain the retrieval data. The asynchronous retrieval can ensure the loading speed of the retrieved data when the retrieved content is excessive.

Further, after receiving the retrieval data, the client determines a retrieval result according to the retrieval data. After the client confirms that the retrieval data returned by all the first target data nodes are received, the retrieval data are merged to obtain the final retrieval data. And then, the client displays the retrieval data to enable the user to clearly retrieve the data, and further obtains a retrieval result according to the retrieval data. For example, data position information in the search data is accessed, and the accessed data is used as a search result. Or obtaining required data according to the data unique identification, and taking the data as a retrieval result. Optionally, after receiving the retrieval data, verifying the retrieval data, and after the verification is successful, obtaining a retrieval result. If the verification fails, the search is confirmed to fail. The verification mode can be set according to actual conditions.

The following describes an exemplary technical solution provided in this embodiment, where fig. 2 is a schematic diagram of data flow when a client retrieves data provided in this embodiment of the present invention. Referring to fig. 2, the distributed data processing method specifically includes: and the client sends a retrieval instruction to the metadata cluster so that a certain metadata node in the metadata cluster determines the retrieval instruction and feeds back the metadata of the characteristic index server cluster to the client. The metadata of the feature index server cluster refers to metadata of all data nodes. And then, the client selects a first target data node in each data node group according to the metadata. And then, the client sends a retrieval request to each first target data node, each first target data node feeds retrieval data back to the client according to the retrieval request, and after the client confirms that all the first target data nodes feed the retrieval data back, the client confirms that the first target data nodes finish responding, then, the client merges the retrieval data and finishes the retrieval. And then, the client can obtain a retrieval result according to the retrieval data. Since the processing rules of the first destination data nodes are the same, fig. 2 shows only one first destination data node as an example.

The retrieval instruction sent by the client is received through the metadata nodes in the metadata cluster, the metadata of the data nodes of all data node groups in the feature index server cluster is fed back to the client, so that the client selects the first target data nodes in all the data node groups according to the metadata and sends retrieval requests to all the first target data nodes, all the first target data nodes determine the feature indexes according to the retrieval requests and retrieve corresponding retrieval data according to the feature indexes, and then all the first target data nodes feed the retrieval data back to the client, so that the client confirms the retrieval results. Meanwhile, the fault tolerance of the node level can be ensured by setting the data node group with synchronous data in the group, and even if any data node in the data node group fails, other data nodes can provide retrieval service, and the safety and stability of the data can be ensured.

Fig. 3 is a flowchart of another distributed data processing method according to an embodiment of the present invention. The present embodiment is embodied on the basis of the above-described embodiments. Specifically, referring to fig. 3, the distributed data processing method includes:

step 210, a metadata node receives a retrieval instruction sent by a client, and the metadata node belongs to a metadata cluster.

And step 220, the metadata node feeds back all stored metadata to the client according to the retrieval instruction so that the client selects a first target data node according to the metadata and sends a retrieval request to the first target data node.

Step 230, the first target data node receives the retrieval request and obtains the feature index in the retrieval request.

In particular, the data node includes a service layer and a retrieval manager. The service layer is configured to communicate with the client, for example, the service layer is configured to receive a retrieval request from the client and send the retrieval request to the retrieval manager. Further, the retrieval manager is used to implement the retrieval function, for example, after receiving the retrieval request, the retrieval manager parses the retrieval request and obtains the feature index. In an embodiment, the set feature index includes feature data.

Step 240, the first target data node determines whether the retrieved data corresponding to the characteristic index exists in the cache manager of the first target data node. If so, go to step 250, and if not, go to step 260.

Specifically, the data node further comprises a cache manager which has a caching function. Generally, the cache manager stores data written in a set duration, where the written data includes feature data and feature information having an association relationship. The set time period can be set according to actual conditions, for example, the set time period is 90 days. Optionally, the cache manager may include one or more. For example, after the set retrieval manager determines the feature index, the corresponding one or more cache managers are accessed according to the feature data in the feature index to determine whether the feature data meeting the set similarity exists in the cache managers.

And step 250, acquiring the retrieval data, and feeding the retrieval data back to the client so that the client determines a retrieval result according to the retrieval data.

Specifically, after the cache manager acquires the retrieval data, the cache manager sends the retrieval data to the retrieval manager. The retrieval manager can obtain the retrieval data fed back by the cache manager and/or the storage manager, then, the retrieval data is subjected to asynchronous data retrieval, and after all asynchronous data retrieval is completed, results are combined to obtain the final retrieval data. And then, feeding back the retrieval data to the service layer so that the service layer feeds back the retrieval data to the client.

Step 260, searching the retrieval data corresponding to the characteristic index in the self storage manager, and feeding back the retrieval data to the client so that the client determines a retrieval result according to the retrieval data.

Illustratively, the data node further comprises a storage manager having memory functionality. In the embodiment, the storage manager and the cache manager are set to store the same data types and different data volumes. Generally, the amount of data stored in the storage manager is greater than the amount of data stored in the cache manager, the cache manager only stores the data within the most recently set time length, and the storage manager can store data for more time lengths. Optionally, the storage manager is configured to manage the data stored therein by using the LRU algorithm, that is, to move some data stored therein out of the memory by using the LRU algorithm to make room for loading other data.

Specifically, after receiving the feature index, the storage manager acquires feature data of the feature index, and then determines whether the storage manager has feature data satisfying the set similarity. And if so, acquiring the retrieval data, and feeding the retrieval data back to the client so that the client determines a retrieval result according to the retrieval data. At this time, the storage manager sends the search data to the cache manager in advance, that is, updates the data of the cache manager, and then the cache manager feeds the search data back to the search manager, and at this time, the processing procedure of the search manager is the same as that of step 250, which is not described herein again. If not, determining that the retrieval data cannot be acquired, and feeding back a retrieval failure message to the client. At this time, the storage manager may send a retrieval failure message to the retrieval manager, and the retrieval manager feeds back to the client through the service layer.

It should be noted that, in steps 240 to 260, sending the feature index from the retrieval manager to the cache manager to the retrieval manager for asynchronous data retrieval is a loop process. Through the circulation process, the retrieval of all the feature indexes can be realized.

Generally, there is typically retrieved data in the storage manager that is needed by the user.

The following describes an exemplary process of the first target data node processing the retrieval request in this embodiment. Fig. 4 is a schematic diagram of data flow when a first target data node processes a retrieval request according to an embodiment of the present invention. Referring to fig. 4, the first target data node includes a service layer, a retrieval manager, a cache manager, and a storage manager. Specifically, the service layer receives a retrieval request sent by the client and sends the retrieval request to the retrieval manager. The search manager splits the search request, i.e. analyzes the search request for the feature index. The retrieval manager then sends a data acquisition indication, i.e. sends the feature index, to the cache manager. And the cache manager confirms whether the retrieval data corresponding to the characteristic index exists or not according to the characteristic index, and if so, the retrieval data is obtained and fed back to the retrieval manager. If not, the feature index is sent to the storage manager. The storage manager retrieves the retrieval data corresponding to the characteristic index, feeds the retrieval data back to the cache manager, and then feeds the retrieval data back to the retrieval manager through the cache manager. And when the retrieval manager receives the retrieval data, asynchronous data searching is executed, and then after all asynchronous data searching is finished, results are merged to obtain final retrieval data. And then, the retrieval manager sends the retrieval data to the service layer, and the retrieval data is fed back to the client by the service layer.

By setting the cache manager and the storage manager, the retrieval data can be quickly retrieved, and meanwhile, the data can be conveniently managed.

Fig. 5 is a flowchart of another distributed data processing method according to an embodiment of the present invention. The present embodiment is embodied on the basis of the above-described embodiments. The present embodiment describes a write request from a user. Specifically, the set data node group includes a master data node and a plurality of slave data nodes, and the master data node is used for controlling data synchronization of the data node group.

Specifically, each data node group is set to include a master data node and a plurality of slave data nodes. The main data node is used for controlling data synchronization among the data nodes in the group. When any data node in the group receives a write request, the main data node responds to the write request and carries out data synchronization after data is written.

Optionally, the embodiment of the selection mode of the master data node is not limited. For example, the master data node is selected within the data node group by a distributed consistency algorithm protocol (e.g., RAFT algorithm). In general, a data node group always has a master data node. If the current master data node is abnormal or unavailable, other data nodes in the data node group can be selected as the master data node again. At this time, the primary data node is determined in the same manner as the aforementioned determination. If the abnormal or unavailable master data node is available again, the available master data node can be changed into a slave data node, and the current master data node controls the slave data node to perform data synchronization.

Referring to fig. 5, the distributed data processing method further includes:

step 310, the metadata node receives a write instruction sent by the client.

Specifically, when a user has a writing requirement, a writing instruction is generated through the client and is sent to the metadata node. The manner and rule for sending the write instruction to the metadata node by the client is the same as the manner and rule for sending the search instruction to the metadata node by the client, which is not described herein again. Further, the specific content included in the write command may be set according to actual conditions. For example, the write command includes the identity information of the data node that desires to write the data. The identity information has uniqueness, and may be a number or the like. Generally, a user can obtain identity information of all data nodes in which data can be written currently through a client.

And step 320, the metadata node determines a second target data node in all the data nodes according to the writing instruction.

Specifically, the metadata node parses the write instruction and confirms the identity information of the data node. Generally, the metadata node stores the identity information of each data node, so that the corresponding data node is determined by the identity information in all the data nodes, and the data node is determined as a second target data node. Wherein the second target data node may be one or more.

Step 330, the metadata node feeds back the metadata of the second target data node to the client, so that the client sends a write request to the second target data node according to the metadata.

For example, after the metadata node determines the second target data node, the metadata of the second target data node is fed back to the client, so that the client determines the location information of the second target data node through the metadata. And then, the client generates a write request and sends the write request to a second target data node. Wherein the write request includes at least data that the user desires to write to the data node. The written data at least comprises the characteristic data and the characteristic information of the currently stored data, and optionally comprises the currently stored data. Optionally, after obtaining the metadata of the second target data node, the user calculates the storage fragment of the write-in data through a consistent hash algorithm according to the location information, and then generates the write-in request according to the fragment result.

Step 340, the second target data node receives the write request.

Optionally, the service layer of the second target data node receives the write request and sends the write request to the storage manager.

And step 350, the second target data node performs write response according to the write request.

Specifically, the second target data node responds to the write request. The setting of the write response in the embodiment includes at least two schemes:

and the first scheme and the second target data node are main data nodes in the data node group. The method specifically comprises steps 351-353:

and 351, the second target data node performs write operation in the self storage manager according to the write request and creates the characteristic index.

Specifically, when the second target data node is the master data node, the second target data node directly responds to the write request. Optionally, after receiving the write request, the storage manager of the second target data node encapsulates the write request into an RAFT proposal. And then, proposing a local RAFT node, and sending the RAFT proposal to the RAFT cluster after the local RAFT node receives the proposal. And after confirming the RAFT proposal, the RAFT cluster feeds back the passed RAFT proposal to the local RAFT node. The RAFT cluster refers to RAFT nodes of slave data nodes in the data node group. By setting the RAFT nodes and performing RAFT proposal, the consistency of each data node can be kept in the subsequent data writing process. Further, the storage manager calls back the RAFT proposal to confirm that data writes can be made. Then, the storage manager writes data and generates a feature index from the written data. When data is written, the characteristic data and the characteristic information with relevance are recorded, and the characteristic data is used as keys of the characteristic index, so that the characteristic index is generated.

Step 352, the second target data node updates the data and the feature index written in the storage manager to the cache manager thereof.

Specifically, after the storage manager completes the write operation, the written data and the feature index are updated to the cache manager of the storage manager, so that the cache manager stores the written data and the feature index. When a retrieval request is subsequently received, the cache manager preferentially searches for the data to improve the retrieval speed.

And step 353, feeding back the write response information to the client by the second target data node.

For example, after the storage manager of the second target data node sends the written data and the characteristic index to the cache manager, the storage manager feeds back the write response information to the service layer. The write response information may include key of the feature index, write completion flag, and the like. And then, the service layer feeds back the write-in response information to the client so that the client confirms that the data is successfully written in and confirms the related characteristic index, and further subsequent data retrieval is realized.

And the second target data node is a slave data node in the data node group. The steps specifically include steps 354-358:

in step 354, the second target data node sends the write request to a third target data node, where the third target data node is a master data node in the data node group to which the third target data node belongs.

Specifically, since the master data node controls the data synchronization of the data node group, when data is written, the master data node can directly write the data, and after the write operation is completed, the data synchronization is performed on each data node in the group. After synchronization, the second target data node also stores the written data and the characteristic index. Accordingly, when the second target data node is the slave data node, the second target data node sends the write request to the master data node, and the master data node executes the write operation. At this time, the master data node is marked as a third target data node. And when the third target data node receives the write-in request, recording the second target data node so as to feed back write-in response information to the second target data node in the subsequent process.

And step 355, the third target data node performs write operation in the self storage manager according to the write request and creates a characteristic index.

In step 356, the third target data node updates the data and the feature index written in the storage manager to its own cache manager.

Step 357, the third target data node feeds back the write response information to the second target data node.

Step 358, the second target data node feeds back the write response information to the client.

The specific implementation process of step 355 to step 358 is the same as the specific implementation process of step 351 to step 353, and is not described herein again.

It should be noted that, after the third target data node completes the write operation, data synchronization is performed on each slave data node in the group.

The technical solution provided by the present embodiment is exemplarily described below. Fig. 6 is a schematic diagram of data flow between clusters when a client writes data according to an embodiment of the present invention. Fig. 7 is a schematic data flow diagram of a write response performed by a master data node according to an embodiment of the present invention. In this example, it is set that the client desires to write data in the primary data node, that is, the second target data node is the primary data node.

Referring to fig. 6 to 7, after the client generates the write instruction, the client sends the write instruction to the metadata service cluster, so that a certain metadata node in the metadata service cluster determines the write instruction, and feeds back metadata corresponding to the second target data node to the client. And after receiving the metadata, the client calculates the storage fragment of the written data through a consistent hash algorithm according to the position information, then generates a write request by combining the fragment result, and sends the write request to a second target data node.

The service layer of the second target data node receives the write request and sends the write request to the storage manager. The storage manager encapsulates the write request into a RAFT proposal. And then, proposing a local RAFT node, and sending the RAFT proposal to the RAFT cluster after the local RAFT node receives the proposal. And after confirming the RAFT proposal, the RAFT cluster feeds back the passed RAFT proposal to the local RAFT node. Further, the storage manager calls back the RAFT proposal, writes data, and then generates a feature index according to the written data. And after the storage manager finishes the write-in operation, updating the written data and the characteristic index to the cache manager of the storage manager. And after the storage manager sends the written data and the characteristic index to the cache manager, the storage manager feeds back write response information to the service layer, and the service layer feeds back the write response information to the client so that the client confirms that the data is successfully written. At the same time, the cache manager flushes the cache.

In the above, distributed storage is realized through data synchronization in the data node group. In addition, the data writing can be realized through the master data node and the slave data nodes, and the data synchronization is convenient during the data writing. At the same time, fault tolerance of the entire data node group may be achieved, not just at the single data node level. In this case, the client does not need to store the metadata of each data node, and can also communicate with the data node through the metadata node.

Fig. 8 is a schematic structural diagram of a distributed data processing system according to an embodiment of the present invention. Referring to fig. 8, the distributed data processing system includes a metadata cluster 41 and a feature index server cluster 42, the metadata cluster 41 includes at least one metadata node 411, the feature index server cluster 42 includes a plurality of data node groups 421, each data node group 421 includes a plurality of data nodes 422, and data in the data node groups 421 are synchronized.

The metadata node 411 is configured to receive a retrieval instruction sent by a client (not shown), and is further configured to feed back all stored metadata to the client according to the retrieval instruction, so that the client selects a first target data node according to the metadata and sends a retrieval request to the first target data node, a group of metadata corresponds to one data node, and one data node 422 is present in each data node group 421 as the first target data node;

the first target data node is used for receiving the retrieval request and acquiring the feature index in the retrieval request; and the client is also used for determining retrieval data according to the characteristic index and feeding the retrieval data back to the client so that the client determines a retrieval result according to the retrieval data.

On the basis of the above embodiment, the hash value of each data node 422 in the data node group 421 is within a set range.

Specifically, the setting parameters (such as feature codes) of each data node are calculated through consistent hash calculation, and then the data nodes belonging to the same hash value range are grouped into one group, and the hash value ranges of different data node groups are different. The range of the hash value may be according to actual conditions.

Further, when the first target data node is configured to determine the search data according to the feature index, it is specifically configured to:

determining whether retrieval data corresponding to the characteristic index exists in a self cache manager or not; if yes, acquiring the retrieval data; if not, searching retrieval data corresponding to the characteristic index in the self storage manager.

Further, the data node group further includes a second target data node, and the metadata node is further configured to: receiving a write-in instruction sent by the client; determining a second target data node in all data nodes according to the writing instruction; and feeding back the metadata of the second target data node to the client, so that the client sends a write request to the second target data node according to the metadata. The second target data node is to: receiving the write request; and carrying out write response according to the write request.

Further, the second target data node is a master data node in the data node group to which the second target data node belongs, and when the second target data node performs a write response according to the write request, the second target data node is specifically configured to: performing writing operation in a self storage manager according to the writing request and creating a characteristic index; updating the data and the characteristic index written in the storage manager to a self cache manager; and feeding back the write response information to the client.

Further, the second target data node is a slave data node in the data node group to which the second target data node belongs, and the data node group includes a third target data node; the second target data node is configured to, when performing a write response according to the write request, specifically: sending the write request to a third target data node, further configured to: receiving write-in response information fed back by the third target data node, and feeding back the write-in response information to the client; the third target data node is a main data node in the data node group. The third target data node is to: performing writing operation in a self storage manager according to the writing request and creating a characteristic index; updating the data and the characteristic index written in the storage manager to a self cache manager; and feeding back write response information to the second target data node.

The distributed data processing system provided by the embodiment is used for executing any of the above distributed data processing methods, and has corresponding functions and beneficial effects.

Fig. 9 is a schematic structural diagram of a distributed data processing apparatus according to an embodiment of the present invention. Referring to fig. 9, the distributed data processing apparatus includes:

a retrieval instruction receiving module 501, configured to a metadata node, configured to receive a retrieval instruction sent by a client, where the metadata node belongs to a metadata cluster; a first metadata feedback module 502, configured to the metadata nodes, and configured to feed back all stored metadata to the client according to a retrieval instruction, so that the client selects a first target data node according to the metadata and sends a retrieval request to the first target data node, a group of metadata corresponds to one data node, a plurality of data nodes form one data node group, all data node groups form a feature index server cluster, each data node group has one data node as a first target data node, and data in the data node group are synchronized; a retrieval request receiving module 503, configured to the first target data node, and configured to receive a retrieval request and obtain a feature index in the retrieval request; the retrieval response module 504 is configured to the first target data node, and is configured to determine retrieval data according to the feature index, and feed back the retrieval data to the client, so that the client determines a retrieval result according to the retrieval data.

On the basis of the above embodiment, the retrieval response module 504 includes: the index comparison unit is used for determining whether retrieval data corresponding to the characteristic index exists in the cache manager; a first data acquisition unit, configured to acquire the search data if the search data exists; the second data acquisition unit is used for searching retrieval data corresponding to the characteristic index in the self storage manager if the retrieval data does not exist; and the data feedback unit is used for feeding the retrieval data back to the client so that the client determines a retrieval result according to the retrieval data.

On the basis of the above embodiment, the feature index includes feature data obtained by performing deep learning on data to be retrieved.

On the basis of the above embodiment, the retrieval data includes data unique identification and/or data position information.

On the basis of the above embodiment, the data node group includes a master data node and a plurality of slave data nodes, and the master data node is configured to control data synchronization of the data node group.

On the basis of the above embodiment, the method further includes: the write-in instruction receiving module is configured at the metadata node and used for receiving the write-in instruction sent by the client; the node determining module is configured to the metadata nodes and is used for determining second target data nodes in all the data nodes according to the write-in instruction; the second metadata feedback module is configured at the metadata node and used for feeding back the metadata of the second target data node to the client so that the client sends a write-in request to the second target data node according to the metadata; a write request receiving module configured to a second target data node and configured to receive the write request; and the write response module is configured at the second target data node and used for performing write response according to the write request.

On the basis of the foregoing embodiment, the second target data node is a master data node in the data node group to which the second target data node belongs, and the write response module includes: the first writing unit is used for performing writing operation in the self storage manager according to the writing request and creating a characteristic index; the first updating unit is used for updating the data and the characteristic index written in the storage manager to the cache manager; and the first feedback unit is used for feeding back the write response information to the client.

On the basis of the foregoing embodiment, the second target data node is a slave data node in the data node group to which the second target data node belongs, and the write response module specifically includes: the first sending unit is used for sending the writing request to a third target data node, the third target data node is a main data node in a data node group to which the third target data node belongs, and the second sending unit is used for receiving writing response information fed back by the third target data node and sending the writing response information to a client. The distributed data processing apparatus further includes: the second writing module is configured at a third target data node and used for performing writing operation in the self storage manager according to the writing request and creating a characteristic index; the second updating module is configured at a third target data node and used for updating the data and the characteristic index written in the storage manager to the cache manager; and the second feedback module is configured at the third target data node and used for feeding back the write response information to the second target data node.

The distributed data processing device provided by the embodiment of the invention can be used for executing any distributed data processing method, and has corresponding functions and beneficial effects.

Fig. 10 is a schematic structural diagram of a metadata node according to an embodiment of the present invention. As shown in fig. 10, the metadata node includes a first processor 60, a first memory 61, a first input device 62, a first output device 63, and a first communication device 64; the number of the first processors 60 in the metadata node may be one or more, and one first processor 60 is taken as an example in fig. 10; the first processor 60, the first memory 61, the first input device 62, the first output device 63 and the first communication device 64 in the metadata node may be connected by a bus or other means, and the bus connection is taken as an example in fig. 10.

The first memory 61 is used as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as corresponding program instructions/modules executed by a metadata node in the distributed data processing method according to the embodiment of the present invention (for example, a retrieval instruction receiving module 501 and a first metadata feedback module 502 configured at the metadata node in the distributed data processing apparatus). The first processor 60 executes various functional applications of the metadata node and data processing by running software programs, instructions, and modules stored in the first memory 61, that is, implements a portion corresponding to the metadata node in the above-described distributed data processing method.

The first memory 61 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the metadata node, and the like. Further, the first memory 61 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage device. In some examples, the first memory 61 may further include memory located remotely from the first processor 60, which may be connected to the metadata node via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The first input device 62 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the metadata node. The first output device 63 may include a display device such as a display screen. The first communication device 64 is used for data communication with the client and the data node.

The metadata node can be used for executing relevant operations executed by the metadata node in any distributed data processing method, and has corresponding functions and beneficial effects.

Fig. 11 is a schematic structural diagram of a data node according to an embodiment of the present invention. As shown in fig. 11, the data node comprises a second processor 70, a second memory 71, a second input device 72, a second output device 73 and a second communication device 74; the number of the second processors 70 in the data node may be one or more, and one second processor 70 is taken as an example in fig. 11; the second processor 70, the second memory 71, the second input device 72, the second output device 73 and the second communication device 74 in the data node may be connected by a bus or other means, and the bus connection is taken as an example in fig. 11.

The second memory 71 is used as a computer readable storage medium for storing software programs, computer executable programs, and modules, such as corresponding program instructions/modules (for example, the retrieval request receiving module 503 and the retrieval response module 504 configured at the data node in the distributed data processing apparatus) executed by the data node in the distributed data processing method according to the embodiment of the present invention. The second processor 70 executes various functional applications of the data nodes and data processing by running software programs, instructions, and modules stored in the second memory 71, that is, implements the corresponding portions of the data nodes in the distributed data processing method described above.

The second memory 71 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the data node, and the like. Further, the second memory 71 may include a high speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the second memory 71 may further include memory located remotely from the second processor 70, which may be connected to the data node via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The second input device 72 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the data node. The second output device 73 may include a display device such as a display screen. The second communication device 74 is used for data communication with the client and the metadata node.

The data node can be used for executing relevant operations executed by the data node in any distributed data processing method, and has corresponding functions and beneficial effects.

Embodiments of the present invention further provide a storage medium containing computer-executable instructions, which when executed by a first processor and a second processor are configured to perform a distributed data processing method, the method including:

Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the distributed data processing method provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A distributed data processing method, comprising:

the first target data node determines retrieval data according to the characteristic index and feeds the retrieval data back to the client so that the client determines a retrieval result according to the retrieval data;

the method further comprises the following steps:

the metadata node receives a write-in instruction sent by the client;

the second target data node receives the write request;

2. The distributed data processing method of claim 1, wherein determining, by the first target data node, the retrieval data based on the feature index comprises:

if yes, acquiring the retrieval data;

3. The distributed data processing method according to claim 1 or 2, wherein the feature index includes feature data obtained by deep learning of data to be retrieved.

4. A distributed data processing method according to claim 1 or 2, wherein the retrieved data comprises a data unique identification and/or data location information.

5. The distributed data processing method of claim 1, wherein the group of data nodes comprises a master data node and a plurality of slave data nodes, and the master data node is configured to control data synchronization of the group of data nodes.

6. The distributed data processing method of claim 5, wherein the second target data node is a master data node within the data node group to which the second target data node belongs,

7. The distributed data processing method of claim 5, wherein the second target data node is a slave data node within the data node group to which the second target data node belongs,

8. A distributed data processing system, comprising: the system comprises a metadata cluster and a characteristic index server cluster, wherein the metadata cluster comprises at least one metadata node, the characteristic index server cluster comprises a plurality of data node groups, each data node group comprises a plurality of data nodes, and data in the data node groups are synchronized;

the first target data node is used for receiving a retrieval request and acquiring a feature index in the retrieval request; the client is further used for determining retrieval data according to the characteristic index and feeding the retrieval data back to the client so that the client determines a retrieval result according to the retrieval data;

the data node group further includes a second target data node,

the metadata node is further configured to: receiving a write-in instruction sent by the client; determining a second target data node in all data nodes according to the writing instruction; feeding back the metadata of the second target data node to the client, so that the client sends a write request to the second target data node according to the metadata;

the second target data node is to: receiving the write request; and carrying out write response according to the write request.

9. The distributed data processing system of claim 8, wherein the hash value for each data node in the set of data nodes is within a set range.