CN112416889A

CN112416889A - Distributed storage system

Info

Publication number: CN112416889A
Application number: CN202011161754.7A
Authority: CN
Inventors: 徐云龙; 王海荣; 姚伯祥; 陈辉
Original assignee: Sugon Nanjing Research Institute Co ltd
Current assignee: Sugon Nanjing Research Institute Co ltd
Priority date: 2020-10-27
Filing date: 2020-10-27
Publication date: 2021-02-26

Abstract

The invention discloses a distributed storage system which comprises five service components, namely a main node, a backup node, a data node, a client service and an RPC service. A metadata management component in a main node constructs a double buffer area in a memory for storing an operation log, and two buffer areas are exchanged when one buffer area is full; the backup node regularly pulls the operation log from the main node, reconstructs a file directory tree in the memory and regularly performs checkpoint operation; when the client uploads the files, the main node returns information of a plurality of nodes with the minimum occupied capacity in all the data nodes, the client selects one node to upload the files, and other nodes copy information from the node for redundant backup. The invention can improve the management performance of the metadata, ensure the balance of data capacity, reduce the pressure of the main node and provide high-performance uploading and downloading services for the files.

Description

Distributed storage system

Technical Field

The invention relates to a file storage system, in particular to a high-availability and high-performance distributed storage system.

Background

In the era of big data explosion, tens of millions of files are generated along with use, such as commodity picture information in an e-commerce environment, personnel photo information in a personnel management system, picture information of illegal photographing captured in each intersection and the like, and in the face of the big data explosion, the first problem to be solved is how to effectively store the files. The existing distributed storage technology has no efficient storage system and high-performance data element management mode, so that the efficiency of uploading and downloading files by a client is not high.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a distributed storage system which can efficiently manage file data elements and upload and download files at high speed.

The technical scheme is as follows: the distributed storage system comprises a main node, a backup node, a data node, a client service and an RPC service.

The metadata management component in the main node is provided with two exchangeable buffers, and when one buffer is full and the current disk is not flushed, the two buffers are exchanged.

The backup node regularly pulls the operation log from the main node, restores the file directory tree in a local memory and generates a mirror image file, and the mirror image file is returned to the main node in a synchronous non-blocking mode, so that the speed of restarting and restoring the main node is increased.

The data node adopts an asynchronous operation mode when uploading the file, and adopts a synchronous waiting mode when downloading the file, thereby effectively improving the uploading and downloading speed.

The main node further comprises a data node management component, when the files are uploaded, the main node selects 2 to 3 data nodes with the smallest data capacity in the cluster through the data node management component, one of the data nodes is used for uploading the files, if the uploading fails, the other node is selected for re-uploading, and after the uploading succeeds, the rest nodes are used for redundancy backup. On one hand, the method plays a role in avoiding the inclination of the data file when the file is uploaded, and on the other hand, the usability of uploading the file is improved.

The main node also comprises a pull service component, when the backup node sends a pull operation log request to the main node, whether the cache data is refreshed to a disk is judged, if not, the buffer area is not fully written, and the data can be directly taken out from the buffer area and put into a log set to be sent; and if the log is flushed into the disk, finding the operation log file where the log to be pulled is located according to the synchronized transaction number, and putting the data into a log set to be sent after reading.

The data node comprises a storage management component for acquiring information of all disk files, an RPC client for performing network communication between the client and the main node, a heartbeat management component for sending heartbeat requests to the main node at regular time and a synchronous non-blocking service component for uploading and downloading the files of the client.

The main node also comprises a rebalancing component, the main node distributes a duplicate deletion or duplicate copy task to the data node in the rebalancing work, and the data node receives and executes the task through the heartbeat service. Time-consuming tasks are handed to the data nodes to be executed, so that the pressure of the main nodes is relieved, and the overall load is reduced.

The process of generating the mirror image by the backup node is only responsible for the backup node, and the thread of the main node is not occupied. The mirror image check point component contained in the backup node can start the check point thread at regular time and delete the expired mirror image file.

The construction of the directory tree by the data element management component of the main node is based on the memory and not based on other files or databases, so that the performance is further improved, the basic data in the memory can still be used even if the disk of the main node has a problem, and the operation speed is higher.

The distributed storage system also includes a client service for providing a client operational tool for maintaining long connections between clients and data nodes.

The distributed storage system further comprises an RPC service, wherein the RPC service encapsulates the master node and the data nodes and the RPC interface definitions of the master node and the client.

The asynchronous operation mode adopted by the client file uploading comprises the following steps:

(1) sequentially starting a main node, a backup node and a data node;

(2) a client sends a file uploading request, and a main node constructs a file directory tree in a memory to generate an operation log;

(3) the main node selects a plurality of nodes with least occupied capacity and feeds back information to the client;

(4) the client establishes connection with one node in a synchronous non-blocking mode according to the address information of the data node fed back, and writes the file into a channel; the data nodes read the data files through the channels and write the files into the disk;

(5) after one data node is successfully written, feeding back specific copy information to the main node; and other data nodes carry out copy operation from the node to complete the redundant copy of the whole data file.

The synchronous waiting mode adopted by the client file downloading comprises the following steps:

(1) sequentially starting a main node, a backup node and a data node;

(2) the client sends a file downloading request to the main node, the main node finds out the copy node where the file is located according to the file directory tree in the memory, and the data node which is required to store the file is randomly returned;

(3) the client establishes connection with the node, initiates a file reading request, reads the file in a channel mode, and saves the file in a byte array mode.

Has the advantages that: compared with the prior art, the invention has the following remarkable advantages: the core metadata management is based on the memory, the management performance is high, and the request processing speed is high; the pressure and the integral load of the main node are small; the uploading and downloading speed of the file is high; when the file is uploaded, the data inclination is avoided, and the high availability of the file uploading is ensured.

Drawings

FIG. 1 is a block diagram of the system of the present invention;

FIG. 2 is a flow chart of the present invention for creating a file data element;

FIG. 3 is a flow chart of the backup node pulling file and checking point according to the present invention;

FIG. 4 is a flow chart of a file upload of the present invention;

FIG. 5 is a flowchart of file download according to the present invention.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

The invention provides a distributed storage system, which comprises five service components, namely a main node, a backup node, a data node, a client service and an RPC service, as shown in figure 1. The main node is responsible for metadata management and data node service management of the file directory tree; the backup node is responsible for reconstructing the file directory tree and storing the file directory tree locally; the data node is responsible for file storage; the client service is responsible for providing a client operation tool, operating the whole file directory tree and uploading and downloading the file; the RPC service encapsulates the master node and data nodes, the master node and backup nodes, and the RPC interface definitions between the master node and the client.

The main node is the core of the system and comprises a metadata management component, a pulling service component, a mirror image receiving and processing service component, a data node management component and a rebalancing component. The metadata management component is responsible for maintaining a file directory tree; the pull service component is responsible for pulling log information in the main node at regular time to generate a mirror image file; the mirror image receiving and processing service assembly is responsible for recovering a file directory tree structure; the data node management component is responsible for maintaining all data nodes; the rebalancing component is responsible for reallocating files in each data node to prevent skew.

The metadata management component maintains a directory tree, when a client sends a request for creating a directory and uploading a file, the directory or file path is preferentially created in the memory, when the file directory tree in the memory changes, an operation record log is automatically generated, a write buffer area and a synchronous buffer area are constructed in the memory for storing the operation log, the operation log is stored in the write buffer area preferentially, when the write buffer area is full, an additional lock block is entered and the two buffer areas are exchanged, in this embodiment, the flash thread writes the data in the synchronous buffer into the operation log file by using a segmented locking manner, and sequentially generates files such as operator-1-200.log, operator-201-399.log and the like in the disk, where 1-200 indicate the number of the starting transaction operation is 1 and the number of the last ending transaction in the log file is 200. Through a double-buffer area and segmented locking mechanism, locking is not added when the disk is flushed, and the writing efficiency of the operation log is greatly improved under the condition that the sequence of the operation log is ensured; the operation of the whole metadata is based on the memory, and the concurrency efficiency is effectively improved.

The process of exchanging the two buffer areas comprises the following steps: firstly, comparing the current thread transaction number with the synchronous transaction number which is refreshed, and directly returning to exit if the current transaction number is less than or equal to the synchronous transaction number; if the current transaction number is larger than the synchronous transaction number, whether other threads are on the disk or not needs to be judged, if yes, waiting needs to be carried out, if not, two buffer areas are exchanged, and the thread written into the write buffer area by the card before is awakened.

As shown in fig. 2, creating a file data element includes the following steps:

(1) the client sends a directory creation request to the host node by using the RPC frame;

(2) after receiving an RPC request from a client, the master node performs different processing according to different types of the request;

(3) the main node constructs a directory through a file directory tree component maintained in the memory, sequentially and recursively checks whether the directory is created or not, constructs a file directory tree if the directory is not created, and returns information to the client if the directory is created;

(4) sending an operation log to an operation log component;

(5) after receiving the recording request, the operation log component generates an operation log in a json format, writes the operation log into the double buffer area, and preferentially writes the operation log into the write buffer area; when the data information in the writing buffer area reaches more than 100k, exchanging the writing buffer area with the synchronous buffer area;

(6) writing data in the synchronous buffer area into an operation log file by a disc brushing thread, and sequentially generating files such as operator-1-200.log, operator-201-399.log and the like on a disc;

(7) the backup node regularly updates the check point information in the main node, and the main node generates a log cleaning thread to delete the expired log file.

The backup node corresponds to the functions of the pull service component in the main node and the mirror image receiving and processing service component of the main node. The backup node sends an RPC request to the main node at regular time, pulls an operation log file, generates a mirror image file locally and carries out checkpoint operation at regular time; the backup node transmits the mirror image file back to the main node, and the main node can quickly finish the starting only by loading the mirror image file and performing playback operation during the restarting, thereby effectively improving the restarting efficiency of the main node. The file directory tree structure is stored in the memory, so that the situation that service cannot be provided when a main node disk is unavailable is avoided, the load of disk IO is reduced compared with the situation that the file directory tree structure is stored on a disk, and the operation speed can be improved by more than 100 times.

As shown in FIG. 3, the backup node pulls files and checkpoints, comprising the steps of:

(1) the backup node regularly pulls an operation log from the main node through an RPC service framework;

(2) operating the node tree according to the pulled log operation code;

(3) the backup node simultaneously starts a check point thread, removes expired image files and writes the latest memory file directory tree into a disk in the form of image files;

(4) after the local mirror image file is generated, the mirror image file is transmitted back to the main node in a synchronous non-blocking mode;

(5) and informing the master node to write the current timestamp, the maximum transaction number and the name of the latest image file into the disk by using the maximum transaction number which is synchronized currently.

The data nodes comprise a storage management component, an RPC client, a heartbeat management component and a synchronous non-blocking service component, and the data node management component in the main node maintains all the data nodes through a Map data structure. The storage management component is used for acquiring all disk file information; the RPC client is used for communicating with the main node and can send registration information, heartbeat information or reported file copy information; the heartbeat management component is responsible for sending a heartbeat request to the main node and making different treatments according to the feedback information of the main node; and the synchronous non-blocking service component is responsible for uploading and downloading the client files.

The data node sends a registration request to the master node when being started for the first time, in the embodiment, the heartbeat service of the master node is called through an RPC interface every 30 seconds, if the current time and the last transmitted heartbeat time interval exceed 120 seconds, the current data node is triggered to fail, and the node is removed from the data node set; if a heartbeat request is received after a period of time due to network delay, the node may re-register.

When the data nodes are removed or newly added, a rebalancing component of the master node is called, the client calls an RPC service framework to send a rebalancing request to the master node, the master node firstly calculates the total data capacity information stored in each data node, and calculates the average value of the capacity of each node according to the total data capacity information; the data nodes are traversed sequentially, the storage size and the average value in the data nodes are compared, the nodes are divided into an emigration node and an immigration node, the main node generates a copy deleting task for the emigration node and a copy copying task for the immigration node, when the node sends heartbeat, the tasks are automatically picked up, copying and deleting are carried out among the data nodes, the pressure of the main node is relieved, and the load of a system is reduced.

The synchronous non-blocking service component of the data node adopts a Reactor model architecture, a multiplex monitor is used for monitoring event change of a client, when the client responds, a thread is randomly selected from a reserved thread pool, and the thread can place a channel into a cache queue; each thread also has its own multiplexer, and the thread takes the channel out of the buffer queue, registers the channel in its own multiplexer, and parses the data in the channel according to the header information of the request.

The header information is divided into uploading header information and downloading header information, and the uploading header information is as follows: the request type is 4 bytes + the length of the file name is 4 bytes + the name of the file is N bytes + the length of the file content is 8 bytes + the file content; the download header information is: request type 4 bytes + filename length 4 bytes + filename N bytes.

Reading by adopting a channel mode in the synchronous non-blocking service component, possibly causing the problems of package sticking and package unpacking, and for the problem of package sticking, reading a buffer area with a specific length from the channel according to the byte length of a file name and the byte length of file content specified in a header to obtain the file name and the actual content of a file bottom layer; for the unpacking problem, a buffer Request object is established, information in a header is respectively stored in byte buffer areas of different types, when data read from a channel is incomplete, the data are put into a specified byte buffer area until the complete data information is read according to a specified length, and the read data are set into a Request model object.

The client service comprises a file operation component and a network management communication component, wherein the file operation component is responsible for sending a request to a main node, creating a directory, deleting the directory, uploading a file and downloading the file; the network management communication component is responsible for network communication interaction with the data node.

The file operation component is in service communication with the main node by using RPC service, and after a client sends a request for uploading a file, the client needs to return two data nodes with the minimum capacity through the main node, so that at least one part of data redundancy is ensured, the client can select one of the nodes to upload the file, and the task of the file redundancy copy is completed by mutual copy communication among the data nodes; and sending an RPC request to the main node aiming at the downloaded file, wherein the main node can find the node with the data file, the ip address and the port of the node are returned to the client, and the client is directly connected with the host to download the file.

The network management communication assembly maintains long connection between the client and the data node, when the client uploads a file, the client tries to connect with the data node at first, the request state of the thread is set to CONNECTING, and the request is put into a request queue; the background has another thread which monitors the change of the key value (key value) on the multiplexer, when the key value (key value) becomes connectable, the establishment of the connection between the client and the data node is finished, at this time, the connection state is set from connection to CONNECTED, the connection information which is CONNECTED with the host is stored, and when the connection is carried out with the host again, the connection can be quickly acquired. The long connection mode reduces the connection cost, improves the communication efficiency and facilitates the connection between the client and the data node.

And after the long connection between the client and the data node is established, the client encapsulates an uploading file Request model and a downloading file Request model according to the operation type.

Adopting asynchronous operation for an uploading file Request model, firstly packaging an uploading file header and putting the uploading file header into a Request queue, taking out a Request from the Request queue by a special thread in a background and putting the Request into a Request queue to be sent, marking an operation code as writable, writing packaged header information into a channel, and marking the operation code as readable after data is written into the channel; the data node reads all data from the channel and writes the data into the disk, the SUCCESS is fed back to the client after the reading and writing are finished, and the client triggers a callback function preset by the client after receiving the response to perform differentiation processing; if the failure happens, a retry mechanism is triggered, and another data node is selected for uploading operation, so that high availability of the uploading operation is ensured.

Adopting a synchronous waiting mode for a download file Request model, firstly packaging a download file header, sending the Request message, and after the message is successfully sent, the current thread can always block the waiting response, and the file content read from the channel can be stored in the returned response; and after other threads process the request response, the other threads continue to run downwards, and if an exception occurs in the reading period, a retry mechanism is triggered, and a node is replaced to download the file again.

As shown in fig. 4, the client uploads the file, which includes the following steps:

(1) the client sends a file uploading request to the main node by calling the RPC service framework;

(2) the main node selects two data nodes with the minimum data capacity through the data node management component, and returns the host information of the two data nodes to the client;

(3) the client side starts a multiplexer to monitor the change of the channel;

(4) the client calls the network communication management component to try to establish connection and caches the established connection; setting the connection state of the host from CONNECTING to CONNECTED;

(5) creating a file uploading request of a client, and packaging an uploading file request header;

(6) the request is put into a request queue, the channel information of the data node which has established the connection is obtained, and data is written into the channel of the data node; if abnormity occurs in the sending process, resending is needed;

(7) the multiplexer of the data node monitors that the channel has data change and transmits the processing to the processor; the processor is responsible for analyzing the Request, reading data in the channel, packaging the analyzed content into a Request object and putting the Request object into a Request queue;

(8) the I/O thread pulls specific request information from the request queue and writes the request information into a disk, informs a main node that the current data node already has the file copy, and encapsulates a response and puts the response into a response queue;

(9) the processor processes the response in the response queue and feeds success or failure information back to the client;

(10) the client monitors the change of the channel, calls a callback function, and performs different operation feedbacks according to different response information; if the file uploading is successful, recording the file uploading success, and if the file uploading failure occurs, selecting another node for uploading.

As shown in fig. 5, the client performs file downloading, which includes the following steps:

(1) the client sends a file downloading request to the main node by calling the RPC service framework;

(2) the main node randomly selects a node from the nodes with the data file copy and returns the host information of the data node to the client;

(3) the client side starts a multiplexer to monitor the change of the channel;

(5) creating a file downloading request of a client and packaging a file downloading request header;

(6) and putting the download request into a request queue, acquiring channel information of the data node with the established connection, writing the download request into the data node channel, and if the write is abnormal, replacing the data node with another node holding the data file and resending the download request.

(8) the I/O thread pulls specific request information from the request queue, reads a data file from a disk into a memory, and encapsulates response information to place the response information into a response queue;

(9) the processor processes the response in the response queue and sends the file content to the client through the channel;

(10) and the client synchronously waits for the change of the data in the channel, and when the data change occurs, the content information of the file is read from the channel to finish the downloading of the file.

Claims

1. A distributed storage system, characterized by: the system comprises a main node, a backup node and a data node; the metadata management component in the main node is provided with two exchangeable buffer areas, and when one of the buffer areas is full and the current brushless disk process is not performed, the two buffer areas are exchanged; the backup node regularly pulls the operation log from the main node, restores the file directory tree in a local memory and generates a mirror image file, and returns the mirror image file to the main node in a synchronous non-blocking mode; the data node adopts an asynchronous operation mode when uploading the file, and adopts a synchronous waiting mode when downloading the file.

2. The distributed storage system of claim 1, wherein: the main node further comprises a data node management component, when the files are uploaded, the main node selects 2 to 3 data nodes with the smallest data capacity in the cluster through the data node management component, one of the data nodes is used for uploading the files, if the uploading fails, the other node is selected for re-uploading, and after the uploading succeeds, the rest nodes are used for redundancy backup.

3. The distributed storage system of claim 1, wherein: the main node also comprises a pull service component, when the backup node sends a pull operation log request to the main node, whether the cache data is refreshed to a disk is judged, if not, the buffer area is not fully written, and the data can be directly taken out from the buffer area and put into a log set to be sent; and if the log is flushed into the disk, finding the operation log file where the log to be pulled is located according to the synchronized transaction number, and putting the data into a log set to be sent after reading.

4. The distributed storage system of claim 1, wherein: the data node comprises a storage management component for acquiring information of all disk files, an RPC client for performing network communication between the client and the main node, a heartbeat management component for sending heartbeat requests to the main node at regular time and a synchronous non-blocking service component for uploading and downloading the files of the client.

5. The distributed storage system of claim 1, wherein: the main node also comprises a rebalancing component, the main node distributes a duplicate deletion or duplicate copy task to the data node during rebalancing, and the data node receives and executes the task through heartbeat service.

6. The distributed storage system of claim 1, wherein: the process of generating the mirror image by the backup node is only responsible for the backup node, and the mirror image check point component contained in the backup node can start a check point thread at regular time and delete the expired mirror image file.

7. The distributed storage system of claim 1, wherein: and the data element management component of the main node constructs the directory tree based on the memory.

8. The distributed storage system of claim 1, wherein: the system also comprises a client service for providing a client operation tool and maintaining long connection between the client and the data node.

9. The distributed storage system of claim 1, wherein: the asynchronous operation mode adopted by the client file uploading comprises the following steps:

(1) sequentially starting a main node, a backup node and a data node;

10. The distributed storage system of claim 1, wherein: the synchronous waiting mode adopted by the client file downloading comprises the following steps:

(1) sequentially starting a main node, a backup node and a data node;