WO2017096942A1

WO2017096942A1 - File storage system, data scheduling method, and data node

Info

Publication number: WO2017096942A1
Application number: PCT/CN2016/095532
Authority: WO
Inventors: 董阳; 单卫华; 殷晖
Original assignee: 华为技术有限公司
Priority date: 2015-12-11
Filing date: 2016-08-16
Publication date: 2017-06-15
Also published as: CN106873902B; CN106873902A

Abstract

The present invention relates to the field of file systems. Disclosed are a file storage system, a data scheduling method, and a data node. The data scheduling method comprises: a primary data node receives a write operation instruction sent by a client, the write operation instruction comprising to-be-written data; the primary data node writes the to-be-written data into a first distributed storage subsystem, sending a notification message to a name node, and sends a response message to the client; and the primary data node sends an update request to a first backup data node, the update request comprising a storage position of the to-be-written data in the first distributed storage subsystem and attribute information of the to-be-written data. The present invention is applied in a process of storing a file.

Description

File storage system, data scheduling method and data node

Technical field

The present invention relates to the field of file systems, and in particular, to a file storage system, a data scheduling method, and a data node.

Background technique

Hadoop Distributed File System (HDFS) is a distributed file system suitable for running on commodity hardware. It is highly scalable and can be dynamically expanded without downtime. High reliability. Enables automatic data detection and replication as well as high-throughput access, eliminating access bottlenecks and more.

In the prior art, the system architecture of the HDFS is as shown in FIG. 1, and includes: a client 11 and a server group 12. The client 11 includes a distributed file system module 111 and a file system data output stream (FSData OutputStream) module 112. The server group adopts a master-slave structure, and consists of a name node (NN) 121 and a plurality of data nodes (DNs) 122. The name node 121 is a main server that manages the file namespace and adjusts the client access file; the data node 122 is used to store data, generally one data node corresponds to one server, and each data node corresponds to a distributed storage subsystem. Storage with distributed storage.

Before using the above HDFS system to write data, the client first initiates an RPC request to the remote NN node through the DistributedFileSystem module; the NN node creates a new file in the file system namespace; the DistributedFileSystem module returns a DFSOutputStream to the HDFS client, and then The client starts writing data. The client starts writing data, and DFSOutputStream divides the data into blocks and writes it into the data queue. The Data queue is read by the Data Streamer and informs the name node to allocate data nodes for storing data blocks (each data block corresponds to 3 data nodes by default). Data Streamer uses pipelines to sequentially write data into allocated data nodes to achieve mutual backup of data blocks between multiple data nodes. For example: writing a data block to the first data node; One data node sends the data block to the second data node; the second data node sends the data block to the third data node. In addition, each data node corresponds to a distributed storage device, and the distributed storage device refers to a plurality of physical disks. The data node forwards the written block data to the distributed storage device through the IO, triggers the write process of the distributed storage device, and the distributed storage device writes the data to the primary physical disk, and simultaneously sends the copy request to the standby physical disk to implement Multiple backup data (3 copies by default) is written on the distributed storage device. When the client finishes writing data, Data Streamer closes the write stream and notifies the name node that the data has been written.

In the existing file reading and writing method, the writing operation of the next data block is performed only when all the data nodes complete the writing of the data, and the data writing speed is slow.

Summary of the invention

The invention provides a file storage system, a data scheduling method and a data node, which can speed up data writing speed.

In order to achieve the above object, the present invention adopts the following technical solutions:

The first aspect of the present invention provides a file storage system, where the server side of the file storage system includes:

a name node, a primary data node, and at least one backup data node;

The primary data node and the at least one backup data node share a first distributed storage subsystem, the first distributed storage subsystem including a primary storage device and at least one backup storage device;

The master data node is configured to receive a write operation instruction sent by the client, where the write operation instruction includes data to be written; and the data to be written is written into the first distributed storage subsystem; Sending an update request to the first backup data node, where the update request includes a storage location of the data to be written in the first distributed storage subsystem and attribute information of the data to be written;

The backup data node is configured to receive an update request; according to the storage location of the data to be written in the first distributed storage subsystem and the attribute information of the data to be written in the update request, The first distributed storage subsystem searches for the data to be written; when the data to be written is found, the attribute information of the data to be written is saved.

With reference to the first aspect, in a first implementation manner of the first aspect, the operation permission of the primary data node to the first distributed storage subsystem is to allow a read/write operation; the backup data section The operational authority to point to the first distributed storage subsystem is to allow only read operations.

In a second aspect, the embodiment of the present invention further provides a data scheduling method, where the method is applied to the file storage system according to the first aspect, the method includes:

The main data node receives a write operation instruction sent by the client, where the write operation instruction includes data to be written;

Writing, by the primary data node, the data to be written into the first distributed storage subsystem;

The primary data node sends an update request to the first backup data node, where the update request includes a storage location of the data to be written in the first distributed storage subsystem and attribute information of the data to be written .

With reference to the second aspect, in a first implementation manner of the second aspect, the attribute information of the data to be written includes a name and a size of the data to be written.

With reference to the second aspect, or the first implementation of the second aspect, in a second implementation manner of the second aspect, the method further includes:

When the first data node fails, recovering the system file of the failed data node, and obtaining the restored data node, where the first data node is any one of all data nodes;

The first distributed subsystem is mounted to the restored data node.

In a third aspect, an embodiment of the present invention further provides a data scheduling method, including:

The backup data node receives an update request, where the update request includes a storage location of the data to be written in the first distributed storage subsystem and attribute information of the data to be written;

The backup data node searches for the storage location in the first distributed storage subsystem and the attribute information of the data to be written according to the storage data of the to-be-written data in the first distributed storage subsystem. Describe the data to be written;

When the data to be written is found, the backup data node saves the attribute information of the data to be written.

In a fourth aspect, an embodiment of the present invention provides a data node, including:

a receiving module, configured to receive a write operation instruction sent by the client, where the write operation instruction includes data to be written;

a write circuit, configured to write the data to be written into the first distributed storage subsystem;

a sending module, configured to send an update request to the first backup data node, where the update request includes a storage location of the to-be-written data in the first distributed storage subsystem and the number of to-be-written According to the attribute information.

In conjunction with the fourth aspect, in a first implementation manner of the fourth aspect, the attribute information of the data to be written includes a name and a size of the data to be written.

In a fifth aspect, an embodiment of the present invention further provides a data node, including:

a receiving module, configured to receive an update request, where the update request includes a storage location of the data to be written in the first distributed storage subsystem and attribute information of the data to be written;

a processing module, configured to search for a location in the first distributed storage subsystem according to a storage location of the to-be-written data in the first distributed storage subsystem and attribute information of the to-be-written data Describe the data to be written;

And a storage module, configured to: when the data to be written is found, the backup data node saves the attribute information of the data to be written.

According to the file storage system provided by the present invention, a plurality of data nodes share a distributed storage subsystem, and the distributed storage subsystem includes a primary storage device and at least one backup storage device, which can implement mutual data between the storage devices. Backup. When writing data by using the file storage system, the primary data node writes the data to be written into the first distributed storage subsystem, and then sends an update request to the first backup data node to notify the first backup data node that the data is to be written. The attribute information and the storage location of the data to be written, the first backup data node only needs to view the first distributed storage subsystem according to the update request, and determines that the data to be written has been written in the first distributed storage subsystem. The attribute information of the data to be written in the update request is saved, and the writing process of the data to be written is completed. Compared with the prior art, all the data nodes need to write the data to be written into their corresponding distributed storage systems respectively. Compared with the data scheduling method provided by the present invention, although there are multiple data nodes, there is actually only one data. The node performs the process of writing data to the first distributed operating system, and the remaining data nodes utilize the data localization feature to view the shared distributed storage subsystem, which can reduce the network transmission and storage time of data between the various data nodes. In turn, the data writing speed is accelerated.

DRAWINGS

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is some embodiments of the present invention, and those of ordinary skill in the art, Other drawings may also be obtained from these drawings without paying for creative labor.

1 is a schematic structural diagram of an HDFS in the prior art;

2 is a schematic structural diagram of an HDFS according to an embodiment of the present invention;

FIG. 3 is a schematic flowchart of a data scheduling method according to an embodiment of the present disclosure;

FIG. 4 is a schematic flowchart diagram of another data scheduling method according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a data node according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of another data node according to an embodiment of the present invention.

detailed description

The technical solutions in the present embodiment will be clearly and completely described in the following with reference to the drawings in the embodiments. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

The embodiment of the invention provides a file storage system, as shown in FIG. 2, comprising: a client 21 and a server group 22. The client includes a distributed file system module 211 and a file system data output stream (FSData OutputStream) module 212. The server farm employs a master-slave structure, including a name node 221, a primary data node 222, and at least one backup data node 223; wherein the primary data node and the at least one backup data node share a first distributed storage subsystem 224, the first distribution The storage subsystem 224 includes a primary storage device 2241 and at least one backup storage device 2242.

Among them, each node in the server group (including the name node, the primary data node, and the backup data node) can be equivalent to one server in physical implementation. The distributed storage subsystem is provided to each data node in the form of a virtual device, which is embodied as a virtual disk on each data node, and the read and write to the distributed subsystem is similar to the reading and writing of the local physical disk. In physical implementation, the distributed subsystem includes multiple physical storage devices, for example, multiple hard disks; data between multiple physical storage devices can be backed up to each other.

The first distributed storage subsystem is capable of sharing between various data nodes. The operation authority of each data node to the first distributed storage subsystem may be unrestricted, and special provisions may also be made. Optionally, the operation authority of the primary data node 222 to the first distributed storage subsystem 224 is to allow read and write operations; and the operation of the backup data node 223 to the first distributed storage subsystem 224 Permissions are only allowed for read operations.

Since a plurality of data nodes share the first distributed storage subsystem, when a data node fails, only the system file of the failed data node needs to be restored, the restored data node is obtained, and then the first distributed sub-node is The system is mounted to the restored data node to implement data recovery in the first distributed storage subsystem under the data node, without using data replication to recover data, thereby improving data recovery efficiency.

It should be noted that, in the file storage system provided by the embodiment of the present invention, in addition to sharing the first distributed storage subsystem, each data node may separately correspond to the distributed storage subsystem, respectively, with respect to the first distribution. The storage subsystem can be shared by all data nodes. The distributed storage subsystem corresponding to a data node has only read and write permissions to the data node itself.

Based on the foregoing file storage system, an embodiment of the present invention provides a data scheduling method, as described in FIG. 3, the method includes:

301: The primary data node receives a write operation instruction sent by the client.

The write operation instruction includes data to be written.

302: The primary data node writes the to-be-written data into the first distributed storage subsystem.

In the specific implementation process of the step, after receiving the data to be written, the primary data node forwards the data to be written to the distributed storage subsystem, triggering the write operation process of the distributed storage subsystem, and the distributed storage subsystem. The data to be written is written to the primary physical disk, and the copy request is sent to the backup disk. The backup disk then copies the data of the primary physical disk and saves the data to backup the backup disk and the primary physical disk.

303: The primary data node sends a notification message to the name node, and sends a response message to the client.

After the primary data node successfully writes the data to the distributed storage subsystem, indicating that the data is written, the primary data node sends a notification message to the name node to inform the name node that the data to be written has been written, and sends a response message to the client. In order to inform the client that the data has been written according to the command sent by the client.

304: The primary data node sends an update request to the first backup data node.

This is a difference between the data scheduling method provided by the embodiment of the present invention and the prior art. In the prior art, after the primary data node writes the data into the distributed storage system, it sends a write request to the first backup data node, where the write request carries the data to be written, and the first backup data node receives the write request. After that, the data to be written carried in the request needs to be written to an independent one corresponding to itself. In the distributed storage subsystem, this completes the write data process of the first backup data node.

In the present invention, the update request includes only the storage location of the data to be written in the first distributed storage subsystem and the attribute information of the data to be written, such as the name and size of the data to be written, and the like. The information contained in the update request is equivalent to the index information of the data to be written, and does not include the data to be written itself. In this way, the first backup data node can find the data to be written from the corresponding location according to the update request, and the process of data transmission between the primary data node and the first backup data node, and the first backup data are omitted. The node will write the data to be written for the entire process.

305: The backup data node receives the update request.

The update request includes a storage location of the data to be written in the first distributed storage subsystem and attribute information of the data to be written.

The backup data node referred to in this step includes the first backup data node referred to in the foregoing steps.

If the data to be written needs to be backed up between two data nodes, only the primary data node and the first backup data node are included in the file storage system.

If the data to be written needs to be backed up between two or more data nodes (generally implemented between three data nodes), the file storage system may be in addition to the primary data node and the first backup data node. Also included are a second backup data node, a third backup data node, and the like. In this case, the pipeline processing process is used, and after the previous data node is written, an update request is sent to the next data node. For example, after the primary data node writes the data to be written into the distributed storage system, the primary backup data node sends an update request to the first backup data node; after the first backup data node completes the data write, the update request is sent to the second backup data node. So on and so forth.

306: The backup data node searches for the data to be written in the first distributed storage subsystem according to the storage location of the data to be written in the first distributed storage subsystem and the attribute information of the data to be written.

307: When the data to be written is found, the backup data node saves the attribute information of the data to be written.

308: The backup data node sends a notification message to the name node, and sends a response message to the client.

The backup data node only needs to save the attribute information of the data to be written, and the writing process of the data to be written is completed. After the backup data node completes the data writing, it sends a notification message to the name node to send a response message to the client.

It should be noted that only the primary data node and the first file storage system are shown in FIG. When the data node is backed up, the specific implementation process of the data scheduling; when the file storage system further includes other backup data nodes, the operation process of the remaining backup data nodes is the same as that of the first backup data node, which is not shown in FIG.

In the file storage system provided by the embodiment of the present invention, a plurality of data nodes share a distributed storage subsystem, and the distributed storage subsystem includes a primary storage device and at least one backup storage device, which can implement data between the storage devices. Back up each other. When writing data by using the file storage system, the primary data node writes the data to be written into the first distributed storage subsystem, and then sends an update request to the first backup data node to notify the first backup data node that the data is to be written. The attribute information and the storage location of the data to be written, the first backup data node only needs to view the first distributed storage subsystem according to the update request, and determines that the data to be written has been written in the first distributed storage subsystem. The attribute information of the data to be written in the update request is saved, and the writing process of the data to be written is completed. Compared with the prior art, all the data nodes need to write the data to be written into their corresponding distributed storage systems, so that the data scheduling method provided by the present invention has only one data node, but only one The data node performs the process of writing data to the first distributed operating system, and the remaining data nodes utilize the data localization feature to view the shared distributed storage subsystem, which can reduce the network transmission and storage time of data between the various data nodes. , thereby speeding up data writing speed.

In addition, in the prior art, when data backup needs to be implemented through multiple data nodes, data to be written needs to be transmitted between multiple data nodes; data backup is also implemented inside the distributed storage subsystem, and thus the data is still To transmit between multiple storage devices inside a distributed storage subsystem, it is easy to bring network hotspot problems and cause network bottlenecks. Each data node has to read all the data to be written and saves, which occupies a large storage space, which in turn leads to a large storage overhead.

In the embodiment of the present invention, although there are multiple data nodes that are backed up each other, only one data node actually performs the process of writing data, and the remaining data nodes only use the data localization feature through the shared distributed storage. The system implements the local request data, which can reduce the amount of data transmitted between the data nodes and reduce the transmission overhead; in addition, since only the first data node performs the reading and writing of the data to be written and the data to be written is saved in the first In a distributed storage system, the remaining data nodes do not need to save the data to be written, so the storage space occupied can be reduced, thereby saving the storage overhead of the server group.

In conjunction with the actual application, the embodiment of the present invention further provides a specific implementation process of data scheduling, as shown in FIG. 4, including:

401: The client's DistributedFileSystem module initiates an RPC request to the name node.

The Remote Procedure Call Protocol (RPC) request is used to create a new file in the file system namespace.

402: The name node creates a new file after receiving the RPC request.

It should be noted that before performing this step, the name node first checks whether the file to be created already exists, and whether the creator has permission to operate. If the file to be created does not exist and the creator has permission to perform the operation, the name will be executed. Steps and subsequent steps; otherwise the client will throw an exception and end the process of reading and writing files.

The following steps 403 to 407 are data writing processes.

403: The client's DFSOutputStream module divides the data into blocks, writes them into a data queue, and notifies the name node to allocate data nodes.

The Data queue is read by the Data Streamer submodule in the DFSOutputStream module. The data node is used to store data blocks, and the assigned data nodes are placed in a pipeline.

404: The client's Data Streamer sub-module writes the data block to the primary data node in the pipeline.

The primary data node is the first data node in the pipeline.

At the same time, the client's DFSOutputStream module saves the ack queue for the sent data block, waiting for each data node in the pipeline to inform that the data has been successfully written.

405: The primary data node triggers a write process of the distributed storage subsystem.

In the specific implementation process of the step, data is first written to the primary physical disk, and a replication request is sent to the standby disk, so that multiple backup data (3 copies by default) is written on the distributed storage device.

406: After the primary data node writes the data into the distributed storage subsystem, sends a notification message to the name node, sends a response message to the client, and sends an update request to the backup data node.

In the specific implementation process of this step, the primary data node no longer needs to send a data block to the backup data node, but only sends an update request, thereby entering the process of “update layer”.

It should be noted that the backup data node referred to in this step is the first backup data node, and the second backup data node is also shown in FIG. 4, and the first backup data node completes the data write operation. After that, an update request is also sent to the second backup data node.

It should also be noted that the update layer may include other processing procedures in addition to the processing of the following steps.

407: After receiving the update request, the backup data node refreshes the shared distributed storage subsystem according to the content of the message in the update request, and when the information of the data block is read, the attribute information of the data block is saved, that is, the data block is completed. During the writing process, when the writing of the data block is completed, a notification message is sent to the name node, and a response message is returned to the client.

When all data nodes in the pipeline complete the data write, akb quene removes the corresponding data packet.

Repeat the above process, Data Streamer will brush the remaining data packets into the pipeline and wait for the ack information. After receiving the last ack, the metadata node is notified to complete the writing.

When the client completes the write operation of all data blocks, it calls the stream's close method to close the write stream.

As a specific application of the data scheduling method, as shown in FIG. 5, an embodiment of the present invention provides a data node, including:

The receiving module 501 is configured to receive a write operation instruction sent by the client, where the write operation instruction includes data to be written.

Write circuit 502, configured to write the data to be written into the first distributed storage subsystem;

The sending module 503 is configured to send a notification message to the name node, and send a response message to the client.

And sending an update request to the first backup data node, where the update request includes a storage location of the data to be written in the first distributed storage subsystem and attribute information of the data to be written.

The attribute information of the data to be written includes a name and a size of the data to be written.

The embodiment of the invention further provides a data node, as shown in FIG. 6, comprising:

The receiving module 601 is configured to receive an update request, where the update request includes a storage location of the to-be-written data in the first distributed storage subsystem and attribute information of the to-be-written data.

The processing module 602 is configured to search, in the first distributed storage subsystem, according to the storage location of the to-be-written data in the first distributed storage subsystem and the attribute information of the data to be written. The data to be written.

The storage module 603 is configured to: when the data to be written is found, the backup data node protects The attribute information of the data to be written is stored.

The sending module 604 is configured to send a notification message to the name node, and send a response message to the client.

Through the description of the above embodiments, those skilled in the art can clearly understand that the present invention can be implemented by means of software plus necessary general hardware, and of course, by hardware, but in many cases, the former is a better implementation. . Based on the understanding, the technical solution of the present invention, which is essential or contributes to the prior art, can be embodied in the form of a software product stored in a readable storage medium, such as a floppy disk of a computer. A hard disk or optical disk, etc., includes instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various embodiments of the present invention.

The above is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope of the present invention. It should be covered by the scope of the present invention.

Claims

A file storage system, wherein the server side of the file storage system includes:

a name node, a primary data node, and at least one backup data node;

The primary data node and the at least one backup data node share a first distributed storage subsystem, the first distributed storage subsystem including a primary storage device and at least one backup storage device;

The master data node is configured to receive a write operation instruction sent by the client, where the write operation instruction includes data to be written; and the data to be written is written into the first distributed storage subsystem; Sending an update request to the first backup data node, where the update request includes a storage location of the data to be written in the first distributed storage subsystem and attribute information of the data to be written;

The backup data node is configured to receive an update request; according to the storage location of the data to be written in the first distributed storage subsystem and the attribute information of the data to be written in the update request, The first distributed storage subsystem searches for the data to be written; when the data to be written is found, the attribute information of the data to be written is saved.
The file storage system according to claim 1, wherein the primary data node has an operation authority for the first distributed storage subsystem to allow read and write operations; and the backup data node pairs the first The operational permissions of the distributed storage subsystem are only allowed for read operations.
A data scheduling method, which is applied to the file storage system according to claim 1 or 2, the method comprising:

The main data node receives a write operation instruction sent by the client, where the write operation instruction includes data to be written;

Writing, by the primary data node, the data to be written into the first distributed storage subsystem;

The primary data node sends an update request to the first backup data node, where the update request includes a storage location of the data to be written in the first distributed storage subsystem and attribute information of the data to be written .
The method according to claim 3, wherein the attribute information of the data to be written includes a name and a size of the data to be written.
The method according to claim 3 or 4, wherein the method further comprises:

When the first data node fails, recovering the system file of the failed data node, and obtaining the restored data node, where the first data node is any one of all data nodes;

The first distributed subsystem is mounted to the restored data node.
A data scheduling method, characterized by being applied to the article according to claim 1 or In the storage system, including:

The backup data node receives an update request, where the update request includes a storage location of the data to be written in the first distributed storage subsystem and attribute information of the data to be written;

The backup data node searches for the storage location in the first distributed storage subsystem and the attribute information of the data to be written according to the storage data of the to-be-written data in the first distributed storage subsystem. Describe the data to be written;

When the data to be written is found, the backup data node saves the attribute information of the data to be written.
A data node, comprising:

a receiving module, configured to receive a write operation instruction sent by the client, where the write operation instruction includes data to be written;

a write circuit, configured to write the data to be written into the first distributed storage subsystem;

a sending module, configured to send an update request to the first backup data node, where the update request includes a storage location of the data to be written in the first distributed storage subsystem and attribute information of the data to be written .
The data node according to claim 7, wherein the attribute information of the data to be written includes a name and a size of the data to be written.
A data node, comprising:

a receiving module, configured to receive an update request, where the update request includes a storage location of the data to be written in the first distributed storage subsystem and attribute information of the data to be written;

a processing module, configured to search for a location in the first distributed storage subsystem according to a storage location of the to-be-written data in the first distributed storage subsystem and attribute information of the to-be-written data Describe the data to be written;

And a storage module, configured to: when the data to be written is found, the backup data node saves the attribute information of the data to be written.