CN110278222B

CN110278222B - Method, system and related device for data management in distributed file storage system

Info

Publication number: CN110278222B
Application number: CN201810213670.XA
Authority: CN
Inventors: 金中良
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-03-15
Filing date: 2018-03-15
Publication date: 2021-09-14
Anticipated expiration: 2038-03-15
Also published as: CN110278222A

Abstract

The embodiment of the application provides a data management method, a system and related equipment applied to a distributed file storage system, wherein the method comprises the steps of reading file data from a file to be written in to form n file data units, performing redundancy calculation on the n file data units to obtain m redundant data units, respectively writing the n file data units into n file blocks located on a plurality of data nodes of the distributed file storage system, and respectively writing the m redundant data units into m redundant blocks located on the plurality of data nodes of the distributed file storage system, wherein n and m are positive integers; when the file block is abnormal, the contents of the abnormal file block can be restored by performing redundancy calculation on the normal file block and the redundant block. The method improves the utilization rate of the storage space on the premise of ensuring the fault tolerance of the distributed file storage system.

Description

Method, system and related device for data management in distributed file storage system

Technical Field

The present application relates to the field of data storage, and more particularly, to a method, system and related apparatus for data management in a distributed storage system.

Background

The traditional network storage system adopts a centralized storage server to store all data, the storage server becomes the bottleneck of the system performance, is also the focus of reliability and safety, and cannot meet the requirement of large-scale storage application. A distributed storage system is used for storing data on a plurality of independent devices in a distributed mode. The distributed network storage system adopts an expandable system structure, utilizes a plurality of storage servers to share the storage load, and utilizes the position server to position the storage information, thereby not only improving the reliability, the availability and the access efficiency of the system, but also being easy to expand.

The existing distributed storage system divides a file into a plurality of 'blocks' (blocks) to be stored on a plurality of data nodes, stores copies of the blocks (as backup or copy of the blocks) on the data nodes, and records the corresponding relation between the file and the blocks/copies and the corresponding relation between the blocks/copies and the data nodes through metadata; because each block has copies in a plurality of data nodes, when a certain data node is damaged, the file access device can read the contents of the blocks stored on the damaged data node from other data nodes, so that the file access of a service layer is not influenced, and the fault tolerance (fault tolerance) of the storage system is improved.

This method has a problem of low storage space utilization, for example, taking 2 copies per file block as an example, the storage space utilization is 1/(1+2) 33%, and if there are more copies, the storage space utilization will be lower.

Disclosure of Invention

In view of this, it is necessary to provide a data management method to improve the storage space utilization of the distributed file storage system.

In a first aspect, an embodiment of the present application provides a data management method applied to a distributed file storage system, where the distributed file storage system includes a plurality of data nodes, and the method includes: reading file data from a file to be written to form n file data units, wherein the n file data units contain the read file data; performing redundancy calculation on the n file data units to obtain m redundant data units; writing the n file data units into n file blocks on the plurality of data nodes, respectively, and writing the m redundant data units into m redundant blocks on the plurality of data nodes, respectively; wherein n and m are both positive integers.

The method can reduce redundant data and effectively improve the utilization rate of the storage space.

In a possible solution, the storage locations of the n file blocks are respectively located on n data nodes of the plurality of data nodes, and the writing the n file data units into the n file blocks on the plurality of data nodes respectively includes: and writing the n file data units into the n file blocks respectively in parallel.

In one possible scheme, the storage locations of the n file blocks are respectively located on n data nodes of the plurality of data nodes, the storage locations of the m redundant blocks are respectively located in another m data nodes of the plurality of data nodes, and the writing the n file data units into the n file blocks and the writing the m redundant data units into the m redundant blocks respectively includes: and writing the n file data units into the n file blocks respectively in parallel, and writing the m redundant data units into the m redundant blocks respectively in parallel.

In one possible solution, the reading file data from the file to be written to form n file data units includes: determining that the data volume of the file read from the file to be written is less than the data volume corresponding to the n file data units; and appending data to the end of the read file data to form the n file data units.

In one possible approach, the distributed file storage system further includes a management node that manages the plurality of data nodes, and the method further includes: sending the amount of the supplemental data to the management node.

In one possible solution, the distributed file storage system further includes a management node that manages the plurality of data nodes, and before writing the n file data units into the n file blocks respectively, the method further includes: and acquiring the storage position information of the n file blocks and the m file blocks in the plurality of data nodes from the management node.

In one possible solution, the obtaining, from the management node, storage location information of the n file blocks and the m file blocks in the plurality of data nodes includes: sending a storage location request to the management node, wherein the storage location request contains the unique identifier of the file to be written in the distributed file storage system; and receiving a storage location response from the management node, wherein the storage location response comprises the storage locations of the n file blocks in the plurality of data nodes and the storage locations of the m redundant blocks in the plurality of data nodes.

In one possible solution, the storage location request further comprises a redundancy scheme identification or the n and m.

In one possible scheme, the storage location response includes storage locations of the n file blocks in the plurality of data nodes and storage locations of the m redundant blocks in the plurality of data nodes, the storage location response includes metadata information of the file to be written, and the metadata information includes a correspondence relationship between n pairs of file block identifications to the storage locations of the file blocks in the plurality of data nodes and a correspondence relationship between m pairs of redundant block identifications to the storage locations of the redundant blocks in the plurality of data nodes.

In one possible solution, the distributed file storage system further includes a management node that manages the plurality of data nodes, and before performing the redundancy calculation on the n file data units, the method further includes: and acquiring a redundancy algorithm corresponding to the redundancy calculation from the management node.

In one possible approach, the distributed file storage system further includes a management node that manages the plurality of data nodes, and the method further includes: and sending the redundancy algorithm corresponding to the redundancy calculation to the management node.

In one possible scenario, when m is 1, the redundancy algorithm comprises a parity check algorithm, or when m is 2, the redundancy algorithm comprises a galois field based Q-check algorithm.

In one possible scheme, the number of bytes contained in the file data unit and the redundant data unit are both multiples of 8.

In a second aspect, an embodiment of the present application provides a data management method applied to a distributed file storage system, where the distributed file storage system includes a plurality of data nodes, where n file blocks and m redundant blocks of a file to be recovered are stored in the plurality of data nodes, where the n file blocks include f normal file blocks and n-f abnormal file blocks, and the method includes: respectively reading a file data unit from d normal file blocks to obtain d file data units, wherein d is less than or equal to f, and respectively reading a redundant data unit from m redundant blocks to obtain m redundant data units; performing redundancy calculation on the d file data units and the m redundant data units to obtain n-f file data units; restoring the n-f file data units to n-f abnormal file blocks in the n file blocks respectively; and n, m, f and d are positive integers.

The method ensures the fault tolerance of the distributed file storage system, and when a fault file block occurs, the data of the fault file block can be recovered from a normal file block and a redundant file block.

In one possible solution, the restoring the n-f file data units to the n-f abnormal file blocks in the n file blocks respectively includes: creating new file blocks at the storage positions of the n-f abnormal file blocks respectively to obtain n-f new file blocks; and respectively writing the n-f file data units into the n-f new file blocks.

In a possible solution, before performing redundancy calculation on the d file data units and the m redundant data units according to a redundancy algorithm to obtain n-f file data units, the method further includes: and acquiring the redundancy algorithm from the metadata information corresponding to the file to be restored.

In a possible solution, before the reading of one file data unit from each of the d normal file blocks, the method further includes: and acquiring the storage positions of the n file blocks and the storage positions of the m file blocks from the metadata information corresponding to the file to be restored.

In a third aspect, an embodiment of the present application provides a data management method applied to a distributed file storage system, where the distributed file storage system includes a management node, a plurality of data nodes, and a data writing device, and the method includes: the data writing equipment sends a storage position request to the management node, wherein the storage position request contains the unique identifier of the file to be written in the distributed file storage system; the management node determines storage position information of n file blocks and m redundant blocks on the data nodes according to the unique identifier of the file to be written in the distributed file storage system, and returns a storage position response to the data writing device, wherein the storage position response comprises the storage position information, and n and m are positive integers; the data writing equipment reads file data from the file to be written to form a first group of n file data units, redundancy calculation is carried out on the first group of n file data units to obtain m redundant data units, the first group of n file data units are written into the n file blocks respectively, and the m redundant data units are written into the m redundant blocks respectively, wherein n and m are positive integers.

In one possible solution, when the n file blocks include f normal file blocks and n-f abnormal file blocks, the method further includes: the management node respectively reads a file data unit from d normal file blocks to obtain d file data units, wherein d is less than or equal to f, and respectively reads a redundant data unit from m redundant blocks to obtain m redundant data units; the management node performs redundancy calculation on the d file data units and the m redundancy data units to obtain n-f file data units; and the management node respectively restores the n-f file data units to n-f abnormal file blocks in the n file blocks.

In one possible solution, the distributed file storage system further includes a data reading device, and the method further includes: the data reading equipment reads a file data unit from the n data file blocks respectively to obtain a second group of n file data units; the data reading device writes the n file data units into a file created by the data reading device.

In one possible solution, the reading, by the data writing device, file data from the file to be written to form a first set of n file data units includes: the data writing equipment determines that the data volume of a file read from the file to be written is less than the data volume corresponding to n file data units, and data is added at the end of the read file data to form the first group of n file data units; sending the amount of the supplemental data to the management node; the writing, by the data reading device, the n file data units into the file created by the data reading device includes: and the data reading equipment acquires the number of the additional data from the management node, removes the additional data from the tail of the second group of n file data units according to the number of the additional data to obtain the residual file data, and writes the residual file data into the created file.

In one possible scheme, when m is 1, the redundancy algorithm corresponding to the redundancy calculation comprises a parity check algorithm, or when m is 2, the redundancy algorithm corresponding to the redundancy calculation comprises a galois field based Q-check algorithm.

In a fourth aspect, an embodiment of the present application provides a client device, including a processor and a memory, where: the memory to store program instructions; the processor is configured to call and execute the program instructions stored in the memory, so as to enable the client device to execute the data management method of the first aspect.

In a fifth aspect, an embodiment of the present application provides a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to execute the data management method of the first aspect.

In a sixth aspect, an embodiment of the present application provides a management device, including a processor and a memory, where: the memory to store program instructions; the processor is configured to call and execute the program instructions stored in the memory, so that the management device executes the data management method of the second aspect.

In a seventh aspect, an embodiment of the present application provides a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to execute the data management method of the second aspect.

In an eighth aspect, an embodiment of the present application provides a distributed file storage system, including: the client device of the fourth aspect; the management device according to the sixth aspect.

Drawings

FIG. 1 is an architecture diagram of a distributed file storage system provided by an embodiment of the present application;

FIG. 2 is a flowchart of a method for writing a file in a distributed file storage system according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for reading a file in a distributed file storage system according to an embodiment of the present application;

FIG. 4 is a flowchart of a method for recovering data in a distributed file storage system according to an embodiment of the present application;

FIG. 5 is a flowchart of a method for managing file metadata in a distributed file storage system according to an embodiment of the present application;

FIG. 6A is a flowchart of another method for writing a file in a distributed file storage system according to an embodiment of the present application;

FIG. 6B is a schematic diagram illustrating writing of file data and redundant data in a distributed file storage system according to an embodiment of the present application;

FIG. 7A is a flowchart of another method for reading a file in a distributed file storage system according to an embodiment of the present application;

fig. 7B is a schematic diagram illustrating reading of file data in a distributed file storage system according to an embodiment of the present application;

FIG. 8A is a flowchart of another method for recovering data in a distributed file storage system according to an embodiment of the present application;

fig. 8B is a schematic diagram of recovering data in a distributed file storage system according to an embodiment of the present application

FIG. 9 is a schematic diagram of an HDFS system architecture according to an embodiment of the present application;

fig. 10 is a flowchart of a method for writing a file to an HDFS system by an HDFS client according to an embodiment of the present disclosure;

fig. 11 is a flowchart of a method for recovering data in an HDFS by a NameNode according to the embodiment of the present application;

fig. 12 is a hardware structure diagram of a distributed file storage system device according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a data writing device according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a data reading apparatus according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of a management node according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

Example one

Fig. 1 is an architecture diagram of a distributed file storage system according to an embodiment of the present application, where the architecture diagram includes a data writing device 101, a data reading device 102, a management node 103, a data node 104, a data node 105, and a data node 106, and functions of the data writing device, the data reading device, the management node 103, the data node 104, the data node 105, and the data node 106 are respectively described as follows:

the data writing device 101: the client device of the distributed file storage system is a data node for writing file data read from a file to be written 107 to a management node 103, and includes: reading or copying a file data unit from a file 107 to be written, writing the file data unit into the file block of the distributed data node, performing redundancy calculation on the file data unit to obtain a redundant data unit, and writing the redundant data unit into the redundant block of the distributed data node; the second is to trigger the management node 103 to update the metadata information of the file to be written 107. The specific writing method is detailed in the following embodiments of the present application. The data write device may be a data write device (HDFS Client) in a Hadoop Distributed File System (HDFS).

The data read-out device 102: the client device of the distributed file storage system, configured to read file data from multiple data nodes and write the file data into a newly created file 108 according to metadata information of a file provided by a management node 103, so as to recover or restore data to be written 107, includes: if the data nodes for storing the file blocks all work normally, the file data is directly read from the file blocks and written into the newly created file 108; if a node fails in the data node for storing the file block and the data node for storing the redundant block, the file block and the redundant block in the normal data node can be calculated according to the redundancy algorithm to obtain the content of the corresponding file block or redundant block on the failed data node (which is equivalent to restoring the data on the failed data node), and then the file data is read from all the file blocks (which are calculated for the failed data node), and the newly created file 108 is written. The specific reading method is detailed in the following embodiments of the present application. The data reading device may also be a HDFS Client in HDFS.

The management node 103: the management device of the distributed file storage system is configured to allocate storage locations of file blocks and redundant blocks in a plurality of data nodes (

data nodes

104 and 106, etc.) (the storage locations may include data node identifiers and storage paths of the file blocks on corresponding data nodes), manage metadata information of a file, where the metadata information may include file names, sizes, locations, attributes, creation time, modification time, correspondence between files and file blocks, correspondence between file blocks and storage locations, and the like. A distributed storage system may have two management nodes (master/slave relationship), only one of which is shown in fig. 1, and there may be two or more in actual deployment. The management node may be a NameNode in the HDFS.

The data node 104: for storing the above file blocks or redundant blocks. The data node may be a DataNode in HDFS.

The data node 105: similar to data node 104.

The data node 106: similar to data node 104.

The following describes the technical solutions for writing and reading files based on the system architecture shown in fig. 1 by way of example.

Example two

Fig. 2 is a flowchart of a method for writing a file in a distributed file storage system according to an embodiment of the present application, where a data writing device writes a file to be written (which may be a local file managed by the data writing device itself or a remote device file) into multiple data nodes of the distributed file storage system, and the method specifically includes the following steps:

step 201: the data writing device reads data from a file to be written.

Specifically, the data writing device reads out file data from a file to be written to form n file data units, where the n file data units contain the read-out file data.

The data writing device determines the n before reading out the file data. The data writing device may determine n according to a service requirement (e.g., a file read-write speed, etc.), for example, when the file read speed is required to be fast, it determines that the value of n is large, and otherwise, it determines that the value of n is small. The data writing device may further obtain the n from a management node of the distributed file storage system.

Alternatively, when the data writing device determines that the amount of file data read from the file to be written is less than the amount of data corresponding to n file data units, data may be appended to the end of the read file data to form the n file data units, and the data writing device may send the amount of appended data to the management node, so that the data reading device may remove the appended data when reading data from the n file blocks.

Step 202: the data writing device performs redundancy calculation on the read file data.

Specifically, the data writing device performs redundancy calculation on the n file data units to obtain m redundant data units.

And the data writing equipment determines a corresponding redundancy algorithm and the m before performing the redundancy calculation. The data writing device may determine or select a redundancy algorithm by itself, or may obtain the redundancy algorithm and the m from a management node of the distributed file management system. When the data writing device determines or selects a redundancy algorithm and the number of redundant blocks, the redundancy algorithm and the m need to be sent to the management node for subsequent access and maintenance of the written file data.

Step 203: and the data writing equipment writes the read file data and the calculated redundant data into the data nodes.

Specifically, the data writing device writes the n file data units into n file blocks on the plurality of data nodes, respectively, and writes the m redundant data units into m redundant blocks on the plurality of data nodes, respectively. And the data writing equipment acquires the storage positions of the n file blocks and the m redundant blocks in the data nodes from a management node of the distributed file storage system.

Optionally, the n file blocks are respectively located on n data nodes of the plurality of data nodes, and the m redundant blocks are respectively located on the other m data nodes of the plurality of data nodes, so that the n file blocks and the m redundant blocks can read and write file data in parallel, and the reading performance of the file data can be improved.

Specifically, when m is 1, the redundancy algorithm corresponding to the redundancy calculation may include a parity check algorithm, or when m is 2, the redundancy algorithm may include a galois field based Q-check algorithm.

Specifically, the number of bytes contained in each of the file data unit and the redundant data unit may be a multiple of 8.

The second embodiment enables the data writing device to write a file into the n file blocks and the m redundant blocks on the multiple data nodes of the distributed file storage system, so that the file storage utilization rate is improved, and the file reading performance can be improved.

EXAMPLE III

Fig. 3 is a flowchart of a method for reading a file in a distributed file storage system according to an embodiment of the present application, where a data reading device writes a file to be read, which has been already stored in the distributed file storage system, into a newly created file (that is, a file newly created by the data reading device in a local or remote device), n file blocks corresponding to the file to be read are stored in a plurality of data nodes of the distributed file storage system, where n is a positive integer, and the method specifically includes the following steps:

step 301: the data reading device obtains the storage location of the file block.

Specifically, the data reading device obtains, from the management node, storage locations of n file blocks corresponding to the file to be read in the plurality of data nodes. The file to be read out refers to the file data of which is written into n file blocks of a plurality of data nodes of the distributed file storage system.

Specifically, the data reading device may send a storage location request to the management node, where the storage location request includes an identifier of the file to be read; and the data reading equipment receives a storage position response from the management node, wherein the storage position response comprises the storage position information of the n file blocks corresponding to the file to be read.

Step 302: the data read-out device reads file data from a file block.

Specifically, the data reading who reads one file data unit from each of the n file blocks obtains x file data units, where x is a positive integer and is less than or equal to n.

Optionally, the data reading device obtains the number of the additional data of the file to be read, and removes the additional data from the end of the n file data units according to the number of the additional data, so as to obtain the remaining file data.

Step 303: the data readout device writes the read file data into the newly created file.

Specifically, the data reading apparatus writes the x file data units into the newly created file.

Specifically, the number of bytes included in the file data unit may be a multiple of 8.

The third embodiment enables the data reading device to read out the file data from the distributed file storage system and write the file data into the newly created file.

Example four

Fig. 4 is a flowchart of a method for recovering data in a distributed file storage system according to an embodiment of the present application, where the distributed file storage system includes a management node and a plurality of data nodes, where the plurality of data nodes store n file blocks and m redundant blocks of a file to be recovered, and the n file blocks include f normal file blocks and n-f abnormal file blocks, and the method includes the following steps:

step 401: and the management node reads the file data in the normal file block and the redundant data in the redundant block.

Specifically, the management node reads one file data unit from each of d normal file blocks in the f normal file blocks to obtain d file data units, where d is less than or equal to f, and reads one redundant data unit from each of the m redundant blocks to obtain m redundant data units; and n, m, f and d are positive integers.

Specifically, the management node may obtain the storage locations of the n file blocks and the storage locations of the m file blocks from the metadata information corresponding to the file to be restored, so as to read the file data and the redundant data.

Step 402: and the management node performs redundancy calculation on the read file data and the read redundant data.

Specifically, the management node performs redundancy calculation on the d file data units and the m redundancy data units to obtain n-f file data units.

Optionally, the management node may obtain a redundancy algorithm corresponding to the redundancy calculation from metadata information corresponding to the file to be restored.

Step 403: and the management node recovers the abnormal file block according to the calculated data.

Specifically, the management node restores the n-f file data units to n-f abnormal file blocks in the n file blocks respectively.

Specifically, the management node may create new file blocks at the storage locations of the n-f abnormal file blocks, respectively, to obtain n-f new file blocks, and then write the n-f file data units into the n-f new file blocks, respectively.

Specifically, the number of bytes contained in the file data unit and the redundant data unit are both multiples of 8.

The fourth embodiment enables the file data stored in the distributed file storage system to recover the lost or damaged file blocks by using the redundant blocks when the file blocks are lost or damaged due to the failure of the data node.

EXAMPLE five

Fig. 5 is a flowchart of a method for managing file metadata in a distributed file storage system according to an embodiment of the present application, where a management node is used to manage multiple data nodes and metadata information, and the method includes the following steps:

step 501: the management node obtains a storage location request from the file writing device.

Specifically, the management node receives a storage location request from a data writing device, where the storage location request includes a unique identifier of a file to be written in the distributed file storage system.

Optionally, the storage location request may further include a redundancy scheme parameter, where the redundancy scheme parameter includes the number n of file blocks, the number m of redundant blocks, and a redundancy algorithm, so that the data writing device may determine the file writing related parameter by itself.

Step 502: the management node allocates storage locations for the file blocks and the redundant blocks.

Specifically, the management node allocates storage locations of n file blocks in the multiple data nodes and storage locations of m file blocks in the multiple data nodes for the file to be written.

Specifically, the management node may allocate the storage locations of the n files to n data nodes of the plurality of data nodes, and allocate the storage locations of the m redundant blocks to m other data nodes of the plurality of data nodes. For example, assuming that the plurality of data nodes specifically include C data nodes, the management node determines n data nodes from among the C data nodes, and respectively stores the n file blocks, and determines m data nodes from among the remaining C-n data nodes, and respectively stores redundant blocks.

Optionally, the management node may allocate an identifier to each file block, generate a correspondence between the identifiers of the n file blocks and the corresponding storage locations, allocate an identifier to each redundant block, and generate a correspondence between the identifiers of the m redundant blocks and the corresponding storage locations.

Optionally, the management node may create metadata information for the file, and identify the redundancy algorithm, the correspondence between the n file blocks and the corresponding storage locations, and the correspondence between the m redundancy blocks and the corresponding storage locations.

Step 503: and the management node returns a storage position response to the file writing device.

Specifically, the management node returns the storage locations of the n file blocks and the storage locations of the m redundant blocks to the file writing device.

Specifically, the management node may return metadata information to the file writing device, where the metadata information includes storage locations of the n files and storage locations of the m redundant block blocks.

The management node may further receive the number of additional data from the data writing device and write the number of additional data in the metadata information.

In the fifth embodiment, the management node of the distributed file management system can allocate the file blocks and the redundant blocks for the file to be written, and maintain the storage location information of the file blocks and the redundant blocks and the information of the redundant algorithm.

EXAMPLE six

Fig. 6A is a flowchart of another method for writing a file in a distributed file storage system according to an embodiment of the present application, including the following steps:

step 601: the data writing device determines redundancy scheme parameters for the file.

Specifically, after the data writing device determines a file to be written (i.e., a file to be written into the distributed file storage system, which is a file on a local or remote device managed by the data writing device), according to a service requirement (for example, a requirement of a service on a file reading and writing speed, in a case where a plurality of file blocks can be read and written in parallel, the number of file blocks is larger, the reading and writing speed is faster), preset conditions (local configuration information, user preference setting, and the like), and other factors, a redundancy scheme parameter that needs to be adopted when the file to be written is written into the distributed file storage system is determined, where the redundancy scheme parameter may include the number n of file blocks, the number m of redundancy blocks, the length w of a file data unit, and information of a redundancy algorithm. Where w is an optional parameter, since its uniform default value, such as 8, may be agreed upon across the distributed file storage system.

n, m, w and the redundancy algorithm will be further described in subsequent steps.

Step 602: the data writing device sends a file writing request (file identification, redundancy scheme parameters) to the management node.

Specifically, the data writing device sends a file writing request to a management node, where the file writing request may include the file identifier and the redundancy scheme parameter. The file identification is the unique identification of the file to be written in the distributed file storage system, and can be a logical path (such as/home/foo) or a file name (such as foo).

Specifically, the write request may be a request for creating a file or a request for modifying a file.

Optionally, the data writing device may send a redundancy scheme identifier to the management node according to a preset correspondence between a redundancy scheme identifier and a redundancy scheme parameter (as shown in table 1, stored in a location accessible to both the data writing device and the management node), so as to achieve a purpose of transmitting the redundancy scheme parameter to the management node.

Table 1-redundancy scheme numbering

Redundancy scheme identification	Redundancy scheme parameters
		RdntSchm1	n＝4,m＝1,w＝8
RdntSchm2	n＝8,m＝2,w＝16
		RdntSchm3	n＝4,m＝1,w＝8
RdntSchm4	n＝8,m＝2,w＝16

Alternatively, the configuration may be agreed within the distributed file storage system, and when the file write request does not include the redundancy scheme parameter or the redundancy scheme identifier, the management node may adopt a default redundancy scheme parameter.

Optionally, the distributed file storage system may set corresponding redundancy scheme parameters for the progressive logical directories (as shown in table 2), so that it may be agreed that, when the file write request does not include a redundancy scheme parameter or a redundancy scheme identifier, the redundancy scheme parameter of the file parent directory is adopted. For example, when the data writing device sends a file writing request to the management device, the file writing request includes a file identifier "/home/a/asdf.dat" and does not include a redundancy scheme parameter or a redundancy scheme identifier, the management node takes the redundancy scheme of "/home/a" as the redundancy scheme of "/home/a/asdf.dat" (RdntSchm2, n is 8, m is 2, and w is 16).

TABLE 2 redundancy scheme parameters for logical directory mapping

Logical directory	Redundancy scheme identification	Redundancy scheme parameters
			/home	RdntSchm1	n＝4,m＝1,w＝8
/home/A	RdntSchm2	n＝8,m＝2,w＝16
			/home/B	RdntSchm3	n＝4,m＝1,w＝8
/home/B/BB	RdntSchm4	n＝8,m＝2,w＝16

Alternatively, it may be agreed within the distributed file storage system that when the redundancy scheme parameter does not include the file data unit length w, its unified default value (e.g. 8) is adopted.

Step 603: and the management node allocates data nodes for the files.

Specifically, the management node allocates a plurality of data nodes to the file to be written according to the redundancy scheme parameter, for example, n + m data nodes may be allocated to the file to be written, where n data nodes are used to store n file blocks (where each data node stores one file block), and m data nodes are used to store m redundant blocks (where each data node stores one redundant block), so that when one data node fails, only one file block is affected, and the data node may be obtained through redundancy calculation from other normal file blocks and redundant blocks. The embodiment of the application mainly takes the allocation of n + m data nodes as an example, so that the parallel operation can be performed when the file data is written in or read out, the speed of reading and writing the file is improved, and the possibility that a plurality of file blocks are allocated to the same data node or a plurality of redundant blocks are allocated to the same data node is not eliminated.

Optionally, the management node may select a data node for the file according to the number of data nodes configured by the system, the storage space usage of each data node, and the like. For example, the management node may select a data node having the most remaining storage space as a storage node for the file blocks and the redundant blocks of the file.

Optionally, the management node may further create the n file blocks and m redundant blocks on the allocated data nodes.

Step 604: the management node returns a file write response (metadata information) to the data writing device.

Specifically, the management node creates metadata information for the file to be written, and sends the metadata information to the data writing device through a file writing response message, where the metadata information may include a file identifier of the file to be written, the redundancy scheme parameter, a file block information list (including n pieces of file block information, each piece of file block information including a correspondence relationship between the file block identifier and a storage location of the file block), a redundancy block information list (including m pieces of redundancy block information, each piece of redundancy block information including a correspondence relationship between the redundancy block identifier and a storage location of the redundancy block), and specifically, see table 3. It should be noted that the file block information list and the redundant block information list are ordered in the metadata information, and the read-write process needs to write the file data or read the file data with reference to the listed order to ensure that the read-out file data and the written-in file data are consistent.

Optionally, the management node may also create the n file blocks and the m redundant blocks in or before this step, so that an indication that "the file blocks/the redundant blocks are created" needs to be further included in the file write response to indicate that the data writing device does not need to create the corresponding file blocks/the redundant blocks any more.

In addition, in order to facilitate subsequent access (reading, rewriting, etc.), the management node needs to store the metadata information, and may specifically be stored locally in the management node or on a remote device accessible to the management node.

TABLE 3 metadata information

Step 605: the data writing device makes writing preparation: n file blocks and m redundant blocks are created.

Specifically, after receiving the metadata information, the data writing device starts preparation before writing file data, including establishing a session with a data node specified in the metadata information according to the metadata information, and notifying a corresponding data node to create a corresponding file block or a redundant block. The creating of the file block refers to creating a physical file corresponding to the file block according to the metadata information, and the creating of the redundant block also refers to creating a physical file corresponding to the redundant block according to the metadata information. And subsequently writing data into the file block/the redundant block also refers to writing data into a physical file corresponding to the file block/the redundant block, which is not described in detail later.

Taking the sample in table 2 as an example, the data writing device will establish sessions with the data node3, the data node 7, the data node 10, the data node 12, and the data node 15, respectively, and trigger the data node3 to create a file block BlockD1 (create a physical file with a path of/root/blkd _001), the data node 7 to create a file block BlockD2 (create a physical file with a path of/root/blkd _002), the data node 10 to create a file block BlockD3 (create a path of/root/blkd _003), the data node 12 to create a file block BlockD4 (create a physical file with a path of/root/blkd _004), and the data node 15 to create a redundancy block BlcokR (create a physical file with a path of/root/blkr _ 001).

Alternatively, the n file blocks and the m redundant blocks may be created by the management device during the above step 603 or step 604, and the data writing device may notify the data node of creating the file blocks/redundant blocks when determining that the file writing response does not include the indication information that "the file blocks/redundant blocks are created".

At this time, the newly created file block and redundant block have just been allocated storage locations on the corresponding data nodes, and temporarily do not contain file data or redundant data. The subsequent data writing device will implement the writing of the file to be written into the distributed file storage system by circularly executing the following steps 606 to 608.

Step 606: the data writing device reads out a file data packet from a file.

Specifically, the data writing device reads out a file data packet Di (e.g., D1, D2, etc. in fig. 6B) from the file to be written, where the Di includes n file data units, i.e., Ui1 and Ui2 … … Uin (e.g., U11, U12 … … U1n, etc. in fig. 6B), and each file data unit may include an integer multiple of 8 (8 × y, y is a positive integer) bytes of data, as shown in fig. 6B, and may also include any other number of bytes. The length of each file data unit is not limited in the embodiments of the present application, and the following embodiments take 8 × y bytes as an example for explanation.

After this step is performed several times (the index i of the data packet Di is incremented), when the end of the file is read, the number of bytes remaining may not be sufficient to form a complete file data packet, i.e. the number of bytes remaining is less than n x 8 x y (if the length per unit is set to 8 x y bytes), then data needs to be added or appended at the end of the read file data to form a complete file data packet (i.e. up to n x 8 y bytes). Assuming that the last remaining byte number is RemainByteCount (RemainByteCount < n x 8 y), additional bytes of appendix bytecount (n x 8 y-RemainByteCount) are added or appended to complete or complete a complete file data packet.

When there is added or additional data, the data writing device may send the number of additional data to the management node, so that the management node updates the number of additional data to the metadata information, specifically, refer to step 609. The number of the additional data may be in units of bytes or in units of bits, and the embodiment is not limited thereto.

Step 607: the data writing device calculates redundant data packets.

Specifically, the data writing device performs redundancy calculation on Di according to the redundancy scheme parameter to obtain a redundancy data packet Ri (e.g., R1, R2, etc. in fig. 6B), where Ri includes m redundancy data units, i.e., Uri1 and Uri2 … … Uri.

For example, when an "n +1 redundancy scheme" (i.e., m is 1, there is only one redundant data unit, n >1) is employed, a parity algorithm may be employed to calculate a redundant data unit from n file data units, i.e., Uri1 is Ui1XOR Ui2XOR … Uin, where XOR represents exclusive or calculation, so that, if any one of the n +1 data nodes corresponding to the redundancy scheme is damaged, one of the n +1 units of Ui1, Ui2 … … Uin and Uri1 is lost or damaged, and at this time, the value of the missing unit may be calculated from the other n units according to the parity algorithm, e.g., assuming that the first node is damaged, resulting in the missing unit value of Ui1, Ui1 is Ui2 Ui3XOR … Uin XOR Uri.

As another example, when the "N +2 redundancy scheme" (i.e., m ═ 2) is used, the first redundant data unit Uri1 and the second redundant data unit Uri2 may be generated using a parity check algorithm and a galois field (galois field) based Q-check algorithm, respectively.

Step 608: the data writing device writes the file data packets and the redundant data packets into the file blocks and the redundant blocks.

Specifically, the data writing device writes n file data units (Ui1, Ui2 … … Uin) in the data packet Di in parallel into n newly created file blocks (block d1, block d2 … … block dn) (the term "write in parallel" means that it is not necessary to wait for one file data unit to be written and then write another file data unit, but n file data units are simultaneously and respectively written into n file blocks located in n data nodes without waiting for each other, which is equivalent to increasing the file writing speed by n times); the m redundant data units (Uri1, Uri2 … … Uri) in the redundant packet Ri are simultaneously written in parallel into the m redundant blocks (blockar 1, blockar 2 … … blockarm) (said "writing in parallel" meaning the same as the above-described writing of the n file data units into the n file blocks).

In order to ensure that subsequent data reading devices can correctly read the file contents from the n file blocks and the m redundant blocks after writing data, the corresponding relationship rules between the file data units and the file blocks and between the redundant data units and the redundant blocks need to be agreed, for example: the file data unit read out first (from low byte to high byte) in the file is written correspondingly to the file block/data node appearing first in the file block information list of the metadata information. Assuming that the reading sequence of the file data units is Ui1 and Ui2 … … Uin, and the appearance sequence of the file blocks in the file block information list in the metadata information is Block D1 and Block D2 … … Block Dn, Ui1 can be written into Block D1 correspondingly, and Ui2 can be written into Block D2 … … Uin correspondingly and written into Block Dn correspondingly. Subsequently, when reading the file data, the file data can be read according to the same corresponding relation rule. The corresponding relationship between the redundant data unit and the redundant block is similar to that, and is not described again.

Taking fig. 6B as an example, for D1 in fig. 6B and R1 calculated from D1, n file data units in D1 may be written in parallel into blockad 1 on DataNode1 and blockadn on blockad 2 … … DataNode on DataNode2, respectively; the m redundant data units in R1 may be written in parallel into blockarm on blockar 1 on DataNode (n +1) and blockar 2 … … DataNode (n + m) on DataNode (n +2), respectively.

Taking the sample shown in table 1 as an example, the data writing device may write the file data units read from the file into blockad 1 (located at/root/blkd _001 of data node 3), blockad 2 (located at/root/blkd _002 of data node 7), blockad 3 (located at/root/blkd _003 of data node 10), and blockad 4 (located at/root/blkd _004 of data node 12), respectively, in parallel, and write the calculated redundant data units into blockar (located at/root/blkr _001 of data node 15).

According to step 607, the redundancy data unit is calculated from the plurality of file data units, and when any one of the plurality of data units and the plurality of redundancy data units is missing (due to a failure of the node), the data unit can be calculated from other file data units/redundancy data units according to the redundancy algorithm (i.e. the missing data unit is recovered). And because the number of data units contained in each file block and redundant block are aligned, as shown in fig. 6B, both are (x +1) units, when there is a node failure that results in the loss of a certain file block, the recovery operation of the data unit can be performed many times ((x +1) times), so as to achieve the purpose of recovering the whole file block, and ensure the fault-tolerant capability of the storage system.

On this basis, the utilization rate of the storage space is further improved, the utilization rate is calculated as n/(n + m), when m is 1 and n is 4, the space utilization rate reaches 80%, and is far greater than the storage space utilization rate (< 33%) of the existing distributed storage system.

Optionally, to improve the efficiency of writing the file data, the data writing device may write the data into the file blocks or the redundant blocks by using a caching mechanism. For example, the data writing device may write a plurality of units of file data into the buffer, and write a plurality of units of file data in the buffer into the corresponding file block when the buffer is full. For example, in fig. 6B, file data units U11, U12 … … U1n are to be written into BlockD1 on the DataNode1, the data writing device may write 1 file data unit into BlockD 1n times, or may write n data units into BlockD1 at a time by putting U11, U12 … … U1n into a buffer (assuming that the buffer is large enough) and then through a session established with the DataNode 1. The writing manner of the redundant data unit is similar to that, and is not described again.

Step 609: the data writing device notifies the management node of updating metadata information (file identification, number of appended data).

This step is optional, depending on whether there is additional data during the loop execution of steps 606-608.

Assuming that there is additional data, specifically, the writing device may send a metadata update request to the management node, where the request includes the file identifier and the number of the additional data. And the management node writes the number of the additional data into metadata information corresponding to the file identifier.

In the sixth embodiment, the data writing device can read file data from a file to be written, calculate redundant data, and write the file data and the redundant data into the multiple file blocks and the multiple redundant blocks located on the multiple data nodes of the distributed file storage system, so that the space utilization rate of the distributed file storage system and the writing speed of the file are improved on the basis of ensuring the fault tolerance.

EXAMPLE seven

Fig. 7A is a flowchart of another method for reading a file in a distributed file storage system according to an embodiment of the present application, including the following steps:

step 701: the data reading apparatus sends a file reading request (file identification) to the management node.

Specifically, the data reading device sends a file reading request to the management node, where the reading request includes a file identifier, and the file identifier is a unique identifier of the file to be read in the distributed file storage system.

Step 702: the management node returns a file read response (metadata information) to the data reading device.

Specifically, the management node queries corresponding metadata information from a metadata storage space (located in a local or remote device) according to the file identifier, and sends a file read response to the data reading device. The metadata information is described in step 604. The metadata information may include the amount of additional data to be used in the subsequent steps.

Step 703: the data reading apparatus makes a reading preparation: a session is established.

Specifically, the data reading device establishes a session with a data node where each file block is located according to a file block information list in the metadata information.

Step 704: the data reading device creates a file.

Specifically, the data reading device creates a new file in its file storage space (located on a local or remote device), resulting in the newly created file. At this time, the newly created file does not yet contain file data, and the subsequent data reading device will read out a file data unit from each file block in the file block information list by performing the following steps 705 to 706 in a loop, and then write the file data unit into the newly created file.

Step 705: the data readout device reads n file data units.

Specifically, the data reading device reads one file data unit from n file blocks located on n data nodes in parallel according to the file block information list, and obtains n file data units.

Taking fig. 7B as an example, the data reading apparatus will read the file data unit Ui1 from blockad 1 located on the DataNode1, and read the file data unit Ui2 … … from blockadn located on the DataNode 2.

Alternatively, to improve the efficiency of reading file data, the data reading apparatus may read data from a file block using a caching mechanism. For example, in fig. 7B, file data units U11, U21 … … U (x +1)1 are to be read out and written into the newly created file eventually, then the data reading device may read out 1 file data unit from blockad 1 in (x +1) times, or may read U11, U21 … … U (x +1)1 from blockad 1 to a local buffer at a time (assuming the buffer is sufficiently large), and then read U11, U21 … … U (x +1)1 from the buffer and write it into the newly created file.

Step 706: the data reading device writes n file data units into the file.

Specifically, the data reading apparatus writes the n file data units into the newly created file. The writing sequence refers to a rule of correspondence between file data units and file blocks/nodes adopted when the data writing device writes file data. For example, assume that the correspondence rule adopted by the data writing device is: the file data unit read first in the file is written into the file block information list of the metadata information first, and the file data unit read from the file block first in the file block information list of the metadata information is written into the newly created file first. Taking fig. 7B as an example, it is assumed that the file blocks of the file block information list of the metadata information appear in the following order: and according to the corresponding relation rule, the Ui1 read out from the Block D1 can be firstly written into the newly created file, then the Ui2 read out from the Block D2, and the Uin read out from the … … and the last Block Dn can be written into the Block D1 on the DataNode1 and the Block D2 … … on the DataNode 2.

And when all the file data units of all the file blocks are read out and written into the newly created file, the newly created file is the file read out from the distributed file storage system.

Step 707: the data reading apparatus removes the additional data.

This step is not an optional step. The data readout device may determine whether or not to remove the additional data from the end of the file based on the number of additional data of the metadata information read in step 702. If the amount of additional data of the metadata information is greater than zero, a corresponding amount of data may be removed from the end of the file. For example, if the amount of the additional data is 15 bytes, 15 bytes are removed from the file at the end to obtain the file content at the time of the original writing.

Alternatively, in step 706, the additional data may be identified in advance according to the amount of the additional data in the metadata information, and only the file data may be written into the newly created file by ignoring the additional data, so that the additional data does not need to be removed at the end of the newly created file by this step.

The seventh embodiment enables the data reading device to read the file from the distributed file storage system, and the reading speed of the file can be greatly improved because the reading of the data can be performed in parallel.

Example eight

Fig. 8A is a flowchart of another method for recovering data in a distributed file storage system according to an embodiment of the present application, including the following steps:

step 801: the management node determines that the failed data node has restarted.

In particular, the management node determines that one or more data nodes in the distributed file storage system have recovered from the failure (e.g., by detecting heartbeat messages of the failed device), but that file blocks on the data node or nodes need to be recovered (because they have been corrupted or lost).

Step 802: and the management node acquires the metadata information of the file.

Specifically, the management node determines a file related to a failed data node, and obtains metadata information of the file, where the redundancy scheme parameter includes n, m, and w, where n is a file block number of the file (also is a number of data nodes storing the file block), m is a redundancy block number of the file (also is a number of data nodes storing the redundancy block), and w is a length of a data unit (if the value in the metadata is missing, a default length may be adopted). Assuming that d is the number of data nodes that failed and have recovered operation but need to recover the data block, if d < m, then n + m-d data nodes are normal.

Step 803: the management node performs recovery preparation: a session is established with the normal data node.

Specifically, the management node performs recovery preparation, including determining the n + m-d normal data nodes, and establishing sessions with the data nodes.

Step 804: the management node performs recovery preparation: and establishing a session with the data node restarted after the fault, and triggering to create a file block.

Specifically, the management node performs recovery preparation, including determining d data nodes restarted after a failure, establishing sessions with the data nodes, and triggering each data node restarted after the failure to create a file block corresponding to a damaged or lost file block on the basis. At this time, the newly created file block is allocated only a storage space, and there is no valid file data yet. The management node will loop through steps 805 to 807 to recover the newly created file block.

Step 805: and the management node reads the file data unit/redundant data unit on the normal data node.

Specifically, the management node reads n + m-d file data units/redundant data units from the n + m-d normal data nodes.

Taking fig. 8B as an example, assuming that d is 1 and the data node DataNode2 is a failed restarted data node, n + m-1 file data units/redundant data units are read from n + m-1 normal data nodes in DataNode1, DataNode3 … … DataNode n and DataNode (n +1) … … DataNode (n + m), respectively, as follows: ui1, Ui3 … … Uin and Uri1 … … Uri.

Step 806: and the management node calculates d data units according to a redundancy algorithm.

Specifically, the management node calculates d data units based on the n + m-d file data units/redundant data units according to a redundancy algorithm included in the metadata information of the file.

Continuing with the example in step 805, assuming that m is 1 and the redundancy algorithm is a parity algorithm, Ui2 is Ui1XOR Ui3 … … XOR Uin may be calculated. Assuming that m is 2 and d is 2, the first data unit and the second data unit may be generated by using a parity check algorithm and a galois field based Q-check algorithm, respectively.

Step 807: and the management node writes the d data units into the d data nodes.

Specifically, the management node writes the d data units into d file blocks, and the corresponding relationship between each data unit and each file block is consistent with that when a file is written.

The eighth embodiment enables the management node in the file distributed file storage system to recover the damaged file blocks of the file from the normal file blocks and the redundant blocks of the file according to the metadata information of the file, thereby ensuring the fault tolerance of the distributed file storage system.

The scheme described in the above embodiments of the present application is further described below by taking the HDFS architecture as an example.

Example nine

Fig. 9 is a schematic structural diagram of an HDFS system according to an embodiment of the present disclosure, where the HDFS system includes an HDFS Client 101, a NameNode 1902, a NameNode 2903, a DataNode 1904, a DataNode 2905, a DataNode 3906, a DataNode4907, and a DataNode 5908, and functions of the HDFS system are respectively described as follows:

HDFS Client 901: the data writing device for HDFS includes the functions of the data writing device 101 and the data reading device 102 shown in fig. 1, i.e. the file content can be written into the DataNode1-DataNode5 with the assistance of the NameNode 1902 or the NameNode 2903, or the file content can be read out from the DataNode1-DataNode 5.

NameNode 1902: which is a master management node of the HDFS, includes the function of the management node 103 shown in fig. 1, for managing metadata information of files.

NameNode 2903: is a standby management node of the HDFS. When the NameNode1 fails, the NameNode2 takes over the work, and the availability of the system is improved.

DataNode 1904: the data node of the HDFS includes the function of the data node 104 shown in fig. 1, and is used for storing file blocks or redundant blocks.

DataNode 2905: similar to DataNode 1904.

DataNode 3906: similar to DataNode 1904.

DataNode4907 is similar to DataNode 1904.

DataNode 5908: similar to DataNode 1904.

Fig. 9 shows only 5 datanodes, and there may be more datanodes.

The following describes a technical solution for writing a file based on the system architecture shown in fig. 9 by way of example.

Example ten

Fig. 10 is a flowchart of a method for the HDFS Client to write a file into the HDFS system according to an embodiment of the present application, where "foo" is a file to be written into the HDFS Client locally, an identifier or a path of the file in the HDFS system is "/home/foo", the HDFS Client determines to adopt a 4+1 redundancy scheme (that is, 4 file blocks and 1 redundancy block are required), and a process for the HDFS Client to write the file to be written into the HDFS into the/home/foo specifically includes the following steps:

step 1001: the HDFS Client sends a file creating request to the NameNode1, wherein the request carries file identification/home/foo and redundancy scheme parameters: 4+1 redundancy.

Specifically, the HDFS Client sends a file creation request to the NameNode1, where the request carries a file identifier/home/foo and a redundancy scheme parameter: 4 file blocks and 1 redundant block. Because the redundancy algorithm is not carried, a default algorithm, such as a parity algorithm, is adopted; and since the length of the file data unit when the file is read is not indicated, a default value, such as 8 bytes, is adopted.

Regarding the length of the file data unit, optionally, the HDFS system may also be set to be smaller than the size (e.g. 64 k) of the buffer for reading and writing the file of the device (HDFS Client, DataNode, etc.) in the HDFS, so that the data units read from each data packet can be written in parallel into the file blocks located in the data nodes, thereby improving the performance of reading and writing the file.

Step 1002: the NameNode1 selects a data node.

Specifically, the NameNode1 selects a DataNode1-DataNode5 for the file to be written from the multiple data nodes in the HDFS, where the DataNode1-DataNode4 is used to store the file block, and the DataNode5 is used to store the redundant block.

Step 1003: the NameNode1 returns a file creating response to the HDFS Client, and the response carries the metadata information of/home/foo.

Specifically, the NameNode1 creates metadata information in its storage space, as described in Table 4

TABLE 4 metadata information examples

Alternatively, the NameNode1 may create a block d1-block d4 and a block r on the DataNode1-DataNode5, respectively, so that the NameNode1 may further include indication information of "file block/redundant block created" in the file creation response.

Step 1004: the HDFS Client establishes a session with the DataNode1, creating a file block BlockD 1.

Specifically, after receiving the file creation response, the HDFS Client establishes a session with the DataNode1 according to the metadata information contained therein, and notifies the DataNode1 to create a physical file "blkd _ 001" in its local "/root" directory as a storage location of a file block identified as blockad 1 according to the metadata information in table 4.

Optionally, the HDFS Client may notify the DataNode1 to create a physical file corresponding to the block d1 when determining that the file creation response does not include the indication information of "file block/redundant block created".

Step 1005: the HDFS Client establishes a session with the DataNode2, creating a file block BlockD 2.

Similar to step 1004.

Step 1006: the HDFS Client establishes a session with the DataNode3, creating a file block BlockD 3.

Similar to step 1004.

Step 1007: the HDFS Client establishes a session with the DataNode4, creating a file block BlockD 4.

Similar to step 1004.

Step 1008: the HDFS Client establishes a session with the DataNode5, creating a redundant block BlockR.

Specifically, the HDFS Client establishes a session with the DataNode5, and according to the metadata information described in table 4, notifies the DataNode5 to create a file "blkr _ 001" in its local "/root" directory as a storage space of the redundant block identified as blockar.

After the HDFS Client confirms that 4 file blocks and 1 redundant block have been created, the following steps 1009-1014 are executed in a loop until the data of the local foo file is completely read, so as to write the foo into the HDFS (the 4 file blocks and 1 redundant block).

Step 1009: and the HDFS Client reads the next group of the local file foo and calculates a redundant data unit. If the file data is not sufficient, bytes may be appended to complete the component packets.

Specifically, the HDFS Client sequentially reads the next packet (assumed as the ith packet) from the header (or low address) of the file to the tail (or high address) of the file, and calculates a redundant data packet according to a parity algorithm (only 1 redundant data unit, which is directly referred to as a calculated redundant data unit). According to the conditions of step 1001 and the metadata information of table 4, the packet length is: 4 × 8 ═ 32 bytes, which contain 4 units of file data, Ui1, Ui2, Ui3, Ui4, respectively, then the redundant data unit Uri ═ Ui1XOR Ui2XOR Ui3XOR Ui 4. The storage locations of the 4+1 storage units are shown in table 5 according to the metadata information of table 4.

TABLE 5 storage location correspondence

Data unit	Ui1	Ui2	Ui3	Ui4	Uri
						Storage location	BlockD1	BlockD2	BlockD3	BlockD4	BlockR

If the remaining bytes are less than 32 bytes after reading the end of the local file foo, the HDFS Client should add data to complete into a packet (32 bytes), and record the number of the added data, so as to subsequently notify the NameNode1 to update the added byte number into the metadata information corresponding to the home/foo, specifically refer to step 1015.

Step 1010: the HDFS Client writes file data unit Ui1 to Block D1.

After writing, the DataNode1 local file "/root/blkd _ 001" will be added with 8 bytes of content, i.e., the content contained in Ui 1.

Step 1011: the HDFS Client writes file data unit Ui2 to Block D2.

Similar to step 1010.

Step 1012: the HDFS Client writes file data unit Ui3 to Block D3.

Similar to step 1010.

Step 1013: the HDFS Client writes file data unit Ui4 to Block D4.

Similar to step 1010.

Step 1014: the HDFS Client writes the redundant data unit Uri to BlockDR.

Similar to step 1010.

It should be noted that, the process of writing the file block/redundant block in the foregoing steps 1010 to 1015HDFS Client may be executed in parallel, which will greatly increase the writing speed of the file foo (4 times that of the original file foo).

Step 1015: and the HDFS Client sends a metadata updating request to the NameNode, wherein the request comprises the file identification and the quantity of the additional data.

Specifically, the HDFS Client sends a metadata update request to the NameNode1, where the request includes the file identifier/home/foo and the number of appended data (assumed to be 15), and the NameNode1 writes the number of appended data into the metadata information shown in table 4, so that when another HDFS Client reads/home/foo subsequently, the appended bytes can be removed accordingly, and the file can be restored.

Optionally, in the above description of the steps, the HDFS Client writes Ui1, Ui2, Ui3, Ui4, Uri into the corresponding sessions of blockad 1, blockad 2, blockad 3, blockad 4, and blockar, respectively, in each loop. Because the session is already opened, data can be directly written into the network Socket connection corresponding to the session. To further improve the write performance, a buffer may be used for each session, which may be written once after several bytes have been accumulated. After the Data is written into the session, the Data is transmitted to the corresponding DataNode host, the corresponding service on the DataNode host writes the Data into the storage of the corresponding host in time, and before the Data is written, the buffer can be used to improve the performance.

The above embodiment makes it possible for the HDFS Client to increase the utilization of the storage space to 4/(4+1) — 80% for a single file, and the writing speed is increased by 4 times (4 file blocks can be written in parallel). Moreover, different files can adopt different redundancy scheme parameters, so that the effect that files with different service characteristics can adopt different storage strategies is achieved.

EXAMPLE eleven

Fig. 11 is a flowchart of a method for recovering data in an HDFS by a NameNode according to the embodiment of the present application, where a DataNode4 is a data node restarted after a failure, where all file blocks, including the file block BlockD4 of "/home/foo" on the node, described in the tenth embodiment, are lost, and a process of recovering the BlockD4 by the NameNode1 may specifically include the following steps:

step 1101: the NameNode1 determines that the failed node has restarted.

Specifically, the NameNode1 determines, through heartbeat detection or the like, that the DataNode4 has restarted from the failure, but all file blocks therein need to be recovered, including the file block BlockD4 of "/home/foo".

Step 1102: the NameNode1 obtains the metadata information of the file/home/foo.

The NameNode1 obtains the metadata information for the file "/home/foo", as shown in Table 4.

Step 1103: the NameNode1 establishes a session with the DataNode 1.

Step 1104: the NameNode1 establishes a session with the DataNode 2.

Step 1105: the NameNode1 establishes a session with the DataNode 3.

Step 1106: the NameNode1 establishes a session with the DataNode4 and recreates the file block BlockD 4.

The specific process is similar to step 1007.

Step 1107: the NameNode1 establishes a session with the DataNode 5.

After the HDFS Client confirms that all relevant node sessions are normal and that BlockD4 has been created, the following steps 1108-1113 are executed in a loop until the reading of data such as BlockD1 is completed, so as to recover BlockD 4.

Step 1108: the NameNode1 reads the next file data unit from Block D1.

Specifically, according to the metadata information of "/home/foo", if a data unit length is 8, the NameNode1 reads out the next file data unit of 8 bytes from block d1, which is Ui 1. The reading order may be sequentially read from the header (or lower address) of BlockD1 to the footer (or upper address) of the file.

Step 1109: the NameNode1 reads the next file data unit from Block D2.

Similarly, the NameNode1 reads out an 8-byte file data unit from Block D2 as Ui 2.

Step 1110: the NameNode1 reads the next file data unit from Block D3.

Similarly, the NameNode1 reads out an 8-byte file data unit from Block D2 as Ui 3.

Step 1111: the NameNode1 reads the next redundant data unit from BlockR.

Similarly, the NameNode1 reads an 8 byte redundant data unit from BlockR as Uri.

Step 1112: the NameNode1 calculates a data unit.

Specifically, the NameNode1 calculates Ui1, Ui2, Ui3 and Uri according to a redundancy algorithm "parity algorithm" to obtain one data unit Ui4 ═ Ui1XOR Ui2XOR Ui3XOR Uri.

Step 1113: the NameNode1 writes the calculated file data unit to Block D4.

Specifically, the NameNode1 writes the calculated file data unit Ui4 to BlockD4 on the DataNode4, after which BlockD4 will be increased by one data unit length (Uri length) of bytes.

Example twelve

Fig. 12 is a hardware structure diagram of a distributed file storage system device according to an embodiment of the present application. In all embodiments of the present application, a data writing device, a data reading device, a management node, etc., where the data writing device and the data reading device may be 901(HDFS Client) in fig. 9, and the management node may be 902 in fig. 9, may all employ general-purpose computer hardware shown in fig. 12, which includes a processor 1201, a memory 1202, a bus 1203, an input device 1204, an output device 1205, and a network interface 1206.

In particular, the memory 1202 may include computer storage media in the form of volatile and/or nonvolatile memory such as read only memory and/or random access memory. Memory 1202 may store an operating system, application programs, other program modules, executable code, and program data.

Input devices 1204 may be used to input information to facilitate operation and management of the device by a system administrator, such as setting default redundancy algorithms, etc., input devices 1204 may be used such as a keyboard or pointing device, such as a mouse, trackball, touch pad, microphone, joystick, game pad, satellite dish, scanner, or the like. These input devices may be connected to the processor 1201 through the bus 1203.

Output devices 1205 may be used to output information to facilitate operation and management of the device by a system administrator, etc., and in addition to a monitor, output devices 1205 may provide output to other peripheral devices such as speakers and/or printing devices, which may also be connected to processor 1201 via bus 1203.

The devices (data writing device, data reading device, management node) can be connected to a Network, for example, a Local Area Network (LAN), via Network interface 1206. In a networked environment, computer-executable instructions stored in the devices may be stored in remote memory storage devices, and are not limited to local storage.

When the processor 1201 in the device executes the executable code or application stored in the memory 1202, and the device is a data writing device, the method steps corresponding to the data writing device in all the above embodiments may be performed; when the device is a data reading device, the method steps corresponding to the data reading device in all the above embodiments can be executed; when the device is a management node, the method steps corresponding to the management node in all the above embodiments can be executed; for specific execution, reference is made to the above embodiments, which are not described herein again.

EXAMPLE thirteen

Fig. 13 is a schematic structural diagram of a data writing device according to an embodiment of the present application, where the data writing device includes:

a data reading module 1301, configured to read file data from a file to be written, where the specific implementation process is described in the description of the steps on the data writing device side in the first to eleventh embodiments, such as steps 201 and 606;

a redundancy calculation module 1302, configured to perform redundancy calculation on the file data read by the data reading module 1301 to obtain redundant data, where the specific implementation process refers to the description of the steps on the data writing device side in the first to eleventh embodiments, such as steps 202 and 606;

a data writing module 1303, configured to write the file data read by the data reading module 1301 and the redundant data calculated by the redundant calculating module 1302 into data nodes of the distributed file storage system, where the specific implementation process refers to the description of the steps on the data writing device side in the first to eleventh embodiments, such as steps 203 and 608.

In the present embodiment, the data writing device is presented in the form of a functional module. A "module" as used herein may refer to an application-specific integrated circuit (ASIC), an electronic circuit, a processor and memory that execute one or more software or firmware programs, an integrated logic circuit, and/or other devices that provide the described functionality. In a simple embodiment, those skilled in the art may think that the data writing device may also adopt the form shown in fig. 12, and the data reading module 1301, the redundancy calculation module 1302 and the data writing module 1303 may all be implemented by the processor 1201 and the memory 1202 in fig. 12. For example, the function of the redundancy calculation module 1302 to perform redundancy calculation on file data may be implemented by the processor 1201 executing code stored in the memory 1202.

Example fourteen

Fig. 14 is a schematic structural diagram of a data reading apparatus provided in an embodiment of the present application, where the data reading apparatus includes:

a location obtaining module 1401, configured to obtain a storage location of a file block of a file to be read out, where the specific implementation process refers to the description of the steps on the data reading device side in the first to eleventh embodiments, such as steps 701 and 702;

a data reading module 1402, configured to read file data from the file block of the file to be read, where the specific implementation process is described in the description of the steps on the data reading device side in the first to eleventh embodiments, as step 705 and the like; .

A data writing module 1403, configured to write the read file data into a file created by the data reading device, specifically perform the process described in the steps on the data reading device side in the first to eleventh embodiments, such as step 706;

in the present embodiment, the data readout device is presented in the form of a functional module. A "module" as used herein may refer to an application-specific integrated circuit (ASIC), an electronic circuit, a processor and memory that execute one or more software or firmware programs, an integrated logic circuit, and/or other devices that provide the described functionality. In a simple embodiment, those skilled in the art will appreciate that the client may also take the form shown in fig. 12. The position acquisition module 1401, the data reading module 1402 and the data writing module 1403 can be implemented by the processor 1201 and the memory 1202 in fig. 12. For example, the function of the data writing module 1403 to write the read file data to a newly created file can be realized by the processor 1201 executing code stored in the memory 1202.

Example fifteen

Fig. 15 is a schematic structural diagram of a management node according to an embodiment of the present application, where the management node includes:

a data reading module 1501, configured to read the file data and the redundant data from the normal file block and the redundant block, respectively, and refer to the descriptions of the steps on the management node side in the first to eleventh embodiments, as step 805 and the like;

a redundancy calculation module 1502 for performing redundancy calculation on the file data and the redundant data read by the data reading module 1501, the specific implementation process is described in the description of the steps on the management node side in the first to eleventh embodiments, such as step 806;

a data recovery module 1503, configured to perform data recovery according to the redundant data calculated by the redundancy calculation module 1502, where the specific implementation process is described in the steps on the management node side in the first to eleventh embodiments above, such as step 807.

Optionally, the management node further includes a metadata management module 1504, configured to manage metadata information (e.g., storage locations of redundant blocks, redundant algorithms, etc.) of the file, and the specific implementation process refers to the descriptions of the steps at the management node side in the first to eleventh embodiments, such as step 701, step 702, step 801, and step 802;

in this embodiment, the management node is presented in the form of a functional module. A "module" as used herein may refer to an application-specific integrated circuit (ASIC), an electronic circuit, a processor and memory that execute one or more software or firmware programs, an integrated logic circuit, and/or other devices that provide the described functionality. In a simple embodiment, those skilled in the art may think that the client may also adopt the form shown in fig. 12, and the data reading module 1501, the redundancy calculation module 1502 and the data recovery module 1503 may be implemented by the processor 1201 and the memory 1202 in fig. 12. For example, the data recovery module 1502 may perform the data recovery function according to the redundant data calculated by the redundancy calculation module 1502 by executing the code stored in the memory 1202 by the processor 1201.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the systems, apparatuses and units described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the division of the units is only one logical division, and the actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially or partially contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: u disk, removable hard disk, read only memory, random access memory, magnetic or optical disk, etc. for storing program codes.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A data management method applied to a distributed file storage system, the distributed file storage system including a plurality of data nodes, the method comprising:

reading file data from a file to be written to form n file data units, wherein the n file data units contain the read file data;

performing redundancy calculation on the n file data units to obtain m redundant data units;

writing the n file data units into n file blocks on the plurality of data nodes, respectively, and writing the m redundant data units into m redundant blocks on the plurality of data nodes, respectively;

wherein n and m are both positive integers;

the storage locations of the n file blocks are respectively located on n data nodes of the plurality of data nodes, the storage locations of the m redundant blocks are respectively located in the other m data nodes of the plurality of data nodes, and the writing the n file data units into the n file blocks and the writing the m redundant data units into the m redundant blocks respectively includes:

and writing the n file data units into the n file blocks respectively in parallel, and writing the m redundant data units into the m redundant blocks respectively in parallel.

2. The method of claim 1, wherein the storage locations of the n file blocks are located on n data nodes of the plurality of data nodes, respectively, and wherein writing the n file data units into the n file blocks on the plurality of data nodes, respectively, comprises:

and writing the n file data units into the n file blocks respectively in parallel.

3. The method of claim 1, wherein reading file data from the file to be written to form n file data units comprises:

determining that the data volume of the file read from the file to be written is less than the data volume corresponding to the n file data units;

and appending data to the end of the read file data to form the n file data units.

4. The method of claim 3, wherein the distributed file storage system further comprises a management node that manages the plurality of data nodes, the method further comprising:

sending the amount of the supplemental data to the management node.

5. The method of claim 1, wherein the distributed file storage system further comprises a management node that manages the plurality of data nodes, and wherein before writing the n units of file data to the n file blocks, respectively, the method further comprises:

and acquiring the storage position information of the n file blocks and the m file blocks in the plurality of data nodes from the management node.

6. The method of claim 5, wherein the obtaining of the n file blocks from the management node comprises:

sending a storage location request to the management node, wherein the storage location request contains the unique identifier of the file to be written in the distributed file storage system;

and receiving a storage location response from the management node, wherein the storage location response comprises the storage locations of the n file blocks in the plurality of data nodes and the storage locations of the m redundant blocks in the plurality of data nodes.

7. The method of claim 6, wherein the storage location request further comprises a redundancy scheme identifier or the n and m.

8. The method of claim 6, wherein the storage location response comprises storage locations of the n file blocks in the plurality of data nodes and storage locations of the m redundant blocks in the plurality of data nodes, and wherein:

the storage location response comprises metadata information of the file to be written, and the metadata information comprises a corresponding relation between n pairs of file block identifications and storage locations of the file blocks in the data nodes and a corresponding relation between m pairs of redundant block identifications and storage locations of the redundant blocks in the data nodes.

9. The method of claim 1, wherein the distributed file storage system further comprises a management node that manages the plurality of data nodes, and wherein prior to performing the redundancy calculation on the n units of file data, the method further comprises:

and acquiring a redundancy algorithm corresponding to the redundancy calculation from the management node.

10. The method of claim 1, wherein the distributed file storage system further comprises a management node that manages the plurality of data nodes, the method further comprising:

and sending the redundancy algorithm corresponding to the redundancy calculation to the management node.

11. The method of any one of claims 1-10, wherein the redundancy algorithm comprises a parity check algorithm when m is 1, or a galois field based Q-check algorithm when m is 2.

12. The method according to any one of claims 1 to 10, wherein the number of bytes contained in the file data unit and the redundant data unit are both multiples of 8.

13. A data management method applied to a distributed file storage system, wherein the distributed file storage system comprises a plurality of data nodes, n file blocks and m redundant blocks of a file to be recovered are stored in the plurality of data nodes, and the n file blocks comprise f normal file blocks and n-f abnormal file blocks, and the method comprises the following steps:

respectively reading a file data unit from d normal file blocks to obtain d file data units, wherein d is less than or equal to f, and respectively reading a redundant data unit from m redundant blocks to obtain m redundant data units;

performing redundancy calculation on the d file data units and the m redundant data units to obtain n-f file data units;

restoring the n-f file data units to n-f abnormal file blocks in the n file blocks respectively;

wherein n, m, f and d are positive integers.

14. The method of claim 13, wherein the restoring the n-f file data units to the n-f abnormal file blocks of the n file blocks comprises:

creating new file blocks at the storage positions of the n-f abnormal file blocks respectively to obtain n-f new file blocks;

and respectively writing the n-f file data units into the n-f new file blocks.

15. The method of claim 13, wherein the redundancy algorithm comprises a parity check algorithm when m is 1, or a galois field based Q-check algorithm when m is 2.

16. The method of claim 13, wherein before performing the redundancy calculation on the d file data units and the m redundant data units according to a redundancy algorithm to obtain n-f file data units, the method further comprises:

and acquiring the redundancy algorithm from the metadata information corresponding to the file to be restored.

17. The method of claim 13, wherein before reading one file data unit from each of the d normal file blocks, the method further comprises:

and acquiring the storage positions of the n file blocks and the storage positions of the m file blocks from the metadata information corresponding to the file to be restored.

18. The method of claim 13, wherein the number of bytes contained in the file data unit and the redundant data unit are both multiples of 8.

19. A data management method applied to a distributed file storage system including a management node, a plurality of data nodes, and a data writing device, characterized by comprising:

the data writing equipment sends a storage position request to the management node, wherein the storage position request contains the unique identifier of the file to be written in the distributed file storage system;

the management node determines storage position information of n file blocks and m redundant blocks on the data nodes according to the unique identifier of the file to be written in the distributed file storage system, and returns a storage position response to the data writing device, wherein the storage position response comprises the storage position information, and n and m are positive integers;

the data writing equipment reads file data from the file to be written to form a first group of n file data units, performs redundancy calculation on the first group of n file data units to obtain m redundant data units, writes the first group of n file data units into the n file blocks respectively, and writes the m redundant data units into the m redundant blocks respectively;

wherein n and m are both positive integers;

when the n file blocks contain f normal file blocks and n-f abnormal file blocks, the method further comprises:

the management node respectively reads a file data unit from the d normal file blocks to obtain d file data units, and respectively reads a redundant data unit from the m redundant blocks to obtain m redundant data units;

the management node performs redundancy calculation on the d file data units and the m redundancy data units to obtain n-f file data units;

the management node respectively restores the n-f file data units to n-f abnormal file blocks in the n file blocks;

wherein f and d are positive integers, f is less than n, and d is less than or equal to f.

20. The method of claim 19, wherein the distributed file storage system further comprises a data readout device, the method further comprising:

the data reading equipment reads a file data unit from the n data file blocks respectively to obtain a second group of n file data units;

the data reading device writes the n file data units into a file created by the data reading device.

21. The method of claim 20, wherein:

the data writing device reading file data from the file to be written to form a first group of n file data units comprises:

the data writing equipment determines that the data volume of a file read from the file to be written is less than the data volume corresponding to n file data units, and data is added at the end of the read file data to form the first group of n file data units; sending the amount of the supplemental data to the management node;

the writing, by the data reading device, the n file data units into the file created by the data reading device includes:

and the data reading equipment acquires the number of the additional data from the management node, removes the additional data from the tail of the second group of n file data units according to the number of the additional data to obtain the residual file data, and writes the residual file data into the created file.

22. The method according to any one of claims 19 to 21, wherein the redundancy algorithm corresponding to the redundancy calculation comprises a parity check algorithm when m is 1, or comprises a galois field based Q-check algorithm when m is 2.

23. The method according to any of claims 19-21, wherein the number of bytes contained in the file data unit and the redundant data unit are both multiples of 8.

24. A client device comprising a processor and a memory, wherein:

the memory to store program instructions;

the processor to invoke and execute program instructions stored in the memory to cause the client device to perform the data management method of any of claims 1 to 12.

25. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the data management method of any of claims 1 to 12.

26. A management device comprising a processor and a memory, wherein:

the memory to store program instructions;

the processor for invoking and executing program instructions stored in the memory to cause the management device to perform the data management method of any of claims 13 to 18.

27. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the data management method of any of claims 13 to 18.

28. A distributed file storage system comprising a plurality of data nodes, comprising:

the client device of claim 24;

the management device of claim 26.