WO2016095149A1 - 一种数据压缩存储方法、装置,及分布式文件系统 - Google Patents

一种数据压缩存储方法、装置,及分布式文件系统 Download PDF

Info

Publication number
WO2016095149A1
WO2016095149A1 PCT/CN2014/094179 CN2014094179W WO2016095149A1 WO 2016095149 A1 WO2016095149 A1 WO 2016095149A1 CN 2014094179 W CN2014094179 W CN 2014094179W WO 2016095149 A1 WO2016095149 A1 WO 2016095149A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
data
file
data compression
data block
Prior art date
Application number
PCT/CN2014/094179
Other languages
English (en)
French (fr)
Inventor
李雪斌
张创
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201480037404.6A priority Critical patent/CN106170968B/zh
Priority to PCT/CN2014/094179 priority patent/WO2016095149A1/zh
Publication of WO2016095149A1 publication Critical patent/WO2016095149A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications

Definitions

  • the present invention relates to the field of storage technologies, and in particular, to a data compression storage method and apparatus, and a distributed file system.
  • HDFS Hadoop Distributed File System
  • HDFS is a commonly used distributed file system that is highly fault tolerant and suitable for deployment on inexpensive machines.
  • HDFS can achieve high-throughput data access, so it is suitable for large-scale data applications.
  • HDFS there are at least three types of functional nodes: Data Node (DN), NameNode (NN), and HDFS Client (HDFS client).
  • DN Data Node
  • N NameNode
  • HDFS Client HDFS client
  • the data node is used to store the specific content of the file in the HDFS file system.
  • a file to be stored is divided into multiple data blocks (usually 64M or 128M per block size), and multiple copies of the same data block need to be stored in different DNs to improve data. Storage reliability.
  • the name node considered to be the core of the HDFS file system, is used to store the directory tree structure of all files in the distributed file system and the exact location of the file data in the data node.
  • the name node does not save specific file content data.
  • the HDFS client node is a device that is responsible for dividing a file to be stored into multiple data blocks and storing the data blocks according to the requirements of the name node.
  • the HDFS client node obtains the file to be stored, and then compresses the file to be stored to obtain a compressed file; the HDFS client node sends a file creation request to the name node to notify that there is a file to be stored;
  • the name node After receiving the file creation request, the name node sends the parameter information of the compressed file to the HDFS client node.
  • the HDFS client node compresses and divides the file to be stored into a plurality of data blocks according to the indication of the parameter information, and then acquires a data node to be stored in a copy of each data block from the name node; and finally divides the obtained block. Store to the data node.
  • the HDFS client node compresses the storage file, and the compression speed is slow.
  • the save process saves the next data block after a data block and its copy are saved successfully, and the file save speed is slow.
  • Embodiments of the present invention provide a data compression storage method and apparatus, and a distributed file system, which are used to improve data compression storage efficiency of a distributed system and improve the speed of the distributed system.
  • An embodiment of the present invention provides a data compression storage method, which is applied to a distributed file system, where the distributed file system includes a client node, a name node, and a data node, and includes:
  • the name node After receiving the file creation request sent by the client node, the name node determines a data compression node set, where the data compression node set includes at least two data compression nodes, and the data compression node is a data node having a data compression processing resource;
  • the name node sends the set of data compression nodes to a client node
  • the name node After receiving the node acquisition request sent by the data compression node in the data compression node set, the name node determines a data storage node, where the data storage node is a data node having a data storage resource;
  • the name node sends the determined information of the data storage node to the data compression node corresponding to the node acquisition request.
  • the determining the data compression node set includes:
  • the name node determines the data storage node, including :
  • the name node After the name node receives the node acquisition request, determining whether the data compression node belongs to And in the data compression node set, if yes, determining the data storage node.
  • the method further includes: the name node recording the data compression node set and corresponding to the Information of the file to be stored of the data compression node set;
  • the node acquiring request carries information about a file to be stored in the data block, and an identifier of the data compression node;
  • Determining whether the data compression node belongs to the data compression node set includes:
  • the name node determines a corresponding data compression node set according to the information of the file to be stored in the data block, and determines whether the data compression node that sends the node acquisition request belongs to the determined data compression node set.
  • the method further includes: recording, by the file creation request, the file to be saved that needs to be saved. file name;
  • the method further includes:
  • a data block number of the data block and an identifier of a data storage node storing the data block the data block number including a sequence number of the data block in a file fragment in which the data block is located, and a file to which the data block belongs The serial number of the slice.
  • the method further include:
  • determining, according to the data block number, a file to be stored to which the data block belongs, according to the serial number and the number of the data block in the file fragment in the data block number determines the order of the data block in the file to be stored.
  • the method further includes: recording, by the file creation request, the file to be saved that needs to be saved. file name;
  • the method further includes:
  • the number of suffixes is the same, and the file shards are distributed to the data compression node in the order of the sequence number of the data compression node, and the data block number of the data block and the identifier of the data storage node storing the data block are recorded.
  • the data block number contains the sequence number of the data block in the file fragment in which it resides and the sequence number of the data compression node.
  • the method further include:
  • determining, according to the data block number, a file to be stored to which the data block belongs, according to the sequence number of the data block in the data block number in the file fragment where the data block is located The sequence number of the data compression node determines the order of the data blocks in the file to be stored.
  • a second aspect of the present invention provides a distributed file system, including: a client node, a name node, and a data node, where
  • the client node obtains the file to be stored, and sends a file creation request to the name node;
  • the name node After receiving the file creation request sent by the client node, the name node determines a data compression node set, where the data compression node set includes at least two data compression nodes, and the data compression node is data with data compression processing resources. a node; the name node sends the set of data compression nodes to the client node;
  • the client node receives the data compression node set returned by the name node according to the file creation request, divides the file to be stored to obtain at least two file fragments, and then sends each file fragment to the data compression node.
  • Centralized data compression node
  • the data compression node After receiving the file fragment sent by the client node, the data compression node compresses the received file fragment and divides the data block; the data compression node sends a node acquisition request to the name node;
  • the name node After receiving the node acquisition request sent by the data compression node in the data compression node set, the name node determines a data storage node, where the data storage node is a data node having a data storage resource; the name node will determine The information of the data storage node is sent to the node to obtain a data compression node corresponding to the request;
  • the data compression node receives information of a data storage node sent by the name node; the data compression node sends the data block to the data storage node for storage.
  • the determining a data compression node set includes:
  • the name node selects at least two data compression nodes that the currently available compression processing resources reach a predetermined criterion; and the selected set of the at least two data compression nodes is used as the data compression node set.
  • the name node determines, after receiving the node acquisition request sent by the data compression node, the data storage node, including:
  • the name node After receiving the node acquisition request, the name node determines whether the data compression node belongs to the data compression node set, and if so, determines the data storage node.
  • the system further includes:
  • the name node records information of the data compression node set and a file to be stored corresponding to the data compression node set;
  • the name node obtaining request carries information about a file to be stored in the data block and an identifier of the data compression node; and determining whether the data compression node belongs to the data compression node set includes:
  • the name node determines a corresponding data compression node set according to the information of the file to be stored in the data block, and determines whether the data compression node that sends the node acquisition request belongs to the determined data compression node set.
  • system further includes:
  • the name node After receiving the file creation request sent by the client node, the name node records the file creation request to specify the file name of the file to be saved that needs to be saved;
  • the name node After determining the data storage node, the name node records a data block number of the data block and an identifier of a data storage node storing the data block, where the data block number includes a file fragment in which the data block is located The serial number in the sequence number of the file fragment to which the data block belongs.
  • system further includes:
  • the name node determines, according to the data block number, a file to be stored to which the data block belongs, according to the file in which the data block is located in the data block number.
  • the sequence number in the slice and the sequence number of the file slice to which the data block belongs determine the order of the data block in the file to be stored.
  • system further includes:
  • the name node After receiving the file creation request sent by the client node, the name node records the file creation request to specify the file name of the file to be saved that needs to be saved;
  • the number of file fragments obtained by the client node dividing the file to be stored is the same as the number of data compression nodes in the data compression node set, and the file segment obtained by the client node is according to the data compression node.
  • the sequence of sequence numbers is distributed to the data compression node;
  • the name node After determining the data storage node, the name node records a data block number of the data block and an identifier of a data storage node storing the data block, where the data block number includes a file fragment in which the data block is located The serial number in the sequence number of the data compression node.
  • system further includes:
  • the name node determines, according to the data block number, a file to be stored to which the data block belongs, according to the file block in the data block number in which the file is located.
  • the sequence number in the sequence and the sequence number of the data compression node determine the order of the data blocks in the file to be stored.
  • the splitting, by the client node, the file to be stored, to obtain at least two file fragments includes: dividing the file to be stored into each data compression node The number of currently available compression processing resources corresponding to the size of the file fragment; the number of the file fragments is equal to the number of data compression nodes in the data compression node set;
  • Sending, by the client node, each file fragment to the data compression node in the data compression node set includes: transmitting a large file fragment to the data compression node, and currently compressing data compression resources that are currently available.
  • the node sends a smaller file fragment to the data compression node in the data compression node set that currently has less compression processing resources available.
  • the number of the file fragments is greater than or equal to the number of data compression nodes in the data compression node set
  • Sending, by the client node, each file fragment to the data compression node in the data compression node set includes: the client node sends file fragments one by one to a data compression unit that currently has idleness The data compression node of the resource.
  • system further includes:
  • the data compression section negotiates data compression rules with other data compression nodes before compressing the file fragments
  • the compressing the file into the compressed file by the data compression section includes: the data compression section compressing the file fragment according to the data compression rule negotiated.
  • the system further includes:
  • the data compression node generates a file compression header before transmitting the data block to the data storage node, and carries the indication information of the data compression rule in the file compression header according to the currently used data compression rule. Determining whether to incorporate the file compression header into the data block, and if so, incorporating the file compression header into the data block.
  • the third embodiment of the present invention further provides a name node, which is applied to a distributed file system, where the distributed file system includes a client node, the name node, and a data node, and the name node includes:
  • a first receiving unit configured to receive a file creation request sent by the client node
  • a first determining unit configured to: after the first receiving unit receives the file creation request sent by the client node, determine a data compression node set, where the data compression node set includes at least two data compression nodes, and the data compression The node is a data node having a data compression processing resource;
  • a first sending unit configured to send the data compression node set determined by the first determining unit to a client node
  • a second receiving unit configured to receive a node acquisition request sent by a data compression node in the data compression node set
  • a second determining unit configured to: after the second receiving unit receives the node obtaining request sent by the data compression node in the data compression node set, determine a data storage node, where the data storage node is data having a data storage resource node;
  • a second sending unit configured to send information about the data storage node determined by the second determining unit to a data compression node corresponding to the node obtaining request.
  • the first determining unit is configured to select at least two data compression nodes that are currently available for processing compressed resources to reach a predetermined standard; A set of two data compression nodes is used as the set of data compression nodes.
  • the second determining unit is specifically configured to: after the first receiving unit receives the node obtaining request, Determining whether the data compression node belongs to the data compression node set, and if so, determining the data storage node.
  • the name node further includes:
  • a first recording unit configured to: after the first determining unit determines the data compression node set, record the data compression node set and the information of the file to be stored corresponding to the data compression node set;
  • the node acquiring request carries information about a file to be stored in the data block, and an identifier of the data compression node;
  • the second determining unit is configured to determine, according to the information about the file to be stored in the data block, the corresponding data compression node set, and determine whether the data compression node that sends the node acquisition request belongs to the determined data compression node. set.
  • the name node further includes:
  • a second recording unit configured to: after the first determining unit receives the file creation request sent by the client node, record the file creation request to specify a file name of the file to be saved that needs to be saved;
  • the second recording unit is further configured to: after the second determining unit determines the data storage node, record a data block number of the data block and an identifier of a data storage node that stores the data block, where the data block number The sequence number of the data block in which the data block is located and the sequence number of the file fragment to which the data block belongs.
  • the name node further includes:
  • a first recovery unit configured to determine, according to a data block number recorded by the second recording unit, a file to be stored to which the data block belongs according to the data in the data block number The sequence number of the block in the file fragment in which it resides and the sequence of the file fragment to which the data block belongs The number determines the order of the data blocks in the file to be stored.
  • the name node further includes:
  • a third recording unit configured to: after the first determining unit receives the file creation request sent by the client node, record the file creation request to specify a file name of the file to be saved that needs to be saved;
  • the third recording unit is further configured to: after determining the data storage node, if the number of file fragments of the file to be stored is the same as the number of data compression sections in the data compression node set, and the file fragmentation is Distributing to the data compression node in the order of the sequence number of the data compression node, recording the data block number of the data block and the identifier of the data storage node storing the data block, where the data block number includes the data block in its The sequence number in the file fragment and the sequence number of the data compression node.
  • the name node further includes:
  • a second recovery unit configured to determine, according to the data block number recorded by the third recording unit, a file to be stored that belongs to the data block, according to the data block number, in the process of restoring the file to be stored
  • the sequence number of the data block in the file fragment in which it resides and the sequence number of the data compression node determine the order of the data block in the file to be stored.
  • the data compression node determined by the name node includes at least two data compression nodes, and the data compression node in the data compression node group participates in compression of the file to be stored. Since the data compression node is a data node, the function modification of the name node management node is small; more importantly, the data compression and storage processes of the respective data compression nodes are parallel. Therefore, the compression and storage of the file to be stored in the embodiment of the present invention is no longer limited to the processing capability of the client node, so the data compression storage efficiency of the distributed system can be improved, and the speed of the distributed system can be improved.
  • FIG. 2 is a schematic flow chart of a method for combining a system according to an embodiment of the present invention
  • FIG. 3 is a schematic flowchart of a method for combining a system according to an embodiment of the present invention
  • FIG. 4 is a schematic structural diagram of a name node according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a name node according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a name node according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a name node according to an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a name node according to an embodiment of the present invention.
  • FIG. 9 is a schematic structural diagram of a name node according to an embodiment of the present invention.
  • FIG. 10 is a schematic structural diagram of a name node according to an embodiment of the present invention.
  • the embodiment of the present invention provides a data compression storage method, which is applied to a distributed file system.
  • the distributed file system includes a client node, a name node, and a data node. As shown in FIG. 1, the method includes:
  • the distributed file system may be any distributed file system, and particularly applicable to HDFS.
  • the name node After receiving the file creation request sent by the client node, the name node determines a data compression node set, where the data compression node set includes at least two data compression nodes, and the data compression node is a data node having a data compression processing resource;
  • the name node has the function of managing the data compression node and the data storage node.
  • the name node needs to determine the data compression node that can be used as a data compression storage process.
  • This embodiment also provides a strategy for determining the data compression node, as follows: Determining the data compression node set comprises: selecting at least two data compression nodes that the currently available compression processing resources reach a predetermined criterion; and selecting the selected set of the at least two data compression nodes as the data compression node set.
  • the compression processing resources currently available to all data compression nodes are selected as standards; the available compression processing resources may include the most direct resources of data compression, such as: idle voltage
  • the computing resources are also reduced, and may also include necessary resources for processing compression, such as resources for transmitting compressed data. Therefore, compression processing resources should be understood as a relatively wide range of compression processing resources, and should not be simply understood as containing only computing resources.
  • the name node sends the foregoing data compression node set to the client node
  • the name node After receiving the node acquisition request sent by the data compression node in the data compression node set, the name node determines a data storage node, where the data storage node is a data node having a data storage resource;
  • the name node manages the process of data compression storage, so it is also possible to add an authentication scheme to ensure that the client node can allocate file fragments according to the compressed node set determined by the name node, as follows: the above name node is receiving After the node acquisition request sent by the data compression node, the data storage node is determined, including:
  • the name node After receiving the node acquisition request, the name node determines whether the data compression node belongs to the data compression node set, and if so, determines the data storage node.
  • the method further includes: the name node recording the data compression node set and the information of the file to be stored corresponding to the data compression node set;
  • the determining whether the data compression node belongs to the data compression node set includes: the name node according to the to-be-stored file of the data block.
  • the information determines a corresponding data compression node set, and determines whether the data compression node that sent the node acquisition request belongs to the determined data compression node set.
  • the name node sends the determined information of the data storage node to the data compression node corresponding to the node acquisition request.
  • the data compression node set determined by the name node includes at least two data compression nodes, and the data compression node in the data compression node group participates in compression of the file to be stored. Since the data compression node is a data node, the function modification of the name node management node is small; more importantly, the data compression and storage processes of the respective data compression nodes are parallel. Therefore, the compression and storage of the file to be stored in the embodiment of the present invention is no longer limited to the processing capability of the client node, so the data compression storage efficiency of the distributed system can be improved, and the speed of the distributed system can be improved.
  • This embodiment can implement data error storage. Based on the data compression storage process, this embodiment also provides data preparation for how to perform data recovery in the case of subsequent data recovery requirements. Some data needs to be recorded on the name node side.
  • the method is as follows: after receiving the file creation request sent by the client node, the method further includes: recording, by the file creation request, the file name of the file to be saved that needs to be saved;
  • the method further includes: recording a data block number of the data block and an identifier of a data storage node storing the data block, where the data block number includes a sequence number of the data block in a file fragment in which the data block is located And the sequence number of the file fragment to which the above data block belongs.
  • the sequence number of the file fragment is a sequence number sequentially sequenced according to the order of the file fragments in the file to be stored after the file to be stored is divided into file fragments; the data block is obtained by file fragment compression. Therefore, the data block has a affiliation with the file shard.
  • the file shard compression will get a lot of data blocks.
  • the serial number of the data block in the file shard in which it is located is also the serial number obtained by sequential numbering.
  • the embodiment further provides a solution for performing data recovery, as follows: after recording the data block number of the data block and the identifier of the data storage node storing the data block, the method further includes:
  • the sequence number of the slice determines the order of the above data blocks in the file to be stored.
  • the recording scheme can be applied to all scenarios by recording the sequence number of the data block in the file fragment in which the data block is located and the sequence number of the file fragment to which the data block belongs.
  • the specific content of the recorded data may be changed.
  • the embodiment further provides the following solution: after receiving the file creation request sent by the client node, the method further includes: recording the file creation request specified to be saved. The file name of the file to be stored;
  • the method further includes: if the number of file fragments of the file to be stored is the same as the number of data compression sections in the data compression node set, and the file fragment is according to the serial number of the data compression node. And sequentially distributing to the data compression node, recording a data block number of the data block and an identifier of a data storage node storing the data block, where the data block number includes a sequence number of the data block in a file fragment in which the data block is located, and the data compression The serial number of the node.
  • the embodiment of the present invention further provides a processing solution in the data recovery process. Specifically, the method further includes: after recording the data block number of the data block and the identifier of the data storage node storing the data block, the method further includes:
  • the sequence number determines the order of the above data blocks in the file to be stored.
  • the embodiment further provides a comprehensive implementation example as follows. Referring to FIG. 2, the following steps are included:
  • the client node After obtaining the file to be stored, the client node sends a file creation request to the name node.
  • the file to be stored is data that needs to be stored, and the amount of data is usually large, so compressed storage is required.
  • the file to be stored may be a file local to the client, or may be a file from another device. This embodiment does not limit this.
  • the name node After receiving the foregoing file creation request sent by the client node, the name node determines a data compression node set, where the data compression node set includes at least two data compression nodes, and the data compression node is a data node with data compression processing resources.
  • the above name node sends the above data compression node set to the client node;
  • the set of data compression nodes can be recorded.
  • the record can be recorded in the form of a data compression node table, and the data compression node identifier is used as an entry, for example, as shown in Table 1:
  • Data compression node number Data compression node number identifier 1 DN1 2 DN5 ... ... N DNn
  • the data compression node and the data storage node are nodes that are divided by functions, and the functions of the data compression node and the data storage node are placed in the management needs of the name node.
  • Data node implementation is more appropriate.
  • name node determines the strategy used by the data compression node set, which can be set according to actual needs. The following gives specific examples:
  • the name node Before determining the data compression node set, the name node obtains a compression processing resource currently available to each data compression node managed by the name node; and selects at least two data compression nodes that are currently available for processing the compression processing resource to reach a predetermined standard; At least two data compression nodes are elements of the above set of data compression nodes.
  • the information of the available compression processing resources can be set as needed, so the predetermined standard can also correspond to the setting criteria.
  • the predetermined standard can also correspond to the setting criteria.
  • the predetermined criterion may be that the idle compression calculation capability exceeds a predetermined threshold
  • the predetermined criterion may be: idle compression calculation The capability exceeds a predetermined threshold and the data transmission capability also exceeds another predetermined threshold.
  • the above criteria for compressing processing resources can determine which ones meet the requirements of the data compression processing node.
  • This embodiment also shows how to determine the number of data compression nodes. And how to select the data compression node that meets the requirements as the final node for performing data compression after the quantity is determined, as follows:
  • data compression node selection For example, if the data compression node on the same rack as the client node is selected first, and the number of data compression nodes in the same rack is insufficient, the data compression node of the adjacent rack is selected. If it is still insufficient, other racks can be selected. Data compression nodes on the same data center until the number of nodes required is selected.
  • the data compression node may be selected according to the requirements of the load balancing.
  • the above examples are not to be construed as exhaustive.
  • the client node divides the file to be stored to obtain at least two file fragments, and then Sending each file fragment to the data compression node in the data compression node set;
  • the policy for the client node to divide the file to be stored can be set according to requirements.
  • the following examples are given as examples:
  • the files to be stored are equally divided into a number of files equal to the number of the above elements.
  • the compression processing resource currently available to the data compression node in the data compression node set is obtained.
  • the compression processing resource currently available to each data compression node may be the statistics of the client node itself, or may be notified by the name node.
  • the sending policy corresponding to the splitting strategy is executed: the larger file fragment is sent to the data compression node with more compression processing resources currently available in the data compression node set, and the smaller file fragment is sent to the data compression node. Concentrate the currently available compression nodes with less compression processing resources.
  • the data compression performance of each data compression node can be achieved by on-demand fragmentation.
  • the file to be stored is equally divided, and the number of file fragments obtained by the segmentation is larger than the number of elements of the data compression node set.
  • the corresponding transmission policy may be as follows: the file fragments are sent one by one to the node currently having the idle data compression processing resource. .
  • the data compression node After receiving the file fragment sent by the client node, the data compression node compresses the file fragment and divides the data block; the data compression rule used by the data compression node and the data compression rule used by the other data compression node. The same; the data compression node sends a node acquisition request to the name node;
  • the file fragment is one of the fragments of the file obtained by dividing the file to be stored, and other file fragments other than the file fragment are sent to other data compression. node.
  • the data block is a unit for storing data by the storage node, and may generally be a fixed size data block.
  • the above data storage node is a node having a data storage resource.
  • the compression rules used by the data compression nodes are the same.
  • the manner in which the compression rules are kept the same can be determined as needed. For example, a fixed compression rule can be used.
  • This embodiment also provides more flexibility.
  • the compression rules are determined as follows:
  • the method further includes: the data compression node negotiating a data compression rule with the other data compression node;
  • the compressing the file into the compressed file includes: compressing the file fragment according to the data compression rule obtained through negotiation.
  • the data compression node and the other data compression nodes described above negotiate data compression rules by using a remote direct memory access (RDMA) connection, or by using a User Datagram Protocol (UDP).
  • RDMA remote direct memory access
  • UDP User Datagram Protocol
  • the communication connection negotiates data compression rules.
  • the data compression node generates a file compression header, and carries the indication information of the data compression rule in the file compression header, and determines whether to merge the file compression header into the data block according to the currently used data compression rule, and if so, compress the file.
  • the header is incorporated into the above data block.
  • the information carried by the file compression header, the specific location of the file compression header, and the number of file compression headers are all related to the specific data compression algorithm used.
  • the specific shape of the file compression header in this embodiment There is no limit to the formula.
  • the data compression node compresses the data by using a soft compression method or a hard compression method.
  • the following scheme may be preferably adopted: compressing the above file fragments by using a compression card of hardware of the data storage node.
  • the name node After receiving the node acquisition request sent by the data compression node, the name node determines the data storage node.
  • the sender of the node acquisition request may also be authenticated before determining the data storage node, as follows:
  • the name node After receiving the node acquisition request for requesting the storage data block sent by the data compression node, the name node determines whether the data compression node belongs to the data compression node set, and determines the data storage node.
  • the data compression node Since the original file to be stored is divided into at least two file fragments, and the purpose of the node acquisition request is to determine the node in which the data block is stored, the data compression node transmits the information of the modified data block, for example: the data block thereof The serial number of the compressed file fragment.
  • the name node may not consider the impact of file fragmentation when determining the data storage node, but the embodiment of the present invention also provides a specific implementation scheme for how to record the exact location of the file data for the subsequent management of the data block:
  • the method further includes: recording, by the file creation request, the file name of the file to be saved that needs to be saved;
  • the method further includes: recording a data block number of the data block and an identifier of a data storage node storing the data block, where the data block number includes a sequence number of the data block in a file fragment in which the data block is located And the sequence number of the file fragment to which the above data block belongs.
  • the original file to be stored is 1G, and is divided into 10 file fragments.
  • the file fragment number is 1 to 10.
  • the data compression node will serially sequence each file fragment separately.
  • the NN node records that the first data block of the first file fragment can be: 1-001, and the third data block of the second file fragment can be: 2-003, the third file is divided into The first data block of the slice is 3-001, and so on.
  • the order of the data blocks in the original file to be stored can be determined by the above data block number.
  • the embodiment further provides a recovery scheme of the file to be stored as follows: in the process of restoring the file to be stored, determining, according to the data block number, a file to be stored to which the data block belongs, according to the data block number, where the data block is located.
  • the sequence number in the file fragment and the sequence number of the file fragment to which the data block belongs determine the order of the data block in the file to be stored.
  • This embodiment provides a scheme for recording the exact location of the file data in a specific application scenario.
  • the specific application scenario is as follows: the number of file fragments of the file to be stored and the number of data compression sections in the data compression node set. The same, and the file fragments are distributed to the data compression node in the order of the sequence number of the data compression node. Then you can do the following:
  • the method further includes: recording, by the file creation request, the file name of the file to be saved that needs to be saved;
  • the method further includes: recording a data block number of the data block and an identifier of a data storage node storing the data block, where the data block number includes a sequence number of the data block in a file fragment in which the data block is located And the sequence number of the above data compression node.
  • the following shows a recording scheme in the above specific application scenario. It is assumed that the name node records a list of data compression nodes participating in compression, assuming DN1, DN2...DNn, the first file fragment is processed by DN1, and the second File sharding is handled by DN2, and the third file sharding is handled by DN3. Then, when the data compression node obtains the data block and assigns the number, it can add a prefix before the sequence number of the data block, such as the first data submitted by DN1, numbered 1-001, and the second data is 1-002, and DN2 submits The first data number is 2-001, and so on.
  • the order of the data blocks obtained by each data compression node can be determined by the prefix, for example, 2-001 must be after 1-100.
  • the server can return the data block according to the order of the data block numbers saved by the name node. It is not important whether the data block number is continuous, as long as the data block number can be used to distinguish the order. .
  • the identity of the data storage node storing the data block can be recorded. This will find the data block.
  • the sequence number determines the order of the above data blocks in the file to be stored.
  • the data compression node receives the data storage node sent by the name node; the number The data block is sent to the data storage node for storage according to the compression node.
  • the embodiment of the present invention further provides an example of another embodiment.
  • the system structure of the name node, the client, the data storage node, and the data compression node is integrated, and the function of the data compression node is integrated into the data storage node to compress the data.
  • the function is implemented by a compression card integrated on a data node, which is exemplified as a preferred embodiment of the embodiment of the present invention.
  • the functions of the data compression node and the data storage node are both located in a data node (Date Node, ND).
  • the present embodiment uses the high-speed compression capability of the high-speed compression module to implement parallel compression and parallel storage mechanisms of multiple data nodes, thereby providing the capability of high-speed file compression and storage in the HDFS system.
  • the high-speed compression module may be a hardware device such as a hardware compression card, or may be a software module.
  • a hardware compression card is a hardware device that implements a compression algorithm using hardware logic to compress data and output compressed data. The operation of the hardware compression card does not consume the CPU resources of the host.
  • the software compression module can be implemented by using the data compression capability of self-developed software or common software.
  • nodes participating in data compression are DN1 and DN2, and DN3 to DN5 are DNs for saving copies of data blocks.
  • the HDFS client (HDFS cllent) is running on the client node (Cllent Node, CN).
  • the indication that the elliptical area is a library function does not belong to the hardware architecture.
  • the direction of the arrow shown in Figure 3 is the flow of data or messages, as follows:
  • the Client Node sends a file creation request message to the NN by using the DistributedFileSystem to notify the NN that the file to be stored needs to be stored, and requests the NN to return information that can compress the DN of the file to be stored.
  • the above DistributedFileSystem is a function in the HDFS system development class library, which is used to request the NN to create a file.
  • DistributedFileSystem returns an FSDataOutputStream object, which is responsible for communication between the NN and the DN.
  • the FSDataOutputStream object is a library function. If both the DN and the CN have a function library containing the library function, there are at least two ways to implement communication between the DN and the NN: 1.
  • the CN informs the DN of the FSDataOutputStream that the FSDataOutputStream is used by the FSDataOutputStream. Parameters; 2, DN itself calls FSDataOutputStream, communicate with the NN to get the parameters used to run FSDataOutputStream.
  • the CN can first send the above function library to the DN, and the subsequent implementation refers to the above two methods.
  • DN and CN have The above library function, the manner in which the CN tells the DN to run the parameters used by the FSDataOutputStream can be used as a preferred implementation.
  • the information of the above two functions of the file creation request may be sent separately or separately.
  • various information for determining the DN for the NN may be carried, and other information may be carried, for example:
  • Configuration information such as the available hardware compression card (or DN), the path of the rack-aware location script when HDFS is stored.
  • the rack-aware location script is used to determine the rack-distribution information, CPU and memory usage of the DN's hardware compression card.
  • the embodiment can also be compatible with the centralized compression mode.
  • the HDFS client can specify the compression mode in the file creation request.
  • the specific solution is as follows: the file creation request message carries the compression identifier: 0-using centralized compression, 1- adopting parallel compression. If the compression flag is 0, the HDFS independently completes the data compression storage, and the NN does not need to return the information of the DN.
  • the NN After receiving the file creation request message, the NN creates information about the file to be stored, selects the DN, and returns it to the Client Node.
  • the created information of the file to be stored includes: a save path of the file to be stored, and a file creation time stamp. It is also possible to save information about all DNs returned.
  • the save path is, for example, hdfs://namenode:9000/user/hadoop/study/helloworld.dat; a location for indicating that the information of the file to be stored is saved.
  • the file name and the DN corresponding to the file name can be saved.
  • the NN needs to comprehensively evaluate according to the DN status, and select the appropriate DN to return to the Client Node.
  • the message returned to the Client Node needs to carry the necessary information that allows the Client Node to find the DN, such as the host name of the DN, the Internet Protocol (IP) address, or the port number.
  • IP Internet Protocol
  • the NN can select the DN scheme as follows: The NN maintains the status information of all the DNs.
  • the DN can be flexibly implemented according to a predetermined selection rule. For example, first, the DN that has been configured with the hardware compression card is queried, and then the DN is searched for the nearest HDFS client. DN (such as in the same rack, the same subnet segment is medium), and then select the lighter DN (such as CPU) according to the load information of the DN. Less memory footprint).
  • the size of the file to be stored can also be taken into consideration to determine the number of DNs required. In Figure 5, it is assumed that the selected DNs are DN1 and DN2.
  • the HDFS client After receiving the DN returned by the NN, the HDFS client reads the file to be stored from the client node, and shards the file to be stored to obtain a file fragment.
  • the number of file fragments is the same as the number of DNs.
  • One file per DN is fragmented when sending file fragments, which avoids multiple allocation of file fragments.
  • the policy for the HDFS client to split the file to be stored can be as follows:
  • Strategy 1 According to the number of DNs returned by the NN. For example, NN returns 2 DN information, and Client Node divides the original files to be stored into 2 equal parts.
  • Strategy 2 Query the computing power and load of each DN returned by the NN, and then determine the file size of the corresponding size according to the computing power and the load, and then perform segmentation according to the file fragment of the determined size, and then send it to the corresponding DN.
  • the number of file fragments after file splitting is still equal to the number of DNs returned by NN.
  • the HDFS client sends the file fragment to the DN returned by the NN.
  • the embodiment of the present invention adopts a scheme of negotiating a compression rule between DNs. Therefore, the HDFS client needs to notify the DN to participate in the compression of the DN information of the file to be stored, and may carry information such as the IP address and host name of the DN.
  • the file fragmentation may be sent by the HDFS client, or may be obtained by the DN after the DN is notified.
  • the HDFS client needs to inform the DN file fragmentation information, for example, the file fragment corresponding to the file fragment.
  • the path information to be saved in the file to be stored, and the DN obtains the file fragment according to the path information.
  • the client node After the client node sends the file fragment, it can record the status information of the transmission.
  • step 504 the function of the client node in the process can be ended, and the subsequent process is completed by the DN and the NN.
  • the following description is made corresponding to FIG. 5, and the contents of DN1 and DN2 are the same.
  • the following embodiment DN2 is described in detail, and the description of DN1 can refer to DN2.
  • DN2's Compress storage agent will first receive and save the file fragment locally on DN2.
  • the compression agent module is responsible for communicating with the client node, and thus receives the participation compression. Information about the DN of the file to be stored.
  • the compressed storage agent module on DN2 notifies the hardware compression card that compression can begin.
  • the information involved in compressing the DN of the file to be stored needs to be notified to the hardware compression card.
  • the hardware compression card on DN2 negotiates with the hardware compression card on DN1 to obtain data compression rules.
  • Data compression rules are usually embodied in the form of compression algorithms. Different compression algorithms have different file compression headers and distribution characteristics. So this step can determine the location of the file compression header and the file compression header.
  • dictionary compression after receiving data fragments, each DN scans each file fragment and calculates a dictionary corresponding to the data fragment according to a certain strategy (such as Huffman coding). After each DN generates its own dictionary, the DNs communicate with each other, broadcast their own load and resource status (such as CPU load, memory usage, bandwidth occupancy, etc.), select the lightest DN as the summary node, and each DN will calculate itself.
  • the dictionary is sent to the summary node, and the summary node synthesizes each dictionary, sorts out a unified dictionary, broadcasts to each DN, and then each DN starts its own compression process.
  • the hardware compression card performs data compression and segmentation on the local file fragment according to the compression rule obtained by negotiation, to obtain a data block.
  • the position of the file compression header is determined according to the compression algorithm used. Taking the dictionary compression as an example, the file compression header is located in the first data block obtained by compressing the original file to be stored, so in this embodiment, the corresponding A file is sliced and compressed into the first block of data. The file compression header is merged with the first block of data generated by the first block fragmentation compression and placed before the first data block.
  • the file compression header is merged with the last block of data generated by the last file slice compression, placed after the last data block.
  • Other merge modes are determined according to different compression algorithms, and this embodiment will be further described one by one. Compressing the data block with the same dictionary ensures that the compressed fast structure is the same as the single node compression.
  • the HDFS system usually specifies the size of the data block (Block), that is, the granularity of data compression and storage. Therefore, in this step, the size of the data block obtained by the hardware compression card is a fixed size.
  • the NN returns a list of DNs for storing the above data blocks to the compressed storage agent module.
  • DN2 can send the identifier of DN2 to the NN, and the file name to which the new block belongs; then the NN can determine the DN used for authentication by the file name after receiving the request: DN1 and DN2, and then determine The identifier of DN2 is DN2, which belongs to the DN used for authentication. Therefore, it can be determined that the authentication is passed. After the authentication is passed, the NN can return the DN list to DN2.
  • the request sent by the compressed storage agent module to the NN carries the above saved path, such as:
  • This step may also be performed by the compressed storage agent module.
  • it may be performed by a hardware compression card, or a new module may be implemented.
  • the number of DNs included in the DN list is the same as the number of copies of the data block backup. In the DN list, you need to carry the necessary information to determine the DN, such as the host name, IP address, or port number of the DN. In FIG. 5, the number of DNs in the DN list is 3, which are DN3 to DN5, respectively.
  • the embodiment of the present invention further provides a scheme for recording data block related information on the NN side, as follows: DN2 needs to send a data block number of the data block to the NN, and is used to determine the order of the data block in the entire file to be stored. .
  • the numbering mode of the data block number can be different depending on the specific application scenario.
  • the common solution is as follows:
  • the numbering mode of the data block number can be performed in the following manner: fragment number + data block number.
  • the fragment number is the sequence number of the file fragment in all the fragments of the file to be stored
  • the data block number is the serial number of the data block in the file fragment in which it is located. For example, 1-001 must be before 2-001, so the order of each data block can still be determined.
  • the data block number is based on the following specific application scenarios.
  • the numbering method can be carried out as follows: DN number + data block number.
  • the first data block number obtained by DN1 is: 1-001
  • the second data block number obtained by DN2 is 2-002.
  • the NN After the NN receives an instruction to restore the original file to be stored, it can first find the file to be stored. The data block number corresponding to the file and the DN of the file are read out from the DN node, and the order of the data block in the original file to be stored is determined according to the recorded data block number, thereby restoring the original to be stored. file.
  • the DN2 compressed storage agent module calls FSDataOutputStream to store the data blocks in DN3 to DN5.
  • the process of depositing in sequence is: the compressed storage agent module sends the data block to the first DN (DN3) in the DN list.
  • the message carries the data block.
  • DN3 saves the data block, it sends the data block to the next DN (DN4) in the DN list until the last DN (DN5) in the list saves the data block.
  • DN5 to DN3 sequentially return a write confirmation to the FSDataOutputStream called by the compressed storage agent module, and is used to confirm that the data block is stored.
  • the compressed storage agent module calls FSDataOutputStream to perform the storage operation of the next data block after receiving the write confirmation, and the execution process is the same as the previous data block.
  • the client node and the NN are notified to complete the storage, and the connection with the NN and the client node is closed.
  • the return path of the message for writing confirmation is as follows: the last DN (DN5) in the DN list saves the data block and sends a write confirmation to the second-to-last DN of the DN list (DN4), and DN4 forwards the write confirmation to the previous DN. Until the first DN (DN3) of the DN list, DN3 forwards the write acknowledgment to the FSDataOutputStream called by the compressed storage agent module. Finally, the compressed storage agent module determines that a data block storage is completed.
  • the client node If the client node maintains the status information of the file fragmentation, the status of the file fragment corresponding to the DN returning the stored information may be set to Finished, and the client node determines all the file fragments. After the status is Finished, it can be determined that the file to be stored has been stored. At this time, the storage completion message can be returned to the NN, and the distributed compressed storage process can be recorded.
  • the hardware compression card on multiple DNs performs data compression, which improves the parallelism of compression and can shorten the file compression time.
  • the file fragment can be directly compressed into the HDFS Block size.
  • the DN can store a data block to the HDFS. Multiple DN storage operations are parallel, without waiting for all data to be compressed. After the data The node is divided and saved.
  • the compression is performed by using a hardware compression card, which does not occupy the DN or the CPU resources of the client node, and can save CPU resources.
  • the embodiment of the present invention provides a name node, which is applied to a distributed file system.
  • the distributed file system includes a client node, the name node, and a data node.
  • the name node includes:
  • the first receiving unit 401 is configured to receive a file creation request sent by the client node
  • the first determining unit 402 is configured to determine, after the first receiving unit 401 receives the file creation request sent by the client node, the data compression node set, where the data compression node set includes at least two data compression nodes, and the data compression node a data node having data compression processing resources;
  • a first sending unit 403, configured to send the foregoing data compression node set determined by the first determining unit 402 to a client node;
  • the second receiving unit 404 is configured to receive a node acquisition request sent by the data compression node in the data compression node set;
  • a second determining unit 405, configured to determine, after the second receiving unit 404 receives the node obtaining request sent by the data compression node in the data compression node set, the data storage node is a data node having a data storage resource ;
  • the second sending unit 406 is configured to send the information of the data storage node determined by the second determining unit 405 to the data compression node corresponding to the node obtaining request.
  • the first determining unit 402 is configured to select at least two data compression nodes that the currently available compression processing resources reach a predetermined criterion; and use the selected set of the at least two data compression nodes as the data compression node set.
  • the second determining unit 405 is specifically configured to: after the first receiving unit 401 receives the node obtaining request, determine whether the data compression node belongs to the data compression node set, and if yes, determine a data storage node.
  • the name node further includes:
  • the first recording unit 501 is configured to: after the first determining unit 402 determines the data compression node set, record the data compression node set and the information of the file to be stored corresponding to the data compression node set;
  • the information about the file to be stored in the foregoing data block is carried in the node obtaining request, and the foregoing number According to the identity of the compressed node;
  • the second determining unit 405 is specifically configured to determine, according to information about the file to be stored in the data block, the corresponding data compression node set, and determine whether the data compression node that sends the node acquisition request belongs to the determined data compression node set.
  • the name node further includes:
  • the second recording unit 601 is configured to record, after the first determining unit 402 receives the file creation request sent by the client node, the file creation request to specify a file name of the file to be saved that needs to be saved;
  • the second recording unit 601 is further configured to: after the second determining unit 405 determines the data storage node, record a data block number of the data block and an identifier of a data storage node that stores the data block, where the data block number includes the foregoing data The serial number of the block in the file fragment in which it resides and the sequence number of the file fragment to which the above data block belongs.
  • the above name node further includes:
  • the first restoring unit 701 is configured to determine, according to the data block number recorded by the second recording unit 601, the file to be stored that belongs to the data block in the process of restoring the file to be stored, according to the data block in the data block number.
  • the sequence number in the file fragment in which it is located and the sequence number of the file fragment to which the data block belongs are determined in the order in which the data block is stored in the file to be stored.
  • the name node further includes:
  • the third recording unit 801 is configured to record, after the first determining unit 402 receives the file creation request sent by the client node, the file creation request to specify a file name of the file to be saved that needs to be saved;
  • the third recording unit 801 is further configured to: after determining the data storage node, if the number of file fragments of the file to be stored is the same as the number of data compression sections in the data compression node set, and the file fragment is according to the data.
  • the sequence of the serial number of the compressed node is distributed to the data compression node, and the data block number of the data block and the identifier of the data storage node storing the data block are recorded, where the data block number includes the data block in the file fragment in which the data block is located.
  • the serial number and the serial number of the above data compression node is further configured to: after determining the data storage node, if the number of file fragments of the file to be stored is the same as the number of data compression sections in the data compression node set, and the file fragment is according to the data.
  • the sequence of the serial number of the compressed node is distributed to the data compression node, and the data block number of the data block and the identifier of the data storage node storing the data block are recorded
  • the name node further includes:
  • the second recovery unit 901 is configured to determine, according to the data block number recorded by the third recording unit 801, the file to be stored to which the data block belongs according to the data to be stored, according to the data.
  • the sequence number of the data block in the block number in the file fragment in which it is located and the sequence number of the data compression node determine the order of the data block in the file to be stored.
  • the embodiment of the present invention further provides another name node, as shown in FIG. 10, including: a receiver 1001, a transmitter 1002, a processor 1003, and a memory 1004.
  • the memory 1004 can be applied to the processor 1003 during data processing. Applications such as data caching can also be applied to the storage of data.
  • the above-mentioned name node is applied to the distributed file system, and the distributed file system includes a client node, the above-mentioned name node, and a data node.
  • the distributed file system may be any distributed file system, and particularly applicable. In HDFS.
  • the receiver 1001 is configured to receive a file creation request sent by a client node.
  • the processor 1003 is configured to: after receiving a file creation request sent by the client node, determine a data compression node set, where the data compression node set includes at least two data compression nodes, and the data compression node is a data compression processing resource.
  • the transmitter 1002 is configured to send the foregoing data compression node set to the client node;
  • the receiver 1001 is further configured to receive a node acquisition request sent by a data compression node in the data compression node set;
  • the processor 1003 is configured to: after receiving the node acquisition request sent by the data compression node in the data compression node set, determine a data storage node, where the data storage node is a data node having a data storage resource;
  • the transmitter 1002 is configured to send the determined information about the data storage node to the data compression node corresponding to the node acquisition request.
  • the data compression node set determined by the name node includes at least two data compression nodes, and the data compression node in the data compression node group participates in compression of the file to be stored. Since the data compression node is a data node, the function modification of the name node management node is small; more importantly, the data compression and storage processes of the respective data compression nodes are parallel. Therefore, the compression and storage of the file to be stored in the embodiment of the present invention is no longer limited to the processing capability of the client node, so the data compression storage efficiency of the distributed system can be improved, and the speed of the distributed system can be improved.
  • the name node has the function of managing the data compression node and the data storage node.
  • the name node needs to determine the data compression node that can be used as a data compression storage process.
  • This embodiment also provides a strategy for determining the data compression node, as follows:
  • the processor 1003 is configured to determine data compression
  • the node set includes: selecting at least two data compression nodes that the currently available compression processing resources reach a predetermined criterion; and selecting the selected set of the at least two data compression nodes as the data compression node set.
  • the compression processing resources currently available to all data compression nodes are selected as standards; the available compression processing resources may include the most direct resources of data compression, such as: idle compressed computing resources, and may also include compression processing.
  • the necessary resources such as: the resources to transfer compressed data. Therefore, compression processing resources should be understood as a relatively wide range of compression processing resources, and should not be simply understood as containing only computing resources.
  • the name node manages the process of data compression storage, so the authentication scheme can also be added to ensure that the client node can allocate file fragments according to the compressed node set determined by the name node, as follows:
  • the processor 1003 After the receiving the node acquisition request sent by the data compression node, determining the data storage node includes: determining, after receiving the node acquisition request, whether the data compression node belongs to the data compression node set, and if yes, determining data Storage node.
  • the processor 1003 is further configured to record the data compression node set and the information of the file to be stored corresponding to the data compression node set; Carrying the information of the file to be stored in the data block and the identifier of the data compression node; the processor 1003, configured to determine whether the data compression node belongs to the data compression node set, according to: information about the file to be stored according to the data block Determining a corresponding set of data compression nodes, and determining whether the data compression node that sent the node acquisition request belongs to the determined set of data compression nodes.
  • This embodiment can implement data error storage. Based on the data compression storage process, this embodiment also provides data preparation for how to perform data recovery in the case of subsequent data recovery requirements. Some data needs to be recorded on the name node side.
  • the processor 1003 is further configured to: after receiving the file creation request sent by the client node, record the file creation request to specify a file name of the file to be saved that needs to be saved;
  • the data storage node After determining the data storage node, recording a data block number of the data block and an identifier of a data storage node storing the data block, where the data block number includes a sequence number of the data block in a file fragment in which the data block is located, and a data block to which the data block belongs The serial number of the file fragment.
  • the sequence number of the file fragment is a sequence number that is sequentially numbered according to the order of the file fragments in the file to be stored after the file to be stored is divided into file fragments; the data block is compressed by the file fragmentation. Obtained, therefore, the data block has a dependency relationship with the file fragmentation.
  • the file fragmentation compression will get a lot of data blocks, and the serial number of the data block in the file fragment in which it is located is also the serial number obtained by sequential numbering.
  • the embodiment further provides a solution for performing data recovery, as follows:
  • the processor 1003 is further configured to record a data block number of the data block and an identifier of a data storage node that stores the data block. After the file to be stored is restored, the file to be stored corresponding to the data block is determined according to the data block number, according to the sequence number of the data block in the file fragment and the data block to which the data block belongs. The sequence number of the file fragment determines the order of the above data blocks in the file to be stored.
  • the recording scheme can be applied to all scenarios by recording the sequence number of the data block in the file fragment in which the data block is located and the sequence number of the file fragment to which the data block belongs.
  • the specific content of the recorded data may be changed.
  • the processor 1003 is further configured to: after receiving the file creation request sent by the client node, record the file creation request designation. The file name of the file to be saved that needs to be saved;
  • the node After determining the data storage node, if the number of file fragments of the file to be stored is the same as the number of data compression sections in the data compression node set, and the file fragments are distributed to the data compression according to the sequence number of the data compression node.
  • the node records the data block number of the data block and the identifier of the data storage node storing the data block, where the data block number includes the sequence number of the data block in which the data block is located and the sequence number of the data compression node.
  • the embodiment of the present invention further provides a processing solution in the data recovery process.
  • the processor 1003 is further configured to: after recording the data block number of the data block and the identifier of the data storage node storing the data block, in the process of restoring the file to be stored, determining the foregoing according to the data block number
  • the file to be stored by the data block determines the order of the data block in the file to be stored according to the sequence number of the data block in the data block number and the sequence number of the data compression node.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Computer And Data Communications (AREA)

Abstract

一种数据压缩存储方法、装置,及分布式文件系统,所述分布式文件系统包括客户端节点、名称节点以及数据节点,所述方法的实现包括:名称节点在接收到客户端节点发送的文件创建请求后,确定数据压缩节点集,所述数据压缩节点集包含至少两个数据压缩节点,所述数据压缩节点为具有数据压缩处理资源的数据节点;所述名称节点将所述数据压缩节点集发送给客户端节点;所述名称节点在接收到所述数据压缩节点集中的数据压缩节点发送的节点获取请求后,确定数据存储节点,所述数据存储节点为具有数据存储资源的数据节点;所述名称节点将确定的所述数据存储节点的信息发送给所述节点获取请求对应的数据压缩节点。用于提高数据压缩存储效率和速度。

Description

一种数据压缩存储方法、装置,及分布式文件系统 技术领域
本发明涉及存储技术领域,具体涉及一种数据压缩存储方法、装置,及分布式文件系统。
背景技术
在分布式文件系统(Distributed File System)中,文件系统管理的物理存储资源有的在本地节点上,有的在远程节点上。Hadoop分布式文件系统(Hadoop Distributed File System,HDFS)是一种常用分布式文件系统,具备高度容错性,适合部署在廉价的机器上。另外,HDFS能实现高吞吐量的数据访问,因此较适合大规模数据的应用环境。
在HDFS中,至少包含如下三类功能节点:数据节点(DataNode,DN)、名称节点(NameNode,NN),以及HDFS客户端节点(HDFS client)。以上三类功能节点可以任意组合以后部署在物理实体设备中。
其中,数据节点在HDFS文件系统中用于存储文件的具体内容。在HDFS系统中,一个待存储文件会被切分为多个数据块(通常默每块大小为64M或128M),同一个数据块有需要有多个副本存储在不同的DN中,以提高数据存储的可靠性。
名称节点,被认为是HDFS文件系统的核心,用于存储分布式文件系统中所有文件的目录树结构及文件数据在在数据节点中的准确位置。名称节点并不保存具体的文件内容数据。
HDFS客户端节点是负责将待存储文件切分为多个数据块并按照名称节点的要求进行对数据块进行存储的设备。
在HDFS中,数据压缩存储的实现过程如下:
HDFS客户端节点,获取待存储文件,然后将待存储文件压缩得到压缩文件;HDFS客户端节点向名称节点发送文件创建请求,告知有文件需要存储;
上述名称节点接收到文件创建请求后,将如何分割压缩文件的参数信息发送给HDFS客户端节点;
HDFS客户端节点按照上述参数信息的指示,将上述待存储文件压缩并分割为若干个数据块(Block),然后从名称节点获取各数据块的副本将要存放的数据节点;最后将分割得到的Block存储到数据节点。
若采用以上数据压缩存储方案,一方面,HDFS客户端节点对待存储文件进行压缩,压缩速度较慢。另一方面,保存过程在一个数据块及其副本保存成功后才能保存下一个数据块,文件保存速度较慢。
发明内容
本发明实施例提供一种数据压缩存储方法、装置,及分布式文件系统,用于提高分布式系统的数据压缩存储效率,提高分布式系统的速度。
本发明实施例一方面提供了一种数据压缩存储方法,应用于分布式文件系统,所述分布式文件系统包括客户端节点、名称节点以及数据节点,包括:
名称节点在接收到客户端节点发送的文件创建请求后,确定数据压缩节点集,所述数据压缩节点集包含至少两个数据压缩节点,所述数据压缩节点为具有数据压缩处理资源的数据节点;
所述名称节点将所述数据压缩节点集发送给客户端节点;
所述名称节点在接收到所述数据压缩节点集中的数据压缩节点发送的节点获取请求后,确定数据存储节点,所述数据存储节点为具有数据存储资源的数据节点;
所述名称节点将确定的所述数据存储节点的信息发送给所述节点获取请求对应的数据压缩节点。
结合一方面的实现方式,在第一种可能的实现方式中,所述确定数据压缩节点集包括:
选取当前可用的压缩处理资源达到预定标准的至少两个数据压缩节点;将选取的所述至少两个数据压缩节点的集合作为所述数据压缩节点集。
结合一方面或者一方面的第一种可能的实现方式,在第二种可能的实现方式中,所述名称节点在接收到所述数据压缩节点发送的节点获取请求后,确定数据存储节点,包括:
所述名称节点接收到所述节点获取请求后,确定所述数据压缩节点是否属 于所述数据压缩节点集,若是,则确定数据存储节点。
结合一方面的第二种实现方式,在第三种可能的实现方式中,在确定数据压缩节点集之后,所述方法还包括:所述名称节点记录所述数据压缩节点集和对应于所述数据压缩节点集的待存储文件的信息;
所述节点获取请求中携带所述数据块所属待存储文件的信息,以及所述数据压缩节点的标识;
所述确定所述数据压缩节点是否属于所述数据压缩节点集包括:
所述名称节点依据所述数据块所属待存储文件的信息确定对应的数据压缩节点集,并判断发送所述节点获取请求的数据压缩节点是否属于确定的所述数据压缩节点集。
结合一方面的实现方式,在第四种可能的实现方式中,在接收到客户端节点发送的文件创建请求后,所述方法还包括:记录所述文件创建请求指定需要保存的待存储文件的文件名;
在确定数据存储节点之后,所述方法还包括:
记录所述数据块的数据块号和存储所述数据块的数据存储节点的标识,所述数据块号包含所述数据块在其所在的文件分片中的序号以及所述数据块所属的文件分片的序号。
结合一方面的第四种可能的实现方式,在第五种可能的实现方式中,在记录所述数据块的数据块号和存储所述数据块的数据存储节点的标识之后,所述方法还包括:
在恢复所述待存储文件过程中,依据所述数据块号确定所述数据块所属的待存储文件,依据所述数据块号中所述数据块在其所在的文件分片中的序号以及所述数据块所属的文件分片的序号确定所述数据块在所述待存储文件中的顺序。
结合一方面的实现方式,在第六种可能的实现方式中,在接收到客户端节点发送的文件创建请求后,所述方法还包括:记录所述文件创建请求指定需要保存的待存储文件的文件名;
在确定数据存储节点之后,所述方法还包括:
若所述待存储文件的文件分片的个数与所述数据压缩节点集中的数据压 缩节个数相同,并且文件分片被按照数据压缩节点的序号的顺序分发给数据压缩节点,则记录所述数据块的数据块号和存储所述数据块的数据存储节点的标识,所述数据块号包含所述数据块在其所在的文件分片中的序号以及所述数据压缩节点的序号。
结合一方面的第六种可能的实现方式,在第七种可能的实现方式中,在记录所述数据块的数据块号和存储所述数据块的数据存储节点的标识之后,所述方法还包括:
在恢复所述待存储文件过程中,依据所述数据块号确定所述数据块所属的待存储文件,依据所述数据块号中的所述数据块在其所在的文件分片中的序号以及所述数据压缩节点的序号确定所述数据块在所述待存储文件中的顺序。
本发明实施例二方面提供了一种分布式文件系统,包括:客户端节点、名称节点以及数据节点,其特征在于,
客户端节点获取待存储文件,向名称节点发送文件创建请求;
名称节点在接收到所述客户端节点发送的文件创建请求后,确定数据压缩节点集,所述数据压缩节点集包含至少两个数据压缩节点,所述数据压缩节点为具有数据压缩处理资源的数据节点;所述名称节点将所述数据压缩节点集发送给所述客户端节点;
所述客户端节点接收所述名称节点根据所述文件创建请求返回的数据压缩节点集,分割所述待存储文件得到至少两个文件分片,然后将各文件分片发送给所述数据压缩节点集中的数据压缩节点;
数据压缩节点在接收到所述客户端节点发送的文件分片后,压缩接收到的所述文件分片,并分割得到数据块;所述数据压缩节点向所述名称节点发送节点获取请求;
所述名称节点在接收到所述数据压缩节点集中的数据压缩节点发送的节点获取请求后,确定数据存储节点,所述数据存储节点为具有数据存储资源的数据节点;所述名称节点将确定的所述数据存储节点的信息发送给所述节点获取请求对应的数据压缩节点;
所述数据压缩节点接收所述名称节点发送的数据存储节点的信息;所述数据压缩节点将所述数据块发送给所述数据存储节点存储。
结合二方面的实现方式,在第一种可能的实现方式中,所述确定数据压缩节点集包括:
所述名称节点选取当前可用的压缩处理资源达到预定标准的至少两个数据压缩节点;将选取的所述至少两个数据压缩节点的集合作为所述数据压缩节点集。
结合一方面的实现方式,在第二种可能的实现方式中,所述名称节点在接收到所述数据压缩节点发送的节点获取请求后,确定数据存储节点,包括:
所述名称节点接收到所述节点获取请求后,确定所述数据压缩节点是否属于所述数据压缩节点集,若是,则确定数据存储节点。
结合二方面的第二种可能的实现方式,在第三种可能的实现方式中,在所述名称节点确定数据压缩节点集之后,所述系统还包括:
所述名称节点记录所述数据压缩节点集和对应于所述数据压缩节点集的待存储文件的信息;
所述名称节点获取请求中携带所述数据块所属待存储文件的信息,以及所述数据压缩节点的标识;所述确定所述数据压缩节点是否属于所述数据压缩节点集包括:
所述名称节点依据所述数据块所属待存储文件的信息确定对应的数据压缩节点集,并判断发送所述节点获取请求的数据压缩节点是否属于确定的所述数据压缩节点集。
结合一方面的实现方式,在第四种可能的实现方式中,所述系统还包括:
所述名称节点在接收到客户端节点发送的文件创建请求后,记录所述文件创建请求指定需要保存的待存储文件的文件名;
所述名称节点在确定数据存储节点之后,记录所述数据块的数据块号和存储所述数据块的数据存储节点的标识,所述数据块号包含所述数据块在其所在的文件分片中的序号以及所述数据块所属的文件分片的序号。
结合一方面的第四种可能的实现方式,在第五种可能的实现方式中,所述系统还包括:
所述名称节点在恢复所述待存储文件过程中,依据所述数据块号确定所述数据块所属的待存储文件,依据所述数据块号中所述数据块在其所在的文件分 片中的序号以及所述数据块所属的文件分片的序号确定所述数据块在所述待存储文件中的顺序。
结合一方面的实现方式,在第六种可能的实现方式中,所述系统还包括:
所述名称节点在接收到客户端节点发送的文件创建请求后,记录所述文件创建请求指定需要保存的待存储文件的文件名;
所述客户端节点分割所述待存储文件得到的文件分片个数与所述数据压缩节点集中的数据压缩节个数相同,所述客户端节点将得到的文件分片被按照数据压缩节点的序号的顺序分发给数据压缩节点;
所述名称节点在确定数据存储节点之后,记录所述数据块的数据块号和存储所述数据块的数据存储节点的标识,所述数据块号包含所述数据块在其所在的文件分片中的序号以及所述数据压缩节点的序号。
结合一方面的第六种可能的实现方式,在第七种可能的实现方式中,所述系统还包括:
所述名称节点在恢复所述待存储文件过程中,依据所述数据块号确定所述数据块所属的待存储文件,依据所述数据块号中的所述数据块在其所在的文件分片中的序号以及所述数据压缩节点的序号确定所述数据块在所述待存储文件中的顺序。
结合一方面的实现方式,在第八种可能的实现方式中,所述客户端节点分割所述待存储文件得到至少两个文件分片包括:将所述待存储文件分割为与各数据压缩节点当前可用的压缩处理资源的多少对应大小的文件分片;所述文件分片的个数等于所述数据压缩节点集中数据压缩节点的个数;
所述客户端节点将各文件分片发送给所述数据压缩节点集中的数据压缩节点包括:将较大的文件分片发送给所述数据压缩节点集中当前可用的压缩处理资源较多的数据压缩节点,将较小的文件分片发送给所述数据压缩节点集中当前可用的压缩处理资源较少的数据压缩节点。
结合一方面的第八种实现方式,在第九种可能的实现方式中,所述文件分片的数量大于或等于所述数据压缩节点集中的数据压缩节点的个数;
所述客户端节点将各文件分片发送给所述数据压缩节点集中的数据压缩节点包括:所述客户端节点将文件分片逐个发送给当前具有空闲的数据压缩处 理资源的数据压缩节点。
结合一方面的实现方式,在第十种可能的实现方式中,所述系统还包括:
所述数据压缩节在压缩所述文件分片之前与其他数据压缩节点协商数据压缩规则;
所述数据压缩节将所述文件分片压缩为压缩文件包括:所述数据压缩节按照协商得到的所述数据压缩规则压缩所述文件分片。
结合一方面、一方面的第一种、第二种、第三种、第四种、第五种、第六种、第七种、第八种、第九种或者第十种可能的实现方式,在第十一种可能的实现方式中,所述系统还包括:
所述数据压缩节点在将所述数据块发送给所述数据存储节点存储之前,生成文件压缩头,在所述文件压缩头中携带所述数据压缩规则的指示信息,依据当前使用的数据压缩规则确定是否将所述文件压缩头并入所述数据块,若是则将所述文件压缩头并入所述数据块。
本发明实施例三方面还提供了一种名称节点,应用于分布式文件系统,所述分布式文件系统包括客户端节点、所述名称节点以及数据节点,所述名称节点包括:
第一接收单元,用于接收客户端节点发送的文件创建请求;
第一确定单元,用于在所述第一接收单元接收到客户端节点发送的文件创建请求后,确定数据压缩节点集,所述数据压缩节点集包含至少两个数据压缩节点,所述数据压缩节点为具有数据压缩处理资源的数据节点;
第一发送单元,用于将所述第一确定单元确定的所述数据压缩节点集发送给客户端节点;
第二接收单元,用于接收到所述数据压缩节点集中的数据压缩节点发送的节点获取请求;
第二确定单元,用于在所述第二接收单元接收到所述数据压缩节点集中的数据压缩节点发送的节点获取请求后,确定数据存储节点,所述数据存储节点为具有数据存储资源的数据节点;
第二发送单元,用于将所述第二确定单元确定的所述数据存储节点的信息发送给所述节点获取请求对应的数据压缩节点。
结合三方面的实现方式,在第一种可能的实现方式中,所述第一确定单元,用于选取当前可用的压缩处理资源达到预定标准的至少两个数据压缩节点;将选取的所述至少两个数据压缩节点的集合作为所述数据压缩节点集。
结合三方面或者一方面的第一种可能的实现方式,在第二种可能的实现方式中,所述第二确定单元,具体用于在所述第一接收单元接收到所述节点获取请求后,确定所述数据压缩节点是否属于所述数据压缩节点集,若是,则确定数据存储节点。
结合三方面的第二种实现方式,在第三种可能的实现方式中,所述名称节点还包括:
第一记录单元,用于在所述第一确定单元确定数据压缩节点集之后,记录所述数据压缩节点集和对应于所述数据压缩节点集的待存储文件的信息;
所述节点获取请求中携带所述数据块所属待存储文件的信息,以及所述数据压缩节点的标识;
所述第二确定单元,具体用于依据所述数据块所属待存储文件的信息确定对应的数据压缩节点集,并判断发送所述节点获取请求的数据压缩节点是否属于确定的所述数据压缩节点集。
结合三方面的实现方式,在第四种可能的实现方式中,所述名称节点还包括:
第二记录单元,用于在所述第一确定单元接收到客户端节点发送的文件创建请求后,记录所述文件创建请求指定需要保存的待存储文件的文件名;
所述第二记录单元,还用于在所述第二确定单元确定数据存储节点之后,记录所述数据块的数据块号和存储所述数据块的数据存储节点的标识,所述数据块号包含所述数据块在其所在的文件分片中的序号以及所述数据块所属的文件分片的序号。
结合三方面的第四种可能的实现方式,在第五种可能的实现方式中,所述名称节点还包括:
第一恢复单元,用于在恢复所述待存储文件过程中,依据所述第二记录单元记录的数据块号确定所述数据块所属的待存储文件,依据所述数据块号中所述数据块在其所在的文件分片中的序号以及所述数据块所属的文件分片的序 号确定所述数据块在所述待存储文件中的顺序。
结合三方面的实现方式,在第六种可能的实现方式中,所述名称节点还包括:
第三记录单元,用于在所述第一确定单元接收到客户端节点发送的文件创建请求后,记录所述文件创建请求指定需要保存的待存储文件的文件名;
所述第三记录单元,还用于在确定数据存储节点之后,若所述待存储文件的文件分片的个数与所述数据压缩节点集中的数据压缩节个数相同,并且文件分片被按照数据压缩节点的序号的顺序分发给数据压缩节点,则记录所述数据块的数据块号和存储所述数据块的数据存储节点的标识,所述数据块号包含所述数据块在其所在的文件分片中的序号以及所述数据压缩节点的序号。
结合三方面的第六种可能的实现方式,在第七种可能的实现方式中,所述名称节点还包括:
第二恢复单元,用于在恢复所述待存储文件过程中,依据所述第三记录单元记录的数据块号确定所述数据块所属的待存储文件,依据所述数据块号中的所述数据块在其所在的文件分片中的序号以及所述数据压缩节点的序号确定所述数据块在所述待存储文件中的顺序。
在本发明实施例中,名称节点确定的数据压缩节点集中包含了至少两个数据压缩节点,数据压缩节点集中的数据压缩节点参与了待存储文件的压缩。由于数据压缩节点是数据节点,名称节点管理节点的功能修改较小;更重要的是,各个数据压缩节点的数据压缩和存储过程是并行的。因此,采用本发明实施例待存储文件的压缩和存储不再仅限于客户端节点的处理能力,因此可以提高分布式系统的数据压缩存储效率,提高分布式系统的速度。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施方法流程示意图;
图2是本发明实施例结合系统的方法流程示意图;
图3为本发明实施例结合系统的方法流程示意图;
图4为本发明实施例名称节点结构示意图;
图5为本发明实施例名称节点结构示意图;
图6为本发明实施例名称节点结构示意图;
图7为本发明实施例名称节点结构示意图;
图8为本发明实施例名称节点结构示意图;
图9为本发明实施例名称节点结构示意图;
图10为本发明实施例名称节点结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明实施例提供一种数据压缩存储方法,应用于分布式文件系统,上述分布式文件系统包括客户端节点、名称节点以及数据节点,如图1所示,包括:
在本实施例中,分布式文件系统可以是任意的分布式文件系统,特别地可以应用于HDFS。
101:名称节点在接收到客户端节点发送的文件创建请求后,确定数据压缩节点集,上述数据压缩节点集包含至少两个数据压缩节点,上述数据压缩节点为具有数据压缩处理资源的数据节点;
名称节点具有管理数据压缩节点以及数据存储节点的功能,名称节点需要确定可以作为某次数据压缩存储过程中的数据压缩节点,本实施例还提供了如何确定数据压缩节点的策略,具体如下:上述确定数据压缩节点集包括:选取当前可用的压缩处理资源达到预定标准的至少两个数据压缩节点;将选取的上述至少两个数据压缩节点的集合作为上述数据压缩节点集。
在本实施例中,采用所有数据压缩节点当前可用的压缩处理资源为标准进行选择;可用的压缩处理资源可以包含数据压缩的最直接资源,如:空闲的压 缩计算资源,还可以包括配合压缩处理的必要资源,如:传输压缩数据的资源。因此压缩处理资源应当理解为较为广泛的压缩处理资源,不应简单理解为只能包含计算资源。
102:上述名称节点将上述数据压缩节点集发送给客户端节点;
103:上述名称节点在接收到上述数据压缩节点集中的数据压缩节点发送的节点获取请求后,确定数据存储节点,上述数据存储节点为具有数据存储资源的数据节点;
在本实施例中,名称节点管理数据压缩存储的过程,因此还可以加入鉴权的方案来保证客户端节点能够按照名称节点确定的压缩节点集分配文件分片,具体如下:上述名称节点在接收到上述数据压缩节点发送的节点获取请求后,确定数据存储节点,包括:
上述名称节点接收到上述节点获取请求后,确定上述数据压缩节点是否属于上述数据压缩节点集,若是,则确定数据存储节点。
基于本实施例中的名称节点在确定数据压缩节点集之后,上述方法还包括:上述名称节点记录上述数据压缩节点集和对应于上述数据压缩节点集的待存储文件的信息;
上述节点获取请求中携带上述数据块所属待存储文件的信息,以及上述数据压缩节点的标识;上述确定上述数据压缩节点是否属于上述数据压缩节点集包括:上述名称节点依据上述数据块所属待存储文件的信息确定对应的数据压缩节点集,并判断发送上述节点获取请求的数据压缩节点是否属于确定的上述数据压缩节点集。
104:上述名称节点将确定的上述数据存储节点的信息发送给上述节点获取请求对应的数据压缩节点。
在本实施例中,名称节点确定的数据压缩节点集中包含了至少两个数据压缩节点,数据压缩节点集中的数据压缩节点参与了待存储文件的压缩。由于数据压缩节点是数据节点,名称节点管理节点的功能修改较小;更重要的是,各个数据压缩节点的数据压缩和存储过程是并行的。因此,采用本发明实施例待存储文件的压缩和存储不再仅限于客户端节点的处理能力,因此可以提高分布式系统的数据压缩存储效率,提高分布式系统的速度。
本实施例可以实现数据压错存储,基于数据压缩存储的流程,本实施例还提供了用户在后续有数据恢复需求的情况下如何进行数据恢复的数据准备,在名称节点一侧需要记录一些数据,具体如下:在接收到客户端节点发送的文件创建请求后,上述方法还包括:记录上述文件创建请求指定需要保存的待存储文件的文件名;
在确定数据存储节点之后,上述方法还包括:记录上述数据块的数据块号和存储上述数据块的数据存储节点的标识,上述数据块号包含上述数据块在其所在的文件分片中的序号以及上述数据块所属的文件分片的序号。
在本实施例中,文件分片的序号是待存储文件被切分为文件分片后,按照文件分片在待存储文件中的顺序依次编号的序号;数据块由于是文件分片压缩得到的,因此数据块与文件分片有所属关系,文件分片压缩会得到很多数据块,数据块在其所在的文件分片中的序号也是顺序编号得到的序号。
基于本实施例记录的数据,本实施例还提供了进行数据恢复的方案,如下:在记录上述数据块的数据块号和存储上述数据块的数据存储节点的标识之后,上述方法还包括:
在恢复上述待存储文件过程中,依据上述数据块号确定上述数据块所属的待存储文件,依据上述数据块号中上述数据块在其所在的文件分片中的序号以及上述数据块所属的文件分片的序号确定上述数据块在上述待存储文件中的顺序。
以上实施例通过记录数据块在所在的文件分片中的序号以及数据块所属的文件分片的序号,该记录方案可以应用在所有场景下。对于特定的场景,可以改变记录的数据的具体内容,本实施例还提供了如下方案:在接收到客户端节点发送的文件创建请求后,上述方法还包括:记录上述文件创建请求指定需要保存的待存储文件的文件名;
在确定数据存储节点之后,上述方法还包括:若上述待存储文件的文件分片的个数与上述数据压缩节点集中的数据压缩节个数相同,并且文件分片被按照数据压缩节点的序号的顺序分发给数据压缩节点,则记录上述数据块的数据块号和存储上述数据块的数据存储节点的标识,上述数据块号包含上述数据块在其所在的文件分片中的序号以及上述数据压缩节点的序号。
基于以上实施例记录的具体数据内容(数据块号包含上述数据块在其所在的文件分片中的序号以及上述数据压缩节点的序号),本发明实施例还提供了数据恢复过程中的处理方案,具体如下:在记录上述数据块的数据块号和存储上述数据块的数据存储节点的标识之后,上述方法还包括:
在恢复上述待存储文件过程中,依据上述数据块号确定上述数据块所属的待存储文件,依据上述数据块号中的上述数据块在其所在的文件分片中的序号以及上述数据压缩节点的序号确定上述数据块在上述待存储文件中的顺序。
基于以上实施例对于客户端节点、名称节点以及数据压缩的分别介绍,本实施例还提供了综合的实施实例进行详细说明如下,请参阅图2所示,包括如下步骤:
201:客户端节点在获取到待存储文件后,向名称节点发送文件创建请求;
在本步骤中,待存储文件是需要存储的数据,数据量通常较大,因此需要压缩存储。待存储文件可以是客户端本地的文件,也可以是来源于其他设备的文件,本实施例对此不作限制。
202:名称节点在接收到客户端节点发送的上述文件创建请求后,确定数据压缩节点集,上述数据压缩节点集包含至少两个数据压缩节点,上述数据压缩节点为具有数据压缩处理资源的数据节点;上述名称节点将上述数据压缩节点集发送给上述客户端节点;
在名称节点确定数据压缩节点集以后,可以记录数据压缩节点集。记录时可以采用数据压缩节点表的形式记录,并采用数据压缩节点标识作为表项,例如,表1所示:
表1
数据压缩节点序号 数据压缩节点序号标识
1 DN1
2 DN5
N DNn
在本实施例中,数据压缩节点和数据存储节点均是采用功能划分的节点,处于对名称节点的管理需要来看,将数据压缩节点和数据存储节点的功能放在 数据节点实现较为合适。
另需说明的是名称节点确定数据压缩节点集所采用的策略,可以按照实际需要进行设定,以下给出了具体举例:
在确定上述数据压缩节点集之前,上述名称节点获取上述名称节点管理的各数据压缩节点当前可用的压缩处理资源;选取当前可用的压缩处理资源达到预定标准的至少两个数据压缩节点;选取的上述至少两个数据压缩节点作为上述数据压缩节点集的元素。
在本实施例中,可用的压缩处理资源的信息可以按需要设定,因此预定标准也可以对应设定标准,以下给出了几个举例:
假定1:当前具有的空闲压缩计算能力,那么预定标准可以是空闲压缩计算能力超过预定阈值;
假定2:综合当前空闲压缩计算能力以及当前数据传输能力(考虑即使压缩计算能力空闲较多,数据传输能力较弱,那么综合的存储能力仍然会较低),那么预定标准可以是:空闲压缩计算能力超过预定阈值并且数据传输能力也超过另一预定阈值。
以上假定仅作为举例不应理解为对本实施例穷举,以上采用压缩处理资源的标准可以确定哪些是符合作为数据压缩处理节点要求的节点,本实施例还给出了如何确定数据压缩节点数量,以及在数量确定后如何选取符合要求的数据压缩节点作为最终执行数据压缩的节点的方案,如下:
节点集中的数据压缩节点的数量可以由多种确定方法。如:根据原始数据量及数据分块预设大小确定,假定有10G原始数据作为待存储文件,预设的数据分片大小为2G,则需要有10/2=5个数据压缩节点。
数据压缩节点选取也有多种实现方法。如:优先选取和客户端节点在同一个机架上的数据压缩节点,同一个机架的数据压缩节点数目不足,则选择相邻机架的数据压缩节点,如果仍然不足,可以选择其他机架上同一个数据中心的数据压缩节点,直到选择到所需的节点数目。
可选的数据压缩节点较多时,可以依据负载均衡的需求来指导如何选择数据压缩节点,以上举例不应理解为本发明实施例可选方案的穷举。
203:上述客户端节点分割上述待存储文件得到至少两个文件分片,然后 将各文件分片发送给上述数据压缩节点集中的数据压缩节点;
客户端节点分割待存储文件的策略可以按照需求进行设定,本实施例给出了如下几种作为举例:
1、按照数据压缩节点集的元素数量,将待存储文件等分为和上述元素数量相等的数量的文件分片。
采用等分的方式控制过程最为简便。
2、按照各数据压缩节点的资源多少来确定文件分片的数据量大小,具体如下:
在分割上述待存储文件之前,获取上述数据压缩节点集中的数据压缩节点当前可用的压缩处理资源。在本实施例中,各数据压缩节点当前可用的压缩处理资源可以是客户端节点自己统计的,也可以是由名称节点统计后告知的。
然后执行分割:将上述待存储文件分割为与上述各数据压缩节点当前可用的压缩处理资源的多少对应大小的文件分片;上述文件分片的个数等于上述数据压缩节点集中数据压缩节点的个数。
最后执行与分割策略对应的发送策略:将较大的文件分片发送给上述数据压缩节点集中当前可用的压缩处理资源较多的数据压缩节点,将较小的文件分片发送给上述数据压缩节点集中当前可用的压缩处理资源较少的数据压缩节点。
采用按照资源多少来确定文件分片的数据量大小,可以实现按需分片发挥各数据压缩节点的数据压缩性能。
3、等分待存储文件,分割得到的文件分片的数量大于数据压缩节点集的元素数量,那么对应的发送策略可以如下:将文件分片逐个发送给当前具有空闲的数据压缩处理资源的节点。
采用本方案,分割策略控制较为简便,仍然可以发挥各数据压缩节点的数据压缩性能。
204:数据压缩节点在接收到客户端节点发送的文件分片后,压缩上述文件分片,并分割得到数据块;上述数据压缩节点使用的数据压缩规则与上述其他数据压缩节点使用的数据压缩规则相同;数据压缩节点向名称节点发送节点获取请求;
由于数据压缩节点集中有至少两个数据压缩节点,因此上述文件分片为分割待存储文件得到的文件的分片之一,上述文件分片之外的其他文件分片被发送给了其它数据压缩节点。
本实施例中,数据块是存储节点存储数据的单元,通常来说可以是固定大小的数据块。上述数据存储节点为具有数据存储资源的节点。
在本实施例中,各数据压缩节点使用的压缩规则是相同的,压缩规则是如何保持相同的方式可以按需确定,例如:采用固定的压缩规则就可以,本实施例还提供了更为灵活的压缩规则确定方式,如下:
在压缩上述文件分片之前,上述方法还包括:上述数据压缩节点与上述其他数据压缩节点协商数据压缩规则;
上述将上述文件分片压缩为压缩文件包括:按照协商得到的上述数据压缩规则压缩上述文件分片。
具体协商得到何种数据压缩规则,可以参考各种数据压缩算法本实施例对此不作限制。
由于数据压缩节点之间需要协商数据压缩规则,因此相互之间有通信需求,通信过程可以由客户端节点或者名称节点协助完成,本实施例还提供了较为优选的实现方式如下:
上述数据压缩节点与上述其他数据压缩节点,通过采用远程直接存储器存取(Remote Direct Memory Access,RDMA)建立的连接协商数据压缩规则,或者,通过采用用户数据报协议(User Datagram Protocol,UDP)建立的通信连接协商数据压缩规则。
另外,由于参与数据压缩的数据压缩节点至少有两个,那么为了保持数据块保存以后能够和使用一个节点进行压缩时一致,减少对整个系统架构的修改,本发明实施例可以在数据块被存储之前进行如下操作:
上述数据压缩节点生成文件压缩头,在上述文件压缩头中携带上述数据压缩规则的指示信息,依据当前使用的数据压缩规则确定是否将上述文件压缩头并入上述数据块,若是则将上述文件压缩头并入上述数据块。
文件压缩头携带的信息、文件压缩头的具体位置以及文件压缩头的数量需求等都是和采用的具体数据压缩算法相关的,本实施例对文件压缩头的具体形 式不作限制。
另外,在本实施例中,数据压缩节点压缩数据可以采用软压缩的方式进行,也可以使用硬压缩的方式进行。为了提高压缩数据的效率,减少对集成数据压缩节点的影响,可以优选采用如下方案:采用数据存储节点的硬件的压缩卡压缩上述文件分片。
205:名称节点在接收到上述数据压缩节点发送的节点获取请求后,确定数据存储节点;
在本实施例中,如果在确定数据压缩节点以后记录了数据压缩节点集,那么在确定数据存储节点之前,还可以对节点获取请求的发送者进行认证,具体如下:
上述名称节点接收到上述数据压缩节点发送的请求存储数据块的节点获取请求后,确定上述数据压缩节点是否属于上述数据压缩节点集,若是确定数据存储节点。
由于原始的待存储文件被分割成了至少两个文件分片,另外节点获取请求的用途是确定数据块存储的节点,因此数据压缩节点会传递改数据块的信息,例如:该数据块的其压缩的在文件分片中的序号。名称节点虽然在确定数据存储节点时可以不考虑文件分片造成的影响,不过为了后续管理数据块的需要,本发明实施例还提供了如何记录文件数据的准确位置的具体实现方案:
在接收到客户端节点发送的文件创建请求后,上述方法还包括:记录上述文件创建请求指定需要保存的待存储文件的文件名;
在确定数据存储节点之后,上述方法还包括:记录上述数据块的数据块号和存储上述数据块的数据存储节点的标识,上述数据块号包含上述数据块在其所在的文件分片中的序号以及上述数据块所属的文件分片的序号。
作为一个举例:假定原始的待存储文件为1G,被分成10个文件分片,文件分片的序号是1~10,数据压缩节点对每个文件分片进行压缩的过程中会独立按序编号;NN节点记录的是:第一个文件分片的第一个数据块可以是:1-001,第二个文件分片的第三个数据块可以是:2-003,第三个文件分片的第一个数据块为3-001,以此类推。可以通过上述数据块号确定数据块在原始的待存储文件中的顺序。
本实施例还提供了待存储文件的恢复方案如下:在恢复上述待存储文件过程中,依据上述数据块号确定上述数据块所属的待存储文件,依据上述数据块号中上述数据块在其所在的文件分片中的序号以及上述数据块所属的文件分片的序号确定上述数据块在上述待存储文件中的顺序。
本实施例给给出了一个特定的应用场景的记录文件数据准确位置的方案,该特定应用场景如下:上述待存储文件的文件分片的个数与上述数据压缩节点集中的数据压缩节个数相同,并且文件分片被按照数据压缩节点的序号的顺序分发给数据压缩节点。那么可以如下:
在接收到客户端节点发送的文件创建请求后,上述方法还包括:记录上述文件创建请求指定需要保存的待存储文件的文件名;
在确定数据存储节点之后,上述方法还包括:记录上述数据块的数据块号和存储上述数据块的数据存储节点的标识,上述数据块号包含上述数据块在其所在的文件分片中的序号以及上述数据压缩节点的序号。
以下给出了一种上述特定应用场景下的记录方案,假定:名称节点记录有参与压缩的数据压缩节点列表,假设为DN1,DN2…DNn,第一个文件分片由DN1处理,第二个文件分片由DN2处理,第三个文件分片由DN3处理。那么,数据压缩节点在得到数据块,分配编号时,可以在数据块的序号前添加前缀,如DN1提交的第一块数据,编号为1-001,第二块数据为1-002,DN2提交的第一块数据编号为2-001,以此类推。这样可以通过前缀以及可以确定各数据压缩节点得到的数据块的先后顺序,例如:2-001一定在1-100之后。客户端读取原始文件时,服务端可以根据名称节点保存的数据块号的先后顺序依次返回数据块即可,数据块号是否连续并不重要,只要能通过数据块号区分出先后顺序即可。为了确定数据块号对应的数据块的存储的位置,因此可以记录存储数据块的数据存储节点的标识。这样就可以找到数据块了。
本实施例还提供了以上特定应用场景下待存储文件的恢复方案如下:
在恢复上述待存储文件过程中,依据上述数据块号确定上述数据块所属的待存储文件,依据上述数据块号中的上述数据块在其所在的文件分片中的序号以及上述数据压缩节点的序号确定上述数据块在上述待存储文件中的顺序。
206:上述数据压缩节点接收上述名称节点发送的数据存储节点;上述数 据压缩节点将上述数据块发送给上述数据存储节点存储。
本发明实施例还提供了另一实施例的举例,本实施例将结合名称节点、客户端、数据存储节点以及数据压缩节点的系统结构,将数据压缩节点的功能集成在数据存储节点,压缩数据的功能采用集成在数据节点上的压缩卡实现,作为本发明实施例的一个优选实施例进行举例说明。在本实施例中,数据压缩节点和数据存储节点的功能均位于数据节点(Date Node,ND)。
另需说明的是,本实施例使用高速压缩模块的高速压缩能力实现多个数据节点的并行压缩和并行存储机制,从而提供HDFS系统中文件高速压缩和存储的能力。上述高速压缩模块可以是硬件压缩卡等硬件设备,也可以是软件模块。硬件压缩卡是使用硬件逻辑实现某种压缩算法,对数据进行压缩并输出压缩数据的硬件设备,硬件压缩卡的运行不需要消耗主机的CPU资源。软件压缩模块可以利用自研软件或者普通软件的数据压缩能力实现。
请参阅图5所示,在图5中,参与数据压缩的节点为DN1和DN2,DN3~DN5是用于保存数据块副本的DN。在客户端节点(Cllent Node,CN)运行有HDFS客户端(HDFS cllent),椭圆形区域是库函数的示意不属于硬件架构。图3所示箭头方向为数据或消息的流向,具体如下:
301:Client Node调用DistributedFileSystem向NN发送文件创建请求消息,用于告知NN有待存储文件需要存储,并请求NN返回可以压缩待存储文件的DN的信息。
上述DistributedFileSystem是HDFS系统开发类库中的功能函数,用于请求NN创建文件。另外,DistributedFileSystem会返回一个FSDataOutputStream对象,这个对象负责NN和DN之间的通信。FSDataOutputStream对象是库函数,如果DN和CN均具有包含该库函数的函数库,那么至少有如下两种方式实现DN与NN之间的通信:1、CN通过FSDataOutputStream告知DN的FSDataOutputStream运行FSDataOutputStream所使用的参数;2、DN自身调用FSDataOutputStream,与NN进行通信获得运行FSDataOutputStream所使用的参数。另一种是,DN没有包含以上库函数的函数库,那么CN可以先将上述函数库发送给DN,之后的实现再参考以上两种方式。其中DN和CN具有 上述库函数,由CN告知DN运行FSDataOutputStream所使用的参数的方式可以作为一个优选实现方式。
文件创建请求的上述两个功能的信息可以分开发送,也可以单独发送。在文件创建请求中,可以携带用于NN确定DN的各种信息,还可以携带其他信息,例如:
可用的硬件压缩卡(或者DN),HDFS存储时机架感知位置脚本的路径等配置信息。机架感知位置脚本用于确定DN的硬件压缩卡的在机架的分布信息,CPU和内存占用率等。
另外,本实施例还可以兼容集中压缩的方式,HDFS客户端可以在文件创建请求中指定压缩方式,具体方案如下:在文件创建请求消息中携带压缩标识:0-采用集中压缩,1-采用并行压缩。如果压缩标识为0那么HDFS独立完成数据压缩存储,NN不用返回DN的信息。
302:NN收到文件创建请求消息后,创建待存储文件的信息,选择DN并返回给Client Node。
在本步骤中,创建的待存储文件的信息包括:待存储文件的保存路径,文件创建时间戳。还可以保存返的所有DN的信息。
保存路径如:hdfs://namenode:9000/user/hadoop/study/helloworld.dat;用于表示上述待存储文件的信息保存的位置。
在本步骤中,创建的待存储文件的信息后可以保存其文件名,以及对应该文件名的DN。
在本步骤中NN需要根据DN状态综合评估,选择合适的DN返回给Client Node。返回给Client Node的消息中需要携带能够让Client Node找到DN的必要信息,例如:DN的主机名、互联网协议(Internet Protocol,IP)地址或者端口号等。
NN选择DN方案可以如下:NN中维护所有DN的状态信息,在选择DN时,可以根据预定的选择规则灵活实现,例如:首先查询已经配置有硬件压缩卡的DN,然后查找距离HDFS客户端最近的DN(如在同一个机架上,同一个子网段中等),然后根据DN的负载信息,选择负载较轻的DN(如CPU, 内存占用量较小)。另外,还可以将待存储文件的大小作为考虑因素来确定需要的DN数量。图5中,假定选择的DN为DN1和DN2。
303:HDFS客户端接收到NN返回的DN后,从客户端节点读出待存储文件,将上述待存储文件切分得到文件分片。
在本步骤中,文件分片的数量与DN数目相同,在发送文件分片时每个DN一个文件分片,这样可以避免多次分配文件分片。
HDFS客户端切分待存储文件的策略可以如下:
策略一:根据NN返回的DN的数目均分。如:NN返回2个DN信息,Client Node则将原始的待存储文件均分成2等份。
策略二:查询NN返回的各DN的计算能力和负载,再根据计算能力和负载确定相应的大小的文件分片,然后根据确定大小的文件分片进行切分,然后发送给对应的DN。文件切分后的文件分片的数量仍旧与NN返回的DN的数目相等。
切分策略还可以有其他方式,本发明实施例不作唯一性限制。
304:HDFS客户端将文件分片发送给NN返回的DN。
由于本发明实施例采用DN之间协商压缩规则的方案,因此HDFS客户端还需要告知DN,参与压缩上述待存储文件的DN的信息,可以携带DN的IP地址、主机名等信息。
本步骤中文件分片可以是HDFS客户端主动发送的,也可以是告知DN以后由DN获取的,后一种方式:HDFS客户端需要告知DN文件分片的信息,例如:文件分片对应的的待存储文件保存的路径信息,由DN根据上述路径信息获取文件分片。Client Node在发送完文件分片以后,可以记录发送完毕的状态信息。
在以上步骤504执行完毕后,客户端节点在本流程中的功能就可以结束,后续流程由DN和NN完成。以下对应图5进行说明,DN1和DN2执行内容是相同的,以下实施例DN2进行详细说明,DN1可参考DN2的说明本实施例不再一一赘述。
305:DN2的压缩存储代理模块(Compress storage agent)会首先将文件分片接收并保存在DN2本地。
在本实施例中压缩代理模块负责和客户端节点通信,因此会收到参与压缩 上述待存储文件的DN的信息。
306:DN2上的压缩存储代理模块通知硬件压缩卡,可以开始压缩。
在本步骤中,参与压缩上述待存储文件的DN的信息需要告知给硬件压缩卡。
307:DN2上的硬件压缩卡和DN1上的硬件压缩卡协商得到数据压缩规则。
数据压缩规则通常以压缩算法的形式体现,不同的压缩算法会有不同的文件压缩头和分布特点。因此本步骤可以确定文件压缩头以及文件压缩头的位置。以字典压缩为例,各个DN收到数据分片后,各自扫描各自的文件分片,按照一定的策略(如霍夫曼编码等)计算数据分片对应的字典。各个DN产生各自的字典后DN间互相通信,广播自己的负载和资源状况(如CPU负载,内存使用率,带宽占用率等),选择负载最轻的DN作为汇总节点,各个DN将自己计算出的字典发送到汇总节点,汇总节点综合各个字典,整理出一个统一的字典,广播到各个DN,之后各个DN开始各自的压缩流程。
308:硬件压缩卡按照协商获得的压缩规则对本地的文件分片进行数据压缩并分割,得到数据块。
文件压缩头的位置是根据所用的压缩算法确定的,以字典压缩为例,文件压缩头位于原始的待存储文件压缩得到的第一个数据块中,因此在本实施例中,应该对应的第一块文件分片压缩产生的第一块数据块中。文件压缩头和第一块文件分片压缩产生的第一块数据块合并,置于第一个数据块前边。
另外,如果文件压缩头位于压缩文件尾部,则文件压缩头和最后一块文件分片压缩产生的最后一块数据块合并,置于最后一个数据块后边。其他合并方式按照不同的压缩算法确定,本实施例再一一说明。使用相同的字典压缩数据块保证了压缩后的压缩快的结构和单节点压缩相同。
HDFS系统通常会规定数据块(Block)的大小,即:数据压缩和存储的粒度,因此在本步骤中,硬件压缩卡得到的数据块的大小都是固定大小的。
309:DN2的压缩存储代理模块每检测到产生了一个新的Block大小的数据块,就通过调用FSDataOutputStream向NN发送请求保存该数据块的DN信 息。NN向压缩存储代理模块返回用于存储上述数据块的DN列表。
在本步骤中,DN2可以向NN发送DN2的标识,以及新的Block所属的文件名;那么NN可以在收到请求以后,通过文件名确定用于鉴权的DN即:DN1和DN2,然后确定DN2的标识是DN2,属于用于鉴权的DN,因此可以确定鉴权通过,在鉴权通过后,NN可以向DN2返回DN列表。
压缩存储代理模块发送给NN的请求中携带上述保存路径,如:
hdfs://namenode:9000/user/hadoop/study/helloworld.dat;用于将数据块对应到NN创建的待存储文件的信息。
本步骤也可以不由压缩存储代理模块执行,例如:由硬件压缩卡执行是可以的,也可以新设置一个模块来实现。
在DN列表中包含的DN的个数与数据块备份的副本个数相同。在DN列表中,需要携带能够确定DN的必要信息,例如:DN的主机名、IP地址,或者端口号等。在图5中,DN列表中的DN个数为3,分别为DN3~DN5。
本步骤中,由于数据块被存储到DN节点以后,还需要在用户发出恢复原始的待存储文件的指令后对上述待存储文件进行恢复操作。基于此,本发明实施例还提供了在NN侧记录数据块相关信息的方案,具体如下:DN2需要向NN发送数据块的数据块号,用于确定这个数据块在整个待存储文件中的顺序。
数据块号的编号方式可以依具体的应用场景不同而不同,其中可以通用的方案如下:数据块号的编号方式可以采用如下方式进行:分片号+数据块序号。其中分片号是文件分片在待存储文件的所有分片中的序号,数据块序号是该数据块在其所在的文件分片中的序号。例如1-001就必定在2-001之前,因此仍然可以确定各数据块的顺序。
如果基于如下的特定应用场景,如:文件分片的个数和数据压缩节点的个数相同,并且文件分片是按照DN的序号的先后次序依次发送给数据压缩节点的,那么,数据块号的编号方式,可以采用如下方式进行:DN号+数据块序号。例如:DN1得到的第一个数据块号为:1-001,DN2得到的第二个数据块号为2-002。
在NN接收到需要恢复原始的待存储文件的指令后,可以首先找到待存储 文件对应的数据块号及其所在的DN,从DN节点中读取出这些数据块,并依据记录的数据块号确定数据块在原始的待存储文件中的顺序,从而恢复出原始的待存储文件。
310:DN2的压缩存储代理模块调用FSDataOutputStream,将数据块依次存入DN3~DN5。
依次存入的过程是:压缩存储代理模块将数据块发送给DN列表中的第一个DN(DN3)。消息中携带数据块,DN3保存完数据块,则向DN列表中的下一个DN(DN4)发送数据块,直至列表中的最后一个DN(DN5)保存完数据块。
311:DN5~DN3依次返回写确认到达压缩存储代理模块调用的FSDataOutputStream,用于确认数据块存储完毕。压缩存储代理模块调用FSDataOutputStream在接收到写确认以后,可以进行下一个数据块的存储操作,执行过程与前一数据块相同。待全部数据块存储完毕后通知客户端节点和NN存储完毕,并关闭与NN以及客户端节点的连接。
写确认的消息的返回路径如下:DN列表中的最后一个DN(DN5)保存完数据块后发送写确认给DN列表的倒数第二个DN(DN4),DN4将写确认转发给前一个DN,直到DN列表的第一个DN(DN3),DN3再将写确认转发给压缩存储代理模块调用的FSDataOutputStream。最终由压缩存储代理模块确定一个数据块存储完成。
客户端节点如果维护了文件分片的状态信息,那么还可以将与返回存储完毕信息的DN对应的文件分片的状态设定为完成(Finished),客户端节点在确定全部的文件分片的状态为Finished以后,可以确定待存储文件已经存储完毕,这时可以向NN返回存储完成消息,还可以记录本次分布式压缩存储流程完成。
在本实施例中,多个DN上的硬件压缩卡进行数据压缩,提高了压缩的并行度,可以缩短文件压缩时间。在硬件压缩卡上,可以将文件分片直接压缩为HDFS Block大小,每产生一个数据块,DN就可以向HDFS存储一个数据块块,多个DN存储操作时并行的,无需等待所有数据压缩完毕后再由数据所在 节点切分后保存。采用硬件压缩卡执行压缩,不必占用DN或者客户端节点的CPU资源,能够节省了CPU资源。
本发明实施例提供了一种名称节点,应用于分布式文件系统,上述分布式文件系统包括客户端节点、上述名称节点以及数据节点,如图4所示,上述名称节点包括:
第一接收单元401,用于接收客户端节点发送的文件创建请求;
第一确定单元402,用于在上述第一接收单元401接收到客户端节点发送的文件创建请求后,确定数据压缩节点集,上述数据压缩节点集包含至少两个数据压缩节点,上述数据压缩节点为具有数据压缩处理资源的数据节点;
第一发送单元403,用于将上述第一确定单元402确定的上述数据压缩节点集发送给客户端节点;
第二接收单元404,用于接收到上述数据压缩节点集中的数据压缩节点发送的节点获取请求;
第二确定单元405,用于在上述第二接收单元404接收到上述数据压缩节点集中的数据压缩节点发送的节点获取请求后,确定数据存储节点,上述数据存储节点为具有数据存储资源的数据节点;
第二发送单元406,用于将上述第二确定单元405确定的上述数据存储节点的信息发送给上述节点获取请求对应的数据压缩节点。
可选地,上述第一确定单元402,用于选取当前可用的压缩处理资源达到预定标准的至少两个数据压缩节点;将选取的上述至少两个数据压缩节点的集合作为上述数据压缩节点集。
可选地,上述第二确定单元405,具体用于在上述第一接收单元401接收到上述节点获取请求后,确定上述数据压缩节点是否属于上述数据压缩节点集,若是,则确定数据存储节点。
进一步地,如图5所示,上述名称节点还包括:
第一记录单元501,用于在上述第一确定单元402确定数据压缩节点集之后,记录上述数据压缩节点集和对应于上述数据压缩节点集的待存储文件的信息;
上述节点获取请求中携带上述数据块所属待存储文件的信息,以及上述数 据压缩节点的标识;
上述第二确定单元405,具体用于依据上述数据块所属待存储文件的信息确定对应的数据压缩节点集,并判断发送上述节点获取请求的数据压缩节点是否属于确定的上述数据压缩节点集。
进一步地,如图6所示,上述名称节点还包括:
第二记录单元601,用于在上述第一确定单元402接收到客户端节点发送的文件创建请求后,记录上述文件创建请求指定需要保存的待存储文件的文件名;
上述第二记录单元601,还用于在上述第二确定单元405确定数据存储节点之后,记录上述数据块的数据块号和存储上述数据块的数据存储节点的标识,上述数据块号包含上述数据块在其所在的文件分片中的序号以及上述数据块所属的文件分片的序号。
进一步地,如图7所示,上述名称节点还包括:
第一恢复单元701,用于在恢复上述待存储文件过程中,依据上述第二记录单元601记录的数据块号确定上述数据块所属的待存储文件,依据上述数据块号中上述数据块在其所在的文件分片中的序号以及上述数据块所属的文件分片的序号确定上述数据块在上述待存储文件中的顺序。
进一步地,如图8所示,上述名称节点还包括:
第三记录单元801,用于在上述第一确定单元402接收到客户端节点发送的文件创建请求后,记录上述文件创建请求指定需要保存的待存储文件的文件名;
上述第三记录单元801,还用于在确定数据存储节点之后,若上述待存储文件的文件分片的个数与上述数据压缩节点集中的数据压缩节个数相同,并且文件分片被按照数据压缩节点的序号的顺序分发给数据压缩节点,则记录上述数据块的数据块号和存储上述数据块的数据存储节点的标识,上述数据块号包含上述数据块在其所在的文件分片中的序号以及上述数据压缩节点的序号。
进一步地,如图9所示,上述名称节点还包括:
第二恢复单元901,用于在恢复上述待存储文件过程中,依据上述第三记录单元801记录的数据块号确定上述数据块所属的待存储文件,依据上述数据 块号中的上述数据块在其所在的文件分片中的序号以及上述数据压缩节点的序号确定上述数据块在上述待存储文件中的顺序。
本发明实施例还提供另一种名称节点,如图10所示,包括:接收器1001、发射器1002、处理器1003以及存储器1004;其中存储器1004可以应用于处理器1003在数据处理过程中的数据缓存等应用,也可以应用于数据的存储。
上述名称节点应用于分布式文件系统,上述分布式文件系统包括客户端节点、上述名称节点以及数据节点;在本实施例中,分布式文件系统可以是任意的分布式文件系统,特别地可以应用于HDFS。
上述接收器1001,用于接收客户端节点发送的文件创建请求;
上述处理器1003,用于在接收到客户端节点发送的文件创建请求后,确定数据压缩节点集,上述数据压缩节点集包含至少两个数据压缩节点,上述数据压缩节点为具有数据压缩处理资源的数据节点;
上述发射器1002,用于将上述数据压缩节点集发送给客户端节点;
上述接收器1001,还用于接收上述数据压缩节点集中的数据压缩节点发送的节点获取请求;
上述处理器1003,用于在接收到上述数据压缩节点集中的数据压缩节点发送的节点获取请求后,确定数据存储节点,上述数据存储节点为具有数据存储资源的数据节点;
上述发射器1002,用于将确定的上述数据存储节点的信息发送给上述节点获取请求对应的数据压缩节点。
在本实施例中,名称节点确定的数据压缩节点集中包含了至少两个数据压缩节点,数据压缩节点集中的数据压缩节点参与了待存储文件的压缩。由于数据压缩节点是数据节点,名称节点管理节点的功能修改较小;更重要的是,各个数据压缩节点的数据压缩和存储过程是并行的。因此,采用本发明实施例待存储文件的压缩和存储不再仅限于客户端节点的处理能力,因此可以提高分布式系统的数据压缩存储效率,提高分布式系统的速度。
名称节点具有管理数据压缩节点以及数据存储节点的功能,名称节点需要确定可以作为某次数据压缩存储过程中的数据压缩节点,本实施例还提供了如何确定数据压缩节点的策略,具体如下:上述处理器1003,用于确定数据压缩 节点集包括:选取当前可用的压缩处理资源达到预定标准的至少两个数据压缩节点;将选取的上述至少两个数据压缩节点的集合作为上述数据压缩节点集。
在本实施例中,采用所有数据压缩节点当前可用的压缩处理资源为标准进行选择;可用的压缩处理资源可以包含数据压缩的最直接资源,如:空闲的压缩计算资源,还可以包括配合压缩处理的必要资源,如:传输压缩数据的资源。因此压缩处理资源应当理解为较为广泛的压缩处理资源,不应简单理解为只能包含计算资源。
在本实施例中,名称节点管理数据压缩存储的过程,因此还可以加入鉴权的方案来保证客户端节点能够按照名称节点确定的压缩节点集分配文件分片,具体如下:上述处理器1003,用于在接收到上述数据压缩节点发送的节点获取请求后,确定数据存储节点,包括:在接收到上述节点获取请求后,确定上述数据压缩节点是否属于上述数据压缩节点集,若是,则确定数据存储节点。
基于本实施例中的名称节点在确定数据压缩节点集之后,上述处理器1003,还用于记录上述数据压缩节点集和对应于上述数据压缩节点集的待存储文件的信息;上述节点获取请求中携带上述数据块所属待存储文件的信息,以及上述数据压缩节点的标识;上述处理器1003,用于确定上述数据压缩节点是否属于上述数据压缩节点集包括:依据上述数据块所属待存储文件的信息确定对应的数据压缩节点集,并判断发送上述节点获取请求的数据压缩节点是否属于确定的上述数据压缩节点集。
本实施例可以实现数据压错存储,基于数据压缩存储的流程,本实施例还提供了用户在后续有数据恢复需求的情况下如何进行数据恢复的数据准备,在名称节点一侧需要记录一些数据,具体如下:上述处理器1003,还用于在接收到客户端节点发送的文件创建请求后,记录上述文件创建请求指定需要保存的待存储文件的文件名;
在确定数据存储节点之后,记录上述数据块的数据块号和存储上述数据块的数据存储节点的标识,上述数据块号包含上述数据块在其所在的文件分片中的序号以及上述数据块所属的文件分片的序号。
在本实施例中,文件分片的序号是待存储文件被切分为文件分片后,按照文件分片在待存储文件中的顺序依次编号的序号;数据块由于是文件分片压缩 得到的,因此数据块与文件分片有所属关系,文件分片压缩会得到很多数据块,数据块在其所在的文件分片中的序号也是顺序编号得到的序号。
基于本实施例记录的数据,本实施例还提供了进行数据恢复的方案,如下:上述处理器1003,还用于在记录上述数据块的数据块号和存储上述数据块的数据存储节点的标识之后,在恢复上述待存储文件过程中,依据上述数据块号确定上述数据块所属的待存储文件,依据上述数据块号中上述数据块在其所在的文件分片中的序号以及上述数据块所属的文件分片的序号确定上述数据块在上述待存储文件中的顺序。
以上实施例通过记录数据块在所在的文件分片中的序号以及数据块所属的文件分片的序号,该记录方案可以应用在所有场景下。对于特定的场景,可以改变记录的数据的具体内容,本实施例还提供了如下方案:上述处理器1003,还用于在接收到客户端节点发送的文件创建请求后,记录上述文件创建请求指定需要保存的待存储文件的文件名;
在确定数据存储节点之后,若上述待存储文件的文件分片的个数与上述数据压缩节点集中的数据压缩节个数相同,并且文件分片被按照数据压缩节点的序号的顺序分发给数据压缩节点,则记录上述数据块的数据块号和存储上述数据块的数据存储节点的标识,上述数据块号包含上述数据块在其所在的文件分片中的序号以及上述数据压缩节点的序号。
基于以上实施例记录的具体数据内容(数据块号包含上述数据块在其所在的文件分片中的序号以及上述数据压缩节点的序号),本发明实施例还提供了数据恢复过程中的处理方案,具体如下:上述处理器1003,还用于在记录上述数据块的数据块号和存储上述数据块的数据存储节点的标识之后,在恢复上述待存储文件过程中,依据上述数据块号确定上述数据块所属的待存储文件,依据上述数据块号中的上述数据块在其所在的文件分片中的序号以及上述数据压缩节点的序号确定上述数据块在上述待存储文件中的顺序。
值得注意的是,上述名称节点只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本发明的保护范围。
另外,本领域普通技术人员可以理解实现上述各方法实施例中的全部或部 分步骤是可以通过程序来指令相关的硬件完成,相应的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上仅为本发明较佳的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明实施例揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应该以权利要求的保护范围为准。

Claims (28)

  1. 一种数据压缩存储方法,应用于分布式文件系统,所述分布式文件系统包括客户端节点、名称节点以及数据节点,其特征在于,包括:
    名称节点在接收到客户端节点发送的文件创建请求后,确定数据压缩节点集,所述数据压缩节点集包含至少两个数据压缩节点,所述数据压缩节点为具有数据压缩处理资源的数据节点;
    所述名称节点将所述数据压缩节点集发送给客户端节点;
    所述名称节点在接收到所述数据压缩节点集中的数据压缩节点发送的节点获取请求后,确定数据存储节点,所述数据存储节点为具有数据存储资源的数据节点;
    所述名称节点将确定的所述数据存储节点的信息发送给所述节点获取请求对应的数据压缩节点。
  2. 根据权利要求1所述方法,其特征在于,所述确定数据压缩节点集包括:
    选取当前可用的压缩处理资源达到预定标准的至少两个数据压缩节点;将选取的所述至少两个数据压缩节点的集合作为所述数据压缩节点集。
  3. 根据权利要求1或2所述方法,其特征在于,所述名称节点在接收到所述数据压缩节点发送的节点获取请求后,确定数据存储节点,包括:
    所述名称节点接收到所述节点获取请求后,确定所述数据压缩节点是否属于所述数据压缩节点集,若是,则确定数据存储节点。
  4. 根据权利要求3所述方法,其特征在于,
    在确定数据压缩节点集之后,所述方法还包括:所述名称节点记录所述数据压缩节点集和对应于所述数据压缩节点集的待存储文件的信息;
    所述节点获取请求中携带所述数据块所属待存储文件的信息,以及所述数据压缩节点的标识;
    所述确定所述数据压缩节点是否属于所述数据压缩节点集包括:
    所述名称节点依据所述数据块所属待存储文件的信息确定对应的数据压缩节点集,并判断发送所述节点获取请求的数据压缩节点是否属于确定的所述数据压缩节点集。
  5. 根据权利要求1所述方法,其特征在于,
    在接收到客户端节点发送的文件创建请求后,所述方法还包括:记录所述文件创建请求指定需要保存的待存储文件的文件名;
    在确定数据存储节点之后,所述方法还包括:
    记录所述数据块的数据块号和存储所述数据块的数据存储节点的标识,所述数据块号包含所述数据块在其所在的文件分片中的序号以及所述数据块所属的文件分片的序号。
  6. 根据权利要求5所述方法,其特征在于,在记录所述数据块的数据块号和存储所述数据块的数据存储节点的标识之后,所述方法还包括:
    在恢复所述待存储文件过程中,依据所述数据块号确定所述数据块所属的待存储文件,依据所述数据块号中所述数据块在其所在的文件分片中的序号以及所述数据块所属的文件分片的序号确定所述数据块在所述待存储文件中的顺序。
  7. 根据权利要求1所述方法,其特征在于,
    在接收到客户端节点发送的文件创建请求后,所述方法还包括:记录所述文件创建请求指定需要保存的待存储文件的文件名;
    在确定数据存储节点之后,所述方法还包括:
    若所述待存储文件的文件分片的个数与所述数据压缩节点集中的数据压缩节个数相同,并且文件分片被按照数据压缩节点的序号的顺序分发给数据压缩节点,则记录所述数据块的数据块号和存储所述数据块的数据存储节点的标识,所述数据块号包含所述数据块在其所在的文件分片中的序号以及所述数据压缩节点的序号。
  8. 根据权利要求7所述方法,其特征在于,在记录所述数据块的数据块号和存储所述数据块的数据存储节点的标识之后,所述方法还包括:
    在恢复所述待存储文件过程中,依据所述数据块号确定所述数据块所属的待存储文件,依据所述数据块号中的所述数据块在其所在的文件分片中的序号以及所述数据压缩节点的序号确定所述数据块在所述待存储文件中的顺序。
  9. 一种分布式文件系统,包括:客户端节点、名称节点以及数据节点,其特征在于,
    客户端节点获取待存储文件,向名称节点发送文件创建请求;
    名称节点在接收到所述客户端节点发送的文件创建请求后,确定数据压缩节点集,所述数据压缩节点集包含至少两个数据压缩节点,所述数据压缩节点为具有数据压缩处理资源的数据节点;所述名称节点将所述数据压缩节点集发送给所述客户端节点;
    所述客户端节点接收所述名称节点根据所述文件创建请求返回的数据压缩节点集,分割所述待存储文件得到至少两个文件分片,然后将各文件分片发送给所述数据压缩节点集中的数据压缩节点;
    数据压缩节点在接收到所述客户端节点发送的文件分片后,压缩接收到的所述文件分片,并分割得到数据块;所述数据压缩节点向所述名称节点发送节点获取请求;
    所述名称节点在接收到所述数据压缩节点集中的数据压缩节点发送的节点获取请求后,确定数据存储节点,所述数据存储节点为具有数据存储资源的数据节点;所述名称节点将确定的所述数据存储节点的信息发送给所述节点获取请求对应的数据压缩节点;
    所述数据压缩节点接收所述名称节点发送的数据存储节点的信息;所述数据压缩节点将所述数据块发送给所述数据存储节点存储。
  10. 根据权利要求9所述系统,其特征在于,所述确定数据压缩节点集包括:
    所述名称节点选取当前可用的压缩处理资源达到预定标准的至少两个数据压缩节点;将选取的所述至少两个数据压缩节点的集合作为所述数据压缩节点集。
  11. 根据权利要求9所述系统,其特征在于,
    所述名称节点在接收到所述数据压缩节点发送的节点获取请求后,确定数据存储节点,包括:
    所述名称节点接收到所述节点获取请求后,确定所述数据压缩节点是否属于所述数据压缩节点集,若是,则确定数据存储节点。
  12. 根据权利要求11所述系统,其特征在于,在所述名称节点确定数据压缩节点集之后,所述系统还包括:
    所述名称节点记录所述数据压缩节点集和对应于所述数据压缩节点集的 待存储文件的信息;
    所述名称节点获取请求中携带所述数据块所属待存储文件的信息,以及所述数据压缩节点的标识;所述确定所述数据压缩节点是否属于所述数据压缩节点集包括:
    所述名称节点依据所述数据块所属待存储文件的信息确定对应的数据压缩节点集,并判断发送所述节点获取请求的数据压缩节点是否属于确定的所述数据压缩节点集。
  13. 根据权利要求9所述系统,其特征在于,所述系统还包括:
    所述名称节点在接收到客户端节点发送的文件创建请求后,记录所述文件创建请求指定需要保存的待存储文件的文件名;
    所述名称节点在确定数据存储节点之后,记录所述数据块的数据块号和存储所述数据块的数据存储节点的标识,所述数据块号包含所述数据块在其所在的文件分片中的序号以及所述数据块所属的文件分片的序号。
  14. 根据权利要求13所述系统,其特征在于,所述系统还包括:
    所述名称节点在恢复所述待存储文件过程中,依据所述数据块号确定所述数据块所属的待存储文件,依据所述数据块号中所述数据块在其所在的文件分片中的序号以及所述数据块所属的文件分片的序号确定所述数据块在所述待存储文件中的顺序。
  15. 根据权利要求9所述系统,其特征在于,所述系统还包括:
    所述名称节点在接收到客户端节点发送的文件创建请求后,记录所述文件创建请求指定需要保存的待存储文件的文件名;
    所述客户端节点分割所述待存储文件得到的文件分片个数与所述数据压缩节点集中的数据压缩节个数相同,所述客户端节点将得到的文件分片被按照数据压缩节点的序号的顺序分发给数据压缩节点;
    所述名称节点在确定数据存储节点之后,记录所述数据块的数据块号和存储所述数据块的数据存储节点的标识,所述数据块号包含所述数据块在其所在的文件分片中的序号以及所述数据压缩节点的序号。
  16. 根据权利要求15所述系统,其特征在于,所述系统还包括:
    所述名称节点在恢复所述待存储文件过程中,依据所述数据块号确定所述 数据块所属的待存储文件,依据所述数据块号中的所述数据块在其所在的文件分片中的序号以及所述数据压缩节点的序号确定所述数据块在所述待存储文件中的顺序。
  17. 根据权利要求9所述系统,其特征在于,
    所述客户端节点分割所述待存储文件得到至少两个文件分片包括:将所述待存储文件分割为与各数据压缩节点当前可用的压缩处理资源的多少对应大小的文件分片;所述文件分片的个数等于所述数据压缩节点集中数据压缩节点的个数;
    所述客户端节点将各文件分片发送给所述数据压缩节点集中的数据压缩节点包括:将较大的文件分片发送给所述数据压缩节点集中当前可用的压缩处理资源较多的数据压缩节点,将较小的文件分片发送给所述数据压缩节点集中当前可用的压缩处理资源较少的数据压缩节点。
  18. 根据权利要求17所述系统,其特征在于,所述文件分片的数量大于或等于所述数据压缩节点集中的数据压缩节点的个数;
    所述客户端节点将各文件分片发送给所述数据压缩节点集中的数据压缩节点包括:所述客户端节点将文件分片逐个发送给当前具有空闲的数据压缩处理资源的数据压缩节点。
  19. 根据权利要求9所述系统,其特征在于,所述系统还包括:
    所述数据压缩节在压缩所述文件分片之前与其他数据压缩节点协商数据压缩规则;
    所述数据压缩节将所述文件分片压缩为压缩文件包括:所述数据压缩节按照协商得到的所述数据压缩规则压缩所述文件分片。
  20. 根据权利要求9至19任意一项所述系统,其特征在于,所述系统还包括:
    所述数据压缩节点在将所述数据块发送给所述数据存储节点存储之前,生成文件压缩头,在所述文件压缩头中携带所述数据压缩规则的指示信息,依据当前使用的数据压缩规则确定是否将所述文件压缩头并入所述数据块,若是则将所述文件压缩头并入所述数据块。
  21. 一种名称节点,应用于分布式文件系统,所述分布式文件系统包括客 户端节点、所述名称节点以及数据节点,其特征在于,所述名称节点包括:
    第一接收单元,用于接收客户端节点发送的文件创建请求;
    第一确定单元,用于在所述第一接收单元接收到客户端节点发送的文件创建请求后,确定数据压缩节点集,所述数据压缩节点集包含至少两个数据压缩节点,所述数据压缩节点为具有数据压缩处理资源的数据节点;
    第一发送单元,用于将所述第一确定单元确定的所述数据压缩节点集发送给客户端节点;
    第二接收单元,用于接收到所述数据压缩节点集中的数据压缩节点发送的节点获取请求;
    第二确定单元,用于在所述第二接收单元接收到所述数据压缩节点集中的数据压缩节点发送的节点获取请求后,确定数据存储节点,所述数据存储节点为具有数据存储资源的数据节点;
    第二发送单元,用于将所述第二确定单元确定的所述数据存储节点的信息发送给所述节点获取请求对应的数据压缩节点。
  22. 根据权利要求21所述名称节点,其特征在于,
    所述第一确定单元,用于选取当前可用的压缩处理资源达到预定标准的至少两个数据压缩节点;将选取的所述至少两个数据压缩节点的集合作为所述数据压缩节点集。
  23. 根据权利要求21或22所述名称节点,其特征在于,
    所述第二确定单元,具体用于在所述第一接收单元接收到所述节点获取请求后,确定所述数据压缩节点是否属于所述数据压缩节点集,若是,则确定数据存储节点。
  24. 根据权利要求23所述名称节点,其特征在于,所述名称节点还包括:
    第一记录单元,用于在所述第一确定单元确定数据压缩节点集之后,记录所述数据压缩节点集和对应于所述数据压缩节点集的待存储文件的信息;
    所述节点获取请求中携带所述数据块所属待存储文件的信息,以及所述数据压缩节点的标识;
    所述第二确定单元,具体用于依据所述数据块所属待存储文件的信息确定对应的数据压缩节点集,并判断发送所述节点获取请求的数据压缩节点是否属 于确定的所述数据压缩节点集。
  25. 根据权利要求21所述名称节点,其特征在于,所述名称节点还包括:
    第二记录单元,用于在所述第一确定单元接收到客户端节点发送的文件创建请求后,记录所述文件创建请求指定需要保存的待存储文件的文件名;
    所述第二记录单元,还用于在所述第二确定单元确定数据存储节点之后,记录所述数据块的数据块号和存储所述数据块的数据存储节点的标识,所述数据块号包含所述数据块在其所在的文件分片中的序号以及所述数据块所属的文件分片的序号。
  26. 根据权利要求25所述名称节点,其特征在于,所述名称节点还包括:
    第一恢复单元,用于在恢复所述待存储文件过程中,依据所述第二记录单元记录的数据块号确定所述数据块所属的待存储文件,依据所述数据块号中所述数据块在其所在的文件分片中的序号以及所述数据块所属的文件分片的序号确定所述数据块在所述待存储文件中的顺序。
  27. 根据权利要求21所述名称节点,其特征在于,所述名称节点还包括:
    第三记录单元,用于在所述第一确定单元接收到客户端节点发送的文件创建请求后,记录所述文件创建请求指定需要保存的待存储文件的文件名;
    所述第三记录单元,还用于在确定数据存储节点之后,若所述待存储文件的文件分片的个数与所述数据压缩节点集中的数据压缩节个数相同,并且文件分片被按照数据压缩节点的序号的顺序分发给数据压缩节点,则记录所述数据块的数据块号和存储所述数据块的数据存储节点的标识,所述数据块号包含所述数据块在其所在的文件分片中的序号以及所述数据压缩节点的序号。
  28. 根据权利要求27所述名称节点,其特征在于,所述名称节点还包括:
    第二恢复单元,用于在恢复所述待存储文件过程中,依据所述第三记录单元记录的数据块号确定所述数据块所属的待存储文件,依据所述数据块号中的所述数据块在其所在的文件分片中的序号以及所述数据压缩节点的序号确定所述数据块在所述待存储文件中的顺序。
PCT/CN2014/094179 2014-12-18 2014-12-18 一种数据压缩存储方法、装置,及分布式文件系统 WO2016095149A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201480037404.6A CN106170968B (zh) 2014-12-18 2014-12-18 一种数据压缩存储方法、装置,及分布式文件系统
PCT/CN2014/094179 WO2016095149A1 (zh) 2014-12-18 2014-12-18 一种数据压缩存储方法、装置,及分布式文件系统

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/094179 WO2016095149A1 (zh) 2014-12-18 2014-12-18 一种数据压缩存储方法、装置,及分布式文件系统

Publications (1)

Publication Number Publication Date
WO2016095149A1 true WO2016095149A1 (zh) 2016-06-23

Family

ID=56125612

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/094179 WO2016095149A1 (zh) 2014-12-18 2014-12-18 一种数据压缩存储方法、装置,及分布式文件系统

Country Status (2)

Country Link
CN (1) CN106170968B (zh)
WO (1) WO2016095149A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156359A (zh) * 2016-07-28 2016-11-23 四川新环佳科技发展有限公司 一种云计算平台下的数据同步更新方法
CN106682227A (zh) * 2017-01-06 2017-05-17 郑州云海信息技术有限公司 基于分布式文件系统的日志数据存储系统及读写方法
CN108242931A (zh) * 2016-12-23 2018-07-03 航天星图科技(北京)有限公司 一种数据压缩提供方法
CN109302449A (zh) * 2018-08-31 2019-02-01 阿里巴巴集团控股有限公司 数据写入方法、数据读取方法、装置和服务器
CN109766319A (zh) * 2018-12-27 2019-05-17 网易(杭州)网络有限公司 压缩任务处理方法、装置、存储介质及电子设备
CN109831540A (zh) * 2019-04-12 2019-05-31 成都四方伟业软件股份有限公司 分布式存储方法、装置、电子设备及存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107977442B (zh) * 2017-12-08 2020-08-07 北京希嘉创智教育科技有限公司 日志文件压缩及解压缩方法、电子设备和可读存储介质
CN114040027B (zh) * 2021-10-29 2023-11-24 深圳智慧林网络科技有限公司 一种基于双模式的数据压缩方法、装置和数据解压方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080001791A1 (en) * 2006-06-30 2008-01-03 Omneon Video Networks Transcoding for a distributed file system
CN103020205A (zh) * 2012-12-05 2013-04-03 北京普泽天玑数据技术有限公司 一种分布式文件系统上基于硬件加速卡的压缩解压缩方法
US20140358996A1 (en) * 2013-05-30 2014-12-04 Hon Hai Precision Industry Co., Ltd. Distributed encoding and decoding system, method, and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100837410B1 (ko) * 2006-11-30 2008-06-12 삼성전자주식회사 주관적인 무손실 이미지 데이터 압축 방법 및 장치
CN101605148A (zh) * 2009-05-21 2009-12-16 何吴迪 云存储的并行系统的架构方法
US8510267B2 (en) * 2011-03-08 2013-08-13 Rackspace Us, Inc. Synchronization of structured information repositories

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080001791A1 (en) * 2006-06-30 2008-01-03 Omneon Video Networks Transcoding for a distributed file system
CN103020205A (zh) * 2012-12-05 2013-04-03 北京普泽天玑数据技术有限公司 一种分布式文件系统上基于硬件加速卡的压缩解压缩方法
US20140358996A1 (en) * 2013-05-30 2014-12-04 Hon Hai Precision Industry Co., Ltd. Distributed encoding and decoding system, method, and device

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156359A (zh) * 2016-07-28 2016-11-23 四川新环佳科技发展有限公司 一种云计算平台下的数据同步更新方法
CN106156359B (zh) * 2016-07-28 2019-05-21 广东奥飞数据科技股份有限公司 一种云计算平台下的数据同步更新方法
CN108242931A (zh) * 2016-12-23 2018-07-03 航天星图科技(北京)有限公司 一种数据压缩提供方法
CN108242931B (zh) * 2016-12-23 2023-04-28 中科星图股份有限公司 一种数据压缩提供方法
CN106682227A (zh) * 2017-01-06 2017-05-17 郑州云海信息技术有限公司 基于分布式文件系统的日志数据存储系统及读写方法
CN109302449A (zh) * 2018-08-31 2019-02-01 阿里巴巴集团控股有限公司 数据写入方法、数据读取方法、装置和服务器
CN109766319A (zh) * 2018-12-27 2019-05-17 网易(杭州)网络有限公司 压缩任务处理方法、装置、存储介质及电子设备
CN109831540A (zh) * 2019-04-12 2019-05-31 成都四方伟业软件股份有限公司 分布式存储方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN106170968B (zh) 2019-09-20
CN106170968A (zh) 2016-11-30

Similar Documents

Publication Publication Date Title
WO2016095149A1 (zh) 一种数据压缩存储方法、装置,及分布式文件系统
EP4178134B1 (en) Data transmission method, proxy server, storage medium, and electronic device
RU2630377C1 (ru) Способ и устройство для обработки запроса операции в системе хранения данных
CN102411637B (zh) 分布式文件系统的元数据管理方法
US20140165119A1 (en) Offline download method, multimedia file download method and system thereof
US20190102103A1 (en) Techniques for storing and retrieving data from a computing device
WO2017167171A1 (zh) 一种数据操作方法,服务器及存储系统
WO2019075978A1 (zh) 数据传输方法、装置、计算机设备和存储介质
US10831612B2 (en) Primary node-standby node data transmission method, control node, and database system
CN105025053A (zh) 基于云存储技术的分布式文件的上传方法及其系统
US10728335B2 (en) Data processing method, storage system, and switching device
WO2017088705A1 (zh) 数据处理方法和装置
US20240039995A1 (en) Data access system and method, device, and network adapter
TW201301053A (zh) 基於雲端儲存的檔案處理方法,系統及伺服器叢集系統
CN111338806B (zh) 一种业务控制方法及装置
CN112351068A (zh) 信息同步方法、系统、装置、计算机设备和存储介质
WO2024148824A1 (zh) 数据处理方法、系统、装置、存储介质及电子设备
CN112100146B (zh) 一种高效的纠删分布式存储写入方法、系统、介质及终端
EP3707610B1 (en) Redundant data storage using different compression processes
CN102857547B (zh) 分布式缓存的方法及设备
US11444998B2 (en) Bit rate reduction processing method for data file, and server
CN104092754A (zh) 文件存储系统和文件存储方法
CN109710502A (zh) 日志传输方法、装置及存储介质
CN108242931B (zh) 一种数据压缩提供方法
CN113608694A (zh) 数据迁移方法、信息处理方法、装置及服务器与介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14908184

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14908184

Country of ref document: EP

Kind code of ref document: A1