WO2016095760A1 - Data dynamic re-distribution method, data node, name node and system - Google Patents

Data dynamic re-distribution method, data node, name node and system Download PDF

Info

Publication number
WO2016095760A1
WO2016095760A1 PCT/CN2015/097172 CN2015097172W WO2016095760A1 WO 2016095760 A1 WO2016095760 A1 WO 2016095760A1 CN 2015097172 W CN2015097172 W CN 2015097172W WO 2016095760 A1 WO2016095760 A1 WO 2016095760A1
Authority
WO
WIPO (PCT)
Prior art keywords
data block
node
data
copy
target
Prior art date
Application number
PCT/CN2015/097172
Other languages
French (fr)
Chinese (zh)
Inventor
李嘉
刘杰
党李飞
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2016095760A1 publication Critical patent/WO2016095760A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the invention relates to the field of Internet big data processing technology, in particular to a method, a data node, a name node and a system for dynamically redistributing data.
  • HDFS Hadoop Distributed File System
  • the HDFS file system uses a blocking mechanism to distribute files in a distributed manner and improves system reliability through a block redundancy strategy.
  • Each data block has multiple copies in the system at the same time. These copies are distributed in multiple racks in the system. On multiple nodes within, prevent the loss of data blocks due to the failure of a single node.
  • the HDFS file system In order to implement this data block redundancy strategy, the HDFS file system must ensure that multiple copies are simultaneously written when writing data. The number of copies written is called the data block replication factor, which is usually three by default.
  • HDFS is a master-slave structure, which is generally composed of a Name Node ("NN") and a plurality of Data Nodes (DNs).
  • the NN is also called the Master node and is responsible for management.
  • HDFS namespace and block mapping information configure copy policies, and handle client requests.
  • the DN also called a slave node, stores the actual data, performs read and write operations on the data block, and periodically reports the information of the stored data block to the NN.
  • the client can access or manage HDFS through the command line; interact with the Name Node to obtain file location information; interact with the Data Node to perform data read and write operations.
  • the task of accessing a data block is usually assigned preferentially to the DN in which the target data block is stored (for a DN storing the target data block, such a task can be referred to as a local task), facilitating the task directly from the
  • the target data block is read in the data node.
  • the task is assigned to the DN that does not store the target data block (for a DN that does not store the target data block, this task can be Called a non-local task, the DN assigned to a non-local task needs to read the target data block from the DN where the target data block is stored over the network.
  • the hotspot data block needs to be accessed through the network, which reduces the running speed of the task and consumes network resources. At the same time, the data node storing the hotspot data block will always receive the task of accessing the data block, so that the task load in the HDFS is unbalanced.
  • the method of the present invention provides a method for dynamic data redistribution. , data nodes, name nodes, and systems.
  • the technical solution is as follows:
  • a method for dynamically redistributing data comprising:
  • the first data node receives a data block read command, the data block read command is used to instruct the first data node to read a target data block located on the second data node, the second data node and the first A data node is a different data node in the same HDFS;
  • the first modification instruction is used to instruct the name node to increase a replication factor of the target data block by one.
  • the method further includes:
  • deleting the data block copy from the first data node includes:
  • Deleting the data block copy from the first data node when the length of the data block copy that is not accessed exceeds the lifetime of the data block copy including:
  • the method further includes:
  • the second modification instruction is used to instruct the name node to decrement the replication factor of the target data block by one.
  • a method for dynamically redistributing data comprising:
  • the first modification instruction is used to indicate that the name node adds one of the replication factor of the target data block, and the first modification instruction is that the first data node is located at the first And the target data block on the data node is sent after the data block copy of the target data block is generated on the first data node;
  • the method includes:
  • a data node comprising:
  • a receiving module configured to receive a data block read command, the data block read command is used to instruct the data node to read a target data block located on the second data node, the second data node and the data node Different data nodes in the same HDFS;
  • An access module configured to access the target data block according to the data block read command
  • Generating a module configured to generate a data block copy of the target data block on the data node
  • the first sending module is configured to send a first modification instruction to the name node, where the first modification instruction is used to instruct the name node to increase the replication factor of the target data block by one.
  • the data node further includes:
  • a deleting module configured to delete the data block copy from the data node when a length of time that the data block copy is not accessed exceeds a lifetime of the data block copy.
  • the deleting module includes:
  • a receiving unit configured to receive a data block copy deletion instruction sent by the name node, where the data block is The deletion instruction includes an identifier of the copy of the data block;
  • a first deleting unit configured to delete the data block copy from the data node according to the data block copy deletion instruction
  • the deleting module includes:
  • a monitoring unit configured to monitor an unvisited duration of the data block copy
  • a second deleting unit configured to delete the data block copy from the data node when a length of time that the data block copy is not accessed exceeds a lifetime of the data block copy.
  • the data node further includes:
  • a second sending module configured to send a second modification instruction to the name node, where the second modification instruction is used to instruct the name node to decrement the replication factor of the target data block by one.
  • a name node comprising:
  • a first receiving module configured to receive a first modification instruction sent by the first data node, where the first modification instruction is used to indicate that the name node adds one of the replication factors of the target data block, and the first modification instruction is the first Reading, by a data node, the target data block located on the second data node, and transmitting the data block copy of the target data block on the first data node;
  • a first modifying module configured to modify a replication factor of the target data block according to the first modification instruction.
  • the name node further includes:
  • a second receiving module configured to receive a second modification instruction sent by the first data node, where the second modification instruction is used to instruct the name node to decrement a replication factor of the target data block by one;
  • a second modifying module configured to modify a replication factor of the target data block according to the second modification instruction.
  • a data node comprising:
  • a processor a memory, a bus, and a communication interface; the memory for storing computer execution instructions, the processor being coupled to the memory via the bus, the processor executing the memory storage when the computer is running The computer executes instructions to cause the computer to perform the method as previously described.
  • a name node comprising:
  • a processor a memory, a bus, and a communication interface; the memory for storing computer execution instructions, the processor being coupled to the memory via the bus, the processor executing the memory storage when the computer is running The computer executes instructions to cause the computer to perform the method as previously described.
  • a system for dynamic data redistribution comprising a data node as hereinbefore described, and a name node as hereinbefore described.
  • the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources.
  • the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.
  • FIG. 1 is a schematic diagram of a network architecture of an HDFS provided by the present invention
  • FIG. 2 is a flowchart of a method for dynamically redistributing data according to Embodiment 1 of the present invention
  • FIG. 3 is a flowchart of a method for dynamically redistributing data according to Embodiment 2 of the present invention.
  • FIG. 5 is a schematic structural diagram of a data node according to Embodiment 4 of the present invention.
  • FIG. 6 is a schematic structural diagram of a name node according to Embodiment 5 of the present invention.
  • FIG. 7 is a schematic structural diagram of a data node according to Embodiment 6 of the present invention.
  • FIG. 8 is a schematic structural diagram of a name node according to Embodiment 7 of the present invention.
  • FIG. 9 is a schematic structural diagram of a system for dynamically redistributing data according to Embodiment 8 of the present invention.
  • the HDFS usually includes a name node 11 for managing all data nodes 12 in the HDFS, and a data node 12 for storing data block information of the file, the name in the HDFS.
  • the node 11 and the data node 12 communicate with each other through a network.
  • An embodiment of the present invention provides a method for dynamically redistributing data.
  • the method may be performed by a first data node, where the method includes:
  • Step S11 The first data node receives a data block read command, where the data block read command is used to instruct the first data node to read the target data block located on the second data node, where the second data node and the first data node are Different data nodes in the same HDFS.
  • Step S12 accessing the target data block according to the data block read command.
  • the first data node and the second data node are different data nodes in the same HDFS, the first data node needs to access the second data node through the network when accessing the target data block according to the data block read command. Read.
  • the task is prevented from passing through the network in the HDSF system. To access the target data block.
  • Step S13 generating a data block copy of the target data block on the first data node.
  • the assigned task if there is no data block of the target file storing the task in the data node to which the task is assigned, the assigned task is referred to as a non-local task; if the data node of the assigned task is stored in the data node If there is a data block for the target file of the task, the task is called a local task. Since the local task reads faster than the non-local task and does not need to occupy the HDFS network resources, Therefore, in order to improve the reading speed of HDFS, it can be realized by increasing the localization probability of the task.
  • the first data node copies the target data block and generates a data block copy of the target data block, increasing the number of copies of the target data block in the HDFS.
  • the number of tasks allocated to the data node storing the target data block increases, which increases the localization probability of the HDFS task, and at the same time, more tasks can be in their own local data.
  • the target data block is read in the node, which improves the running speed of the HDFS and saves the network resources of the HDFS.
  • the data node is always assigned a local task (that is, the localization principle), and the non-local task is allocated only after the local task is allocated. Therefore, the data node storing the target data block is always assigned the task of reading the target data block, and the load of the data node is large.
  • the first data node copies the target data block, increases the number of data nodes storing the target data block, and also shares the task load of a part of the data node storing the target data block, thereby realizing the load in the HDFS. balanced.
  • Step S14 sending a first modification instruction to the name node, the first modification instruction is used to instruct the name node to increase the replication factor of the target data block by one.
  • the number of copies of the target data block in the HDFS is increased, and the copy factor information of the target data block stored in the name node needs to be updated.
  • the embodiment of the present invention receives a data block read command by the first data node, and accesses the target data block (stored on the second data node) according to the data block read command, and then generates a data block copy of the target data block, and finally
  • the name node sends a first modification instruction for instructing the name node to increment the replication factor of the target data block by one.
  • This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored)
  • the increase in the number increases the localization probability of HDFS tasks.
  • the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources.
  • the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.
  • Embodiments of the present invention provide a method for dynamically redistributing data.
  • the method may be The name node is executed, and the method includes:
  • Step S21 receiving a first modification instruction sent by the first data node, the modification instruction is used to instruct the name node to increase the replication factor of the target data block by one, and the first modification instruction is that the first data node reads the second data node.
  • the target data block is sent after the data block copy of the target data block is generated on the first data node.
  • Step S22 modifying the replication factor of the target data block according to the first modification instruction.
  • the name node also receives the periodic heartbeat information sent by the data node (the information of the data block in the data node is included in the periodic heartbeat information), and the data block information recorded on the name node is corrected and corrected, and the record is recorded in time. Data block information.
  • the embodiment of the present invention receives a data block read command by the first data node, and accesses the target data block (stored on the second data node) according to the data block read command, and then generates a data block copy of the target data block, and finally
  • the name node sends a first modification instruction for instructing the name node to increment the replication factor of the target data block by one.
  • This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored)
  • the increase in the number increases the localization probability of HDFS tasks.
  • the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources.
  • the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.
  • An embodiment of the present invention provides a method for dynamically redistributing data. Referring to FIG. 4, the method includes:
  • step S31 the first data node sends a file open request to the name node.
  • the file open request includes a file name of the target file, a file offset, and a file data size.
  • the application on the first data node sends the file open request to the name node through a distributed file system client ("DFSclient").
  • DFSclient distributed file system client
  • Step S32 the name node sends file feedback information to the first data node according to the file open request.
  • the file feedback information includes a data block of the target file and an IP of the data node where each data block is located.
  • the name node may send a data block list List ⁇ located Block>block() corresponding to the target file to the first data node according to the file open request.
  • Step S33 when the file feedback information indicates that the target data block is stored in the first data node, the first data node directly reads the target data block; when the file feedback information indicates that the target data block is stored in the second data node, performing steps S34.
  • the application on the first data node directly reads the target data block from the first data node through the DFSclient, specifically, the DFSclient is reading the data block.
  • the DFSclient is reading the data block.
  • the target data block is read by the FSDataInputStream object.
  • Step S34 the first data node reads the target data block located on the second data node, and the second data node and the first data node are different data nodes in the same HDFS.
  • the DFSclient on the first data node sends a file read to the second data node through the FSDataInputStream object. Requesting and receiving the target data block returned by the second data node.
  • step S35 the first data node generates a data block copy of the target data block.
  • the first data node copies the target data block, thereby generating a copy of the data block.
  • Step S36 The first data node sends a first modification instruction to the name node, where the first modification instruction is used to instruct the name node to increase the replication factor of the target data block by one.
  • the number of copies of the target data block in the HDFS is increased, and the copy factor information of the target data block stored in the name node needs to be updated.
  • the variation range of the replication factor can be limited.
  • the variation range of the replication factor can be set to 1 to 512.
  • the range of the copy factor can be set from 3 to 512.
  • Step S37 the name node receives the first modification instruction sent by the first data node, and modifies the replication factor of the target data block according to the first modification instruction.
  • Step S38 when the duration of the unblocked copy of the data block exceeds the lifetime of the data block copy, the first data node deletes the data block copy.
  • the lifetime of the data block of the generated target data block is set, so that the data block copy is prevented from being accessed for a long time and occupies the storage space of the data node.
  • the lifetime of the data block replica may be set in the following manner: setting an initial lifetime for the newly added data block replica, and when the data chunk replica is accessed within the set duration When the frequency (or number of times) reaches the set value, the initial lifetime is extended as the new lifetime of the copy of the block.
  • the lifetime of the data block replica may also be statically set.
  • the data block copy setting lifetime is only for these new copies, and the copy formed when the file is created is not set to survive time.
  • the copy of the data block deleted here is only for the newly added copy of the data block, and the copy formed when the file is created will not be deleted.
  • the length of time that the data block copy is not accessed may be monitored by the first data node or by the name node.
  • step S38 may include the following steps:
  • the first data node monitors an unvisited duration of the data block copy.
  • the length of time that the block copy is not accessed can be monitored by adding a new replication manager (ReplicaManager) on the first data node.
  • ReplicaManager a new replication manager
  • step S38 may include the following steps:
  • the name node monitors a length of time that the data block copy is not accessed.
  • the name node when the duration of the unblocked copy of the data block exceeds the lifetime of the data block copy, the name node sends a data block copy deletion instruction to the first data node, where the data block copy deletion instruction includes an identifier of the data block copy to be deleted. ;
  • the first data node receives a data block copy deletion instruction sent by the name node.
  • the first data node deletes the data block copy according to the data block copy deletion instruction.
  • the name information of the data block copy included in the heartbeat information sent by the first data node to the name node may be monitored by the first data node.
  • Step S39 The first data node sends a second modification instruction to the name node, where the second modification instruction is used to instruct the name node to decrement the replication factor of the target data block by one.
  • Step S40 the name node receives the second modification instruction sent by the first data node, and modifies the replication factor of the target data block according to the second modification instruction.
  • the assigned task if there is no data block of the target file storing the task in the data node to which the task is assigned, the assigned task is referred to as a non-local task; if the data node of the assigned task is stored in the data node If there is a data block for the target file of the task, the task is called a local task.
  • a non-local task if there is no data block of the target file storing the task in the data node to which the task is assigned.
  • the first data node reads the target data block located on the second data node, indicating that the HDFS allocates a non-local task to the first data node.
  • the target data block is not copied to the first data node, if the next data node is assigned the same next time. At the time of the task, the first data node still needs to read the target data block from the second data node.
  • the first data node generates a data block copy of the target data block, and adds the number of copies of the target data block in the HDFS.
  • the number of tasks allocated to the data node storing the target data block increases, which increases the localization probability of the HDFS task, and at the same time, more tasks can be in their own local data.
  • the target data block is read in the node, which improves the running speed of the HDFS and saves the network resources of the HDFS.
  • HDFS always assigns local tasks to the data nodes (that is, the localization principle), and only assigns non-local tasks after the local tasks are allocated. Therefore, the data node storing the target data block is always assigned the task of reading the target data block, and the load of the data node is large.
  • the first data node copies the target data block, increases the number of data nodes storing the target data block, and also shares the task load of a part of the data node storing the target data block, thereby realizing the load in the HDFS. balanced.
  • the embodiment of the present invention receives a data block read command by the first data node, and accesses the target data block (stored on the second data node) according to the data block read command, and then generates data of the target data block.
  • the block copy finally sends a first modification instruction to the name node to instruct the name node to increase the copy factor of the target data block by one.
  • This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored)
  • the increase in the number increases the localization probability of HDFS tasks.
  • the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources.
  • the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.
  • the embodiment of the present invention provides a data node.
  • the data node includes: a receiving module 501, an access module 502, a generating module 503, and a first sending module 504.
  • the receiving module 501 is configured to receive a data block read command, where the data block read command is used to instruct the data node to read the target data block located on the second data node, where the second data node and the data node are different in the same HDFS Data node.
  • the data node indicated by the data block read command may be the first data node in the first to third embodiments.
  • the access module 502 is configured to access the target data block according to the data block read command.
  • the data node ie, the first data node
  • the second data node are different data nodes in the same HDFS
  • the data node needs to go through the network when accessing the target data block according to the data block read command. Two data nodes are read.
  • the task is prevented from passing through the network in the HDSF system. To access the target data block.
  • the generating module 503 is configured to generate a data block copy of the target data block on the data node.
  • the first sending module 504 is configured to send a first modification instruction to the name node, where the first modification instruction is used to instruct the name node to increase the replication factor of the target data block by one.
  • the number of copies of the target data block in the HDFS is increased, and the copy factor information of the target data block stored in the name node needs to be updated.
  • the data node further includes: a deleting module 505.
  • the deleting module 505 is configured to delete the data block copy from the data node when the unused time of the data block copy exceeds the lifetime of the data block copy.
  • the lifetime of the data block copy of the generated target data block is set, and the problem that the data block copy does not be accessed for a long time and occupies the storage space of the data node can be prevented.
  • the lifetime of the data block replica may be set in the following manner: setting an initial lifetime for the newly added data block replica, and when the data chunk replica is accessed within the set duration When the frequency (or number of times) reaches the set value, the initial lifetime is extended as the new lifetime of the copy of the block.
  • the lifetime of the data block replica may also be statically set.
  • the length of time that the data block copy is not accessed may be monitored by the data node (ie, the first data node) or by the name node.
  • the deleting module 505 further includes: a receiving unit 515 and a first deleting unit 525;
  • the receiving unit 515 is configured to receive a data block copy deletion instruction sent by the name node, where the data block copy deletion instruction includes an identifier of the data block copy.
  • the first deleting unit 525 is configured to delete the data block copy from the data node according to the data block copy deletion instruction.
  • the deletion module 505 further includes a monitoring unit 535 and a second deleting unit 545 when the data node monitors the unvisited duration of the data block copy.
  • the monitoring unit 535 is configured to monitor the length of time that the data block copy is not accessed.
  • the second deleting unit 545 is configured to delete the data block copy from the data node when the length of the unblocked data block is longer than the lifetime of the data block copy.
  • the data node further includes: a second sending module 506.
  • the second sending module 506 is configured to send a second modification instruction to the name node, where the second modification instruction is used to instruct the name node to decrement the replication factor of the target data block by one.
  • the assigned task if there is no data block of the target file storing the task in the data node to which the task is assigned, the assigned task is referred to as a non-local task; if the data node of the assigned task is stored in the data node If there is a data block for the target file of the task, the task is called a local task.
  • the local task can be read faster than the non-local task and does not need to occupy the network resources of the HDFS. Therefore, in order to improve the reading speed of the HDFS, the localization probability of the task can be improved.
  • the data node (ie, the first data node) copies the target data block and generates a data block copy of the target data block, increasing the number of copies of the target data block in the HDFS.
  • the number of tasks allocated to the data node storing the target data block increases, which increases the localization probability of the HDFS task, and at the same time, more tasks can be in their own local data.
  • the target data block is read in the node, which improves the running speed of the HDFS and saves the network resources of the HDFS.
  • the data node is always assigned a local task (that is, the localization principle), and the non-local task is allocated only after the local task is allocated. Therefore, the data node storing the target data block is always assigned the task of reading the target data block, and the load of the data node is large.
  • the data node ie, the first data node
  • copies the target data block increases the number of data nodes storing the target data block, and also shares the task load of the data node storing the target data block. Load balancing in HDFS.
  • the embodiment of the present invention receives a data block read command by the first data node, and accesses the target data block (stored on the second data node) according to the data block read command, and then generates a data block copy of the target data block, and finally
  • the name node sends a first modification instruction for instructing the name node to increment the replication factor of the target data block by one.
  • This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored)
  • the increase in the number increases the localization probability of HDFS tasks.
  • the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources.
  • the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.
  • the data node includes: a first receiving module 601 and a first modifying module 602.
  • the first receiving module 601 is configured to receive a first modification instruction sent by the data node, where the first modification instruction is used to indicate that the name node increases the replication factor of the data block of the target file copied in the data node by one.
  • the first modification module 602 is configured to modify a replication factor of the target data block according to the first modification instruction.
  • the name node also receives the periodic heartbeat information sent by the data node (the information of the data block in the data node is included in the periodic heartbeat information), and the information of the data block recorded on the name node is corrected and corrected in time. Recorded data block information.
  • the name node further includes: a second receiving module 603 and a second modifying module 604.
  • the second receiving module 603 is configured to receive a second modification instruction sent by the first data node, where the second modification instruction is used to instruct the name node to decrement the replication factor of the target data block by one.
  • the second modification module 604 is configured to modify a replication factor of the target data block according to the second modification instruction.
  • the embodiment of the present invention receives a data block read command by the first data node, and accesses the target data block (stored on the second data node) according to the data block read command, and then generates a data block copy of the target data block, and finally
  • the name node sends a first modification instruction for instructing the name node to increment the replication factor of the target data block by one.
  • This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored)
  • the increase in the number increases the localization probability of HDFS tasks.
  • the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources.
  • the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.
  • An embodiment of the present invention provides a data node.
  • the data node includes:
  • the embodiment of the present invention receives a data block read command by the first data node, and accesses the target data block (stored on the second data node) according to the data block read command, and then generates a data block copy of the target data block, and finally
  • the name node sends a first modification instruction for indicating that the name node will target the data block
  • the replication factor is increased by one. This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored) The increase in the number increases the localization probability of HDFS tasks.
  • the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources.
  • the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.
  • An embodiment of the present invention provides a name node.
  • the name node includes:
  • the embodiment of the present invention receives a data block read command by the first data node, and accesses the target data block (stored on the second data node) according to the data block read command, and then generates a data block copy of the target data block, and finally
  • the name node sends a first modification instruction for instructing the name node to increment the replication factor of the target data block by one.
  • This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored)
  • the increase in the number increases the localization probability of HDFS tasks.
  • the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources.
  • the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.
  • An embodiment of the present invention provides a system for dynamically redistributing data.
  • the system includes: a data node 50 as described in Embodiment 4, and a name node 60 as described in Embodiment 5.
  • the embodiment of the present invention receives a data block read command by the first data node, and accesses the target data block (stored on the second data node) according to the data block read command, and then generates data of the target data block.
  • the block copy finally sends a first modification instruction to the name node to instruct the name node to increase the copy factor of the target data block by one.
  • This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored)
  • the increase in the number increases the localization probability of HDFS tasks.
  • the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources.
  • the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.
  • the data node provided by the foregoing embodiment implements the method for dynamically redistributing data
  • only the division of each functional module is used as an example.
  • the foregoing functions may be assigned different functions according to requirements.
  • the module is completed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the data node and the method for dynamically redistributing data are provided in the same embodiment. For details, refer to the method embodiment, and details are not described herein.
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Abstract

The present invention belongs to a technical field of Internet big data processing and disclosed are a data dynamic re-distribution method, data node, name node and system, the method comprising: a first data node receives a read command of a data block (S11); accessing a target data block according to the read command of the data block (S12); generating a copy of the data block of a target data block on a first data node (S13); transmitting a first modifying instruction to the name node (S14). The copy of the data block of the target data block is generated on the first data node, so as to increase number of the data node storing the target data block on the HDFS, and when tasks of accessing the target data block are re-distributed, the number of the local task increases, reducing consumption of the HDFS network resource, improving operation speed of the HDFS, as well as sharing operation task load of the data node storing the data block of the target file, realizing balance of the load.

Description

数据动态重分布的方法、数据节点、名字节点及系统Data dynamic redistribution method, data node, name node and system 技术领域Technical field
本发明涉及互联网大数据处理技术领域,特别涉及一种数据动态重分布的方法、数据节点、名字节点及系统。The invention relates to the field of Internet big data processing technology, in particular to a method, a data node, a name node and a system for dynamically redistributing data.
背景技术Background technique
Hadoop分布式文件系统(Hadoop Distributed File System,简称“HDFS”)是一个优秀的分布式文件系统,可以用于海量数据的存储。目前,HDFS已经在各种大型在线服务和大型存储系统中得到了广泛应用。The Hadoop Distributed File System (HDFS) is an excellent distributed file system that can be used for massive data storage. Currently, HDFS has been widely used in various large online services and large storage systems.
HDFS文件系统采用分块机制分布式存储文件,并通过数据块冗余策略来提高系统可靠性,每个数据块在系统中都有多个副本同时存在,这些副本分布在系统中多个机架内的多个节点上,防止因单个节点出现故障造成数据块的丢失。为实现这种数据块冗余策略,HDFS文件系统在写入数据时必须保证有多个副本同时写入,写入的副本的个数被称为数据块的复制因子,通常默认为三个。The HDFS file system uses a blocking mechanism to distribute files in a distributed manner and improves system reliability through a block redundancy strategy. Each data block has multiple copies in the system at the same time. These copies are distributed in multiple racks in the system. On multiple nodes within, prevent the loss of data blocks due to the failure of a single node. In order to implement this data block redundancy strategy, the HDFS file system must ensure that multiple copies are simultaneously written when writing data. The number of copies written is called the data block replication factor, which is usually three by default.
HDFS是一个主从结构,一般由一个名字节点(Name Node,简称“NN”)和多个数据节点(Data Node,简称“DN”)组成,其中,NN也称主(Master)节点,负责管理HDFS的名称空间和数据块映射信息,配置副本策略,并处理客户端请求。DN也称为从(Slave)节点,存储实际的数据,执行数据块的读写操作,并周期性地将存储的数据块的信息汇报给NN。客户端可以访问或通过命令行管理HDFS;与Name Node交互,获取文件位置信息;与Data Node交互,进行数据读和写操作。HDFS is a master-slave structure, which is generally composed of a Name Node ("NN") and a plurality of Data Nodes (DNs). The NN is also called the Master node and is responsible for management. HDFS namespace and block mapping information, configure copy policies, and handle client requests. The DN, also called a slave node, stores the actual data, performs read and write operations on the data block, and periodically reports the information of the stored data block to the NN. The client can access or manage HDFS through the command line; interact with the Name Node to obtain file location information; interact with the Data Node to perform data read and write operations.
HDFS中,访问数据块的任务通常会被优先分配到存储有目标数据块的DN中(对于存储有目标数据块的DN而言,这种任务可被称为本地任务),便于任务直接从该数据节点中读取目标数据块。当存储有目标数据块的数据节点所能够运行的任务数达到最大值时,任务会被分配到没有存储目标数据块的DN中(对于没有存储目标数据块的DN而言,这种任务可被称为非本地任务),分配到非本地任务的DN,需要通过网络从存储有目标数据块的DN读取目标数据块。In HDFS, the task of accessing a data block is usually assigned preferentially to the DN in which the target data block is stored (for a DN storing the target data block, such a task can be referred to as a local task), facilitating the task directly from the The target data block is read in the data node. When the number of tasks that can be run by the data node storing the target data block reaches the maximum value, the task is assigned to the DN that does not store the target data block (for a DN that does not store the target data block, this task can be Called a non-local task, the DN assigned to a non-local task needs to read the target data block from the DN where the target data block is stored over the network.
在实现本发明的过程中,发明人发现现有技术至少存在以下问题:In the process of implementing the present invention, the inventors have found that the prior art has at least the following problems:
由于用户对数据的访问存在不均衡性和不确定性,会导致一些数据块在某 一段时间内会被过多访问而成为热点数据块,由于复制因子的限制(其在系统中的副本最大数量不超过复制因子),当多个任务同时需要访问该热点数据块时,必然有些任务需要通过网络访问该热点数据块,这样既降低了任务的运行速度,又消耗了网络资源。同时,存储有该热点数据块的数据节点会一直收到访问该数据块的任务,使得HDFS中的任务负载不均衡。Due to the user's unbalanced and uncertain access to the data, some data blocks will be caused in some After a period of time, it will be too much access to become a hotspot data block. Due to the limitation of the replication factor (the maximum number of copies in the system does not exceed the replication factor), when multiple tasks need to access the hotspot data block at the same time, there must be some tasks. The hotspot data block needs to be accessed through the network, which reduces the running speed of the task and consumes network resources. At the same time, the data node storing the hotspot data block will always receive the task of accessing the data block, so that the task load in the HDFS is unbalanced.
发明内容Summary of the invention
为了解决现有HDFS中由于复制因子的限制,对热点数据块访问过多会导致运行速度慢、浪费网络资源且导致负载不均衡的问题,本发明实施例提供了一种数据动态重分布的方法、数据节点、名字节点及系统。所述技术方案如下:In order to solve the problem that the access to the hotspot data block is too high, the operation speed is slow, the network resources are wasted, and the load is unbalanced, the method of the present invention provides a method for dynamic data redistribution. , data nodes, name nodes, and systems. The technical solution is as follows:
第一方面,提供了一种数据动态重分布的方法,所述方法包括:In a first aspect, a method for dynamically redistributing data is provided, the method comprising:
第一数据节点接收数据块读取命令,所述数据块读取命令用于指示所述第一数据节点读取位于第二数据节点上的目标数据块,所述第二数据节点和所述第一数据节点为同一HDFS中的不同数据节点;The first data node receives a data block read command, the data block read command is used to instruct the first data node to read a target data block located on the second data node, the second data node and the first A data node is a different data node in the same HDFS;
根据所述数据块读取命令访问所述目标数据块;Accessing the target data block according to the data block read command;
在所述第一数据节点上生成所述目标数据块的数据块副本;Generating a data block copy of the target data block on the first data node;
向名字节点发送第一修改指令,所述第一修改指令用于指示所述名字节点将所述目标数据块的复制因子加一。Sending a first modification instruction to the name node, the first modification instruction is used to instruct the name node to increase a replication factor of the target data block by one.
具体地,所述方法还包括:Specifically, the method further includes:
当所述数据块副本的未被访问的时长超过所述数据块副本的生存时间,从所述第一数据节点删除所述数据块副本。And deleting the data block from the first data node when a length of time that the copy of the data block is not accessed exceeds a lifetime of the copy of the data block.
进一步地,所述当所述数据块副本的未被访问的时长超过所述数据块副本的生存时间时,从所述第一数据节点删除所述数据块副本,包括:Further, when the duration of the unblocked copy of the data block copy exceeds the lifetime of the data block copy, deleting the data block copy from the first data node includes:
接收名字节点发送的数据块副本删除指令,所述数据块副本删除指令包括所述数据块副本的标识;Receiving a data block copy deletion instruction sent by the name node, where the data block copy deletion instruction includes an identifier of the data block copy;
根据所述数据块副本删除指令,从所述第一数据节点删除所述数据块副本;Deleting the data block copy from the first data node according to the data block copy deletion instruction;
或者,or,
所述当所述数据块副本的未被访问的时长超过所述数据块副本的生存时间时,从所述第一数据节点删除所述数据块副本,包括:Deleting the data block copy from the first data node when the length of the data block copy that is not accessed exceeds the lifetime of the data block copy, including:
监控所述数据块副本的未被访问的时长; Monitoring the length of time that the copy of the data block is not accessed;
当所述数据块副本的未被访问的时长超过所述数据块副本的生存时间时,从所述第一数据节点删除所述数据块副本。And deleting the data block from the first data node when a length of time that the copy of the data block is not accessed exceeds a lifetime of the copy of the data block.
进一步地,所述方法还包括:Further, the method further includes:
向所述名字节点发送第二修改指令,所述第二修改指令用于指示所述名字节点将所述目标数据块的复制因子减一。Sending a second modification instruction to the name node, the second modification instruction is used to instruct the name node to decrement the replication factor of the target data block by one.
第二方面,提供了一种数据动态重分布的方法,所述方法包括:In a second aspect, a method for dynamically redistributing data is provided, the method comprising:
接收第一数据节点发送的第一修改指令,所述第一修改指令用于指示名字节点将目标数据块的复制因子加一,所述第一修改指令是所述第一数据节点读取位于第二数据节点上的所述目标数据块,并在所述第一数据节点上生成所述目标数据块的数据块副本后发送的;Receiving, by the first data node, a first modification instruction, where the first modification instruction is used to indicate that the name node adds one of the replication factor of the target data block, and the first modification instruction is that the first data node is located at the first And the target data block on the data node is sent after the data block copy of the target data block is generated on the first data node;
根据所述第一修改指令修改所述目标数据块的复制因子。Modifying a replication factor of the target data block according to the first modification instruction.
具体地,所述方法包括:Specifically, the method includes:
接收第一数据节点发送的第二修改指令,所述第二修改指令用于指示所述名字节点将所述目标数据块的复制因子减一;Receiving, by the first data node, a second modification instruction, where the second modification instruction is used to instruct the name node to decrement a replication factor of the target data block by one;
根据所述第二修改指令修改所述目标数据块的复制因子。Modifying a copy factor of the target data block according to the second modification instruction.
第三方面,提供了一种数据节点,所述数据节点包括:In a third aspect, a data node is provided, the data node comprising:
接收模块,用于接收数据块读取命令,所述数据块读取命令用于指示所述数据节点读取位于第二数据节点上的目标数据块,所述第二数据节点和所述数据节点为同一HDFS中的不同数据节点;a receiving module, configured to receive a data block read command, the data block read command is used to instruct the data node to read a target data block located on the second data node, the second data node and the data node Different data nodes in the same HDFS;
访问模块,用于根据所述数据块读取命令访问所述目标数据块;An access module, configured to access the target data block according to the data block read command;
生成模块,用于在所述数据节点上生成所述目标数据块的数据块副本;Generating a module, configured to generate a data block copy of the target data block on the data node;
第一发送模块,用于向名字节点发送第一修改指令,所述第一修改指令用于指示所述名字节点将所述目标数据块的复制因子加一。The first sending module is configured to send a first modification instruction to the name node, where the first modification instruction is used to instruct the name node to increase the replication factor of the target data block by one.
具体地,所述数据节点还包括:Specifically, the data node further includes:
删除模块,用于当所述数据块副本的未被访问的时长超过所述数据块副本的生存时间时,从所述数据节点删除所述数据块副本。And a deleting module, configured to delete the data block copy from the data node when a length of time that the data block copy is not accessed exceeds a lifetime of the data block copy.
进一步地,所述删除模块包括:Further, the deleting module includes:
接收单元,用于接收名字节点发送的数据块副本删除指令,所述数据块副 本删除指令包括所述数据块副本的标识;a receiving unit, configured to receive a data block copy deletion instruction sent by the name node, where the data block is The deletion instruction includes an identifier of the copy of the data block;
第一删除单元,用于根据所述数据块副本删除指令,从所述数据节点删除所述数据块副本;a first deleting unit, configured to delete the data block copy from the data node according to the data block copy deletion instruction;
或者,所述删除模块包括:Alternatively, the deleting module includes:
监控单元,用于监控所述数据块副本的未被访问的时长;a monitoring unit, configured to monitor an unvisited duration of the data block copy;
第二删除单元,用于当所述数据块副本的未被访问的时长超过所述数据块副本的生存时间时,从所述数据节点删除所述数据块副本。And a second deleting unit, configured to delete the data block copy from the data node when a length of time that the data block copy is not accessed exceeds a lifetime of the data block copy.
进一步地,所述数据节点还包括:Further, the data node further includes:
第二发送模块,用于向所述名字节点发送第二修改指令,所述第二修改指令用于指示所述名字节点将所述目标数据块的复制因子减一。And a second sending module, configured to send a second modification instruction to the name node, where the second modification instruction is used to instruct the name node to decrement the replication factor of the target data block by one.
第四方面,提供了一种名字节点,所述名字节点包括:In a fourth aspect, a name node is provided, the name node comprising:
第一接收模块,用于接收第一数据节点发送的第一修改指令,所述第一修改指令用于指示名字节点将目标数据块的复制因子加一,所述第一修改指令是所述第一数据节点读取位于第二数据节点上的所述目标数据块,并在所述第一数据节点上生成所述目标数据块的数据块副本后发送的;a first receiving module, configured to receive a first modification instruction sent by the first data node, where the first modification instruction is used to indicate that the name node adds one of the replication factors of the target data block, and the first modification instruction is the first Reading, by a data node, the target data block located on the second data node, and transmitting the data block copy of the target data block on the first data node;
第一修改模块,用于根据所述第一修改指令修改所述目标数据块的复制因子。And a first modifying module, configured to modify a replication factor of the target data block according to the first modification instruction.
具体地,所述名字节点还包括:Specifically, the name node further includes:
第二接收模块,用于接收第一数据节点发送的第二修改指令,所述第二修改指令用于指示所述名字节点将所述目标数据块的复制因子减一;a second receiving module, configured to receive a second modification instruction sent by the first data node, where the second modification instruction is used to instruct the name node to decrement a replication factor of the target data block by one;
第二修改模块,用于根据所述第二修改指令修改所述目标数据块的复制因子。And a second modifying module, configured to modify a replication factor of the target data block according to the second modification instruction.
第五方面,提供了一种数据节点,所述数据节点包括:In a fifth aspect, a data node is provided, the data node comprising:
处理器、存储器、总线和通信接口;所述存储器用于存储计算机执行指令,所述处理器与所述存储器通过所述总线连接,当所述计算机运行时,所述处理器执行所述存储器存储的所述计算机执行指令,以使所述计算机执行如前文所述的方法。 a processor, a memory, a bus, and a communication interface; the memory for storing computer execution instructions, the processor being coupled to the memory via the bus, the processor executing the memory storage when the computer is running The computer executes instructions to cause the computer to perform the method as previously described.
第六方面,提供了一种名字节点,所述名字节点包括:In a sixth aspect, a name node is provided, the name node comprising:
处理器、存储器、总线和通信接口;所述存储器用于存储计算机执行指令,所述处理器与所述存储器通过所述总线连接,当所述计算机运行时,所述处理器执行所述存储器存储的所述计算机执行指令,以使所述计算机执行如前文所述的方法。a processor, a memory, a bus, and a communication interface; the memory for storing computer execution instructions, the processor being coupled to the memory via the bus, the processor executing the memory storage when the computer is running The computer executes instructions to cause the computer to perform the method as previously described.
第七方面,提供了一种数据动态重分布的系统,所述系统包括如前文所述的数据节点,以及如前文所述的名字节点。In a seventh aspect, a system for dynamic data redistribution is provided, the system comprising a data node as hereinbefore described, and a name node as hereinbefore described.
本发明实施例提供的技术方案带来的有益效果是:The beneficial effects brought by the technical solutions provided by the embodiments of the present invention are:
通过第一数据节点接收数据块读取命令,并根据该数据块读取命令访问目标数据块(存储在第二数据节点上),然后生成目标数据块的数据块副本,最后向名字节点发送第一修改指令,用于指示名字节点将目标数据块的复制因子加一。这样使得HDFS中存储目标数据块的数据节点的数量增加,当再次分配需要访问目标数据块的任务时,HDFS中的本地任务(即被分配到存储有目标数据块的数据节点中的任务)的数量增加,提高了HDFS的任务的本地化概率。同时,由于本地任务执行速度快且对HDFS网络资源的消耗少,使得HDFS运行的速度提高。此外,新增的存储有目标数据块的数据节点会分担一部分原存储有目标数据块的数据节点的任务负载,实现了HDFS中的负载均衡。Receiving a data block read command by the first data node, and accessing the target data block (stored on the second data node) according to the data block read command, and then generating a data block copy of the target data block, and finally sending the data block to the name node A modification instruction for instructing the name node to increment the replication factor of the target data block by one. This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored) The increase in the number increases the localization probability of HDFS tasks. At the same time, the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources. In addition, the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.
附图说明DRAWINGS
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention. Other drawings may also be obtained from those of ordinary skill in the art in light of the inventive work.
图1是本发明提供的一种HDFS的网络架构示意图;1 is a schematic diagram of a network architecture of an HDFS provided by the present invention;
图2是本发明实施例一提供的一种数据动态重分布的方法流程图;2 is a flowchart of a method for dynamically redistributing data according to Embodiment 1 of the present invention;
图3是本发明实施例二提供的一种数据动态重分布的方法流程图;3 is a flowchart of a method for dynamically redistributing data according to Embodiment 2 of the present invention;
图4是本发明实施例三提供的一种数据动态重分布的信息交互图;4 is an information interaction diagram of data dynamic redistribution according to Embodiment 3 of the present invention;
图5是本发明实施例四提供的一种数据节点的结构示意图;FIG. 5 is a schematic structural diagram of a data node according to Embodiment 4 of the present invention; FIG.
图6是本发明实施例五提供的一种名字节点的结构示意图; 6 is a schematic structural diagram of a name node according to Embodiment 5 of the present invention;
图7是本发明实施例六提供的一种数据节点的结构示意图;7 is a schematic structural diagram of a data node according to Embodiment 6 of the present invention;
图8是本发明实施例七提供的一种名字节点的结构示意图;8 is a schematic structural diagram of a name node according to Embodiment 7 of the present invention;
图9是本发明实施例八提供的一种数据动态重分布的系统的结构示意图。FIG. 9 is a schematic structural diagram of a system for dynamically redistributing data according to Embodiment 8 of the present invention.
具体实施方式detailed description
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。The embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.
下面先结合图1简单介绍HDFS的结构。如图1所示,HDFS通常包括一个名字节点11和多个数据节点12,名字节点11用于管理HDFS中的所有数据节点12,数据节点12用于存储文件的数据块信息,HDFS中的名字节点11和数据节点12之间通过网络相互通信。The structure of the HDFS will be briefly introduced below in conjunction with FIG. As shown in FIG. 1, the HDFS usually includes a name node 11 for managing all data nodes 12 in the HDFS, and a data node 12 for storing data block information of the file, the name in the HDFS. The node 11 and the data node 12 communicate with each other through a network.
实施例一Embodiment 1
本发明实施例提供了一种数据动态重分布的方法,参见图2,该方法可以由第一数据节点来执行,该方法包括:An embodiment of the present invention provides a method for dynamically redistributing data. Referring to FIG. 2, the method may be performed by a first data node, where the method includes:
步骤S11,第一数据节点接收数据块读取命令,该数据块读取命令用于指示第一数据节点读取位于第二数据节点上的目标数据块,第二数据节点和第一数据节点为同一HDFS中的不同数据节点。Step S11: The first data node receives a data block read command, where the data block read command is used to instruct the first data node to read the target data block located on the second data node, where the second data node and the first data node are Different data nodes in the same HDFS.
步骤S12,根据数据块读取命令访问目标数据块。Step S12, accessing the target data block according to the data block read command.
在本实施例中,由于第一数据节点和第二数据节点为同一HDFS中的不同数据节点,第一数据节点在根据数据块读取命令访问目标数据块时,需要通过网络向第二数据节点读取。In this embodiment, since the first data node and the second data node are different data nodes in the same HDFS, the first data node needs to access the second data node through the network when accessing the target data block according to the data block read command. Read.
在实际应用中,由于HDSF系统的网络资源是有限的,且通过网络访问目标文件既会消耗有限的网络资源,又会减慢HDSF系统的运行速度,故在HDSF系统中会尽量避免任务通过网络来访问目标数据块。In practical applications, because the network resources of the HDSF system are limited, and accessing the target file through the network consumes limited network resources and slows down the operation speed of the HDSF system, the task is prevented from passing through the network in the HDSF system. To access the target data block.
步骤S13,在第一数据节点上生成目标数据块的数据块副本。Step S13, generating a data block copy of the target data block on the first data node.
需要说明的是,在HDFS中,如果被分配任务的数据节点中没有存储该任务的目标文件的数据块,则该被分配的任务被称为非本地任务;如果被分配任务的数据节点中存储有该任务的目标文件的数据块,则该任务被称为本地任务。由于本地任务的读取速度远于非本地任务,且不需要占用HDFS的网络资源, 故为了提高HDFS的读取速度,可以通过提高任务的本地化概率来实现。It should be noted that, in HDFS, if there is no data block of the target file storing the task in the data node to which the task is assigned, the assigned task is referred to as a non-local task; if the data node of the assigned task is stored in the data node If there is a data block for the target file of the task, the task is called a local task. Since the local task reads faster than the non-local task and does not need to occupy the HDFS network resources, Therefore, in order to improve the reading speed of HDFS, it can be realized by increasing the localization probability of the task.
在本实施例中,第一数据节点复制目标数据块并生成目标数据块的数据块副本,增加了HDFS中目标数据块的副本数量。当再次分配读取目标数据块的任务时,被分配到存储有目标数据块的数据节点中任务数量增加,提高了HDFS的任务的本地化概率,同时,更多的任务能在自己的本地数据节点中读取目标数据块,提高了HDFS的运行速度,还能节约HDFS的网络资源。In this embodiment, the first data node copies the target data block and generates a data block copy of the target data block, increasing the number of copies of the target data block in the HDFS. When the task of reading the target data block is allocated again, the number of tasks allocated to the data node storing the target data block increases, which increases the localization probability of the HDFS task, and at the same time, more tasks can be in their own local data. The target data block is read in the node, which improves the running speed of the HDFS and saves the network resources of the HDFS.
此外,在HDSF系统中,总是优先为数据节点分配本地任务(即本地化原则),只有当本地任务分配完之后才会分配非本地任务。故存储有目标数据块的数据节点上会一直被分配读取该目标数据块的任务,该数据节点的负载较大。在本实施例中,第一数据节点复制目标数据块,增加了存储有目标数据块的数据节点数量,同时也会分担一部分存储有目标数据块的数据节点的任务负载,实现了HDFS中的负载均衡。In addition, in the HDSF system, the data node is always assigned a local task (that is, the localization principle), and the non-local task is allocated only after the local task is allocated. Therefore, the data node storing the target data block is always assigned the task of reading the target data block, and the load of the data node is large. In this embodiment, the first data node copies the target data block, increases the number of data nodes storing the target data block, and also shares the task load of a part of the data node storing the target data block, thereby realizing the load in the HDFS. balanced.
步骤S14,向名字节点发送第一修改指令,该第一修改指令用于指示名字节点将目标数据块的复制因子加一。Step S14, sending a first modification instruction to the name node, the first modification instruction is used to instruct the name node to increase the replication factor of the target data block by one.
在本实施例中,第一数据节点上生成目标数据块的数据块副本后,HDFS中目标数据块的副本数量增多,需要更新名字节点中存储的该目标数据块的复制因子信息。In this embodiment, after the data block copy of the target data block is generated on the first data node, the number of copies of the target data block in the HDFS is increased, and the copy factor information of the target data block stored in the name node needs to be updated.
本发明实施例通过第一数据节点接收数据块读取命令,并根据该数据块读取命令访问目标数据块(存储在第二数据节点上),然后生成目标数据块的数据块副本,最后向名字节点发送第一修改指令,用于指示名字节点将目标数据块的复制因子加一。这样使得HDFS中存储目标数据块的数据节点的数量增加,当再次分配需要访问目标数据块的任务时,HDFS中的本地任务(即被分配到存储有目标数据块的数据节点中的任务)的数量增加,提高了HDFS的任务的本地化概率。同时,由于本地任务执行速度快且对HDFS网络资源的消耗少,使得HDFS运行的速度提高。此外,新增的存储有目标数据块的数据节点会分担一部分原存储有目标数据块的数据节点的任务负载,实现了HDFS中的负载均衡。The embodiment of the present invention receives a data block read command by the first data node, and accesses the target data block (stored on the second data node) according to the data block read command, and then generates a data block copy of the target data block, and finally The name node sends a first modification instruction for instructing the name node to increment the replication factor of the target data block by one. This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored) The increase in the number increases the localization probability of HDFS tasks. At the same time, the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources. In addition, the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.
实施例二Embodiment 2
本发明实施例提供了一种数据动态重分布的方法,参见图3,该方法可以由 名字节点来执行,该方法包括:Embodiments of the present invention provide a method for dynamically redistributing data. Referring to FIG. 3, the method may be The name node is executed, and the method includes:
步骤S21,接收第一数据节点发送的第一修改指令,该修改指令用于指示名字节点将目标数据块的复制因子加一,第一修改指令是第一数据节点读取位于第二数据节点上的目标数据块,并在第一数据节点上生成目标数据块的数据块副本后发送的。Step S21, receiving a first modification instruction sent by the first data node, the modification instruction is used to instruct the name node to increase the replication factor of the target data block by one, and the first modification instruction is that the first data node reads the second data node. The target data block is sent after the data block copy of the target data block is generated on the first data node.
具体地,数据节点在复制了目标数据块后,由于HDFS中该目标数据块的副本数量增加,故需要更新名字节点中存储的该目标数据块的复制因子信息。Specifically, after the data node copies the target data block, since the number of copies of the target data block in the HDFS increases, it is necessary to update the replication factor information of the target data block stored in the name node.
步骤S22,根据该第一修改指令修改目标数据块的复制因子。Step S22, modifying the replication factor of the target data block according to the first modification instruction.
在实际应用中,名字节点还接收数据节点发送的定期心跳信息(定期心跳信息中包括数据节点中的数据块的信息)时,对记录在名字节点上的数据块信息进行对照修正,及时更记录的数据块信息。In practical applications, the name node also receives the periodic heartbeat information sent by the data node (the information of the data block in the data node is included in the periodic heartbeat information), and the data block information recorded on the name node is corrected and corrected, and the record is recorded in time. Data block information.
本发明实施例通过第一数据节点接收数据块读取命令,并根据该数据块读取命令访问目标数据块(存储在第二数据节点上),然后生成目标数据块的数据块副本,最后向名字节点发送第一修改指令,用于指示名字节点将目标数据块的复制因子加一。这样使得HDFS中存储目标数据块的数据节点的数量增加,当再次分配需要访问目标数据块的任务时,HDFS中的本地任务(即被分配到存储有目标数据块的数据节点中的任务)的数量增加,提高了HDFS的任务的本地化概率。同时,由于本地任务执行速度快且对HDFS网络资源的消耗少,使得HDFS运行的速度提高。此外,新增的存储有目标数据块的数据节点会分担一部分原存储有目标数据块的数据节点的任务负载,实现了HDFS中的负载均衡。The embodiment of the present invention receives a data block read command by the first data node, and accesses the target data block (stored on the second data node) according to the data block read command, and then generates a data block copy of the target data block, and finally The name node sends a first modification instruction for instructing the name node to increment the replication factor of the target data block by one. This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored) The increase in the number increases the localization probability of HDFS tasks. At the same time, the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources. In addition, the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.
实施例三Embodiment 3
本发明实施例提供了一种数据动态重分布的方法,参见图4,该方法包括:An embodiment of the present invention provides a method for dynamically redistributing data. Referring to FIG. 4, the method includes:
步骤S31,第一数据节点向名字节点发送文件打开请求。In step S31, the first data node sends a file open request to the name node.
具体地,文件打开请求包括目标文件的文件名、文件偏移量、以及文件数据大小。Specifically, the file open request includes a file name of the target file, a file offset, and a file data size.
在本实施例中,第一数据节点上的应用通过分布式文件系统客户端(Distributed File System client,简称“DFSclient”)向名字节点发送该文件打开请求。 In this embodiment, the application on the first data node sends the file open request to the name node through a distributed file system client ("DFSclient").
步骤S32,名字节点根据文件打开请求,向第一数据节点发送文件反馈信息。Step S32, the name node sends file feedback information to the first data node according to the file open request.
具体地,文件反馈信息包括目标文件的数据块以及各个数据块所在的数据节点的IP。Specifically, the file feedback information includes a data block of the target file and an IP of the data node where each data block is located.
在本实施例中,名字节点可以根据文件打开请求,向第一数据节点发送目标文件对应的数据块列表List<located Block>block()。In this embodiment, the name node may send a data block list List<located Block>block() corresponding to the target file to the first data node according to the file open request.
步骤S33,当文件反馈信息指示目标数据块存储在第一数据节点中时,第一数据节点直接读取目标数据块;当文件反馈信息指示目标数据块存储在第二数据节点中时,执行步骤S34。Step S33, when the file feedback information indicates that the target data block is stored in the first data node, the first data node directly reads the target data block; when the file feedback information indicates that the target data block is stored in the second data node, performing steps S34.
在本实施例中,当目标数据块存储在第一数据节点中时,第一数据节点上的应用通过DFSclient直接从第一数据节点中读取目标数据块,具体地,DFSclient在读取数据块时,会创建一个FSDataInputStream(分布式文件系统数据输入流)对象,通过FSDataInputStream对象读取目标数据块。In this embodiment, when the target data block is stored in the first data node, the application on the first data node directly reads the target data block from the first data node through the DFSclient, specifically, the DFSclient is reading the data block. When you create an FSDataInputStream (distributed file system data input stream) object, the target data block is read by the FSDataInputStream object.
步骤S34,第一数据节点读取位于第二数据节点上的目标数据块,第二数据节点和第一数据节点为同一HDFS中的不同数据节点。Step S34, the first data node reads the target data block located on the second data node, and the second data node and the first data node are different data nodes in the same HDFS.
在本实施例中,当目标数据块存储在第二数据节点中时(即不存储在第一数据节点中),第一数据节点上的DFSclient会通过FSDataInputStream对象向第二数据节点发送文件读取请求,并接收第二数据节点返回的目标数据块。In this embodiment, when the target data block is stored in the second data node (ie, not stored in the first data node), the DFSclient on the first data node sends a file read to the second data node through the FSDataInputStream object. Requesting and receiving the target data block returned by the second data node.
步骤S35,第一数据节点生成目标数据块的数据块副本。In step S35, the first data node generates a data block copy of the target data block.
在本实施例中,第二数据节点通过网络将目标数据块发送至第一数据节点后,第一数据节点复制该目标数据块,从而生成其数据块副本。In this embodiment, after the second data node sends the target data block to the first data node through the network, the first data node copies the target data block, thereby generating a copy of the data block.
步骤S36,第一数据节点向名字节点发送第一修改指令,该第一修改指令用于指示名字节点将目标数据块的复制因子加一。Step S36: The first data node sends a first modification instruction to the name node, where the first modification instruction is used to instruct the name node to increase the replication factor of the target data block by one.
在本实施例中,第一数据节点上生成目标数据块的数据块副本后,HDFS中目标数据块的副本数量增多,需要更新名字节点中存储的该目标数据块的复制因子信息。In this embodiment, after the data block copy of the target data block is generated on the first data node, the number of copies of the target data block in the HDFS is increased, and the copy factor information of the target data block stored in the name node needs to be updated.
此外,在本实施例中,由于考虑到HDFS的存储空间限制,可以限制复制因子的变动范围,例如,复制因子的变动范围可以设置为1~512。此外,如果创建HDFS时默认的复制因子为3,则复制因子的变动范围可以设置为3~512。In addition, in this embodiment, since the storage space limitation of the HDFS is considered, the variation range of the replication factor can be limited. For example, the variation range of the replication factor can be set to 1 to 512. In addition, if the default copy factor is 3 when HDFS is created, the range of the copy factor can be set from 3 to 512.
步骤S37,名字节点接收第一数据节点发送的第一修改指令,并根据该第一修改指令修改目标数据块的复制因子。 Step S37, the name node receives the first modification instruction sent by the first data node, and modifies the replication factor of the target data block according to the first modification instruction.
步骤S38,当数据块副本的未被访问的时长超过数据块副本的生存时间时,第一数据节点删除数据块副本。Step S38, when the duration of the unblocked copy of the data block exceeds the lifetime of the data block copy, the first data node deletes the data block copy.
在本实施例中,会为生成的目标数据块的数据块副本设置生存时间,这样可以防止这些数据块副本长时间不被访问而占用数据节点的存储空间。In this embodiment, the lifetime of the data block of the generated target data block is set, so that the data block copy is prevented from being accessed for a long time and occupies the storage space of the data node.
在本实施例的一种实现方式中,数据块副本的生存时间可以采用以下方式设置:为新增的数据块副本设定一个初始生存时间,当该数据块副本在设定时长内被访问的频率(或次数)达到设定值时,延长初始生存时间作为该数据块副本的新的生存时间。In an implementation manner of this embodiment, the lifetime of the data block replica may be set in the following manner: setting an initial lifetime for the newly added data block replica, and when the data chunk replica is accessed within the set duration When the frequency (or number of times) reaches the set value, the initial lifetime is extended as the new lifetime of the copy of the block.
在本实施例的另一实现方式中,数据块副本的生存时间也可以是静态设置的。In another implementation of this embodiment, the lifetime of the data block replica may also be statically set.
此外,需要说明的是,数据块副本设置生存时间只针对这些新增副本,对文件创建时形成的副本是不设置生存时间的。同时,这里删除的数据块副本也只针对新增的数据块副本,对于文件创建时形成的副本是不会被删除的。In addition, it should be noted that the data block copy setting lifetime is only for these new copies, and the copy formed when the file is created is not set to survive time. At the same time, the copy of the data block deleted here is only for the newly added copy of the data block, and the copy formed when the file is created will not be deleted.
在本实施例中,对于数据块副本的未被访问的时长,可以由第一数据节点监控,也可以由名字节点监控。In this embodiment, the length of time that the data block copy is not accessed may be monitored by the first data node or by the name node.
在由第一数据节点监控数据块副本的未被访问的时长的情景下,步骤S38可以包括以下步骤:In the case where the first data node monitors the length of the unblocked data block copy, step S38 may include the following steps:
S381,第一数据节点监控数据块副本的未被访问的时长;S381. The first data node monitors an unvisited duration of the data block copy.
S382,当数据块副本的未被访问的时长超过该数据块副本的生存时间时,第一数据节点删除该数据块副本。S382. When the duration of the unblocked copy of the data block exceeds the lifetime of the data block copy, the first data node deletes the data block copy.
具体地,可以通过在第一数据节点上新增一个复制管理器(ReplicaManager)来监控块副本的未被访问的时长。Specifically, the length of time that the block copy is not accessed can be monitored by adding a new replication manager (ReplicaManager) on the first data node.
在由名字节点监控数据块副本的未被访问的时长的情景下,步骤S38可以包括以下步骤:In the case where the name node monitors the unvisited duration of the data block copy, step S38 may include the following steps:
S383,名字节点监控数据块副本未被访问的时长;S383. The name node monitors a length of time that the data block copy is not accessed.
S384,当数据块副本的未被访问的时长超过数据块副本的生存时间时,名字节点向第一数据节点发送数据块副本删除指令,该数据块副本删除指令包括待删除的数据块副本的标识;S384, when the duration of the unblocked copy of the data block exceeds the lifetime of the data block copy, the name node sends a data block copy deletion instruction to the first data node, where the data block copy deletion instruction includes an identifier of the data block copy to be deleted. ;
S385,第一数据节点接收名字节点发送的数据块副本删除指令;S385. The first data node receives a data block copy deletion instruction sent by the name node.
S386,第一数据节点根据数据块副本删除指令,删除数据块副本。 S386. The first data node deletes the data block copy according to the data block copy deletion instruction.
具体地,名字节点监控数据块副本的未被访问的时长时,可以通过第一数据节点定期向名字节点发送的心跳信息中包含的数据块副本的相关信息来进行监控。Specifically, when the name node monitors the length of the unblocked data block copy, the name information of the data block copy included in the heartbeat information sent by the first data node to the name node may be monitored by the first data node.
步骤S39,第一数据节点向名字节点发送第二修改指令,该第二修改指令用于指示名字节点将目标数据块的复制因子减一。Step S39: The first data node sends a second modification instruction to the name node, where the second modification instruction is used to instruct the name node to decrement the replication factor of the target data block by one.
实际应用中,数据块的副本被删除后,需要修改HDFS中相应的复制因子。In practical applications, after the copy of the data block is deleted, the corresponding replication factor in HDFS needs to be modified.
步骤S40,名字节点接收第一数据节点发送的第二修改指令,并根据该第二修改指令修改目标数据块的复制因子。Step S40, the name node receives the second modification instruction sent by the first data node, and modifies the replication factor of the target data block according to the second modification instruction.
需要说明的是,在HDFS中,如果被分配任务的数据节点中没有存储该任务的目标文件的数据块,则该被分配的任务被称为非本地任务;如果被分配任务的数据节点中存储有该任务的目标文件的数据块,则该任务被称为本地任务。为了提高HDFS的数据读取效率,往往需要提高HDFS的本地任务的概率。It should be noted that, in HDFS, if there is no data block of the target file storing the task in the data node to which the task is assigned, the assigned task is referred to as a non-local task; if the data node of the assigned task is stored in the data node If there is a data block for the target file of the task, the task is called a local task. In order to improve the data reading efficiency of HDFS, it is often necessary to increase the probability of local tasks of HDFS.
在本实施例中,第一数据节点读取位于第二数据节点上的目标数据块,表明HDFS为第一数据节点分配了一个非本地任务。现有的HDFS中,在第一数据节点读取第二数据节点上的目标数据块后,不会复制该目标数据块到第一数据节点中,如果下次第一数据节点被分配到相同的任务时,第一数据节点仍然需要从第二数据节点读取目标数据块。In this embodiment, the first data node reads the target data block located on the second data node, indicating that the HDFS allocates a non-local task to the first data node. In the existing HDFS, after the first data node reads the target data block on the second data node, the target data block is not copied to the first data node, if the next data node is assigned the same next time. At the time of the task, the first data node still needs to read the target data block from the second data node.
而本实施例中,第一数据节点生成目标数据块的数据块副本,加了HDFS中目标数据块的副本数量。当再次分配读取目标数据块的任务时,被分配到存储有目标数据块的数据节点中任务数量增加,提高了HDFS的任务的本地化概率,同时,更多的任务能在自己的本地数据节点中读取目标数据块,提高了HDFS的运行速度,还能节约HDFS的网络资源。In this embodiment, the first data node generates a data block copy of the target data block, and adds the number of copies of the target data block in the HDFS. When the task of reading the target data block is allocated again, the number of tasks allocated to the data node storing the target data block increases, which increases the localization probability of the HDFS task, and at the same time, more tasks can be in their own local data. The target data block is read in the node, which improves the running speed of the HDFS and saves the network resources of the HDFS.
还需要说明的是,HDFS总是优先为数据节点分配本地任务(即本地化原则),只有当本地任务分配完之后才会分配非本地任务。故存储有目标数据块的数据节点上会一直被分配读取该目标数据块的任务,该数据节点的负载较大。在本实施例中,第一数据节点复制目标数据块,增加了存储有目标数据块的数据节点数量,同时也会分担一部分存储有目标数据块的数据节点的任务负载,实现了HDFS中的负载均衡。It should also be noted that HDFS always assigns local tasks to the data nodes (that is, the localization principle), and only assigns non-local tasks after the local tasks are allocated. Therefore, the data node storing the target data block is always assigned the task of reading the target data block, and the load of the data node is large. In this embodiment, the first data node copies the target data block, increases the number of data nodes storing the target data block, and also shares the task load of a part of the data node storing the target data block, thereby realizing the load in the HDFS. balanced.
本发明实施例通过第一数据节点接收数据块读取命令,并根据该数据块读取命令访问目标数据块(存储在第二数据节点上),然后生成目标数据块的数据 块副本,最后向名字节点发送第一修改指令,用于指示名字节点将目标数据块的复制因子加一。这样使得HDFS中存储目标数据块的数据节点的数量增加,当再次分配需要访问目标数据块的任务时,HDFS中的本地任务(即被分配到存储有目标数据块的数据节点中的任务)的数量增加,提高了HDFS的任务的本地化概率。同时,由于本地任务执行速度快且对HDFS网络资源的消耗少,使得HDFS运行的速度提高。此外,新增的存储有目标数据块的数据节点会分担一部分原存储有目标数据块的数据节点的任务负载,实现了HDFS中的负载均衡。The embodiment of the present invention receives a data block read command by the first data node, and accesses the target data block (stored on the second data node) according to the data block read command, and then generates data of the target data block. The block copy finally sends a first modification instruction to the name node to instruct the name node to increase the copy factor of the target data block by one. This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored) The increase in the number increases the localization probability of HDFS tasks. At the same time, the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources. In addition, the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.
实施例四Embodiment 4
本发明实施例提供了一种数据节点,参见图5,该数据节点包括:接收模块501、访问模块502、生成模块503、以及第一发送模块504。The embodiment of the present invention provides a data node. Referring to FIG. 5, the data node includes: a receiving module 501, an access module 502, a generating module 503, and a first sending module 504.
接收模块501,用于接收数据块读取命令,该数据块读取命令用于指示数据节点读取位于第二数据节点上的目标数据块,第二数据节点和数据节点为同一HDFS中的不同数据节点。The receiving module 501 is configured to receive a data block read command, where the data block read command is used to instruct the data node to read the target data block located on the second data node, where the second data node and the data node are different in the same HDFS Data node.
在本实施例中,数据块读取命令所指示的数据节点可以为上述实施例一至三中的第一数据节点。In this embodiment, the data node indicated by the data block read command may be the first data node in the first to third embodiments.
访问模块502,用于根据数据块读取命令访问目标数据块。The access module 502 is configured to access the target data block according to the data block read command.
在本实施例中,由于数据节点(即第一数据节点)和第二数据节点为同一HDFS中的不同数据节点,数据节点在根据数据块读取命令访问目标数据块时,需要通过网络向第二数据节点读取。In this embodiment, since the data node (ie, the first data node) and the second data node are different data nodes in the same HDFS, the data node needs to go through the network when accessing the target data block according to the data block read command. Two data nodes are read.
在实际应用中,由于HDSF系统的网络资源是有限的,且通过网络访问目标文件既会消耗有限的网络资源,又会减慢HDSF系统的运行速度,故在HDSF系统中会尽量避免任务通过网络来访问目标数据块。In practical applications, because the network resources of the HDSF system are limited, and accessing the target file through the network consumes limited network resources and slows down the operation speed of the HDSF system, the task is prevented from passing through the network in the HDSF system. To access the target data block.
生成模块503,用于在数据节点上生成目标数据块的数据块副本。The generating module 503 is configured to generate a data block copy of the target data block on the data node.
第一发送模块504,用于向名字节点发送第一修改指令,该第一修改指令用于指示名字节点将目标数据块的复制因子加一。The first sending module 504 is configured to send a first modification instruction to the name node, where the first modification instruction is used to instruct the name node to increase the replication factor of the target data block by one.
在本实施例中,数据节点上生成目标数据块的数据块副本后,HDFS中目标数据块的副本数量增多,需要更新名字节点中存储的该目标数据块的复制因子信息。 In this embodiment, after the data block copy of the target data block is generated on the data node, the number of copies of the target data block in the HDFS is increased, and the copy factor information of the target data block stored in the name node needs to be updated.
具体地,数据节点还包括:删除模块505。Specifically, the data node further includes: a deleting module 505.
删除模块505,用于当数据块副本的未被访问的时长超过数据块副本的生存时间时,从数据节点删除数据块副本。The deleting module 505 is configured to delete the data block copy from the data node when the unused time of the data block copy exceeds the lifetime of the data block copy.
在本实施例中,会对生成的目标数据块的数据块副本设置生存时间,可以防止这些数据块副本长时间不被访问而占用数据节点的存储空间的问题。In this embodiment, the lifetime of the data block copy of the generated target data block is set, and the problem that the data block copy does not be accessed for a long time and occupies the storage space of the data node can be prevented.
在本实施例的一种实现方式中,数据块副本的生存时间可以采用以下方式设置:为新增的数据块副本设定一个初始生存时间,当该数据块副本在设定时长内被访问的频率(或次数)达到设定值时,延长初始生存时间作为该数据块副本的新的生存时间。In an implementation manner of this embodiment, the lifetime of the data block replica may be set in the following manner: setting an initial lifetime for the newly added data block replica, and when the data chunk replica is accessed within the set duration When the frequency (or number of times) reaches the set value, the initial lifetime is extended as the new lifetime of the copy of the block.
在本实施例的另一实现方式中,数据块副本的生存时间也可以是静态设置的。In another implementation of this embodiment, the lifetime of the data block replica may also be statically set.
在本实施例中,对于数据块副本的未被访问的时长,可以由数据节点(即第一数据节点)监控,也可以由名字节点监控。In this embodiment, the length of time that the data block copy is not accessed may be monitored by the data node (ie, the first data node) or by the name node.
进一步地,当由名字节点监控数据块副本的未被访问的时长时,删除模块505还包括:接收单元515和第一删除单元525;Further, when the unnamed duration of the data block copy is monitored by the name node, the deleting module 505 further includes: a receiving unit 515 and a first deleting unit 525;
接收单元515,用于接收名字节点发送的数据块副本删除指令,该数据块副本删除指令包括数据块副本的标识。The receiving unit 515 is configured to receive a data block copy deletion instruction sent by the name node, where the data block copy deletion instruction includes an identifier of the data block copy.
第一删除单元525,用于根据数据块副本删除指令,从数据节点删除数据块副本。The first deleting unit 525 is configured to delete the data block copy from the data node according to the data block copy deletion instruction.
当由数据节点监控数据块副本的未被访问的时长时,删除模块505还包括:监控单元535和第二删除单元545。The deletion module 505 further includes a monitoring unit 535 and a second deleting unit 545 when the data node monitors the unvisited duration of the data block copy.
监控单元535,用于监控数据块副本的未被访问的时长。The monitoring unit 535 is configured to monitor the length of time that the data block copy is not accessed.
第二删除单元545,用于当数据块副本的未被访问的时长超过该数据块副本的生存时间时,从数据节点删除数据块副本。The second deleting unit 545 is configured to delete the data block copy from the data node when the length of the unblocked data block is longer than the lifetime of the data block copy.
进一步地,数据节点还包括:第二发送模块506。Further, the data node further includes: a second sending module 506.
第二发送模块506,用于向名字节点发送第二修改指令,该第二修改指令用于指示名字节点将目标数据块的复制因子减一。The second sending module 506 is configured to send a second modification instruction to the name node, where the second modification instruction is used to instruct the name node to decrement the replication factor of the target data block by one.
需要说明的是,在HDFS中,如果被分配任务的数据节点中没有存储该任务的目标文件的数据块,则该被分配的任务被称为非本地任务;如果被分配任务的数据节点中存储有该任务的目标文件的数据块,则该任务被称为本地任务。 由于本地任务的读取速度远于非本地任务,且不需要占用HDFS的网络资源,故为了提高HDFS的读取速度,可以通过提高任务的本地化概率来实现。It should be noted that, in HDFS, if there is no data block of the target file storing the task in the data node to which the task is assigned, the assigned task is referred to as a non-local task; if the data node of the assigned task is stored in the data node If there is a data block for the target file of the task, the task is called a local task. The local task can be read faster than the non-local task and does not need to occupy the network resources of the HDFS. Therefore, in order to improve the reading speed of the HDFS, the localization probability of the task can be improved.
在本实施例中,数据节点(即第一数据节点)复制目标数据块并生成目标数据块的数据块副本,增加了HDFS中目标数据块的副本数量。当再次分配读取目标数据块的任务时,被分配到存储有目标数据块的数据节点中任务数量增加,提高了HDFS的任务的本地化概率,同时,更多的任务能在自己的本地数据节点中读取目标数据块,提高了HDFS的运行速度,还能节约HDFS的网络资源。In this embodiment, the data node (ie, the first data node) copies the target data block and generates a data block copy of the target data block, increasing the number of copies of the target data block in the HDFS. When the task of reading the target data block is allocated again, the number of tasks allocated to the data node storing the target data block increases, which increases the localization probability of the HDFS task, and at the same time, more tasks can be in their own local data. The target data block is read in the node, which improves the running speed of the HDFS and saves the network resources of the HDFS.
此外,在HDSF系统中,总是优先为数据节点分配本地任务(即本地化原则),只有当本地任务分配完之后才会分配非本地任务。故存储有目标数据块的数据节点上会一直被分配读取该目标数据块的任务,该数据节点的负载较大。在本实施例中,数据节点(即第一数据节点)复制目标数据块,增加了存储有目标数据块的数据节点数量,同时也会分担一部分存储有目标数据块的数据节点的任务负载,实现了HDFS中的负载均衡。In addition, in the HDSF system, the data node is always assigned a local task (that is, the localization principle), and the non-local task is allocated only after the local task is allocated. Therefore, the data node storing the target data block is always assigned the task of reading the target data block, and the load of the data node is large. In this embodiment, the data node (ie, the first data node) copies the target data block, increases the number of data nodes storing the target data block, and also shares the task load of the data node storing the target data block. Load balancing in HDFS.
本发明实施例通过第一数据节点接收数据块读取命令,并根据该数据块读取命令访问目标数据块(存储在第二数据节点上),然后生成目标数据块的数据块副本,最后向名字节点发送第一修改指令,用于指示名字节点将目标数据块的复制因子加一。这样使得HDFS中存储目标数据块的数据节点的数量增加,当再次分配需要访问目标数据块的任务时,HDFS中的本地任务(即被分配到存储有目标数据块的数据节点中的任务)的数量增加,提高了HDFS的任务的本地化概率。同时,由于本地任务执行速度快且对HDFS网络资源的消耗少,使得HDFS运行的速度提高。此外,新增的存储有目标数据块的数据节点会分担一部分原存储有目标数据块的数据节点的任务负载,实现了HDFS中的负载均衡。The embodiment of the present invention receives a data block read command by the first data node, and accesses the target data block (stored on the second data node) according to the data block read command, and then generates a data block copy of the target data block, and finally The name node sends a first modification instruction for instructing the name node to increment the replication factor of the target data block by one. This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored) The increase in the number increases the localization probability of HDFS tasks. At the same time, the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources. In addition, the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.
实施例五Embodiment 5
本发明实施例提供了一种名字节点,参见图6,该数据节点包括:第一接收模块601和第一修改模块602。The embodiment of the present invention provides a name node. Referring to FIG. 6, the data node includes: a first receiving module 601 and a first modifying module 602.
第一接收模块601,用于接收数据节点发送的第一修改指令,该第一修改指令用于指示名字节点将复制在数据节点中的目标文件的数据块的复制因子加一。 The first receiving module 601 is configured to receive a first modification instruction sent by the data node, where the first modification instruction is used to indicate that the name node increases the replication factor of the data block of the target file copied in the data node by one.
具体地,数据节点在复制了目标数据块后,由于HDFS中该目标数据块的副本数量增加,故需要更新名字节点中存储的该目标数据块的复制因子信息。Specifically, after the data node copies the target data block, since the number of copies of the target data block in the HDFS increases, it is necessary to update the replication factor information of the target data block stored in the name node.
第一修改模块602,用于根据第一修改指令修改目标数据块的复制因子。The first modification module 602 is configured to modify a replication factor of the target data block according to the first modification instruction.
在实际应用中,名字节点还接收数据节点发送的定期心跳信息(定期心跳信息中包括数据节点中的数据块的信息)时,对记录在名字节点上的数据块的信息进行对照修正,及时更记录的数据块信息。In practical applications, the name node also receives the periodic heartbeat information sent by the data node (the information of the data block in the data node is included in the periodic heartbeat information), and the information of the data block recorded on the name node is corrected and corrected in time. Recorded data block information.
进一步地,名字节点还包括:第二接收模块603和第二修改模块604。Further, the name node further includes: a second receiving module 603 and a second modifying module 604.
第二接收模块603,用于接收第一数据节点发送的第二修改指令,该第二修改指令用于指示名字节点将目标数据块的复制因子减一。The second receiving module 603 is configured to receive a second modification instruction sent by the first data node, where the second modification instruction is used to instruct the name node to decrement the replication factor of the target data block by one.
第二修改模块604,用于根据第二修改指令修改目标数据块的复制因子。The second modification module 604 is configured to modify a replication factor of the target data block according to the second modification instruction.
本发明实施例通过第一数据节点接收数据块读取命令,并根据该数据块读取命令访问目标数据块(存储在第二数据节点上),然后生成目标数据块的数据块副本,最后向名字节点发送第一修改指令,用于指示名字节点将目标数据块的复制因子加一。这样使得HDFS中存储目标数据块的数据节点的数量增加,当再次分配需要访问目标数据块的任务时,HDFS中的本地任务(即被分配到存储有目标数据块的数据节点中的任务)的数量增加,提高了HDFS的任务的本地化概率。同时,由于本地任务执行速度快且对HDFS网络资源的消耗少,使得HDFS运行的速度提高。此外,新增的存储有目标数据块的数据节点会分担一部分原存储有目标数据块的数据节点的任务负载,实现了HDFS中的负载均衡。The embodiment of the present invention receives a data block read command by the first data node, and accesses the target data block (stored on the second data node) according to the data block read command, and then generates a data block copy of the target data block, and finally The name node sends a first modification instruction for instructing the name node to increment the replication factor of the target data block by one. This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored) The increase in the number increases the localization probability of HDFS tasks. At the same time, the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources. In addition, the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.
实施例六Embodiment 6
本发明实施例提供了一种数据节点,参见图7,该数据节点包括:An embodiment of the present invention provides a data node. Referring to FIG. 7, the data node includes:
处理器701、存储器702、总线703和通信接口704;该存储器702用于存储计算机执行指令,处理器701与存储器702通过总线703连接,当计算机运行时,处理器701执行存储器存储的计算机执行指令,以使计算机执行实施例一、或实施例三所述的方法。The processor 701, the memory 702, the bus 703, and the communication interface 704; the memory 702 is used to store computer execution instructions, the processor 701 is connected to the memory 702 via a bus 703, and when the computer is running, the processor 701 executes the memory storage computer execution instructions. So that the computer executes the method described in Embodiment 1 or Embodiment 3.
本发明实施例通过第一数据节点接收数据块读取命令,并根据该数据块读取命令访问目标数据块(存储在第二数据节点上),然后生成目标数据块的数据块副本,最后向名字节点发送第一修改指令,用于指示名字节点将目标数据块 的复制因子加一。这样使得HDFS中存储目标数据块的数据节点的数量增加,当再次分配需要访问目标数据块的任务时,HDFS中的本地任务(即被分配到存储有目标数据块的数据节点中的任务)的数量增加,提高了HDFS的任务的本地化概率。同时,由于本地任务执行速度快且对HDFS网络资源的消耗少,使得HDFS运行的速度提高。此外,新增的存储有目标数据块的数据节点会分担一部分原存储有目标数据块的数据节点的任务负载,实现了HDFS中的负载均衡。The embodiment of the present invention receives a data block read command by the first data node, and accesses the target data block (stored on the second data node) according to the data block read command, and then generates a data block copy of the target data block, and finally The name node sends a first modification instruction for indicating that the name node will target the data block The replication factor is increased by one. This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored) The increase in the number increases the localization probability of HDFS tasks. At the same time, the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources. In addition, the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.
实施例七Example 7
本发明实施例提供了一种名字节点,参见图8,该名字节点包括:An embodiment of the present invention provides a name node. Referring to FIG. 8, the name node includes:
处理器801、存储器802、总线803和通信接口804;该存储器802用于存储计算机执行指令,处理器801与存储器802通过总线803连接,当计算机运行时,处理器801执行存储器存储的计算机执行指令,以使计算机执行实施例二、或实施例三所述的方法。The processor 801, the memory 802, the bus 803, and the communication interface 804; the memory 802 is used to store computer execution instructions, the processor 801 is connected to the memory 802 via the bus 803, and when the computer is running, the processor 801 executes the memory storage computer execution instructions. So that the computer performs the method described in Embodiment 2 or Embodiment 3.
本发明实施例通过第一数据节点接收数据块读取命令,并根据该数据块读取命令访问目标数据块(存储在第二数据节点上),然后生成目标数据块的数据块副本,最后向名字节点发送第一修改指令,用于指示名字节点将目标数据块的复制因子加一。这样使得HDFS中存储目标数据块的数据节点的数量增加,当再次分配需要访问目标数据块的任务时,HDFS中的本地任务(即被分配到存储有目标数据块的数据节点中的任务)的数量增加,提高了HDFS的任务的本地化概率。同时,由于本地任务执行速度快且对HDFS网络资源的消耗少,使得HDFS运行的速度提高。此外,新增的存储有目标数据块的数据节点会分担一部分原存储有目标数据块的数据节点的任务负载,实现了HDFS中的负载均衡。The embodiment of the present invention receives a data block read command by the first data node, and accesses the target data block (stored on the second data node) according to the data block read command, and then generates a data block copy of the target data block, and finally The name node sends a first modification instruction for instructing the name node to increment the replication factor of the target data block by one. This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored) The increase in the number increases the localization probability of HDFS tasks. At the same time, the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources. In addition, the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.
实施例八Example eight
本发明实施例提供了一种数据动态重分布的系统,参见图9,该系统包括:如实施例四所述的数据节点50、以及如实施例五所述的名字节点60。An embodiment of the present invention provides a system for dynamically redistributing data. Referring to FIG. 9, the system includes: a data node 50 as described in Embodiment 4, and a name node 60 as described in Embodiment 5.
本发明实施例通过第一数据节点接收数据块读取命令,并根据该数据块读取命令访问目标数据块(存储在第二数据节点上),然后生成目标数据块的数据 块副本,最后向名字节点发送第一修改指令,用于指示名字节点将目标数据块的复制因子加一。这样使得HDFS中存储目标数据块的数据节点的数量增加,当再次分配需要访问目标数据块的任务时,HDFS中的本地任务(即被分配到存储有目标数据块的数据节点中的任务)的数量增加,提高了HDFS的任务的本地化概率。同时,由于本地任务执行速度快且对HDFS网络资源的消耗少,使得HDFS运行的速度提高。此外,新增的存储有目标数据块的数据节点会分担一部分原存储有目标数据块的数据节点的任务负载,实现了HDFS中的负载均衡。The embodiment of the present invention receives a data block read command by the first data node, and accesses the target data block (stored on the second data node) according to the data block read command, and then generates data of the target data block. The block copy finally sends a first modification instruction to the name node to instruct the name node to increase the copy factor of the target data block by one. This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored) The increase in the number increases the localization probability of HDFS tasks. At the same time, the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources. In addition, the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the embodiments of the present invention are merely for the description, and do not represent the advantages and disadvantages of the embodiments.
需要说明的是:上述实施例提供的数据节点在实现数据动态重分布的方法时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的数据节点与数据动态重分布的方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that, when the data node provided by the foregoing embodiment implements the method for dynamically redistributing data, only the division of each functional module is used as an example. In actual applications, the foregoing functions may be assigned different functions according to requirements. The module is completed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the data node and the method for dynamically redistributing data are provided in the same embodiment. For details, refer to the method embodiment, and details are not described herein.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。A person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium. The storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。 The above are only the preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalents, improvements, etc., which are within the spirit and scope of the present invention, should be included in the protection of the present invention. Within the scope.

Claims (15)

  1. 一种数据动态重分布的方法,其特征在于,所述方法包括:A method for dynamically redistributing data, characterized in that the method comprises:
    第一数据节点接收数据块读取命令,所述数据块读取命令用于指示所述第一数据节点读取位于第二数据节点上的目标数据块,所述第二数据节点和所述第一数据节点为同一Hadoop分布式文件系统HDFS中的不同数据节点;The first data node receives a data block read command, the data block read command is used to instruct the first data node to read a target data block located on the second data node, the second data node and the first A data node is a different data node in the same Hadoop distributed file system HDFS;
    根据所述数据块读取命令访问所述目标数据块;Accessing the target data block according to the data block read command;
    在所述第一数据节点上生成所述目标数据块的数据块副本;Generating a data block copy of the target data block on the first data node;
    向名字节点发送第一修改指令,所述第一修改指令用于指示所述名字节点将所述目标数据块的复制因子加一。Sending a first modification instruction to the name node, the first modification instruction is used to instruct the name node to increase a replication factor of the target data block by one.
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1 further comprising:
    当所述数据块副本的未被访问的时长超过所述数据块副本的生存时间时,从所述第一数据节点删除所述数据块副本。And deleting the data block from the first data node when a length of time that the copy of the data block is not accessed exceeds a lifetime of the copy of the data block.
  3. 根据权利要求2所述的方法,其特征在于,所述当所述数据块副本的未被访问的时长超过所述数据块副本的生存时间时,从所述第一数据节点删除所述数据块副本,包括:The method according to claim 2, wherein said deleting said data block from said first data node when said unregistered length of said data block copy exceeds a lifetime of said data block copy A copy, including:
    接收名字节点发送的数据块副本删除指令,所述数据块副本删除指令包括所述数据块副本的标识;Receiving a data block copy deletion instruction sent by the name node, where the data block copy deletion instruction includes an identifier of the data block copy;
    根据所述数据块副本删除指令,从所述第一数据节点删除所述数据块副本;Deleting the data block copy from the first data node according to the data block copy deletion instruction;
    或者,or,
    所述当所述数据块副本的未被访问的时长超过所述数据块副本的生存时间时,从所述第一数据节点删除所述数据块副本,包括:Deleting the data block copy from the first data node when the length of the data block copy that is not accessed exceeds the lifetime of the data block copy, including:
    监控所述数据块副本的未被访问的时长;Monitoring the length of time that the copy of the data block is not accessed;
    当所述数据块副本的未被访问的时长超过所述数据块副本的生存时间时,从所述第一数据节点删除所述数据块副本。And deleting the data block from the first data node when a length of time that the copy of the data block is not accessed exceeds a lifetime of the copy of the data block.
  4. 根据权利要求2或3所述的方法,其特征在于,所述方法还包括:The method according to claim 2 or 3, wherein the method further comprises:
    向所述名字节点发送第二修改指令,所述第二修改指令用于指示所述名字节点将所述目标数据块的复制因子减一。Sending a second modification instruction to the name node, the second modification instruction is used to instruct the name node to decrement the replication factor of the target data block by one.
  5. 一种数据动态重分布的方法,其特征在于,所述方法包括:A method for dynamically redistributing data, characterized in that the method comprises:
    接收第一数据节点发送的第一修改指令,所述第一修改指令用于指示名字 节点将目标数据块的复制因子加一,所述第一修改指令是所述第一数据节点读取位于第二数据节点上的所述目标数据块,并在所述第一数据节点上生成所述目标数据块的数据块副本后发送的;Receiving a first modification instruction sent by the first data node, where the first modification instruction is used to indicate a name The node increments a replication factor of the target data block by the first data node reading the target data block located on the second data node, and generating the location on the first data node Transmitted after the data block copy of the target data block;
    根据所述第一修改指令修改所述目标数据块的复制因子。Modifying a replication factor of the target data block according to the first modification instruction.
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:The method of claim 5, wherein the method further comprises:
    接收第一数据节点发送的第二修改指令,所述第二修改指令用于指示所述名字节点将所述目标数据块的复制因子减一;Receiving, by the first data node, a second modification instruction, where the second modification instruction is used to instruct the name node to decrement a replication factor of the target data block by one;
    根据所述第二修改指令修改所述目标数据块的复制因子。Modifying a copy factor of the target data block according to the second modification instruction.
  7. 一种数据节点,其特征在于,所述数据节点包括:A data node, wherein the data node comprises:
    接收模块,用于接收数据块读取命令,所述数据块读取命令用于指示所述数据节点读取位于第二数据节点上的目标数据块,所述第二数据节点和所述数据节点为同一HDFS中的不同数据节点;a receiving module, configured to receive a data block read command, the data block read command is used to instruct the data node to read a target data block located on the second data node, the second data node and the data node Different data nodes in the same HDFS;
    访问模块,用于根据所述数据块读取命令访问所述目标数据块;An access module, configured to access the target data block according to the data block read command;
    生成模块,用于在所述数据节点上生成所述目标数据块的数据块副本;Generating a module, configured to generate a data block copy of the target data block on the data node;
    第一发送模块,用于向名字节点发送第一修改指令,所述第一修改指令用于指示所述名字节点将所述目标数据块的复制因子加一。The first sending module is configured to send a first modification instruction to the name node, where the first modification instruction is used to instruct the name node to increase the replication factor of the target data block by one.
  8. 根据权利要求7所述的数据节点,其特征在于,所述数据节点还包括:The data node according to claim 7, wherein the data node further comprises:
    删除模块,用于当所述数据块副本的未被访问的时长超过所述数据块副本的生存时间时,从所述数据节点删除所述数据块副本。And a deleting module, configured to delete the data block copy from the data node when a length of time that the data block copy is not accessed exceeds a lifetime of the data block copy.
  9. 根据权利要求8所述的数据节点,其特征在于,所述删除模块包括:The data node according to claim 8, wherein the deleting module comprises:
    接收单元,用于接收名字节点发送的数据块副本删除指令,所述数据块副本删除指令包括所述数据块副本的标识;a receiving unit, configured to receive a data block copy deletion instruction sent by the name node, where the data block copy deletion instruction includes an identifier of the data block copy;
    第一删除单元,用于根据所述数据块副本删除指令,从所述数据节点删除所述数据块副本;a first deleting unit, configured to delete the data block copy from the data node according to the data block copy deletion instruction;
    或者,所述删除模块包括:Alternatively, the deleting module includes:
    监控单元,用于监控所述数据块副本的未被访问的时长;a monitoring unit, configured to monitor an unvisited duration of the data block copy;
    第二删除单元,用于当所述数据块副本的未被访问的时长超过所述数据块副本的生存时间时,从所述数据节点删除所述数据块副本。And a second deleting unit, configured to delete the data block copy from the data node when a length of time that the data block copy is not accessed exceeds a lifetime of the data block copy.
  10. 根据权利要求8或9所述的数据节点,其特征在于,所述数据节点还 包括:A data node according to claim 8 or 9, wherein said data node is further include:
    第二发送模块,用于向所述名字节点发送第二修改指令,所述第二修改指令用于指示所述名字节点将所述目标数据块的复制因子减一。And a second sending module, configured to send a second modification instruction to the name node, where the second modification instruction is used to instruct the name node to decrement the replication factor of the target data block by one.
  11. 一种名字节点,其特征在于,所述名字节点包括:A name node, characterized in that the name node comprises:
    第一接收模块,用于接收第一数据节点发送的第一修改指令,所述第一修改指令用于指示名字节点将目标数据块的复制因子加一,所述第一修改指令是所述第一数据节点读取位于第二数据节点上的所述目标数据块,并在所述第一数据节点上生成所述目标数据块的数据块副本后发送的;a first receiving module, configured to receive a first modification instruction sent by the first data node, where the first modification instruction is used to indicate that the name node adds one of the replication factors of the target data block, and the first modification instruction is the first Reading, by a data node, the target data block located on the second data node, and transmitting the data block copy of the target data block on the first data node;
    第一修改模块,用于根据所述第一修改指令修改所述目标数据块的复制因子。And a first modifying module, configured to modify a replication factor of the target data block according to the first modification instruction.
  12. 根据权利要求11所述的名字节点,其特征在于,所述名字节点还包括:The name node according to claim 11, wherein the name node further comprises:
    第二接收模块,用于接收第一数据节点发送的第二修改指令,所述第二修改指令用于指示所述名字节点将所述目标数据块的复制因子减一;a second receiving module, configured to receive a second modification instruction sent by the first data node, where the second modification instruction is used to instruct the name node to decrement a replication factor of the target data block by one;
    第二修改模块,用于根据所述第二修改指令修改所述目标数据块的复制因子。And a second modifying module, configured to modify a replication factor of the target data block according to the second modification instruction.
  13. 一种数据节点,其特征在于,所述数据节点包括:A data node, wherein the data node comprises:
    处理器、存储器、总线和通信接口;所述存储器用于存储计算机执行指令,所述处理器与所述存储器通过所述总线连接,当所述计算机运行时,所述处理器执行所述存储器存储的所述计算机执行指令,以使所述计算机执行如权利要求1~4任一项所述的方法。a processor, a memory, a bus, and a communication interface; the memory for storing computer execution instructions, the processor being coupled to the memory via the bus, the processor executing the memory storage when the computer is running The computer executes instructions to cause the computer to perform the method of any of claims 1-4.
  14. 一种名字节点,其特征在于,所述名字节点包括:A name node, characterized in that the name node comprises:
    处理器、存储器、总线和通信接口;所述存储器用于存储计算机执行指令,所述处理器与所述存储器通过所述总线连接,当所述计算机运行时,所述处理器执行所述存储器存储的所述计算机执行指令,以使所述计算机执行如权利要求5或6所述的方法。 a processor, a memory, a bus, and a communication interface; the memory for storing computer execution instructions, the processor being coupled to the memory via the bus, the processor executing the memory storage when the computer is running The computer executes instructions to cause the computer to perform the method of claim 5 or 6.
  15. 一种数据动态重分布的系统,其特征在于,所述系统包括如权利要求7~10任一项所述的数据节点,以及如权利要求11或12所述的名字节点。 A system for dynamic data redistribution, characterized in that the system comprises a data node according to any of claims 7 to 10, and a name node according to claim 11 or 12.
PCT/CN2015/097172 2014-12-18 2015-12-11 Data dynamic re-distribution method, data node, name node and system WO2016095760A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410790066.5 2014-12-18
CN201410790066.5A CN105760391B (en) 2014-12-18 2014-12-18 Method, data node, name node and system for dynamically redistributing data

Publications (1)

Publication Number Publication Date
WO2016095760A1 true WO2016095760A1 (en) 2016-06-23

Family

ID=56125919

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/097172 WO2016095760A1 (en) 2014-12-18 2015-12-11 Data dynamic re-distribution method, data node, name node and system

Country Status (2)

Country Link
CN (1) CN105760391B (en)
WO (1) WO2016095760A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108319618B (en) * 2017-01-17 2022-05-06 阿里巴巴集团控股有限公司 Data distribution control method, system and device of distributed storage system
CN106657411A (en) * 2017-02-28 2017-05-10 北京华云网际科技有限公司 Method and device for accessing volume in distributed system
CN110545450B (en) * 2019-09-09 2021-12-03 深圳市网心科技有限公司 Node distribution method, system, electronic equipment and storage medium
CN110825704B (en) * 2019-09-27 2023-09-01 华为云计算技术有限公司 Data reading method, data writing method and server
CN111290710B (en) * 2020-01-20 2024-04-05 北京信息科技大学 Cloud copy storage method and system based on dynamic adjustment of replication factors

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101187931A (en) * 2007-12-12 2008-05-28 浙江大学 Distribution type file system multi-file copy management method
CN101470733A (en) * 2007-12-27 2009-07-01 中国移动通信集团公司 Data block copy amount regulation method and distributed file system
CN101645920A (en) * 2009-04-07 2010-02-10 中国科学院声学研究所 Duplicate rating attenuation method based on time parameter
CN102546782A (en) * 2011-12-28 2012-07-04 北京奇虎科技有限公司 Distribution system and data operation method thereof
CN103207867A (en) * 2012-01-16 2013-07-17 联想(北京)有限公司 Method for processing data blocks, method for initiating recovery operation and nodes
CN103631894A (en) * 2013-11-19 2014-03-12 浪潮电子信息产业股份有限公司 Dynamic copy management method based on HDFS
US20140122429A1 (en) * 2012-10-31 2014-05-01 International Business Machines Corporation Data processing method and apparatus for distributed systems

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080049254A1 (en) * 2006-08-24 2008-02-28 Thomas Phan Method and means for co-scheduling job assignments and data replication in wide-area distributed systems
CN102629941B (en) * 2012-03-20 2014-12-31 武汉邮电科学研究院 Caching method of a virtual machine mirror image in cloud computing system
CN103744799B (en) * 2013-12-26 2017-07-21 华为技术有限公司 A kind of internal storage data access method, device and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101187931A (en) * 2007-12-12 2008-05-28 浙江大学 Distribution type file system multi-file copy management method
CN101470733A (en) * 2007-12-27 2009-07-01 中国移动通信集团公司 Data block copy amount regulation method and distributed file system
CN101645920A (en) * 2009-04-07 2010-02-10 中国科学院声学研究所 Duplicate rating attenuation method based on time parameter
CN102546782A (en) * 2011-12-28 2012-07-04 北京奇虎科技有限公司 Distribution system and data operation method thereof
CN103207867A (en) * 2012-01-16 2013-07-17 联想(北京)有限公司 Method for processing data blocks, method for initiating recovery operation and nodes
US20140122429A1 (en) * 2012-10-31 2014-05-01 International Business Machines Corporation Data processing method and apparatus for distributed systems
CN103631894A (en) * 2013-11-19 2014-03-12 浪潮电子信息产业股份有限公司 Dynamic copy management method based on HDFS

Also Published As

Publication number Publication date
CN105760391A (en) 2016-07-13
CN105760391B (en) 2019-12-13

Similar Documents

Publication Publication Date Title
WO2015176636A1 (en) Distributed database service management system
US10715622B2 (en) Systems and methods for accelerating object stores with distributed caching
WO2016095760A1 (en) Data dynamic re-distribution method, data node, name node and system
US10158579B2 (en) Resource silos at network-accessible services
US10235047B2 (en) Memory management method, apparatus, and system
US11392497B1 (en) Low latency access to data sets using shared data set portions
US11693789B2 (en) System and method for mapping objects to regions
JP2017228323A (en) Virtual disk blueprints for virtualized storage area network
US20120278344A1 (en) Proximity grids for an in-memory data grid
WO2019085769A1 (en) Tiered data storage and tiered query method and apparatus
WO2019061352A1 (en) Data loading method and device
US11550713B1 (en) Garbage collection in distributed systems using life cycled storage roots
US9104501B2 (en) Preparing parallel tasks to use a synchronization register
WO2018054079A1 (en) Method for storing file, first virtual machine and namenode
WO2016202199A1 (en) Distributed file system and file meta-information management method thereof
WO2017113278A1 (en) Data processing method, apparatus and system
WO2021057108A1 (en) Data reading method, data writing method, and server
US10620871B1 (en) Storage scheme for a distributed storage system
WO2019148841A1 (en) Distributed storage system, data processing method and storage node
US10599356B2 (en) Aggregating memory to create a network addressable storage volume for storing virtual machine files
US20180203612A1 (en) Adaptive storage reclamation
US11593270B1 (en) Fast distributed caching using erasure coded object parts
WO2021104383A1 (en) Data backup method and apparatus, device, and storage medium
US20160364268A1 (en) Computer system, management computer, and management method
JP7398567B2 (en) Dynamic adaptive partitioning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15869267

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15869267

Country of ref document: EP

Kind code of ref document: A1