WO2016095760A1

WO2016095760A1 - Data dynamic re-distribution method, data node, name node and system

Info

Publication number: WO2016095760A1
Application number: PCT/CN2015/097172
Authority: WO
Inventors: 李嘉; 刘杰; 党李飞
Original assignee: 华为技术有限公司
Priority date: 2014-12-18
Filing date: 2015-12-11
Publication date: 2016-06-23
Also published as: CN105760391A; CN105760391B

Abstract

The present invention belongs to a technical field of Internet big data processing and disclosed are a data dynamic re-distribution method, data node, name node and system, the method comprising: a first data node receives a read command of a data block (S11); accessing a target data block according to the read command of the data block (S12); generating a copy of the data block of a target data block on a first data node (S13); transmitting a first modifying instruction to the name node (S14). The copy of the data block of the target data block is generated on the first data node, so as to increase number of the data node storing the target data block on the HDFS, and when tasks of accessing the target data block are re-distributed, the number of the local task increases, reducing consumption of the HDFS network resource, improving operation speed of the HDFS, as well as sharing operation task load of the data node storing the data block of the target file, realizing balance of the load.

Description

Data dynamic redistribution method, data node, name node and system

Technical field

The invention relates to the field of Internet big data processing technology, in particular to a method, a data node, a name node and a system for dynamically redistributing data.

Background technique

The Hadoop Distributed File System (HDFS) is an excellent distributed file system that can be used for massive data storage. Currently, HDFS has been widely used in various large online services and large storage systems.

The HDFS file system uses a blocking mechanism to distribute files in a distributed manner and improves system reliability through a block redundancy strategy. Each data block has multiple copies in the system at the same time. These copies are distributed in multiple racks in the system. On multiple nodes within, prevent the loss of data blocks due to the failure of a single node. In order to implement this data block redundancy strategy, the HDFS file system must ensure that multiple copies are simultaneously written when writing data. The number of copies written is called the data block replication factor, which is usually three by default.

HDFS is a master-slave structure, which is generally composed of a Name Node ("NN") and a plurality of Data Nodes (DNs). The NN is also called the Master node and is responsible for management. HDFS namespace and block mapping information, configure copy policies, and handle client requests. The DN, also called a slave node, stores the actual data, performs read and write operations on the data block, and periodically reports the information of the stored data block to the NN. The client can access or manage HDFS through the command line; interact with the Name Node to obtain file location information; interact with the Data Node to perform data read and write operations.

In HDFS, the task of accessing a data block is usually assigned preferentially to the DN in which the target data block is stored (for a DN storing the target data block, such a task can be referred to as a local task), facilitating the task directly from the The target data block is read in the data node. When the number of tasks that can be run by the data node storing the target data block reaches the maximum value, the task is assigned to the DN that does not store the target data block (for a DN that does not store the target data block, this task can be Called a non-local task, the DN assigned to a non-local task needs to read the target data block from the DN where the target data block is stored over the network.

In the process of implementing the present invention, the inventors have found that the prior art has at least the following problems:

Due to the user's unbalanced and uncertain access to the data, some data blocks will be caused in some After a period of time, it will be too much access to become a hotspot data block. Due to the limitation of the replication factor (the maximum number of copies in the system does not exceed the replication factor), when multiple tasks need to access the hotspot data block at the same time, there must be some tasks. The hotspot data block needs to be accessed through the network, which reduces the running speed of the task and consumes network resources. At the same time, the data node storing the hotspot data block will always receive the task of accessing the data block, so that the task load in the HDFS is unbalanced.

Summary of the invention

In order to solve the problem that the access to the hotspot data block is too high, the operation speed is slow, the network resources are wasted, and the load is unbalanced, the method of the present invention provides a method for dynamic data redistribution. , data nodes, name nodes, and systems. The technical solution is as follows:

In a first aspect, a method for dynamically redistributing data is provided, the method comprising:

The first data node receives a data block read command, the data block read command is used to instruct the first data node to read a target data block located on the second data node, the second data node and the first A data node is a different data node in the same HDFS;

Accessing the target data block according to the data block read command;

Generating a data block copy of the target data block on the first data node;

Sending a first modification instruction to the name node, the first modification instruction is used to instruct the name node to increase a replication factor of the target data block by one.

Specifically, the method further includes:

And deleting the data block from the first data node when a length of time that the copy of the data block is not accessed exceeds a lifetime of the copy of the data block.

Further, when the duration of the unblocked copy of the data block copy exceeds the lifetime of the data block copy, deleting the data block copy from the first data node includes:

Receiving a data block copy deletion instruction sent by the name node, where the data block copy deletion instruction includes an identifier of the data block copy;

Deleting the data block copy from the first data node according to the data block copy deletion instruction;

or,

Deleting the data block copy from the first data node when the length of the data block copy that is not accessed exceeds the lifetime of the data block copy, including:

Monitoring the length of time that the copy of the data block is not accessed;

Further, the method further includes:

Sending a second modification instruction to the name node, the second modification instruction is used to instruct the name node to decrement the replication factor of the target data block by one.

In a second aspect, a method for dynamically redistributing data is provided, the method comprising:

Receiving, by the first data node, a first modification instruction, where the first modification instruction is used to indicate that the name node adds one of the replication factor of the target data block, and the first modification instruction is that the first data node is located at the first And the target data block on the data node is sent after the data block copy of the target data block is generated on the first data node;

Modifying a replication factor of the target data block according to the first modification instruction.

Specifically, the method includes:

Receiving, by the first data node, a second modification instruction, where the second modification instruction is used to instruct the name node to decrement a replication factor of the target data block by one;

Modifying a copy factor of the target data block according to the second modification instruction.

In a third aspect, a data node is provided, the data node comprising:

a receiving module, configured to receive a data block read command, the data block read command is used to instruct the data node to read a target data block located on the second data node, the second data node and the data node Different data nodes in the same HDFS;

An access module, configured to access the target data block according to the data block read command;

Generating a module, configured to generate a data block copy of the target data block on the data node;

The first sending module is configured to send a first modification instruction to the name node, where the first modification instruction is used to instruct the name node to increase the replication factor of the target data block by one.

Specifically, the data node further includes:

And a deleting module, configured to delete the data block copy from the data node when a length of time that the data block copy is not accessed exceeds a lifetime of the data block copy.

Further, the deleting module includes:

a receiving unit, configured to receive a data block copy deletion instruction sent by the name node, where the data block is The deletion instruction includes an identifier of the copy of the data block;

a first deleting unit, configured to delete the data block copy from the data node according to the data block copy deletion instruction;

Alternatively, the deleting module includes:

a monitoring unit, configured to monitor an unvisited duration of the data block copy;

And a second deleting unit, configured to delete the data block copy from the data node when a length of time that the data block copy is not accessed exceeds a lifetime of the data block copy.

Further, the data node further includes:

And a second sending module, configured to send a second modification instruction to the name node, where the second modification instruction is used to instruct the name node to decrement the replication factor of the target data block by one.

In a fourth aspect, a name node is provided, the name node comprising:

a first receiving module, configured to receive a first modification instruction sent by the first data node, where the first modification instruction is used to indicate that the name node adds one of the replication factors of the target data block, and the first modification instruction is the first Reading, by a data node, the target data block located on the second data node, and transmitting the data block copy of the target data block on the first data node;

And a first modifying module, configured to modify a replication factor of the target data block according to the first modification instruction.

Specifically, the name node further includes:

a second receiving module, configured to receive a second modification instruction sent by the first data node, where the second modification instruction is used to instruct the name node to decrement a replication factor of the target data block by one;

And a second modifying module, configured to modify a replication factor of the target data block according to the second modification instruction.

In a fifth aspect, a data node is provided, the data node comprising:

a processor, a memory, a bus, and a communication interface; the memory for storing computer execution instructions, the processor being coupled to the memory via the bus, the processor executing the memory storage when the computer is running The computer executes instructions to cause the computer to perform the method as previously described.

In a sixth aspect, a name node is provided, the name node comprising:

In a seventh aspect, a system for dynamic data redistribution is provided, the system comprising a data node as hereinbefore described, and a name node as hereinbefore described.

The beneficial effects brought by the technical solutions provided by the embodiments of the present invention are:

Receiving a data block read command by the first data node, and accessing the target data block (stored on the second data node) according to the data block read command, and then generating a data block copy of the target data block, and finally sending the data block to the name node A modification instruction for instructing the name node to increment the replication factor of the target data block by one. This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored) The increase in the number increases the localization probability of HDFS tasks. At the same time, the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources. In addition, the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.

DRAWINGS

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention. Other drawings may also be obtained from those of ordinary skill in the art in light of the inventive work.

1 is a schematic diagram of a network architecture of an HDFS provided by the present invention;

2 is a flowchart of a method for dynamically redistributing data according to Embodiment 1 of the present invention;

3 is a flowchart of a method for dynamically redistributing data according to Embodiment 2 of the present invention;

4 is an information interaction diagram of data dynamic redistribution according to Embodiment 3 of the present invention;

FIG. 5 is a schematic structural diagram of a data node according to Embodiment 4 of the present invention; FIG.

6 is a schematic structural diagram of a name node according to Embodiment 5 of the present invention;

7 is a schematic structural diagram of a data node according to Embodiment 6 of the present invention;

8 is a schematic structural diagram of a name node according to Embodiment 7 of the present invention;

FIG. 9 is a schematic structural diagram of a system for dynamically redistributing data according to Embodiment 8 of the present invention.

detailed description

The embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.

The structure of the HDFS will be briefly introduced below in conjunction with FIG. As shown in FIG. 1, the HDFS usually includes a name node 11 for managing all data nodes 12 in the HDFS, and a data node 12 for storing data block information of the file, the name in the HDFS. The node 11 and the data node 12 communicate with each other through a network.

Embodiment 1

An embodiment of the present invention provides a method for dynamically redistributing data. Referring to FIG. 2, the method may be performed by a first data node, where the method includes:

Step S11: The first data node receives a data block read command, where the data block read command is used to instruct the first data node to read the target data block located on the second data node, where the second data node and the first data node are Different data nodes in the same HDFS.

Step S12, accessing the target data block according to the data block read command.

In this embodiment, since the first data node and the second data node are different data nodes in the same HDFS, the first data node needs to access the second data node through the network when accessing the target data block according to the data block read command. Read.

In practical applications, because the network resources of the HDSF system are limited, and accessing the target file through the network consumes limited network resources and slows down the operation speed of the HDSF system, the task is prevented from passing through the network in the HDSF system. To access the target data block.

Step S13, generating a data block copy of the target data block on the first data node.

It should be noted that, in HDFS, if there is no data block of the target file storing the task in the data node to which the task is assigned, the assigned task is referred to as a non-local task; if the data node of the assigned task is stored in the data node If there is a data block for the target file of the task, the task is called a local task. Since the local task reads faster than the non-local task and does not need to occupy the HDFS network resources, Therefore, in order to improve the reading speed of HDFS, it can be realized by increasing the localization probability of the task.

In this embodiment, the first data node copies the target data block and generates a data block copy of the target data block, increasing the number of copies of the target data block in the HDFS. When the task of reading the target data block is allocated again, the number of tasks allocated to the data node storing the target data block increases, which increases the localization probability of the HDFS task, and at the same time, more tasks can be in their own local data. The target data block is read in the node, which improves the running speed of the HDFS and saves the network resources of the HDFS.

In addition, in the HDSF system, the data node is always assigned a local task (that is, the localization principle), and the non-local task is allocated only after the local task is allocated. Therefore, the data node storing the target data block is always assigned the task of reading the target data block, and the load of the data node is large. In this embodiment, the first data node copies the target data block, increases the number of data nodes storing the target data block, and also shares the task load of a part of the data node storing the target data block, thereby realizing the load in the HDFS. balanced.

Step S14, sending a first modification instruction to the name node, the first modification instruction is used to instruct the name node to increase the replication factor of the target data block by one.

In this embodiment, after the data block copy of the target data block is generated on the first data node, the number of copies of the target data block in the HDFS is increased, and the copy factor information of the target data block stored in the name node needs to be updated.

The embodiment of the present invention receives a data block read command by the first data node, and accesses the target data block (stored on the second data node) according to the data block read command, and then generates a data block copy of the target data block, and finally The name node sends a first modification instruction for instructing the name node to increment the replication factor of the target data block by one. This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored) The increase in the number increases the localization probability of HDFS tasks. At the same time, the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources. In addition, the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.

Embodiment 2

Embodiments of the present invention provide a method for dynamically redistributing data. Referring to FIG. 3, the method may be The name node is executed, and the method includes:

Step S21, receiving a first modification instruction sent by the first data node, the modification instruction is used to instruct the name node to increase the replication factor of the target data block by one, and the first modification instruction is that the first data node reads the second data node. The target data block is sent after the data block copy of the target data block is generated on the first data node.

Specifically, after the data node copies the target data block, since the number of copies of the target data block in the HDFS increases, it is necessary to update the replication factor information of the target data block stored in the name node.

Step S22, modifying the replication factor of the target data block according to the first modification instruction.

In practical applications, the name node also receives the periodic heartbeat information sent by the data node (the information of the data block in the data node is included in the periodic heartbeat information), and the data block information recorded on the name node is corrected and corrected, and the record is recorded in time. Data block information.

Embodiment 3

An embodiment of the present invention provides a method for dynamically redistributing data. Referring to FIG. 4, the method includes:

In step S31, the first data node sends a file open request to the name node.

Specifically, the file open request includes a file name of the target file, a file offset, and a file data size.

In this embodiment, the application on the first data node sends the file open request to the name node through a distributed file system client ("DFSclient").

Step S32, the name node sends file feedback information to the first data node according to the file open request.

Specifically, the file feedback information includes a data block of the target file and an IP of the data node where each data block is located.

In this embodiment, the name node may send a data block list List<located Block>block() corresponding to the target file to the first data node according to the file open request.

Step S33, when the file feedback information indicates that the target data block is stored in the first data node, the first data node directly reads the target data block; when the file feedback information indicates that the target data block is stored in the second data node, performing steps S34.

In this embodiment, when the target data block is stored in the first data node, the application on the first data node directly reads the target data block from the first data node through the DFSclient, specifically, the DFSclient is reading the data block. When you create an FSDataInputStream (distributed file system data input stream) object, the target data block is read by the FSDataInputStream object.

Step S34, the first data node reads the target data block located on the second data node, and the second data node and the first data node are different data nodes in the same HDFS.

In this embodiment, when the target data block is stored in the second data node (ie, not stored in the first data node), the DFSclient on the first data node sends a file read to the second data node through the FSDataInputStream object. Requesting and receiving the target data block returned by the second data node.

In step S35, the first data node generates a data block copy of the target data block.

In this embodiment, after the second data node sends the target data block to the first data node through the network, the first data node copies the target data block, thereby generating a copy of the data block.

Step S36: The first data node sends a first modification instruction to the name node, where the first modification instruction is used to instruct the name node to increase the replication factor of the target data block by one.

In addition, in this embodiment, since the storage space limitation of the HDFS is considered, the variation range of the replication factor can be limited. For example, the variation range of the replication factor can be set to 1 to 512. In addition, if the default copy factor is 3 when HDFS is created, the range of the copy factor can be set from 3 to 512.

Step S37, the name node receives the first modification instruction sent by the first data node, and modifies the replication factor of the target data block according to the first modification instruction.

Step S38, when the duration of the unblocked copy of the data block exceeds the lifetime of the data block copy, the first data node deletes the data block copy.

In this embodiment, the lifetime of the data block of the generated target data block is set, so that the data block copy is prevented from being accessed for a long time and occupies the storage space of the data node.

In an implementation manner of this embodiment, the lifetime of the data block replica may be set in the following manner: setting an initial lifetime for the newly added data block replica, and when the data chunk replica is accessed within the set duration When the frequency (or number of times) reaches the set value, the initial lifetime is extended as the new lifetime of the copy of the block.

In another implementation of this embodiment, the lifetime of the data block replica may also be statically set.

In addition, it should be noted that the data block copy setting lifetime is only for these new copies, and the copy formed when the file is created is not set to survive time. At the same time, the copy of the data block deleted here is only for the newly added copy of the data block, and the copy formed when the file is created will not be deleted.

In this embodiment, the length of time that the data block copy is not accessed may be monitored by the first data node or by the name node.

In the case where the first data node monitors the length of the unblocked data block copy, step S38 may include the following steps:

S381. The first data node monitors an unvisited duration of the data block copy.

S382. When the duration of the unblocked copy of the data block exceeds the lifetime of the data block copy, the first data node deletes the data block copy.

Specifically, the length of time that the block copy is not accessed can be monitored by adding a new replication manager (ReplicaManager) on the first data node.

In the case where the name node monitors the unvisited duration of the data block copy, step S38 may include the following steps:

S383. The name node monitors a length of time that the data block copy is not accessed.

S384, when the duration of the unblocked copy of the data block exceeds the lifetime of the data block copy, the name node sends a data block copy deletion instruction to the first data node, where the data block copy deletion instruction includes an identifier of the data block copy to be deleted. ;

S385. The first data node receives a data block copy deletion instruction sent by the name node.

S386. The first data node deletes the data block copy according to the data block copy deletion instruction.

Specifically, when the name node monitors the length of the unblocked data block copy, the name information of the data block copy included in the heartbeat information sent by the first data node to the name node may be monitored by the first data node.

Step S39: The first data node sends a second modification instruction to the name node, where the second modification instruction is used to instruct the name node to decrement the replication factor of the target data block by one.

In practical applications, after the copy of the data block is deleted, the corresponding replication factor in HDFS needs to be modified.

Step S40, the name node receives the second modification instruction sent by the first data node, and modifies the replication factor of the target data block according to the second modification instruction.

It should be noted that, in HDFS, if there is no data block of the target file storing the task in the data node to which the task is assigned, the assigned task is referred to as a non-local task; if the data node of the assigned task is stored in the data node If there is a data block for the target file of the task, the task is called a local task. In order to improve the data reading efficiency of HDFS, it is often necessary to increase the probability of local tasks of HDFS.

In this embodiment, the first data node reads the target data block located on the second data node, indicating that the HDFS allocates a non-local task to the first data node. In the existing HDFS, after the first data node reads the target data block on the second data node, the target data block is not copied to the first data node, if the next data node is assigned the same next time. At the time of the task, the first data node still needs to read the target data block from the second data node.

In this embodiment, the first data node generates a data block copy of the target data block, and adds the number of copies of the target data block in the HDFS. When the task of reading the target data block is allocated again, the number of tasks allocated to the data node storing the target data block increases, which increases the localization probability of the HDFS task, and at the same time, more tasks can be in their own local data. The target data block is read in the node, which improves the running speed of the HDFS and saves the network resources of the HDFS.

It should also be noted that HDFS always assigns local tasks to the data nodes (that is, the localization principle), and only assigns non-local tasks after the local tasks are allocated. Therefore, the data node storing the target data block is always assigned the task of reading the target data block, and the load of the data node is large. In this embodiment, the first data node copies the target data block, increases the number of data nodes storing the target data block, and also shares the task load of a part of the data node storing the target data block, thereby realizing the load in the HDFS. balanced.

The embodiment of the present invention receives a data block read command by the first data node, and accesses the target data block (stored on the second data node) according to the data block read command, and then generates data of the target data block. The block copy finally sends a first modification instruction to the name node to instruct the name node to increase the copy factor of the target data block by one. This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored) The increase in the number increases the localization probability of HDFS tasks. At the same time, the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources. In addition, the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.

Embodiment 4

The embodiment of the present invention provides a data node. Referring to FIG. 5, the data node includes: a receiving module 501, an access module 502, a generating module 503, and a first sending module 504.

The receiving module 501 is configured to receive a data block read command, where the data block read command is used to instruct the data node to read the target data block located on the second data node, where the second data node and the data node are different in the same HDFS Data node.

In this embodiment, the data node indicated by the data block read command may be the first data node in the first to third embodiments.

The access module 502 is configured to access the target data block according to the data block read command.

In this embodiment, since the data node (ie, the first data node) and the second data node are different data nodes in the same HDFS, the data node needs to go through the network when accessing the target data block according to the data block read command. Two data nodes are read.

The generating module 503 is configured to generate a data block copy of the target data block on the data node.

The first sending module 504 is configured to send a first modification instruction to the name node, where the first modification instruction is used to instruct the name node to increase the replication factor of the target data block by one.

In this embodiment, after the data block copy of the target data block is generated on the data node, the number of copies of the target data block in the HDFS is increased, and the copy factor information of the target data block stored in the name node needs to be updated.

Specifically, the data node further includes: a deleting module 505.

The deleting module 505 is configured to delete the data block copy from the data node when the unused time of the data block copy exceeds the lifetime of the data block copy.

In this embodiment, the lifetime of the data block copy of the generated target data block is set, and the problem that the data block copy does not be accessed for a long time and occupies the storage space of the data node can be prevented.

In this embodiment, the length of time that the data block copy is not accessed may be monitored by the data node (ie, the first data node) or by the name node.

Further, when the unnamed duration of the data block copy is monitored by the name node, the deleting module 505 further includes: a receiving unit 515 and a first deleting unit 525;

The receiving unit 515 is configured to receive a data block copy deletion instruction sent by the name node, where the data block copy deletion instruction includes an identifier of the data block copy.

The first deleting unit 525 is configured to delete the data block copy from the data node according to the data block copy deletion instruction.

The deletion module 505 further includes a monitoring unit 535 and a second deleting unit 545 when the data node monitors the unvisited duration of the data block copy.

The monitoring unit 535 is configured to monitor the length of time that the data block copy is not accessed.

The second deleting unit 545 is configured to delete the data block copy from the data node when the length of the unblocked data block is longer than the lifetime of the data block copy.

Further, the data node further includes: a second sending module 506.

The second sending module 506 is configured to send a second modification instruction to the name node, where the second modification instruction is used to instruct the name node to decrement the replication factor of the target data block by one.

It should be noted that, in HDFS, if there is no data block of the target file storing the task in the data node to which the task is assigned, the assigned task is referred to as a non-local task; if the data node of the assigned task is stored in the data node If there is a data block for the target file of the task, the task is called a local task. The local task can be read faster than the non-local task and does not need to occupy the network resources of the HDFS. Therefore, in order to improve the reading speed of the HDFS, the localization probability of the task can be improved.

In this embodiment, the data node (ie, the first data node) copies the target data block and generates a data block copy of the target data block, increasing the number of copies of the target data block in the HDFS. When the task of reading the target data block is allocated again, the number of tasks allocated to the data node storing the target data block increases, which increases the localization probability of the HDFS task, and at the same time, more tasks can be in their own local data. The target data block is read in the node, which improves the running speed of the HDFS and saves the network resources of the HDFS.

In addition, in the HDSF system, the data node is always assigned a local task (that is, the localization principle), and the non-local task is allocated only after the local task is allocated. Therefore, the data node storing the target data block is always assigned the task of reading the target data block, and the load of the data node is large. In this embodiment, the data node (ie, the first data node) copies the target data block, increases the number of data nodes storing the target data block, and also shares the task load of the data node storing the target data block. Load balancing in HDFS.

Embodiment 5

The embodiment of the present invention provides a name node. Referring to FIG. 6, the data node includes: a first receiving module 601 and a first modifying module 602.

The first receiving module 601 is configured to receive a first modification instruction sent by the data node, where the first modification instruction is used to indicate that the name node increases the replication factor of the data block of the target file copied in the data node by one.

The first modification module 602 is configured to modify a replication factor of the target data block according to the first modification instruction.

In practical applications, the name node also receives the periodic heartbeat information sent by the data node (the information of the data block in the data node is included in the periodic heartbeat information), and the information of the data block recorded on the name node is corrected and corrected in time. Recorded data block information.

Further, the name node further includes: a second receiving module 603 and a second modifying module 604.

The second receiving module 603 is configured to receive a second modification instruction sent by the first data node, where the second modification instruction is used to instruct the name node to decrement the replication factor of the target data block by one.

The second modification module 604 is configured to modify a replication factor of the target data block according to the second modification instruction.

Embodiment 6

An embodiment of the present invention provides a data node. Referring to FIG. 7, the data node includes:

The processor 701, the memory 702, the bus 703, and the communication interface 704; the memory 702 is used to store computer execution instructions, the processor 701 is connected to the memory 702 via a bus 703, and when the computer is running, the processor 701 executes the memory storage computer execution instructions. So that the computer executes the method described in Embodiment 1 or Embodiment 3.

The embodiment of the present invention receives a data block read command by the first data node, and accesses the target data block (stored on the second data node) according to the data block read command, and then generates a data block copy of the target data block, and finally The name node sends a first modification instruction for indicating that the name node will target the data block The replication factor is increased by one. This causes the number of data nodes storing the target data block in the HDFS to increase, and when the task requiring access to the target data block is reassigned, the local task in the HDFS (ie, the task assigned to the data node in which the target data block is stored) The increase in the number increases the localization probability of HDFS tasks. At the same time, the speed of running HDFS is increased due to the fast execution of local tasks and the low consumption of HDFS network resources. In addition, the newly added data node storing the target data block shares the task load of a part of the data node that originally stored the target data block, and implements load balancing in HDFS.

Example 7

An embodiment of the present invention provides a name node. Referring to FIG. 8, the name node includes:

The processor 801, the memory 802, the bus 803, and the communication interface 804; the memory 802 is used to store computer execution instructions, the processor 801 is connected to the memory 802 via the bus 803, and when the computer is running, the processor 801 executes the memory storage computer execution instructions. So that the computer performs the method described in Embodiment 2 or Embodiment 3.

Example eight

An embodiment of the present invention provides a system for dynamically redistributing data. Referring to FIG. 9, the system includes: a data node 50 as described in Embodiment 4, and a name node 60 as described in Embodiment 5.

The serial numbers of the embodiments of the present invention are merely for the description, and do not represent the advantages and disadvantages of the embodiments.

It should be noted that, when the data node provided by the foregoing embodiment implements the method for dynamically redistributing data, only the division of each functional module is used as an example. In actual applications, the foregoing functions may be assigned different functions according to requirements. The module is completed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the data node and the method for dynamically redistributing data are provided in the same embodiment. For details, refer to the method embodiment, and details are not described herein.

A person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium. The storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

The above are only the preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalents, improvements, etc., which are within the spirit and scope of the present invention, should be included in the protection of the present invention. Within the scope.

Claims

A method for dynamically redistributing data, characterized in that the method comprises:

The first data node receives a data block read command, the data block read command is used to instruct the first data node to read a target data block located on the second data node, the second data node and the first A data node is a different data node in the same Hadoop distributed file system HDFS;

Accessing the target data block according to the data block read command;

Generating a data block copy of the target data block on the first data node;

Sending a first modification instruction to the name node, the first modification instruction is used to instruct the name node to increase a replication factor of the target data block by one.
The method of claim 1 further comprising:

And deleting the data block from the first data node when a length of time that the copy of the data block is not accessed exceeds a lifetime of the copy of the data block.
The method according to claim 2, wherein said deleting said data block from said first data node when said unregistered length of said data block copy exceeds a lifetime of said data block copy A copy, including:

Receiving a data block copy deletion instruction sent by the name node, where the data block copy deletion instruction includes an identifier of the data block copy;

Deleting the data block copy from the first data node according to the data block copy deletion instruction;

or,

Deleting the data block copy from the first data node when the length of the data block copy that is not accessed exceeds the lifetime of the data block copy, including:

Monitoring the length of time that the copy of the data block is not accessed;

And deleting the data block from the first data node when a length of time that the copy of the data block is not accessed exceeds a lifetime of the copy of the data block.
The method according to claim 2 or 3, wherein the method further comprises:

Sending a second modification instruction to the name node, the second modification instruction is used to instruct the name node to decrement the replication factor of the target data block by one.
A method for dynamically redistributing data, characterized in that the method comprises:

Receiving a first modification instruction sent by the first data node, where the first modification instruction is used to indicate a name The node increments a replication factor of the target data block by the first data node reading the target data block located on the second data node, and generating the location on the first data node Transmitted after the data block copy of the target data block;

Modifying a replication factor of the target data block according to the first modification instruction.
The method of claim 5, wherein the method further comprises:

Receiving, by the first data node, a second modification instruction, where the second modification instruction is used to instruct the name node to decrement a replication factor of the target data block by one;

Modifying a copy factor of the target data block according to the second modification instruction.
A data node, wherein the data node comprises:

a receiving module, configured to receive a data block read command, the data block read command is used to instruct the data node to read a target data block located on the second data node, the second data node and the data node Different data nodes in the same HDFS;

An access module, configured to access the target data block according to the data block read command;

Generating a module, configured to generate a data block copy of the target data block on the data node;

The first sending module is configured to send a first modification instruction to the name node, where the first modification instruction is used to instruct the name node to increase the replication factor of the target data block by one.
The data node according to claim 7, wherein the data node further comprises:

And a deleting module, configured to delete the data block copy from the data node when a length of time that the data block copy is not accessed exceeds a lifetime of the data block copy.
The data node according to claim 8, wherein the deleting module comprises:

a receiving unit, configured to receive a data block copy deletion instruction sent by the name node, where the data block copy deletion instruction includes an identifier of the data block copy;

a first deleting unit, configured to delete the data block copy from the data node according to the data block copy deletion instruction;

Alternatively, the deleting module includes:

a monitoring unit, configured to monitor an unvisited duration of the data block copy;

And a second deleting unit, configured to delete the data block copy from the data node when a length of time that the data block copy is not accessed exceeds a lifetime of the data block copy.
A data node according to claim 8 or 9, wherein said data node is further include:

And a second sending module, configured to send a second modification instruction to the name node, where the second modification instruction is used to instruct the name node to decrement the replication factor of the target data block by one.
A name node, characterized in that the name node comprises:

a first receiving module, configured to receive a first modification instruction sent by the first data node, where the first modification instruction is used to indicate that the name node adds one of the replication factors of the target data block, and the first modification instruction is the first Reading, by a data node, the target data block located on the second data node, and transmitting the data block copy of the target data block on the first data node;

And a first modifying module, configured to modify a replication factor of the target data block according to the first modification instruction.
The name node according to claim 11, wherein the name node further comprises:

a second receiving module, configured to receive a second modification instruction sent by the first data node, where the second modification instruction is used to instruct the name node to decrement a replication factor of the target data block by one;

And a second modifying module, configured to modify a replication factor of the target data block according to the second modification instruction.
A data node, wherein the data node comprises:

a processor, a memory, a bus, and a communication interface; the memory for storing computer execution instructions, the processor being coupled to the memory via the bus, the processor executing the memory storage when the computer is running The computer executes instructions to cause the computer to perform the method of any of claims 1-4.
A name node, characterized in that the name node comprises:

a processor, a memory, a bus, and a communication interface; the memory for storing computer execution instructions, the processor being coupled to the memory via the bus, the processor executing the memory storage when the computer is running The computer executes instructions to cause the computer to perform the method of claim 5 or 6.
A system for dynamic data redistribution, characterized in that the system comprises a data node according to any of claims 7 to 10, and a name node according to claim 11 or 12.