CN105760391B

CN105760391B - Method, data node, name node and system for dynamically redistributing data

Info

Publication number: CN105760391B
Application number: CN201410790066.5A
Authority: CN
Inventors: 李嘉; 刘杰; 党李飞
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2014-12-18
Filing date: 2014-12-18
Publication date: 2019-12-13
Anticipated expiration: 2034-12-18
Also published as: WO2016095760A1; CN105760391A

Abstract

The invention discloses a method, a data node, a name node and a system for dynamically redistributing data, and belongs to the technical field of internet big data processing. The method comprises the following steps: a first data node receives a data block reading command; accessing a target data block according to a data block read command; generating a data block copy of the target data block on the first data node; a first modification instruction is sent to the name node. According to the invention, the data block copy of the target data block is generated on the first data node, so that the number of data nodes for storing the target data block in the HDFS is increased, and when the task for accessing the target data block is distributed again, the number of local tasks is increased, thereby reducing the consumption of network resources of the HDFS, improving the operation speed of the HDFS, sharing the operation task load of the data nodes for storing the data block of the target file, and realizing load balance.

Description

Method, data node, name node and system for dynamically redistributing data

Technical Field

The invention relates to the technical field of internet big data processing, in particular to a method, a data node, a name node and a system for dynamically redistributing data.

Background

The Hadoop Distributed File System (HDFS) is an excellent Distributed File System and can be used for storing mass data. At present, HDFS has been widely used in various large-scale online services and large-scale storage systems.

The HDFS file system adopts a block mechanism to store files in a distributed mode, the reliability of the system is improved through a data block redundancy strategy, a plurality of copies exist in each data block in the system at the same time, the copies are distributed on a plurality of nodes in a plurality of racks in the system, and the loss of the data blocks caused by the fact that a single node breaks down is prevented. To implement this data block redundancy strategy, the HDFS file system must ensure that multiple copies are written simultaneously when writing data, and the number of written copies is called the replication factor of a data block, and is usually three by default.

the HDFS is a Master-slave structure, and generally includes a Name Node (NN) and a plurality of Data Nodes (DN), where the NN is also called a Master Node and is responsible for managing a namespace and Data block mapping information of the HDFS, configuring a copy policy, and processing a client request. The DN, also called a Slave (Slave) node, stores actual data, performs read and write operations of a data block, and periodically reports information of the stored data block to the NN. The client can access or manage the HDFS through a command line; interacting with the Name Node to acquire the position information of the file; and interacting with the Data Node to perform Data reading and writing operations.

In HDFS, a task accessing a data block is usually preferentially allocated to a DN storing a target data block (for the DN storing the target data block, such a task may be referred to as a local task), so that the task can read the target data block directly from the data node. When the number of tasks that can be executed by the data node storing the target data block reaches the maximum value, the tasks are allocated to the DNs not storing the target data block (for the DNs not storing the target data block, such tasks may be referred to as non-local tasks), and the DNs allocated to the non-local tasks needs to read the target data block from the DN storing the target data block through the network.

in the process of implementing the invention, the inventor finds that the prior art has at least the following problems:

Due to the fact that access of users to data is unbalanced and uncertain, some data blocks can be accessed too much to become hot data blocks within a certain period of time, due to the limitation of replication factors (the maximum number of copies of the data blocks in a system does not exceed the replication factors), when a plurality of tasks need to access the hot data blocks at the same time, some tasks need to access the hot data blocks through a network, and therefore the running speed of the tasks is reduced, and network resources are consumed. Meanwhile, the data node storing the hot data block can always receive the task of accessing the data block, so that the task load in the HDFS is unbalanced.

disclosure of Invention

in order to solve the problems that due to the limitation of copy factors, excessive access to hot data blocks in the conventional HDFS causes low operation speed, waste of network resources and unbalanced load, embodiments of the present invention provide a method, a data node, a name node and a system for dynamically redistributing data. The technical scheme is as follows:

In a first aspect, a method for dynamically redistributing data is provided, the method comprising:

a first data node receives a data block reading command, wherein the data block reading command is used for instructing the first data node to read a target data block located on a second data node, and the second data node and the first data node are different data nodes in the same HDFS;

Accessing the target data block according to the data block reading command;

generating a data block copy of the target data block on the first data node;

Sending a first modification instruction to a name node, wherein the first modification instruction is used for instructing the name node to increase the replication factor of the target data block by one.

specifically, the method further comprises:

and when the duration of the unaccessed data block copy exceeds the lifetime of the data block copy, deleting the data block copy from the first data node.

Further, the deleting the data block copy from the first data node when the length of time that the data block copy is not accessed exceeds the lifetime of the data block copy includes:

Receiving a data block duplicate deletion instruction sent by a name node, wherein the data block duplicate deletion instruction comprises an identifier of the data block duplicate;

deleting the data block copy from the first data node according to the data block copy deleting instruction;

Alternatively, the first and second electrodes may be,

When the length of time that the data block copy is not accessed exceeds the lifetime of the data block copy, deleting the data block copy from the first data node, including:

Monitoring the duration of the unaccessed data block copies;

And when the duration of the unaccessed data block copy exceeds the life time of the data block copy, deleting the data block copy from the first data node.

Further, the method further comprises:

sending a second modification instruction to the name node, the second modification instruction being used for instructing the name node to reduce the replication factor of the target data block by one.

In a second aspect, a method for dynamically redistributing data is provided, the method comprising:

Receiving a first modification instruction sent by a first data node, wherein the first modification instruction is used for indicating a first data node to add one to a replication factor of a target data block, and the first modification instruction is sent after the first data node reads the target data block located on a second data node and generates a data block copy of the target data block on the first data node;

and modifying the replication factor of the target data block according to the first modification instruction.

specifically, the method comprises the following steps:

Receiving a second modification instruction sent by the first data node, wherein the second modification instruction is used for indicating the namebyte point to reduce the replication factor of the target data block by one;

And modifying the replication factor of the target data block according to the second modification instruction.

in a third aspect, a data node is provided, which includes:

a receiving module, configured to receive a data block read command, where the data block read command is used to instruct a data node to read a target data block located on a second data node, and the second data node and the data node are different data nodes in the same HDFS;

The access module is used for accessing the target data block according to the data block reading command;

a generation module for generating a data block copy of the target data block on the data node;

A first sending module, configured to send a first modification instruction to a name node, where the first modification instruction is used to instruct the name node to add one to the replication factor of the target data block.

Specifically, the data node further includes:

and the deleting module is used for deleting the data block copy from the data node when the duration of the data block copy which is not accessed exceeds the lifetime of the data block copy.

Further, the deletion module includes:

The receiving unit is used for receiving a data block duplicate deletion instruction sent by a name node, wherein the data block duplicate deletion instruction comprises an identifier of the data block duplicate;

a first deleting unit, configured to delete the data block copy from the data node according to the data block copy deleting instruction;

Or, the deleting module includes:

the monitoring unit is used for monitoring the duration of the unaccessed data block copies;

And the second deleting unit is used for deleting the data block copy from the data node when the duration of the data block copy which is not accessed exceeds the lifetime of the data block copy.

further, the data node further includes:

A second sending module, configured to send a second modification instruction to the name node, where the second modification instruction is used to instruct the name node to reduce the replication factor of the target data block by one.

in a fourth aspect, a name node is provided, the name node comprising:

A first receiving module, configured to receive a first modification instruction sent by a first data node, where the first modification instruction is used to instruct a first data node to add one to a replication factor of a target data block, and the first modification instruction is sent after the first data node reads the target data block located on a second data node and generates a data block copy of the target data block on the first data node;

and the first modification module is used for modifying the replication factor of the target data block according to the first modification instruction.

Specifically, the name node further includes:

A second receiving module, configured to receive a second modification instruction sent by a first data node, where the second modification instruction is used to instruct the last byte point to reduce the replication factor of the target data block by one;

and the second modification module is used for modifying the replication factor of the target data block according to the second modification instruction.

In a fifth aspect, a data node is provided, which includes:

a processor, a memory, a bus, and a communication interface; the memory is used for storing computer-executable instructions, the processor is connected with the memory through the bus, and when the computer runs, the processor executes the computer-executable instructions stored by the memory so as to enable the computer to execute the method.

In a sixth aspect, there is provided a name node, comprising:

in a seventh aspect, a system for dynamic redistribution of data is provided, the system comprising a data node as described above, and a name node as described above.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

The method comprises the steps of receiving a data block reading command through a first data node, accessing a target data block (stored on a second data node) according to the data block reading command, then generating a data block copy of the target data block, and finally sending a first modification instruction to a name node to instruct the name node to increase the copy factor of the target data block by one. Therefore, the number of data nodes storing the target data block in the HDFS is increased, when the task needing to access the target data block is allocated again, the number of local tasks (namely the tasks allocated to the data nodes storing the target data block) in the HDFS is increased, and the localization probability of the tasks of the HDFS is improved. Meanwhile, the local task execution speed is high, and the consumption of network resources of the HDFS is low, so that the operation speed of the HDFS is improved. In addition, the newly added data nodes storing the target data blocks can share a part of task load of the data nodes originally storing the target data blocks, and load balance in the HDFS is realized.

drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a network architecture of an HDFS provided in the present invention;

FIG. 2 is a flowchart of a method for dynamically redistributing data according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for dynamically redistributing data according to a second embodiment of the present invention;

fig. 4 is an information interaction diagram of dynamic redistribution of data according to a third embodiment of the present invention;

Fig. 5 is a schematic structural diagram of a data node according to a fourth embodiment of the present invention;

Fig. 6 is a schematic structural diagram of a name node according to a fifth embodiment of the present invention;

Fig. 7 is a schematic structural diagram of a data node according to a sixth embodiment of the present invention;

Fig. 8 is a schematic structural diagram of a name node according to a seventh embodiment of the present invention;

Fig. 9 is a schematic structural diagram of a system for dynamically redistributing data according to an eighth embodiment of the present invention.

Detailed Description

in order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

The structure of the HDFS will be briefly described with reference to fig. 1. As shown in fig. 1, the HDFS generally includes a name node 11 and a plurality of data nodes 12, the name node 11 is used for managing all the data nodes 12 in the HDFS, the data nodes 12 are used for storing data block information of files, and the name node 11 and the data nodes 12 in the HDFS communicate with each other through a network.

example one

an embodiment of the present invention provides a method for dynamically redistributing data, which may be executed by a first data node, with reference to fig. 2, and includes:

In step S11, the first data node receives a data block read command, where the data block read command is used to instruct the first data node to read a target data block located on a second data node, and the second data node and the first data node are different data nodes in the same HDFS.

in step S12, the target data block is accessed according to the data block read command.

In this embodiment, since the first data node and the second data node are different data nodes in the same HDFS, when the first data node accesses the target data block according to the data block read command, the first data node needs to read from the second data node through the network.

in practical applications, since network resources of the HDSF system are limited, and accessing the target file through the network consumes the limited network resources and slows down the operation speed of the HDSF system, tasks are prevented from accessing the target data block through the network in the HDSF system as much as possible.

In step S13, a data chunk copy of the target data chunk is generated on the first data node.

it should be noted that, in the HDFS, if a data node to which a task is allocated does not store a data block of a target file of the task, the allocated task is called a non-local task; if the data node to which the task is assigned stores a data block of a target file of the task, the task is called a local task. Since the reading speed of the local task is far away from that of the non-local task and the network resource of the HDFS is not occupied, the task localization probability can be improved to improve the reading speed of the HDFS.

In this embodiment, the first data node copies the target data block and generates a data block copy of the target data block, increasing the number of copies of the target data block in the HDFS. When the task of reading the target data block is distributed again, the number of the tasks distributed to the data nodes in which the target data block is stored is increased, so that the localization probability of the tasks of the HDFS is improved, meanwhile, more tasks can read the target data block from the local data nodes, the running speed of the HDFS is improved, and the network resources of the HDFS can be saved.

In addition, in the HDSF system, the data nodes are always allocated with priority to local tasks (i.e., localization principle), and non-local tasks are allocated only after the local tasks are allocated. Therefore, the data node storing the target data block is always allocated with a task of reading the target data block, and the load of the data node is large. In this embodiment, the first data node copies the target data block, so that the number of data nodes storing the target data block is increased, and the task load of a part of data nodes storing the target data block is shared, thereby realizing load balancing in the HDFS.

in step S14, a first modification instruction is sent to the name node, where the first modification instruction is used to instruct the name node to increase the replication factor of the target data block by one.

in this embodiment, after the data block copy of the target data block is generated on the first data node, the number of copies of the target data block in the HDFS is increased, and the copy factor information of the target data block stored in the name node needs to be updated.

the embodiment of the invention receives a data block reading command through a first data node, accesses a target data block (stored on a second data node) according to the data block reading command, then generates a data block copy of the target data block, and finally sends a first modification instruction to a name node for instructing the name node to add one to the replication factor of the target data block. Therefore, the number of data nodes storing the target data block in the HDFS is increased, when the task needing to access the target data block is allocated again, the number of local tasks (namely the tasks allocated to the data nodes storing the target data block) in the HDFS is increased, and the localization probability of the tasks of the HDFS is improved. Meanwhile, the local task execution speed is high, and the consumption of network resources of the HDFS is low, so that the operation speed of the HDFS is improved. In addition, the newly added data nodes storing the target data blocks can share a part of task load of the data nodes originally storing the target data blocks, and load balance in the HDFS is realized.

Example two

an embodiment of the present invention provides a method for dynamically redistributing data, which may be executed by a name node, with reference to fig. 3, and includes:

step S21, receiving a first modification instruction sent by the first data node, where the modification instruction is used to instruct the name node to add one to the replication factor of the target data block, and the first modification instruction is sent after the first data node reads the target data block located on the second data node and generates a data block copy of the target data block on the first data node.

Specifically, after the data node copies the target data block, since the number of copies of the target data block in the HDFS increases, the copy factor information of the target data block stored in the name node needs to be updated.

In step S22, the replication factor of the target data block is modified according to the first modification instruction.

In practical application, when the name node further receives periodic heartbeat information (the periodic heartbeat information includes information of a data block in the data node) sent by the data node, the data block information recorded on the name node is compared and corrected, and the recorded data block information is updated in time.

EXAMPLE III

An embodiment of the present invention provides a method for dynamically redistributing data, and referring to fig. 4, the method includes:

In step S31, the first data node sends a file open request to the name node.

Specifically, the file open request includes the file name, the file offset, and the file data size of the target file.

in this embodiment, the application on the first data node sends the File open request to the name node through a Distributed File System client (dfscript for short).

In step S32, the name node sends file feedback information to the first data node according to the file open request.

specifically, the file feedback information includes data blocks of the target file and an IP of a data node where each data block is located.

In this embodiment, the name node may send a List of data blocks List < located Block > Block () corresponding to the target file to the first data node according to the file open request.

step S33, when the file feedback information indicates that the target data block is stored in the first data node, the first data node directly reads the target data block; when the file feedback information indicates that the target data block is stored in the second data node, step S34 is performed.

in this embodiment, when the target data block is stored in the first data node, the application on the first data node directly reads the target data block from the first data node through the DFSclient, and specifically, when the DFSclient reads the data block, an FSDataInputStream object is created, and the target data block is read through the FSDataInputStream object.

in step S34, the first data node reads the target data block located on the second data node, where the second data node and the first data node are different data nodes in the same HDFS.

In this embodiment, when the target data block is stored in the second data node (i.e. not stored in the first data node), the DFSclient on the first data node sends a file read request to the second data node through the FSDataInputStream object, and receives the target data block returned by the second data node.

In step S35, the first data node generates a data chunk copy of the target data chunk.

In this embodiment, after the second data node sends the target data block to the first data node through the network, the first data node copies the target data block, thereby generating a data block copy thereof.

In step S36, the first data node sends a first modification instruction to the name node, where the first modification instruction is used to instruct the name node to increase the replication factor of the target data block by one.

In addition, in the present embodiment, since the storage space limit of the HDFS is taken into consideration, the variation range of the copy factor may be limited, for example, the variation range of the copy factor may be set to 1 to 512. In addition, if the default replication factor is 3 when the HDFS is created, the variation range of the replication factor can be set to 3 ~ 512.

in step S37, the name node receives the first modification instruction sent by the first data node, and modifies the replication factor of the target data block according to the first modification instruction.

In step S38, when the length of time that the data block copy has not been accessed exceeds the lifetime of the data block copy, the first data node deletes the data block copy.

in this embodiment, a lifetime is set for the generated data block copies of the target data block, so that the data block copies can be prevented from occupying the storage space of the data node without being accessed for a long time.

In an implementation manner of this embodiment, the lifetime of the data block copy may be set in the following manner: setting an initial survival time for the newly added data block copy, and prolonging the initial survival time as the new survival time of the data block copy when the frequency (or the frequency) of the data block copy accessed in the set time reaches a set value.

in another implementation manner of this embodiment, the lifetime of the data block copy may also be statically set.

in addition, it should be noted that the set lifetime of the data block copies is only for these new copies, and the set lifetime of the copies formed during file creation is not set. Meanwhile, the deleted data block copy only aims at the newly added data block copy, and the copy formed in the file creation process cannot be deleted.

In this embodiment, the length of time that the data block copy is not accessed may be monitored by the first data node, or may be monitored by the name node.

in the scenario where the first data node monitors the unaccessed duration of the data block copy, step S38 may include the following steps:

S381, the first data node monitors the duration of the unaccessed data block copies;

S382, when the duration of the unaccessed data block copy exceeds the lifetime of the data block copy, the first data node deletes the data block copy.

Specifically, the unaccessed time length of the block copy may be monitored by adding a copy manager (replicamager) to the first data node.

in the scenario where the name node monitors the unaccessed duration of the data block copy, step S38 may include the following steps:

S383, the name node monitors the duration of the data block copy which is not accessed;

S384, when the duration of the unaccessed data block copy exceeds the lifetime of the data block copy, the name node sends a data block copy deleting instruction to the first data node, wherein the data block copy deleting instruction comprises the identifier of the data block copy to be deleted;

s385, the first data node receives a data block duplicate deletion instruction sent by the name node;

And S386, deleting the data block copy by the first data node according to the data block copy deleting instruction.

specifically, when the name node monitors the time length of the data block copy that is not accessed, the monitoring may be performed by using the related information of the data block copy included in the heartbeat information that is periodically sent to the name node by the first data node.

in step S39, the first data node sends a second modification instruction to the name node, where the second modification instruction is used to instruct the name node to reduce the replication factor of the target data block by one.

In practical applications, after the copy of the data block is deleted, the corresponding copy factor in the HDFS needs to be modified.

in step S40, the name node receives the second modification instruction sent by the first data node, and modifies the replication factor of the target data block according to the second modification instruction.

it should be noted that, in the HDFS, if a data node to which a task is allocated does not store a data block of a target file of the task, the allocated task is called a non-local task; if the data node to which the task is assigned stores a data block of a target file of the task, the task is called a local task. In order to improve the data reading efficiency of the HDFS, the probability of local tasks of the HDFS is often required to be improved.

in this embodiment, the first data node reads the target data block located on the second data node, indicating that the HDFS allocates a non-native task to the first data node. In the conventional HDFS, after a first data node reads a target data block on a second data node, the target data block is not copied to the first data node, and if the first data node is assigned to the same task next time, the first data node still needs to read the target data block from the second data node.

In this embodiment, the first data node generates a data block copy of the target data block, and adds the copy number of the target data block in the HDFS. When the task of reading the target data block is distributed again, the number of the tasks distributed to the data nodes in which the target data block is stored is increased, so that the localization probability of the tasks of the HDFS is improved, meanwhile, more tasks can read the target data block from the local data nodes, the running speed of the HDFS is improved, and the network resources of the HDFS can be saved.

It should be noted that, HDFS always preferentially allocates local tasks to data nodes (i.e., localization principle), and allocates non-local tasks only after the local tasks are allocated. Therefore, the data node storing the target data block is always allocated with a task of reading the target data block, and the load of the data node is large. In this embodiment, the first data node copies the target data block, so that the number of data nodes storing the target data block is increased, and the task load of a part of data nodes storing the target data block is shared, thereby realizing load balancing in the HDFS.

example four

an embodiment of the present invention provides a data node, and referring to fig. 5, the data node includes: a receiving module 501, an accessing module 502, a generating module 503, and a first transmitting module 504.

A receiving module 501, configured to receive a data block read command, where the data block read command is used to instruct a data node to read a target data block located on a second data node, and the second data node and the data node are different data nodes in the same HDFS.

in this embodiment, the data node indicated by the data block read command may be the first data node in the first to third embodiments.

an accessing module 502, configured to access the target data block according to the data block reading command.

In this embodiment, since the data node (i.e., the first data node) and the second data node are different data nodes in the same HDFS, when the data node accesses the target data block according to the data block read command, the data node needs to read from the second data node through the network.

A generating module 503, configured to generate a data block copy of the target data block on the data node.

A first sending module 504, configured to send a first modification instruction to the name node, where the first modification instruction is used to instruct the name node to increment the replication factor of the target data block by one.

in this embodiment, after the data node generates the data block copy of the target data block, the number of copies of the target data block in the HDFS is increased, and the copy factor information of the target data block stored in the name node needs to be updated.

specifically, the data node further includes: and a deletion module 505.

And the deleting module 505 is configured to delete the data block copy from the data node when the duration of the data block copy that is not accessed exceeds the lifetime of the data block copy.

in this embodiment, a lifetime is set for the generated data block copies of the target data block, so that the problem that the data block copies occupy the storage space of the data node without being accessed for a long time can be prevented.

In this embodiment, the length of time that the data block copy is not accessed may be monitored by the data node (i.e., the first data node), or may be monitored by the name node.

further, when the name node monitors the unaccessed time of the data block copy, the delete module 505 further comprises: a receiving unit 515 and a first deleting unit 525;

The receiving unit 515 is configured to receive a data block deduplication instruction sent by the name node, where the data block deduplication instruction includes an identifier of a data block duplicate.

The first deleting unit 525 is configured to delete the data block copy from the data node according to the data block copy deletion instruction.

When the data node monitors the unaccessed time length of the data block copy, the delete module 505 further comprises: a monitoring unit 535 and a second deletion unit 545.

A monitoring unit 535, configured to monitor a duration of the non-accessed data block copy.

The second deleting unit 545 is configured to delete the data block copy from the data node when the length of time during which the data block copy is not accessed exceeds the lifetime of the data block copy.

further, the data node further comprises: a second sending module 506.

A second sending module 506, configured to send a second modification instruction to the name node, where the second modification instruction is used to instruct the name node to reduce the replication factor of the target data block by one.

In the present embodiment, the data node (i.e., the first data node) copies the target data block and generates a data block copy of the target data block, increasing the number of copies of the target data block in the HDFS. When the task of reading the target data block is distributed again, the number of the tasks distributed to the data nodes in which the target data block is stored is increased, so that the localization probability of the tasks of the HDFS is improved, meanwhile, more tasks can read the target data block from the local data nodes, the running speed of the HDFS is improved, and the network resources of the HDFS can be saved.

In addition, in the HDSF system, the data nodes are always allocated with priority to local tasks (i.e., localization principle), and non-local tasks are allocated only after the local tasks are allocated. Therefore, the data node storing the target data block is always allocated with a task of reading the target data block, and the load of the data node is large. In this embodiment, the data node (i.e., the first data node) copies the target data block, so that the number of data nodes storing the target data block is increased, and the task load of a part of the data nodes storing the target data block is shared, thereby achieving load balancing in the HDFS.

EXAMPLE five

An embodiment of the present invention provides a name node, and referring to fig. 6, the data node includes: a first receiving module 601 and a first modifying module 602.

The first receiving module 601 is configured to receive a first modification instruction sent by a data node, where the first modification instruction is used to instruct a first place node to increase a replication factor of a data block of a target file to be replicated in the data node by one.

A first modification module 602, configured to modify the replication factor of the target data block according to a first modification instruction.

in practical application, when the name node further receives periodic heartbeat information (the periodic heartbeat information includes information of a data block in the data node) sent by the data node, the information of the data block recorded on the name node is compared and corrected, and the recorded data block information is updated in time.

Further, the name node further comprises: a second receiving module 603 and a second modifying module 604.

A second receiving module 603, configured to receive a second modification instruction sent by the first data node, where the second modification instruction is used to instruct the name node to reduce the replication factor of the target data block by one.

A second modification module 604, configured to modify the replication factor of the target data block according to a second modification instruction.

EXAMPLE six

an embodiment of the present invention provides a data node, and referring to fig. 7, the data node includes:

a processor 701, a memory 702, a bus 703, and a communication interface 704; the memory 702 is used for storing computer-executable instructions, the processor 701 is connected to the memory 702 through the bus 703, and when the computer runs, the processor 701 executes the computer-executable instructions stored in the memory, so that the computer executes the method according to the first embodiment or the third embodiment.

EXAMPLE seven

An embodiment of the present invention provides a name node, and referring to fig. 8, the name node includes:

A processor 801, a memory 802, a bus 803, and a communication interface 804; the memory 802 is used for storing computer execution instructions, the processor 801 is connected to the memory 802 through the bus 803, and when the computer runs, the processor 801 executes the computer execution instructions stored in the memory, so that the computer executes the method described in the second embodiment or the third embodiment.

example eight

An embodiment of the present invention provides a system for dynamically redistributing data, and referring to fig. 9, the system includes: data node 50 as described in example four, and name node 60 as described in example five.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

it should be noted that: in the method for implementing dynamic redistribution of data by the data node provided in the above embodiment, only the division of each functional module is illustrated, and in practical application, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the above described functions. In addition, the data node and the method embodiment for dynamically redistributing data provided by the above embodiments belong to the same concept, and the specific implementation process is detailed in the method embodiment and is not described herein again.

it will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for dynamic redistribution of data, the method comprising:

A first data node receives a data block reading command, wherein the data block reading command is used for indicating the first data node to read a target data block positioned on a second data node, and the second data node and the first data node are different data nodes in the same Hadoop Distributed File System (HDFS);

accessing the target data block according to the data block reading command;

Generating a data block copy of the target data block on the first data node;

sending a first modification instruction to a name node, wherein the first modification instruction is used for indicating the name node to increase the replication factor of the target data block by one;

Setting an initial survival time for the generated data block copy;

When the frequency or the times of accessing the data block copy in a set time length reaches a set value, prolonging the initial survival time as the new survival time of the data block copy;

periodically sending heartbeat information to the name node, wherein the heartbeat information is used for the name node to monitor the duration of the data block copy which is not accessed;

Receiving a data block duplicate deletion instruction sent by the name node, wherein the data block duplicate deletion instruction is sent when the unvisited time length of the data block duplicate exceeds the survival time of the data block duplicate, and the data block duplicate deletion instruction comprises an identifier of the data block duplicate;

2. The method of claim 1, further comprising:

Monitoring the duration of the unaccessed data block copies;

3. A method for dynamic redistribution of data, the method comprising:

modifying the replication factor of the target data block according to the first modification instruction;

Receiving heartbeat information from the first data node periodically, wherein the heartbeat information is used for the name node to monitor the duration of the data block copy which is not accessed;

monitoring the duration of the data block copy which is not accessed according to the heartbeat information;

when the duration of the data block copy which is not accessed exceeds the lifetime of the data block copy, sending a data block copy deleting instruction to the first data node, wherein the lifetime of the data block copy is obtained by prolonging the initial lifetime set for the data block copy, and the first data node is used for prolonging the initial lifetime as the new lifetime of the data block copy when the frequency or the number of times of accessing the data block copy in the set duration reaches a set value;

4. A data node, characterized in that the data node comprises:

the receiving module is used for receiving a data block reading command, the data block reading command is used for indicating the data node to read a target data block located on a second data node, and the second data node and the data node are different data nodes in the same Hadoop Distributed File System (HDFS);

A first sending module, configured to send a first modification instruction to a name node, where the first modification instruction is used to instruct the name node to increase a replication factor of the target data block by one;

The data node is further configured to set an initial survival time for the generated data block copy, and when the frequency or the number of times that the data block copy is accessed within a set time reaches a set value, the initial survival time is prolonged to be used as a new survival time of the data block copy;

The data node is further configured to periodically send heartbeat information to the name node, where the heartbeat information is used for the name node to monitor the length of time that the data block copy is not accessed; receiving a data block duplicate deletion instruction sent by the name node, wherein the data block duplicate deletion instruction is sent when the unvisited time length of the data block duplicate exceeds the survival time of the data block duplicate, and the data block duplicate deletion instruction comprises an identifier of the data block duplicate;

A deleting module, configured to delete the data block copy from the data node according to the data block copy deleting instruction;

5. The data node of claim 4, wherein the deletion module comprises:

6. a name node, wherein said name node comprises:

a first modification module, configured to modify a replication factor of the target data block according to the first modification instruction;

The name node is further configured to periodically receive heartbeat information from the first data node, where the heartbeat information is used for the name node to monitor the unaccessed duration of the data block copy, and the unaccessed duration of the data block copy is monitored according to the heartbeat information; when the duration of the data block copy which is not accessed exceeds the lifetime of the data block copy, sending a data block copy deleting instruction to the first data node, wherein the lifetime of the data block copy is obtained by prolonging the initial lifetime set for the data block copy, and the first data node is used for prolonging the initial lifetime as the new lifetime of the data block copy when the frequency or the number of times of accessing the data block copy in the set duration reaches a set value;

7. A data node, characterized in that the data node comprises:

A processor, a memory, a bus, and a communication interface; the memory is used for storing computer execution instructions, the processor is connected with the memory through the bus, and when the computer runs, the processor executes the computer execution instructions stored by the memory so as to enable the computer to execute the method according to any one of claims 1-2.

8. a name node, wherein said name node comprises:

a processor, a memory, a bus, and a communication interface; the memory is used for storing computer-executable instructions, the processor is connected with the memory through the bus, and when the computer runs, the processor executes the computer-executable instructions stored by the memory so as to enable the computer to execute the method according to claim 3.

9. a system for dynamic redistribution of data, the system comprising a data node according to any of claims 4 to 5 and a name node according to claim 6.