CN115203177B

CN115203177B - Distributed data storage system and storage method

Info

Publication number: CN115203177B
Application number: CN202211125471.6A
Authority: CN
Inventors: 王云
Original assignee: Beijing Zhiyue Network Technology Co ltd
Current assignee: Beijing Zhiyue Network Technology Co ltd
Priority date: 2022-09-16
Filing date: 2022-09-16
Publication date: 2022-12-06
Anticipated expiration: 2042-09-16
Also published as: CN115203177A

Abstract

The invention discloses a distributed data storage system and a storage method, wherein the distributed data storage system comprises storage nodes, a processor and a memory are arranged in the storage nodes, and the storage nodes are mutually connected through a network; the monitoring module monitors and records the capacity occupancy rate and the resource utilization rate of each storage node; the calculation module is used for calculating the migration time period of each storage node based on the historical resource utilization rate aiming at each storage node; the evaluation module is used for screening the storage nodes according to the data provided by the calculation module and the capacity occupancy rate and determining the storage nodes which need to be migrated in and out; the migration module is used for migrating the storage data; the invention can automatically adjust the capacity occupancy rate and the resource utilization rate of each storage node by detecting the data storage amount of each storage node and carrying out data migration, thereby realizing the load balance of the storage system.

Description

Distributed data storage system and storage method

Technical Field

The present invention relates to the field of distributed data storage technologies, and in particular, to a distributed data storage system and a storage method.

Background

The distributed data storage system comprises a plurality of storage nodes and management nodes, and the storage nodes are responsible for storing, reading and writing files; the management node is responsible for distributing tasks to the data nodes for execution and meeting application requirements. Because data are distributed on different storage nodes, when the data of the storage nodes are more, the reading frequency of the storage nodes is correspondingly increased, and if the same storage node receives multiple access information at the same time, the reading speed of the data is liable to be reduced, so that in order to reduce the resource utilization rate of the storage nodes, the data on the storage nodes need to be migrated, however, the current data migration mainly depends on operation and maintenance personnel for operation, and the efficiency is lower.

Disclosure of Invention

In order to solve the above problems, the present invention provides a distributed data storage system and a storage method, so as to solve the problem that in the prior art, data migration in the distributed data storage system mainly depends on operation and maintenance personnel to perform operations, and efficiency is low.

In order to achieve the above object, the present invention adopts the following technical solution, and a distributed data storage method includes:

step S1: acquiring the capacity occupancy rate of each storage node, and defining the storage node of which the capacity occupancy rate exceeds a first threshold value as a high occupancy rate node, wherein the storage node of which the capacity occupancy rate is lower than a second threshold value is a low occupancy rate node, and the second threshold value is smaller than the first threshold value;

step S2: predicting idle time periods with the high-occupancy-rate node resource utilization rate lower than a preset resource utilization rate based on historical resource utilization rate data, if the idle time periods of a plurality of high-occupancy-rate nodes are located in the same time period, executing a step S3, and if not, executing a step S4;

and step S3: calculating the pressure value of each high-occupancy-rate node through a first formula, and selecting the high-occupancy-rate node with the largest pressure value for data migration, wherein the first formula is as follows:

wherein, in the step (A),

in order to be the capacity occupancy rate,

for the read frequency of the high-occupancy node within 24 hours,

the amount of data that needs to be migrated for the high-occupancy node,

are respectively a weighting coefficient;

and step S4: determining the data volume to be migrated of the high-occupancy node;

step S5: screening the low-occupancy nodes with the residual storage capacity meeting a second formula, wherein the second formula is as follows:

wherein, in the step (A),

is the value of the second threshold value and is,

the current amount of data stored for the low-occupancy node,

the amount of data that needs to be migrated for the high-occupancy node,

is the total capacity of the low-occupancy nodes;

step S6: selecting a storage node most suitable for the high-occupancy node from the low-occupancy nodes meeting the second formula, and transferring the storage data of the high-occupancy node to the low-occupancy nodes;

step S7: and repeating the step S2 to the step S6 until the high-occupancy-rate nodes do not exist in the storage system any more or all the low-occupancy-rate nodes are not suitable for migrating new storage data any more.

Further, in step S6, selecting a storage node most suitable for the high-occupancy node includes the following steps:

step S61: adding the historical resource utilization rates of the time points corresponding to the high-occupancy-rate node and the low-occupancy-rate node to obtain the predicted resource utilization rate of each time point after the low-occupancy-rate node is transferred and stored data

Wherein the content of the first and second substances,

and

the resource utilization rates respectively represent the ith time point of the past jth day of the high-occupancy node and the low-occupancy node;

step S62: obtaining an average of the predicted resource utilization based on a third formula

The third formula is:

wherein m represents m days in total, and n represents n time points per day;

step S63: setting a resource utilization rate threshold, establishing a rectangular coordinate system by taking time as an X axis and the resource utilization rate as a Y axis, drawing the resource utilization rate threshold and the predicted resource utilization rate on the rectangular coordinate system, fitting each coordinate point of the predicted resource utilization rate based on a curve fitting method to obtain a curve function f (X), and calculating an area S which is formed by the curve function and the X axis and is larger than an area S formed by the resource utilization rate threshold and the X axis based on a fourth formula:

,

wherein, the first and the second end of the pipe are connected with each other,

for the intersection of the curve function and the resource utilization threshold,

for the purpose of the resource utilization threshold value,

is to return

And

the parameter with the larger median value;

step S64: calculating collision scores of the low-occupancy nodes based on a fifth formula, wherein the low-occupancy node with the lowest collision score is the best-fit storage node, and the fifth formula is as follows:

wherein

Respectively, are weighting coefficients.

Further, before performing the step S61, the method further includes the following steps:

step S061: predicting the migration speed of the stored data based on the current network state, the size of the stored data, the hardware configuration of the storage node and the resource utilization rate of the storage node, and eliminating the low-occupancy node with the migration speed lower than the preset migration speed.

Further, after the step S61, the method further includes the following steps:

step S611: and if the predicted resource utilization rate exceeds the upper limit of the resource utilization rate of the low-occupancy node, rejecting the low-occupancy node.

Further, in the storage data migration process, if the resource utilization rates of the high-occupancy-rate node and the low-occupancy-rate node are greater than the preset resource utilization rate threshold, the migration rate of the storage data is reduced.

Further, when data migration is not performed, the resource utilization rates of the high-occupancy nodes and the low-occupancy nodes are obtained at intervals of a first time, and when data migration is performed, the resource utilization rates of the high-occupancy nodes and the low-occupancy nodes are obtained at intervals of a second time, wherein the second time is less than the first time.

Further, a migration value upper limit is set, and data migration of the storage data with the data volume larger than the migration value upper limit is prohibited.

Further, the curve fitting method is a least square method.

On the other hand, the invention also provides a distributed data storage system, which is used for realizing the distributed data storage method in the technical scheme, and comprises the following steps

The storage nodes are internally provided with a processor and a memory and are connected with each other through a network;

the monitoring module monitors and records the capacity occupancy rate and the resource utilization rate of each storage node;

the computing module is used for computing the idle time period of each storage node based on the historical resource utilization rate;

the evaluation module screens the storage nodes according to the data provided by the calculation module and the capacity occupancy rate, and determines the storage nodes which need to be subjected to data migration and migration;

a migration module for migrating the storage data

Compared with the prior art, the invention has the following beneficial effects:

1. the method comprises the steps that firstly, each storage node is divided based on capacity occupancy rate, and a high-occupancy-rate node and a low-occupancy-rate node are obtained, so that an object of a target needing data migration is obtained; then historical resource utilization rate data are obtained, so that the future resource utilization rate of the high-occupancy-rate node is predicted, and the influence on the data migration speed caused by the fact that data migration is carried out when the data reading of the data node is busy is avoided; the invention can automatically adjust the capacity occupancy rate and the resource utilization rate of each storage node by detecting the data storage amount of each storage node and carrying out data migration, thereby realizing the load balance of the storage system.

2. If a plurality of high-occupancy nodes exist for data migration, the migration sequence needs to be sorted, and during sorting, evaluation is performed according to the capacity occupancy rates, the reading frequency and the data volume needing to be migrated of the data nodes respectively, so that the high-occupancy nodes most needing to be subjected to data migration are obtained.

Drawings

FIG. 1 is a flow chart of a distributed data storage method of the present invention;

FIG. 2 is a schematic diagram of the predicted resource utilization of the low-occupancy node of the present invention;

fig. 3 is a curve fitting graph of the low-occupancy node predicted resource utilization of the present invention.

In the figure: 1. high occupancy rate nodes; 2. and (4) low-occupancy nodes.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It will be understood that, as used herein, the terms "first," "second," and the like may be used herein to describe various elements, but these elements are not limited by these terms unless otherwise specified. These terms are only used to distinguish one element from another. For example, a first xx script may be referred to as a second xx script, and similarly, a second xx script may be referred to as a first xx script, without departing from the scope of the present application.

As shown in fig. 1, a distributed data storage method includes:

step S1: acquiring the capacity occupancy rate of each storage node, defining the storage nodes with the capacity occupancy rates exceeding a first threshold as high-occupancy rate nodes, defining the storage nodes with the capacity occupancy rates lower than a second threshold as low-occupancy rate nodes, and setting the second threshold smaller than the first threshold;

step S2: predicting idle time periods with the high-occupancy-rate node resource utilization rate lower than the preset resource utilization rate based on the historical resource utilization rate data, if the idle time periods of the plurality of high-occupancy-rate nodes are located in the same time period, executing the step S3, otherwise executing the step S4;

and step S3: calculating the pressure value of each high-occupancy-rate node through a first formula, selecting the high-occupancy-rate node with the maximum pressure value for data migration, and performing data migration on the high-occupancy-rate node with the maximum pressure valueThe formula is as follows:

wherein, in the step (A),

in order to be the capacity occupancy rate,

for the read frequency of the high-occupancy node in 24 hours,

the amount of data that needs to be migrated for a high-occupancy node,

respectively are weighting coefficients;

and step S4: determining the data volume to be migrated of the high-occupancy-rate node;

wherein, in the step (A),

is the second threshold value, and is,

the current amount of data stored for the low-occupancy node,

the amount of data that needs to be migrated for a high-occupancy node,

total capacity of low occupancy nodes;

step S6: selecting a storage node most suitable for the high-occupancy node from the low-occupancy nodes meeting the second formula, and transferring the storage data of the high-occupancy node to the low-occupancy node;

step S7: and repeating the step S2 to the step S6 until the high-occupancy-rate nodes do not exist in the storage system any more or all the low-occupancy-rate nodes are not suitable for migrating new storage data.

The method comprises the steps that firstly, each storage node is divided based on capacity occupancy rate, and a high-occupancy-rate node and a low-occupancy-rate node are obtained, so that an object of a target needing data migration is obtained; then historical resource utilization rate data are obtained, so that the future resource utilization rate of the high-occupancy-rate node is predicted, and the influence on the data migration speed caused by the fact that data migration is carried out when the data reading of the data node is busy is avoided; if a plurality of high-occupancy-rate nodes perform data migration, the migration sequence needs to be sorted, and if the system performs data migration of a plurality of storage nodes at the same time, the overload of a CPU is inevitably caused, so that the whole data storage system is stuck.

When sequencing is carried out, evaluation is carried out according to the capacity occupancy rate, the reading frequency and the data volume needing to be migrated of the data nodes respectively, for the capacity occupancy rate, as the data in the storage nodes are stored in the magnetic disk, in the actual use process, when the data in the magnetic disk tends to be saturated, the reading speed of the magnetic disk is reduced to some extent, for the reading frequency, the high-frequency reading represents that the data is frequently accessed, then the data is transferred, the resource utilization rate of the original data nodes can be obviously reduced, for the data volume, the storage data with the priority for transferring the data volume and larger data volume can be quickly read to reduce the capacity occupancy rate of the storage nodes; the step S5 can ensure that the low-occupancy-rate nodes are not changed into the high-occupancy-rate data nodes after the stored data are migrated; the invention can automatically adjust the capacity occupancy rate and the resource utilization rate of each storage node by detecting the data storage amount of each storage node and carrying out data migration, thereby realizing the load balance of the storage system.

In step S6, selecting a storage node most suitable for the high-occupancy node includes the following steps:

step S61: corresponding the high-occupancy-rate node and the low-occupancy-rate node to a time pointThe historical resource utilization rates are added to obtain the predicted resource utilization rate of each time point after the low-occupancy-rate node is transferred and stored data

Wherein the content of the first and second substances,

and

respectively representing the resource utilization rates of the ith time point of the past jth day of the high-occupancy node and the low-occupancy node;

step S62: obtaining an average of predicted resource utilization based on a third formula

The third formula is:

wherein m represents m days in total, and n represents n time points per day;

step S63: setting a resource utilization rate threshold, establishing a rectangular coordinate system by taking time as an X axis and resource utilization rate as a Y axis, drawing the resource utilization rate threshold and the predicted resource utilization rate on a rectangular coordinate system, fitting each coordinate point of the predicted resource utilization rate based on a curve fitting method to obtain a curve function f (X), specifically, the curve fitting method is a least square method, calculating an area S which is formed by the curve function and the X axis and is larger than the area S formed by the resource utilization rate threshold and the X axis, and the fourth formula is as follows:

,

wherein the content of the first and second substances,

as the intersection of the curve function and the resource utilization threshold,

to be the threshold value of the resource utilization,

is to return

And

the parameter with the larger median value;

step S64: calculating the adaptation score of each low-occupancy node based on a fifth formula, wherein the low-occupancy node with the highest adaptation score is the optimal adaptation storage node, and the fifth formula is as follows:

wherein

Respectively, are weighting coefficients.

As shown in fig. 2, firstly, in step S61, the acquired historical resource utilization rates of the time points corresponding to the high-occupancy node 1 and the low-occupancy node 2 are added, so as to obtain the predicted resource utilization rate of the low-occupancy node 2 in the corresponding time period after the storage data in the high-occupancy node 1 is transferred to the low-occupancy node 2; for example, in the present embodiment, historical resource utilization data of 24 time points each day in the last 3 days are obtained, and then the data of corresponding times are added to obtain the predicted resource utilization data shown in fig. 2.

Adding all the predicted resource utilization rate data and dividing by 72 to obtain the average predicted resource utilization rate of each time point through the step S62; the respective predicted resource utilization rates are plotted in the planar coordinate system through step S63,and obtaining a curve function closest to the trend of each coordinate point by using a curve fitting method, as shown in fig. 3, setting a resource utilization rate threshold value and drawing in a coordinate system to obtain an intersection point of the curve function and the resource utilization rate threshold value

Obtaining the area enclosed by the curve function and the resource utilization rate threshold in each intersection point interval and the X axis through definite integral, then subtracting the area enclosed by the resource utilization rate threshold in each intersection point interval and the X axis, and finally obtaining the area enclosed by the curve function and the resource utilization rate threshold in each intersection point interval and the X axis through definite integral

And when the area smaller than 0 in each intersection interval is removed, namely the area formed by the curve function and the X axis is smaller than the threshold of the resource utilization rate, finally the conflict score of each low-occupancy node 2 is calculated through the step S64, so that the low-occupancy node 2 which is optimally matched with the high-occupancy node 1 is obtained, namely the obtained average resource utilization rate is lowest, and the coincidence rate of the transferred storage data and the original storage data in the low-occupancy node 2 at the time point of the reading peak is lower, so that the stability of the whole storage system is improved.

Before step S61, the method further includes the following steps:

step S061: predicting the migration speed of the stored data based on the current network state, the size of the stored data, the hardware configuration of the storage node and the resource utilization rate of the storage node, and eliminating the low-occupancy-rate nodes with the migration speed lower than the preset migration speed.

By the steps, the problem that the migration speed is too low, so that the migration task occupies the resource utilization rate of the node for a long time and the running speed of the system is influenced due to the fact that the migration time of the data is too long can be avoided.

After step S61, the method further includes the following steps:

And in the stored data migration process, if the resource utilization rate of the high-occupancy-rate node and the low-occupancy-rate node is greater than a preset resource utilization rate threshold, reducing the migration rate of the stored data.

The resource utilization rate of the nodes is reduced by reducing the migration rate of the stored data, so that the migration task is prevented from occupying too large resource utilization rate and excessively influencing the reading rate of other resources in the nodes.

And when data migration is not carried out, the resource utilization rates of the high-occupancy-rate nodes and the low-occupancy-rate nodes are obtained at intervals of a first time, and when data migration is carried out, the resource utilization rates of the high-occupancy-rate nodes and the low-occupancy-rate nodes are obtained at intervals of a second time, wherein the second time is less than the first time.

In the data migration process, the interval time for monitoring the resource utilization rate of the high-occupancy-rate node and the low-occupancy-rate node is shortened, so that whether the resource utilization rate is greater than a preset resource utilization rate threshold value or not can be monitored more accurately.

And setting a migration value upper limit, and prohibiting data migration of the storage data with the data volume larger than the migration value upper limit. In general, as the capacity occupied by the storage data increases, the resource utilization rate increases in proportion, and therefore, even if the storage data that is too large is migrated to another node, the storage data causes the node at the migration destination to easily have a large resource utilization rate, and the migration process involves a large migration cost, and therefore the storage data that is too large is not suitable for migration.

A distributed data storage system is used for realizing the distributed data storage method, and comprises storage nodes, a processor and a memory, wherein the storage nodes are internally provided with the processor and the memory and are mutually connected through a network; the monitoring module monitors and records the capacity occupancy rate and the resource utilization rate of each storage node; the computing module is used for computing the idle time period of each storage node based on the historical resource utilization rate; the evaluation module screens the storage nodes according to the data and the capacity occupancy rate provided by the calculation module, and determines the storage nodes needing data migration in and out; and the migration module is used for migrating the storage data.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that various changes and modifications can be made by those skilled in the art without departing from the spirit of the invention, and these changes and modifications are all within the scope of the invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, which is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention.

Claims

1. A distributed data storage method, comprising:

step S1: acquiring the capacity occupancy rate of each storage node, defining the storage node with the capacity occupancy rate exceeding a first threshold as a high-occupancy rate node, defining the storage node with the capacity occupancy rate lower than a second threshold as a low-occupancy rate node, wherein the second threshold is smaller than the first threshold;

wherein, in the step (A),

in order to be the capacity occupancy rate,

for the read frequency of the high-occupancy node over the past 24 hours,

the amount of data that needs to be migrated for the high-occupancy node,

respectively are weighting coefficients;

step S5: screening remaining storesThe low-occupancy node whose capacity satisfies a second formula, the second formula being:

wherein, in the step (A),

is the value of the second threshold value and is,

the current amount of data stored for the low-occupancy node,

the amount of data that needs to be migrated for the high-occupancy node,

is the total capacity of the low-occupancy nodes;

step S7: repeating the step S2 to the step S6 until the high-occupancy-rate nodes do not exist in the storage system any more or all the low-occupancy-rate nodes are not suitable for migrating new storage data any more;

step S61: adding the historical resource utilization rates of the time points corresponding to the high-occupancy-rate node and the low-occupancy-rate node to obtain the predicted resource utilization rate of each time point after the low-occupancy-rate node is transferred to store data

and

the resource utilization rates respectively represent ith time points of the past j days of the high-occupancy node and the low-occupancy node;

The third formula is:

wherein m represents the past m days of total acquisition and n represents n time points of acquisition per day;

step S63: setting a resource utilization rate threshold, establishing a rectangular coordinate system by taking time as an X axis and the resource utilization rate as a Y axis, drawing the resource utilization rate threshold and the predicted resource utilization rate on the rectangular coordinate system, fitting each coordinate point of the predicted resource utilization rate based on a curve fitting method to obtain a curve function f (X), and calculating an area S which is formed by the curve function and the X axis and is larger than an area S formed by the resource utilization rate threshold and the X axis based on a fourth formula, wherein the fourth formula is as follows:

wherein the content of the first and second substances,

for the purpose of the resource utilization threshold value,

is to return

And

the parameter with the larger median value;

wherein

Respectively, are weighting coefficients.

2. A distributed data storage method according to claim 1, further comprising, before performing step S61, the steps of:

3. The distributed data storage method according to claim 2, wherein after said step S61, further comprising the steps of:

4. The distributed data storage method according to claim 1, wherein in a storage data migration process, if the resource utilization rates of the high-occupancy node and the low-occupancy node are greater than the preset resource utilization rate threshold, a migration rate of storage data is reduced.

5. The distributed data storage method according to claim 1, wherein the resource utilization rates of the high-occupancy nodes and the low-occupancy nodes are obtained at intervals of a first time when data migration is not performed, and the resource utilization rates of the high-occupancy nodes and the low-occupancy nodes are obtained at intervals of a second time when data migration is performed, the second time being less than the first time.

6. The distributed data storage method according to claim 1, wherein an upper migration value limit is set, and data migration of storage data with a data volume greater than the upper migration value limit is prohibited.

7. A distributed data storage method as claimed in claim 1, wherein said curve fitting method is a least squares method.

8. A distributed data storage system for implementing a distributed data storage method as claimed in any one of claims 1 to 7, comprising

the evaluation module is used for screening the storage nodes according to the data provided by the calculation module and the capacity occupancy rate and determining the storage nodes needing data migration in and out;

and the migration module is used for migrating the storage data.