CN115599296A

CN115599296A - Automatic node expansion method and system for distributed storage system

Info

Publication number: CN115599296A
Application number: CN202211188692.8A
Authority: CN
Inventors: 李明; 韩德志; 吴中岱; 王骏翔
Original assignee: Shanghai Maritime University; Cosco Shipping Technology Co Ltd
Current assignee: Shanghai Maritime University; Cosco Shipping Technology Co Ltd
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2023-01-13

Abstract

The invention relates to a method and a system for automatically stretching and retracting nodes of a distributed storage system, wherein the method comprises the following steps: acquiring network flow data and making the network flow data into a data set; building a deep learning network model for regression prediction; training a deep learning network model by using a data set to obtain a traffic prediction model for predicting network traffic; the method comprises the steps of obtaining real-time network flow, obtaining future network flow through a flow prediction model, calculating storage data requirements corresponding to the future network flow, calculating the number of nodes required by the hadoop according to the storage data requirements, keeping the number of the nodes of the hadoop unchanged if the number of the current nodes of the hadoop is matched with the number of the required nodes, and adjusting the number of the current nodes of the hadoop to the number of the matched required nodes if the number of the current nodes of the hadoop is not matched with the number of the required nodes. Compared with the prior art, the distributed storage and processing are carried out based on hadoop, the elastic expansion of the number of the nodes is automatically carried out by learning and predicting the flow trend through a deep learning algorithm, the storage service quality is ensured, and the storage resources of a distributed storage system are effectively saved.

Description

Automatic node telescoping method and system for distributed storage system

Technical Field

The invention relates to the field of data analysis and distributed storage, in particular to a method and a system for automatically stretching and retracting nodes of a distributed storage system.

Background

The distributed File System is one of Hadoop core components and exists as a distributed storage service at the bottom layer. The problem solved by distributed file systems is large data storage. They are storage systems that span across multiple computers. Distributed file systems have a wide range of applications in the big data age, providing the needed scalability for storing and processing very large scale data.

The Hadoop cluster size and processing capacity are usually set according to the demand of cluster resources in the peak period, so that the cluster resources are inevitably wasted in the low peak period. At present, research on automatic stretching of distributed nodes exists, but most of the existing research at present sets an elastic stretching threshold value by depending on experience and requirements, dynamic stretching of the dynamic distributed nodes according to requirements is not really realized, and the stretching method causes resource waste to a certain extent.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide a method and a system for automatically stretching nodes of a distributed storage system.

The purpose of the invention can be realized by the following technical scheme:

a node automatic scaling method of a distributed storage system comprises the following steps:

acquiring network flow data and making the network flow data into a data set;

building a deep learning network model for regression prediction;

training a deep learning network model by using the data set to obtain a traffic prediction model for predicting network traffic;

acquiring real-time network flow, obtaining future network flow through the flow prediction model, calculating a storage data requirement corresponding to the future network flow, calculating the number of nodes required by hadoop according to the storage data requirement, if the number of nodes required by hadoop is matched with the number of nodes required by hadoop, keeping the number of nodes of hadoop unchanged, repeating the step, otherwise, adjusting the number of nodes required by hadoop to the number of nodes required by matching, and repeating the step.

Further, the deep learning network model comprises an input layer, a first feature extraction layer, a second feature extraction layer, a first GRU layer, a second GRU layer, a loss function layer and an output layer, wherein the input layer is used for acquiring input data, the first feature extraction layer and the second feature extraction layer respectively obtain a first feature value and a second feature value after performing feature extraction on the input data, the first feature value and the second feature value are input into the first GRU layer after dimension addition and dimension transformation, the output of the first GRU layer is sent into the second GRU layer, the output of the second GRU layer is sent into the output layer, and the loss function layer is used for performing parameter optimization on the deep learning network model.

Further, the principle of the deep learning network model is as follows:

x _j2 ＝f(p(x _i ))

f(*)＝max(0，*)

x _j ＝cat(x _j1 ，x _j2 )

x _t ＝reshape(x _j )

y′ _i ＝GRU1(x _t )

y _i ＝GRU2(x _t )

wherein m is the number of neurons in the input layer, x _i Representing input data, w _i，j Representing weights in the first feature extraction layer, b _j Representing a threshold value, x, in a first feature extraction layer _j1 Representing a first characteristic value, x _j2 Representing a second characteristic value, p () representing a max pooling operation, x _j Denotes a value obtained by dimension-adding the first characteristic value and the second characteristic value, cat () denotes addition by dimension, x _t Represents values of the first and second feature values after dimension transformation, reshape () represents dimension transformation, GRU1 () and GRU2 () represent operations of the first and second GRU layers, y' _i Denotes the output of the first GRU layer, y _i Representing the output of the second GRU layer and MAE the loss function.

Further, the parameters of the deep learning network model are optimally determined by a modified gray wolf algorithm as follows:

(1) initializing wolf clusters

Wherein N represents the scale of the wolf colony,

represents the position vector of the ith wolf, i is more than or equal to 1 and less than or equal to N,

the dimensionality number of the initial coefficient vector is the same as the number of parameters to be optimized in the deep learning network model and corresponds to the parameters one by one, and the coefficient vector is initialized

Coefficient vector

And attenuation coefficient

Determining the maximum iteration times and the upper and lower bounds of the parameters to be optimized;

(2) calculating the fitness value of each wolf in the wolf group, sequentially setting three wolfs with the optimal fitness value as alpha wolfs, beta wolfs and delta wolfs, and setting the rest wolfs as omega wolfs, if the preset convergence condition is met, executing the step (5), otherwise, executing the step (3);

(3) updating the position of the omega wolf, wherein the position updating formula is as follows:

wherein the content of the first and second substances,

indicating the location of the tth iteration of the individual wolf,

coefficient vectors representing alpha wolf, beta wolf and delta wolf,

indicating the position of the t-th iteration of alpha wolf, beta wolf and delta wolf,

represents the distance between the individual grey wolf and alpha wolf at the t-th iteration,

represents the distance between the individual grey wolf and beta wolf at the t-th iteration,

represents the distance between the grey wolf individual and the delta wolf at the t-th iteration, subscripts i =1, 2, 3 correspond to alpha wolf, beta wolf and delta wolf, respectively, f (×) represents the fitness function;

(4) updating a coefficient vector

Coefficient vector

And (3) executing the step (2), and updating the formula as follows:

wherein iter represents the current number of iterations, max _ iter represents the maximum number of iterations,

and

each represents [0,1]A randomly generated vector;

(5) and taking the position of the alpha wolf as an optimal parameter value of the deep learning network model.

Further, the first GRU layer and the second GRU layer have the same structure, and the parameters to be optimized in the deep learning network model include: the number of neurons in the first GRU layer and the second GRU layer and the size of the convolution kernel.

Further, the calculation method of the fitness value is as follows: substituting parameters corresponding to the position vectors of the wolfs into the deep learning network model, training the deep learning network model by using a data set, calculating a loss function in each training, and solving a fitness value based on the loss function, wherein the calculation formula of the fitness value is as follows:

wherein epsilon is a set hyper-parameter, K represents the final training times of the deep learning network model, MAE _k Representing the loss function for the kth iteration.

Further, the output of the traffic prediction model is a network traffic prediction value { f) in a time period T ₁ ，f ₂ ，...，f _N N represents the traffic number, and the future network traffic is determined based on the network traffic predicted value in the time period T as follows:

taking the peak value f of the flow in the time period T _max Taking the minimum flow value f in the time period T _min Taking the average value f of the flow in the time period T _avg Then, the calculation formula of the future network traffic f is as follows:

f＝w ₁ f _max +w ₂ f _min +w ₃ f _avg

wherein, w ₁ 、w ₂ And w ₃ Representing the weight coefficients.

Further, w ₃ ＝0.5，w ₁ 、w ₂ The calculation formula of (2) is as follows:

wherein n is _h Is { f ₁ ，f ₂ ，...，f _N Higher than f _avg Number of flows of n _l Is { f ₁ ，f ₂ ，...，f _N Lower than f in _avg The number of flows of (c).

Further, the "number of nodes currently adapted to the required number of nodes by hadoop" is specifically:

acquiring the current node number D of hadoop ₁ Obtaining the number of required nodes D ₂ If D is ₁ -λ≤D ₂ ≤D ₁ + lambda, the number of hadoop current nodes is adapted to the number of required nodes, wherein lambda is a preset elastic expansion threshold;

the "adjusting the current node number of hadoop to the adaptation requirement node number" specifically includes:

obtaining an adjusting step length;

if the current node number of the hadoop is smaller than the required node number, increasing the current node number according to the adjusting step length, judging whether the increased current node number is matched with the required node number, if so, finishing the adjustment, otherwise, repeating the step;

if the current node number of the hadoop is larger than the required node number, reducing the current node number according to the adjustment step length, judging whether the reduced current node number is matched with the required node number, if so, finishing the adjustment, otherwise, repeating the step.

A node auto-scaling system for a distributed storage system, comprising:

the traffic prediction module is loaded with a trained traffic prediction model and used for obtaining future network traffic according to the real-time network traffic;

the demand node calculation module is used for calculating the storage data demand corresponding to the future network flow and calculating the demand node number of hadoop according to the storage data demand;

the adaptation judging module is used for judging whether the current node number of the hadoop is adapted to the required node number;

the node number adjusting module is used for adjusting the current node number of the hadoop;

the flow prediction model is trained as follows:

acquiring network flow data and making the network flow data into a data set;

building a deep learning network model for regression prediction;

and training a deep learning network model by using the data set to obtain a flow prediction model for network flow prediction.

Compared with the prior art, the invention has the following beneficial effects:

(1) The change trend of the network flow data is automatically sensed through a deep learning algorithm, and the number of nodes required by the distributed file system is automatically adjusted according to the change trend, so that the real dynamic expansion of the nodes along with the network flow data is realized, the storage service quality is ensured, and the storage resources of the distributed storage system are effectively saved.

(2) The improved wolf algorithm is utilized to carry out parameter optimization of the deep learning network model, the problems of long time and large required storage space during model training are effectively solved, and the prediction precision of the model can be effectively improved after the neuron number and the convolution kernel size of the GRU are optimized.

(3) When the future network flow is calculated, the maximum value, the minimum value and the average value of the flow are comprehensively considered, and the weight is determined according to the number of nodes which are higher than the average value of the flow and lower than the average value of the flow, so that the future network flow is more accurate.

Drawings

FIG. 1 is a flow chart of a method for node auto-scaling;

FIG. 2 is a schematic structural diagram of a deep learning network model;

FIG. 3 is a flow chart of the improved Grey wolf algorithm optimization parameters.

Detailed Description

The invention is described in detail below with reference to the figures and the specific embodiments. The present invention is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, it is obvious that the described embodiment is only a part of the embodiment of the present invention, not all embodiments, and the protection scope of the present invention is not limited to the following embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic may be included in at least one implementation of the invention. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus. Furthermore, the terms "first," "second," "third," and the like are used solely to distinguish one from another, and are not to be construed as indicating or implying relative importance.

The present specification provides method steps as in the examples or flow diagrams, but may include more or fewer steps based on routine or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In actual system or server product execution, the method shown in the embodiment or the figures can be executed sequentially or in parallel (for example, in the environment of parallel processors or multi-thread processing), or the execution sequence of steps without timing limitation can be adjusted.

Example 1:

the application provides a node automatic telescoping method of a distributed storage system, as shown in fig. 1, including the following steps:

s1, acquiring network flow data and making the network flow data into a data set;

acquiring network flow data in a period of time, recording related information including year, month, day, week, hour, minute, second, millisecond, flow number at the moment and the like, determining the length of a training sample, determining a predicted time step, intercepting the network flow data according to a certain time length, and carrying out preprocessing operations such as normalization, vacancy value processing and the like, thereby completing construction of a data set

S2, building a deep learning network model for regression prediction;

in this application, the deep learning network model that constructs includes the input layer, first characteristic extraction layer, the second characteristic extraction layer, first GRU layer, the second GRU layer, loss function layer and output layer, as shown in fig. 2, wherein, the input layer is used for acquireing the input data, first characteristic extraction layer and second characteristic layer obtain first eigenvalue and second eigenvalue respectively after carrying out the characteristic extraction to the input data, first eigenvalue and second eigenvalue input first GRU layer after dimension addition and dimension transform, the output of first GRU layer is sent into the second GRU layer, the output of second GRU layer is sent into the output layer, loss function layer is used for carrying out parameter optimization to the deep learning network model.

In the application, the first feature extraction layer is a maximum pooling layer, the second feature extraction layer is a convolution layer, the first GRU layer and the second GRU layer have the same structure, and a GRU (Gate recovery Unit) is one of the Recurrent neural networks, and is provided for solving the problems of Long-Term Memory, gradient in back propagation and the like, like an LSTM (Long-Short Term Memory). The input and output structure of the GRU is the same as that of a common RNN, with inputs: input x at time t ^t Hidden layer State H at time t-1 _t-1 And (3) outputting: output y of hidden node at time t ^t Hidden state h passed to the next node ^t 。

r _t ＝σ(x _t W _xr +H _t-1 W _hr +b _r )

z _t ＝σ(x _t W _xz +H _t-1 W _hz +b _z )

y ^t ＝W _o ·H _t

GRU has only two doors, r _t The updating gate is obtained by combining an input gate and a forgetting gate in the LSTM for the GRU, and is used for controlling the data size of the previous memory information which can be continuously reserved to the current moment, or determining how much information of the previous time step and the current time step is to be continuously transmitted to the future; z is a radical of formula _t The gate is reset for the other gate, which controls how much past information is to be forgotten.

Represents a candidate hidden state, H _t Representing the final hidden state.

Specifically, the principle of the deep learning network model is as follows:

x _j2 ＝f(p(x _i ))

f(*)＝max(0，*)

x _j ＝cat(x _j1 ，x _j2 )

x _t ＝reshape(x _j )

y′ _i ＝GRU1(x _t )

y _i ＝GRU2(x _t )

wherein m is the number of neurons in the input layer, x _i Representing input data, w _i，j Representing weights in the first feature extraction layer, b _j Representing a threshold value, x, in a first feature extraction layer _j1 Representing a first characteristic value, x _j2 Representing a second characteristic value, p () representing a max pooling operation, x _j Representing the value of the first and second eigenvalues after dimensional addition, cat () representing the addition by dimension, x _t Representing the first and second eigenvalues passing throughA dimension-transformed value, reshape () represents a dimension transformation, and GRU1 () and GRU2 () represent operations of the first and second GRU layers, y' _i Denotes the output of the first GRU layer, y _i Denotes the output of the second GRU layer, MAE denotes the loss function, f (x) _i ) Indicates the result of the prediction, y _i Representing the real result of the recording.

S3, training the deep learning network model by using the data set to obtain a traffic prediction model for predicting network traffic;

generally, parameters of a deep learning network model are set empirically or randomly, and often need to be tried for finding optimal parameters many times, and in order to find optimal parameters faster, many researchers have proposed that optimization algorithms such as particle swarm and genetic algorithms can be used for parameter optimization, so as to determine optimal network parameters. The gray wolf optimization (GWOO) is defined as a group intelligent optimization model which has the advantages of flexibility, simplicity and a non-derivative mechanism. Some studies have found that GWO has better numerical properties than other traditional optimization models, can prevent local optimization, and is considered a convenient stochastic approach to solving highly nonlinear, multivariate, and multimodality optimization problems. GWO is a new metaheuristic technique originally proposed by milr gialy et al and inspired by the hunting and social class of grey wolves. The gray wolf group is essentially divided into four classes, named alpha (α), beta (β), delta (σ), and omega (ω), respectively. The top level of the hierarchy is the alpha wolf. In addition, alpha wolf also makes critical decisions. The second level in the grey wolf's scale is the beta wolf, which acts as a mentor and provides feedback to authorize and assist the alpha wolf. On the other hand, the deltoid wolf obeys the commands of the alpha wolf and the beta wolf, and governs the omega wolf. The major stages of the grayish water hunting process include killing, surrounding and attacking the game. In the GWO algorithm, the hierarchy is the most suitable solution, the second best solution, the third best solution, and the remaining candidates.

However, the inventor finds that, in actual operation, when the traditional gray wolf algorithm is directly applied to perform parameter optimization on the deep learning network model designed by the application, the parameter optimization is easy to fall into a local optimal solution, and the optimization effect is poor. Aiming at the problems, the inventor improves the gray wolf algorithm, and utilizes the improved gray wolf algorithm to automatically perform parameter optimization, so that the parameter optimization is more reliable and accurate, and as shown in fig. 3, the parameter optimization of the deep learning network model by using the improved gray wolf algorithm is as follows:

(1) initializing wolf clusters

Wherein, N represents the scale of the wolf colony,

Coefficient vector

And attenuation coefficient

Determining the maximum iteration times and the upper and lower bounds of the parameters to be optimized; in this embodiment, in the deep learning network model, the parameters to be optimized include the number of neurons in the first GRU layer and the second GRU layer and the size of a convolution kernel, the number of graybugs is set to 50, the maximum iteration time is set to 100, the lower bound is set to-20, the upper bound is set to 20, and the number of dimensions is set to 3;

the calculation method of the fitness value is as follows: substituting parameters corresponding to the position vectors of the wolfs into the deep learning network model, training the deep learning network model by using a data set, calculating a loss function in each training, and solving a fitness value based on the loss function, wherein the calculation formula of the fitness value is as follows:

wherein epsilon is a set hyper-parameter, K represents the final training times of the deep learning network model, MAE _k Representing the loss function for the k-th iteration.

in the traditional gray wolf optimization algorithm, the position of omega wolf

Is updated by the formula

In the present application, the new location of the ω wolf in the population is determined according to the fitness scores of the alpha, beta, and delta wolf as follows:

wherein, the first and the second end of the pipe are connected with each other,

indicating the location of the tth iteration of the individual wolf,

coefficient vectors representing alpha wolf, beta wolf and delta wolf,

indicating the position of the t-th iteration of the alpha, beta and delta wolves,

denotes the distance between the individual grey wolfs and the delta wolf at the t-th iteration, the indices i =1, 2, 3 correspond to alpha wolfs, beta wolfs and delta wolfs, respectively, f (—) denotes the fitness function;

(4) updating a coefficient vector

Coefficient vector

And (3) executing the step (2), and updating the formula as follows:

and

each represents [0,1]In the traditional gray wolf optimization algorithm, attenuation coefficient

Is a linearly decreasing vector from 2 to 0, albeit with parameters

In the traditional GWO algorithm, linear decline exists, but in many problems, the exploration and development behaviors of the algorithm need to change in a non-linear way to be far away from the local optimal solution, and therefore the attenuation coefficient is adjusted by the method

The following:

The improvement of the grey wolf algorithm in the application is as follows:

1. when the position of each gray wolf is updated, the test vectors are not averaged, but the weighted sum of the test vectors is used for updating each position according to the scoring weights of the three leading gray wolfs, and the updating mechanism is favorable for improving the exploration and development capacity of the algorithm and the capacity of searching for the optimal GUR parameters, so that the accuracy of the deep learning network model provided by the application is improved.

2. Coefficient of attenuation

Adjusted, although this parameter drops linearly in the original GWO algorithm, in many problems the exploration and development behavior of the algorithm needs to change non-linearly to get away from the locally optimal solution, in order to avoid the algorithm getting into the local optimal, the present application redesigns the algorithm

The update rule of (1).

Through the modification of the two points, the improved wolf algorithm is more suitable for optimizing the deep learning network model provided by the application. And a large number of experiments prove that the improved GWO algorithm has a good outstanding optimization effect in optimizing the deep learning network model provided by the application, the original GWO algorithm has a poor optimization effect, the allowable exploration duration of the transition process can be prolonged by 30% (the exploration time of the original GWO process accounts for about 35% of the total time, and the exploration time accounts for 65% after improvement), the mean square error of about 1% of the prediction result is finally reduced, and the optimal solution can be found only by 90% of iterations of the original GWO.

In the prior art, although a traffic prediction algorithm based on deep learning is researched, the traffic prediction algorithm is usually limited to the prediction capability of a deep learning model, the accuracy is low, and the optimal parameters are difficult to train.

The network traffic data change contains a large amount of information, namely whether the network is attacked or not can be understood, the task amount needing to be processed in the network can also be understood, and the characteristic that the network traffic change is more and more complex along with the continuous development of the network is more and more difficult to understand, so that the traffic prediction is carried out by means of deep learning according to the deep mining of the potential information related to the traffic task of the traffic data change.

It can be understood that, in deep learning network model training, whether the deep learning network model is valid or not may be determined by using a difference between a predicted traffic sequence and a real traffic sequence, and a data set may be further divided into a training set, a verification set, a test set, and the like, which are not described herein again.

And S4, acquiring real-time network flow, obtaining future network flow through a flow prediction model, calculating a storage data requirement corresponding to the future network flow, calculating the number of nodes required by the hadoop according to the storage data requirement, if the number of the current nodes of the hadoop is matched with the number of the required nodes, keeping the number of the nodes of the hadoop unchanged, repeating the step, otherwise, adjusting the number of the current nodes of the hadoop to the number of the matched required nodes, and repeating the step.

Wherein, the output of the flow prediction model is a network flow prediction value { f) in a time period T ₁ ，f ₂ ，...，f _N And N represents the flow number, if T can take 1s, and the time granularity is set to 1ms, the flow number N is 1000, and each time granularity has a flow predicted value. Determining the future network flow based on the network flow predicted value in the time period T as follows:

f＝w ₁ f _max +w ₂ f _min +w ₃ f _avg

wherein w ₁ 、w ₂ And w ₃ Representing the weight coefficients.

w ₃ ＝0.5，w ₁ 、w ₂ The calculation formula of (2) is as follows:

Wherein n is _h Is { f ₁ ，f ₂ ，...，f _N Higher than f _avg Number of flows of n _l Is { f ₁ ，f ₂ ，...，f _N Lower than f in _avg The number of flows of (c). When the future network flow is calculated, the maximum value, the minimum value and the mean value of the flow are comprehensively considered, and the weight is determined according to the number of nodes higher than the mean value of the flow and lower than the mean value of the flow, so that the future network flow is more accurate

The specific judgment of the adaptive required node number of the current node number of hadoop is as follows: acquiring the current node number D of hadoop ₁ Obtaining the number of required nodes D ₂ If D is ₁ -λ≤D ₂ ≤D ₁ + lambda, the number of hadoop current nodes is adapted to the number of required nodes, where lambda is a preset elastic expansion threshold, and in this embodiment, lambda is set to be 3/4 of the number of hadoop current nodes;

the "adjusting the current number of nodes of hadoop to the number of nodes required for adaptation" specifically includes: acquiring an adjusting step length; if the current node number of the hadoop is smaller than the required node number, increasing the current node number according to the adjusting step length, judging whether the increased current node number is matched with the required node number, if so, finishing the adjustment, otherwise, repeating the step; if the current node number of hadoop is larger than the required node number, reducing the current node number according to the adjustment step length, judging whether the reduced current node number is matched with the required node number, if so, finishing the adjustment, otherwise, repeating the step.

Wherein, the adjusting step length can be set to be a proper size, so that the number of the nodes is gradually increased or decreased until the requirement is met.

The invention analyzes the network flow for the first time to predict the demand of data on the distributed storage nodes, automatically expands the nodes, and simultaneously carries out secondary adjustment on partial expansion results in order to save resources and not influence the storage requirement, thereby really realizing dynamic expansion of the distributed nodes and avoiding the waste of node resources as much as possible.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. As such, the software programs (including associated data structures) of the present application can be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal bearing medium and/or stored in a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.

The present application further provides an automatic node expansion system of a distributed storage system, including:

the flow prediction model is trained as follows:

acquiring network flow data and making the network flow data into a data set;

building a deep learning network model for regression prediction;

and training the deep learning network model by using the data set to obtain a traffic prediction model for predicting network traffic.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A node automatic scaling method of a distributed storage system is characterized by comprising the following steps:

acquiring network flow data and making the network flow data into a data set;

building a deep learning network model for regression prediction;

training a deep learning network model by using the data set to obtain a flow prediction model for network flow prediction;

acquiring real-time network flow, obtaining future network flow through the flow prediction model, calculating a storage data requirement corresponding to the future network flow, calculating the number of nodes required by the hadoop according to the storage data requirement, if the number of the current nodes of the hadoop is matched with the number of the required nodes, keeping the number of the nodes of the hadoop unchanged, repeating the step, otherwise, adjusting the number of the current nodes of the hadoop to the number of the matched required nodes, and repeating the step.

2. The method according to claim 1, wherein the deep learning network model comprises an input layer, a first feature extraction layer, a second feature extraction layer, a first GRU layer, a second GRU layer, a loss function layer and an output layer, wherein the input layer is used for acquiring input data, the first feature extraction layer and the second feature layer perform feature extraction on the input data to respectively obtain a first feature value and a second feature value, the first feature value and the second feature value are input into the first GRU layer after dimension addition and dimension conversion, the output of the first GRU layer is sent into the second GRU layer, the output of the second GRU layer is sent into the output layer, and the loss function layer is used for performing parameter optimization on the deep learning network model.

3. The method of claim 2, wherein the deep learning network model is based on the following principles:

x _j2 ＝f(p(x _i ))

f(*)＝max(0，*)

x _j ＝cat(x _j1 ，x _j2 )

x _t ＝reshape(x _j )

y′ _i ＝GRU1(x _t )

y _i ＝GRU2(x _t )

wherein m is the number of neurons in the input layer, x _i Representing input data, w _i，j Representing weights in the first feature extraction layer, b _j Representing a first feature extractionThreshold in layer, x _j1 Representing a first characteristic value, x _j2 Representing a second characteristic value, p () representing a max pooling operation, x _j Representing the value of the first and second eigenvalues after dimensional addition, cat () representing the addition by dimension, x _t Represents values of the first and second feature values after dimension transformation, reshape () represents dimension transformation, GRU1 () and GRU2 () represent operations of the first and second GRU layers, y' _i Denotes the output of the first GRU layer, y _i The output of the second GRU layer is represented and MAE represents the penalty function.

4. The method of claim 3, wherein the parameters of the deep learning network model are determined by an improved Hurrill algorithm optimization as follows:

(1) initializing wolf clusters

Wherein, N represents the scale of the wolf colony,

Coefficient vector

And attenuation coefficient

indicating the position of the t-th iteration of the wolf individual,

coefficient vectors representing alpha wolf, beta wolf and delta wolf,

(4) updating a coefficient vector

Coefficient vector

And (3) executing the step (2), and updating the formula as follows:

and

each represents [0,1 ]]Medium randomA generated vector;

5. The method of claim 4, wherein the first GRU layer and the second GRU layer have the same structure, and the deep learning of the parameters to be optimized in the network model comprises: the number of neurons in the first GRU layer and the second GRU layer and the size of the convolution kernel.

6. The method according to claim 5, wherein the fitness value is calculated in a manner that: substituting parameters corresponding to the position vectors of the wolfs into the deep learning network model, training the deep learning network model by using a data set, calculating a loss function in each training, and solving a fitness value based on the loss function, wherein the calculation formula of the fitness value is as follows:

7. The method according to claim 1, wherein the output of the traffic prediction model is a predicted value { f } of network traffic in a time period T ₁ ，f ₂ ，...，f _N And N represents the flow number, and the future network flow is determined based on the network flow predicted value in the time period T as follows:

f＝w ₁ f _max +w ₂ f _min +w ₃ f _avg

wherein, w ₁ 、w ₂ And w ₃ Representing the weight coefficients.

8. The method of claim 7, wherein w is the number of nodes in the distributed storage system ₃ ＝0.5，w ₁ 、w ₂ The calculation formula of (c) is:

9. The method according to claim 1, wherein the "hadoop current node number adaptation required node number" is specifically:

acquiring the current node number D of hadoop ₁ Obtaining the number D of required nodes ₂ If D is ₁ -λ≤D ₂ ≤D ₁ + lambda, the number of hadoop current nodes is adapted to the number of required nodes, wherein lambda is a preset elastic expansion threshold;

obtaining an adjusting step length;

if the current node number of hadoop is larger than the required node number, reducing the current node number according to the adjustment step length, judging whether the reduced current node number is matched with the required node number, if so, finishing the adjustment, otherwise, repeating the step.

10. A node automatic telescoping system of a distributed storage system, comprising:

the flow prediction model is trained as follows:

acquiring network flow data and making the network flow data into a data set;

building a deep learning network model for regression prediction;