CN115061978A

CN115061978A - Construction method of hadoop parameter optimization model

Info

Publication number: CN115061978A
Application number: CN202210671845.8A
Authority: CN
Inventors: 付学良; 罗小玲; 潘新
Original assignee: Inner Mongolia Agricultural University
Current assignee: Inner Mongolia Agricultural University
Priority date: 2022-06-15
Filing date: 2022-06-15
Publication date: 2022-09-16

Abstract

The invention relates to the technical field of distributed processing, in particular to a method for constructing a hadoop parameter optimization model, which comprises the following steps: collecting data volume generated by each data source in a certain time by using a server; analyzing the characteristics of the single data source, giving characteristic values according to the generated data quantity in proportion, and estimating the scale of the file to be processed by utilizing the server according to the characteristic values; the server collects the resource storage amount of each prepared node in a conventional running state by taking a certain time as a period, and groups the resource storage amount; the server pre-estimates the number of nodes and the processing time according to the scale of the file to be processed; and the server adjusts the hadoop parameters according to the number of the pre-estimated nodes and the processing time of the server. The method comprises the steps of giving characteristic values by analyzing the characteristics of a data source, predicting the file scale according to the characteristic values, grouping hadoop distributed nodes, and adjusting the parameters of the hadoop according to the file scale and the node group, so that the resources of hadoop projects are saved.

Description

Construction method of hadoop parameter optimization model

Technical Field

The invention relates to a hadoop optimization method, in particular to a construction method of a hadoop parameter optimization model.

Background

With the increasing expansion of data generated by information, hadoop is widely applied as an important means for processing and solving large files, and in application, adjustment of hadoop configuration parameters plays a crucial role in overall operation efficiency and resource utilization rate. Chinese patent publication No. CN104317610A discloses "a method and apparatus for automatic installation and deployment of hadoop platform", which loads a hadoop end node by using a host cluster, and adjusts necessary parameters to default parameters. Chinese patent publication No. CN103064664A discloses "an automatic Hadoop parameter optimization method and system based on performance estimation", which adjusts Hadoop parameters by using analog operation on Hadoop projects, so as to reduce cost. Chinese patent publication No. CN104750780A discloses "a Hadoop configuration parameter optimization method based on statistical analysis", which classifies applications with strong characteristics and establishes a prediction model, thereby guiding the parameter optimization of Hadoop.

It can be seen that the above method and system have the following problems: when the information source of the project is in various states, the scale of the project is difficult to judge, and the aim of saving resources is difficult to achieve for the hadoop parameter optimization.

Disclosure of Invention

Therefore, the invention provides a method for constructing a hadoop parameter optimization model. The method is used for solving the problems that in the prior art, when the information source of the project is in various states, the scale of the project is difficult to judge, and the purpose of saving resources is difficult to achieve for the hadoop parameter optimization.

In order to achieve the above object, the present invention provides a method for constructing a hadoop parameter optimization model, comprising:

step S1, collecting the data quantity generated by each data source in a certain time by using a server and analyzing the maximum value and the minimum value of the data generated by the single-grid data source in a preset time;

step S2, analyzing the characteristics of the single data source according to experience and inputting the characteristics into the server, wherein the server gives characteristic values to the data sources according to the data quantity generated by the data sources in proportion, and pre-estimates the scale of the file to be processed according to the characteristic values;

step S3, the server collects the resource storage of each prepared node in the normal running state with a certain time as a period, and groups the prepared nodes according to the resource storage according to the time;

step S4, the server pre-estimates the number of nodes and the processing time according to the scale of the file to be processed;

and step S5, adjusting hadoop parameters according to the node number and the processing time estimated by the server.

Further, the data volume generated by the data source in a preset period is D, and D is regularly changed in a preset time T;

for the data amount Dij generated by the ith data source in the period of j time, which is evenly divided by the preset time T, there are a maximum value maxDij and a minimum value minDij, and i is set to 1,2,3, …, N, j is set to 1,2,3, …, m.

Further, the maximum data size generated by the data source within the preset time T is denoted as maxDT, the minimum data size generated within the preset time T is denoted as minDT, and maxDT is set to maxdijxnxm, and minDT is set to minDij xnxm;

and setting hadoop standard parameters A corresponding to the minDT by taking the minDT as the standard file scale, and grouping by taking the A as a reference according to the running state of the preparation node.

Further, for a single said backup node, the resource storage amount R of the backup node has a highest value maxR within the preset time T, a first preset resource storage amount R1, a second preset resource storage amount R2 is set, wherein R1 is 0.3maxR, R2 is 0.7maxR,

if R < R1, the server judges that the reserve node resource is insufficient, and records the time period of the node under the condition as an unavailable time period;

if R1 is more than or equal to R < R2, the server judges that the reserve node resource storage amount is low, and records the time period of the node under the condition as an inefficient time period;

if R is larger than or equal to R2, the server judges that the reserve node resource storage amount is high, and records the time period of the node under the condition as an efficient time period.

Further, for the kth of the preliminary node, a state within the jth of the time periodP _kj Assignment where k is 1,2,3, …, n,

if the time interval is marked as an unavailable time interval, P is added _kj The value is assigned to be 0 and,

if the period is marked as an inefficient period, P is assigned _kj The value is assigned to be 1,

if the time interval is recorded as the high-efficiency time interval, P _kj The value is assigned to be 2,

grouping the prepared nodes by the optimal node number NA according with the A by using the standard parameter A so that the group of nodes is in P _kj The item whose data size is minDT is completed after the optimal execution time tA in the state of (1).

Further, for the kth of the preparation node,

when j + T is less than or equal to T,

if P _kj ＝P _kj+1 ＝…＝P _kj+t The server takes the node as a stable node record of the (j, j + t) time period and brings the node into a group;

if P _kj ＝P _kj+1 ＝…＝P _kj+t 0, the server regards the node as an unavailable node for the (j, j + t) period;

when j + T > T, the server determines that the node is unavailable.

Further, the average of the maximum data amount maxDT and the minimum data amount minDT is

The hadoop parameter of (a) is a ', wherein the optimal number of nodes of a' is NA ', the optimal execution time is tA', and a function of the optimal number of nodes and the optimal execution time is set as f (d), wherein f (d) N × t,

and after the data volume D is obtained, the server judges the adjustment mode of the working hadoop parameter according to a function f (D).

Furthermore, the data quantity Dij corresponds to a characteristic attribute of the time period j, the characteristic attribute influences the scale of the data quantity Dij, in the time period T, the proportion of the characteristic attributes of Dij and j in the time period T is multiplied, and an approximate value of DT is obtained after summation and is used as estimated data quantity to carry out hadoop parameter optimization reference data.

Further, if each of the nodes cannot be deployed in the case of more than 70% of the maximum data amount maxDT, the node group operates normally.

Further, the preset time has no continuity, and the processing time has continuity.

Compared with the prior art, the method has the advantages that the characteristic values are given by analyzing the characteristics of the data source, the file scale is estimated according to the characteristic values, the hadoop distributed nodes are grouped, the node groups are distributed according to the file scale, and the parameters of the hadoop are adjusted according to the file scale and the node groups, so that the resources of hadoop projects are saved.

Furthermore, by means of classifying the data volume generated by the data source according to time and scale, the data scale judgment error caused by unclear classification of the data source is avoided, and meanwhile, the classification efficiency of the data source is effectively improved, so that the resources of a hadoop project are further saved.

Furthermore, by means of the mode that the maximum scale and the minimum scale which can be reached by the acquired data source within a certain time are pre-estimated, and the minimum-scale data volume is set as a reference group, the calculation power waste caused by no reference in file segmentation is avoided, meanwhile, the reasonability of the distribution of calculation power resources is effectively improved, and therefore the resources of a hadoop project are further saved.

Furthermore, the nodes with larger workload except the hadoop project are separated from the nodes with smaller workload by utilizing the node classification mode, so that the working reliability of the nodes is effectively improved while the time waste caused by insufficient processing capacity of a single node is avoided, and the resources of the hadoop project are further saved.

Furthermore, the nodes are classified by assigning the working states of the nodes, so that the uneven processing capacity of the nodes is avoided, the working reliability of the nodes is effectively improved, and the resources of hadoop projects are further saved.

Furthermore, the time interval with the highest node working efficiency is determined by comparing the node working filling value with the working time interval, the nodes are grouped, the project duration increase caused by different node processing efficiency is avoided, meanwhile, the reliability of the node working is effectively improved, and therefore the resources of the hadoop project are further saved.

Furthermore, the completion result of the hadoop project is estimated in a mode of setting functions of the number of nodes, the execution time and the data volume, the calculation amount is reduced, and meanwhile, resource shortage or waste caused by inaccurate estimation of the data scale is avoided, so that the resources of the hadoop project are further saved.

Furthermore, the pre-estimation attribute of the data source is adjusted through the time attribute, so that the problem that the pre-estimation is inaccurate due to abnormal increase of data amount caused by special time is avoided, the stability of the hadoop project is improved, and the resources of the hadoop project are further saved.

Furthermore, by means of obtaining the maximum data volume, project estimation uncertainty caused by node grouping being too small is avoided, meanwhile, system interference performance is improved, and accordingly resources of hadoop projects are saved.

Furthermore, through the continuity of the set time, the data volume is prevented from being abnormally increased due to continuous information collection, and meanwhile, the stability of the hadoop project is improved, so that the resources of the hadoop project are further saved.

Drawings

FIG. 1 is a schematic flow chart of a method for constructing a hadoop parameter optimization model according to the present invention.

Detailed Description

In order that the objects and advantages of the invention will be more clearly understood, the invention is further described below with reference to examples; it should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and do not limit the scope of the present invention.

It should be noted that in the description of the present invention, the terms of direction or positional relationship indicated by the terms "upper", "lower", "left", "right", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, which are only for convenience of description, and do not indicate or imply that the device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention.

Furthermore, it should be noted that, in the description of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

Fig. 1 is a schematic flow chart of a method for constructing a hadoop parameter optimization model according to the present invention, which includes:

step S2, analyzing the characteristics of a single data source according to the experience of the server and inputting the characteristics into the server, wherein the server gives characteristic values to the data sources according to the proportion of the data amount generated by the data sources and pre-estimates the scale of the file to be processed according to the characteristic values;

and step S5, the server adjusts the hadoop parameters according to the number of nodes and the processing time estimated by the server.

The method comprises the steps of giving a characteristic value by analyzing the characteristics of a data source, estimating the file scale according to the characteristic value, grouping hadoop distributed nodes, distributing node groups according to the file scale, and adjusting the parameters of hadoops according to the file scale and the node groups, so that the resources of hadoop projects are saved.

Specifically, the data volume generated by the data source in a preset period is D, and D is regularly changed in a preset time T;

for the data amount Dij generated by the ith data source in the j time periods evenly dividing the preset time T, there are a maximum value maxDij and a minimum value minDij, setting i equal to 1,2,3, …, N, j equal to 1,2,3, …, m.

By means of the mode that the data volume generated by the data source is classified according to time and scale, the data scale judgment error caused by unclear classification of the data source is avoided, meanwhile, the classification efficiency of the data source is effectively improved, and therefore resources of a hadoop project are further saved.

Specifically, the maximum data size generated by the data source within a preset time T is denoted as maxDT, the minimum data size generated within the preset time T is denoted as minDT, and maxDT ═ maxdijxnxm and minDT ═ minDij × nxxm are set;

and setting a hadoop standard parameter A corresponding to the minDT by taking the minDT as the standard file scale, and grouping according to the running state of the preparation node by taking A as a reference.

The maximum scale and the minimum scale which can be reached by the data source within a certain time are obtained for pre-estimation, and the minimum-scale data volume is set as a reference group, so that the calculation power waste caused by no reference in file segmentation is avoided, and meanwhile, the reasonability of the allocation of calculation power resources is effectively improved, and the resources of a hadoop project are further saved.

Specifically, for a single said backup node, resource storage amount R of the backup node has a highest value maxR within preset time T, a first preset resource storage amount R1 and a second preset resource storage amount R2 are set, where R1 is 0.3maxR, R2 is 0.7maxR,

if R < R1, the server can determine that the reserve node resource is insufficient, and record the time period of the node in the condition as an unavailable time period;

if R1 is not more than R < R2, the server can judge that the reserve node resource storage amount is low, and mark the time period of the node under the condition as an inefficient time period;

if R is larger than or equal to R2, the server can judge that the reserve node resource storage amount is high, and the time period of the node under the condition is recorded as an efficient time period.

By means of node classification, nodes with larger workload except the hadoop project are separated from nodes with smaller workload, time waste caused by insufficient processing capacity of a single node is avoided, reliability of node work is effectively improved, and resources of the hadoop project are further saved.

Specifically, for the kth preliminary node, the state P in the jth time period _kj Assignment where k is 1,2,3, …, n,

if the time interval is marked as the high efficiency time interval, P is added _kj The value is assigned to be 2,

grouping the prepared nodes by the optimal node number NA according with the A by using the standard parameter A so that the group of nodes is in P _kj Can complete the project with the data volume minDT after the optimal execution time tA in the state of (1).

The nodes are classified by assigning the working states of the nodes, so that the uneven processing capacity of the nodes is avoided, the working reliability of the nodes is effectively improved, and the resources of hadoop projects are further saved.

Specifically, for the kth of the preparation node,

when j + T is less than or equal to T,

if P _kj ＝P _kj+1 ＝…＝P _kj+t 0, the server treats the node as an unavailable node for a (j, j + t) period;

when j + T > T, the server determines that the node is unavailable.

The time interval with the highest node working efficiency is determined by comparing the node working filling value with the working time interval, the nodes are grouped, the project duration increase caused by different node processing efficiency is avoided, the node working reliability is effectively improved, and the hadoop project resource is further saved.

Specifically, the average value of the maximum data amount maxDT and the minimum data amount minDT is

after the data volume D is obtained, the adjustment mode of the hadoop parameter of the work can be judged.

The completion result of the hadoop project is estimated by setting functions of the number of nodes, the execution time and the data volume, so that the amount of calculation is reduced, and simultaneously, the resource shortage or waste caused by inaccurate estimation of the data scale is avoided, and the resource of the hadoop project is further saved.

Specifically, the data quantity Dij corresponds to a characteristic attribute of the time period j, the characteristic attribute influences the scale of the data quantity Dij, and in the time period T, the ratio of the characteristic attributes of the time period j to the Dij in the time period T is multiplied, and an approximate value of DT is obtained after the multiplication and is used as estimated data quantity to carry out hadoop parameter optimization reference data. For example, when information of shopping software for a certain continuous 7 days is processed, 2 holidays, a,1 specific shopping festival, 1 travel discount day and 4 working days, the server may assign 1 to a working day, 1.5 to a rest day, 5 to a specific shopping festival and 0.5 to a travel discount day, so that the data volume Di ' generated by the data source in the continuous 7 days is Di ' ═ Di × (1 × 4+1.5 × 2+0.5 × 1+5 × 1), and if the travel discount day overlaps with one rest day, the data volume Di ' generated by the data source in the continuous 7 days is Di × (1 × 4+1.5 × 2+0.5 × 1+5 × 1)

The pre-estimation attribute of the data source is adjusted through the time attribute, so that the problem that pre-estimation is inaccurate due to abnormal increase of data amount caused by special time is avoided, the stability of the hadoop project is improved, and the resources of the hadoop project are further saved.

Specifically, if each of the nodes cannot be placed above 70% of the maximum data amount maxDT, the node group may be operational. By means of obtaining the maximum data volume, project estimation uncertainty caused by node grouping undersize is avoided, meanwhile, system interference performance is improved, and accordingly resources of hadoop projects are saved.

Specifically, the preset time has no continuity, and the processing time has continuity. Through the continuity of the set time, the data volume is prevented from being abnormally increased due to continuous information collection, and meanwhile, the stability of the hadoop project is improved, so that the resources of the hadoop project are further saved.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention; various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for constructing a hadoop parameter optimization model is characterized by comprising the following steps:

step S3, the server collects the resource storage amount of each prepared node in each time interval under the normal operation state with a certain time as a cycle, and groups the prepared nodes according to the size of the resource storage amount and the time;

2. The method for constructing the hadoop parameter optimization model according to claim 1, wherein the data volume generated by the data source in a preset period is D, and D changes regularly in a preset time T;

3. The method for constructing the hadoop parameter optimization model according to claim 2, wherein the maximum data size generated by the data source within the preset time T is denoted as maxDT, the minimum data size generated within the preset time T is denoted as minDT, and maxDT is set to maxdijxnxm, and minDT is set to minDij xm;

4. The method for constructing hadoop parameter optimization model according to claim 3, wherein for a single of the preliminary nodes, the resource storage amount R of the preliminary node has a maximum value maxR within a preset time T, a first preset resource storage amount R1 and a second preset resource storage amount R2 are set, wherein R1 is 0.3maxR, R2 is 0.7maxR,

and if R is larger than or equal to R2, the server judges that the reserve node resource storage amount is high, and records the time period of the node under the condition as an efficient time period.

5. The method for constructing hadoop parameter optimization model according to claim 4, wherein for the kth preparation node, the state P in the jth time period _kj Assignment where k is 1,2,3, …, n,

grouping the prepared nodes by the optimal node number NA according with the A by using the standard parameter A so that the group of nodes is in P _kj In the state of optimum executionAnd finishing the item with the data size of minDT after the time tA.

6. The method of constructing a hadoop parameter optimization model according to claim 5, wherein for the kth preparation node,

when j + T is less than or equal to T,

when j + T > T, the server determines that the node is unavailable.

7. The method for constructing the hadoop parameter optimization model according to claim 6, wherein the mean value of the maximum data amount maxDT and the minimum data amount minDT is

and after the data volume D is obtained, the server calculates the adjustment mode of the hadoop parameter of the work according to a function f (D).

8. The method for constructing the hadoop parameter optimization model according to claim 7, wherein the data quantity Dij corresponds to a characteristic attribute of the time period j, the characteristic attribute affects the scale of the data quantity Dij, in the time period T, the proportion of the characteristic attributes of the Dij and the time period T is multiplied, and an approximate value of DT is obtained after the multiplication and is used as an estimated data quantity to perform hadoop parameter optimization reference data.

9. The method of claim 8, wherein if each node cannot be deployed above 70% of the maximum data volume maxDT, the cluster of nodes operates normally.

10. The method of constructing a hadoop parameter optimization model according to claim 1, wherein the preset time is not continuous and the processing time is continuous.