CN110427356B

CN110427356B - Parameter configuration method and equipment

Info

Publication number: CN110427356B
Application number: CN201810385919.5A
Authority: CN
Inventors: 邵明路; 王蕊; 王衎; 宋哲; 张雨晴
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2018-04-26
Filing date: 2018-04-26
Publication date: 2021-08-13
Anticipated expiration: 2038-04-26
Also published as: CN110427356A

Abstract

The invention discloses a parameter configuration method and equipment, which are used for solving the problem of parameter configuration when batch SPARK data cleaning tasks are processed. The method and the device for processing the SPARK data cleaning task determine the data volume of the received SPARK data cleaning task and at least one atomic operation contained in a SPARK data cleaning program corresponding to the SPARK data cleaning task, train different atomic operation reference parameter values according to reference data and expected execution time, and then determine basic resource configuration parameters of the SPARK data cleaning task according to the reference parameter values, the data volume of the SPARK data cleaning task and the at least one atomic operation in the SPARK data cleaning program. The method comprises the steps of training parameter configurations of different atomic operations through atomic operations contained in SPARK data cleaning programs, and determining the parameter configurations of different SPARK data cleaning tasks according to the parameter configuration of at least one atomic operation forming different SPARK data cleaning programs, so that when a batch of SPARK data cleaning tasks are processed, the parameters can be automatically configured, and the performance of SPARK is improved.

Description

Parameter configuration method and equipment

Technical Field

The invention relates to the technical field of electronic information, big data, cloud computing and the like, in particular to a parameter configuration method and equipment.

Background

Big data refers to a collection of data whose contents cannot be captured, managed and processed within a certain time by conventional software tools. Big data technology, refers to the ability to quickly obtain valuable information from various types of data. To obtain valuable data from big data, the big data is processed, wherein data cleansing is a method for improving data quality.

The big data cleaning means that noise reduction and conversion are carried out on data, wherein the noise reduction mainly comprises the step of processing abnormal values, missing values, repeated values and the like in the data, and the conversion is to convert data in different formats into the same format, so that the data with higher consistency is obtained. Current big data cleansing systems are also built mainly on Hadoop (distributed computing) platforms, where HDFS (distributed file system) makes it possible to store large amounts of data at low cost, which can be processed by MapReduce (distributed computing system). But because of the bottleneck of the MapReduce framework and the implementation mechanism based on disk computing, the MapReduce can process a large amount of data, but the efficiency is more and more insufficient.

In recent years, memory-based distributed systems are becoming more widely used, and SPARK (fast general-purpose computing engine) is one of the representative distributed computing frameworks. The SPARK caches the data in the memory, and the next calculation is directly carried out on the data in the memory. However, when the SPARK computing framework performs a data cleansing task, some application programs cannot be executed correctly under the default parameters of the SPARK, or the running efficiency of the application programs is very low, so that the parameters of the SPARK need to be configured reasonably.

The conventional method for configuring the SPARK parameters is manual configuration, and the method is only suitable for single tasks and is not suitable for batch tasks when the SPARK parameters are manually configured.

Disclosure of Invention

The invention provides a parameter configuration method and equipment, which are used for solving the problem of parameter configuration when batch tasks are processed.

The method comprises the following steps:

in a first aspect, a method for configuring parameters provided in an embodiment of the present invention includes: determining the data volume of a received SPARK data cleaning task of a quick and general computing engine and at least one atomic operation contained in a SPARK data cleaning program corresponding to the SPARK data cleaning task; determining basic resource configuration parameters of the SPARK data cleaning task according to a reference parameter value, the data volume of the SPARK data cleaning task and at least one atomic operation contained in an SPARK data cleaning program corresponding to the SPARK data cleaning task; wherein the benchmark parameter value is a benchmark parameter value under the benchmark data for a different atomic operation trained from the benchmark data and the expected execution time in a defined cluster environment including at least one computing node.

The method comprises the steps of determining the data volume of a received SPARK data cleaning task and at least one atomic operation contained in a SPARK data cleaning program corresponding to the SPARK data cleaning task, training reference parameter values of different atomic operations under reference data according to the reference data and expected execution time under a limited cluster environment comprising at least one computing node, and determining basic resource configuration parameters of the SPARK data cleaning task according to the reference parameter values, the data volume of the SPARK data cleaning task and the at least one atomic operation contained in the SPARK data cleaning program. The SPARK data cleaning programs corresponding to the SPARK data cleaning tasks are divided into different atomic operations, parameter configurations of the different atomic operations are trained in a limited environment, and the parameter configurations of the different SPARK data cleaning tasks are determined according to the parameter configurations of one or more atomic operations forming the different SPARK data cleaning programs, so that when a batch of SPARK data cleaning tasks are processed, the parameters can be automatically configured, and the performance of the SPARK data cleaning is improved.

In one possible implementation manner, determining at least one atomic operation included in a SPARK data cleansing program corresponding to a received SPARK data cleansing task includes: carrying out generalized analysis on codes of the SPARK data washing task in the Spark data washing program; matching the analysis result with the atomic operation of the pre-configured data cleaning field; and determining at least one atomic operation contained in the SPARK data washing program according to the matching result.

The method comprises the steps of firstly carrying out generalized analysis on codes of the SPARK data washing tasks, then matching analysis results with preset atomic operations in the data washing field, and finally determining the atomic operations contained in the SPARK data washing programs.

In a possible implementation manner, obtaining the basic resource configuration parameters of the SPARK data cleansing task according to a reference parameter value, a data amount of the SPARK data cleansing task, and at least one atomic operation included in the SPARK data cleansing program includes: determining a reference parameter value of at least one atomic operation contained in the SPARK data washing program according to the reference parameter value and the at least one atomic operation contained in the SPARK data washing program; determining a reference parameter value of the SPARK data cleaning task by using a mean value formula or a maximum value formula according to a reference parameter value of at least one atomic operation contained in the SPARK data cleaning program; and determining basic resource configuration parameters of the SPARK data cleaning task according to the reference data, the reference parameter value of the SPARK data cleaning task and the data volume of the SPARK data cleaning task.

The method comprises the steps of firstly determining a reference parameter value of at least one atomic operation contained in an SPARK data cleaning program, then determining a reference parameter value of an SPARK data cleaning task, and finally determining basic resource configuration parameters of the SPARK data cleaning task, thereby completing parameter configuration.

In a possible implementation manner, determining a basic resource configuration parameter of the SPARK data cleansing task according to the reference data, the reference parameter value of the SPARK data cleansing task, and the data volume of the SPARK data cleansing task includes: determining a data adjusting parameter value according to the data volume of the SPARK data cleaning task and the reference data; and adjusting the reference parameter value of the SPARK data cleaning task according to the data adjusting parameter value to obtain the basic resource configuration parameter of the SPARK data cleaning task.

The method provides a specific implementation mode of parameter configuration, and the basic resource configuration parameters of the SPARK data cleaning task are obtained by determining data adjusting parameter values according to the data volume and the reference data of the SPARK data cleaning task and adjusting the reference parameter values of the SPARK data cleaning task according to the data adjusting parameters.

In a possible implementation manner, after obtaining the basic resource configuration parameters of the SPARK data cleansing task, the method further includes performing part or all of the following adjustment manners:

adjusting a mode 1, adjusting a parallelism parameter value according to the data volume of the SPARK data cleaning task;

adjusting mode 2, adjusting the actuator memory parameter value in the basic resource configuration parameter according to the operation number of the SPARK data cleaning tasks;

and adjusting the number parameter values of the actuators in the basic resource configuration parameters according to the execution requirements of the SPARK data cleaning task.

In the method, after basic resource configuration parameters of the SPARK data cleaning tasks are configured, in order to improve the SPARK performance and fully utilize resources, partial parameters such as parallelism parameter values, actuator memory parameter values and actuator number parameter values need to be adjusted in consideration of the uncontrollable performance of the SPARK data cleaning tasks during execution, such as different data volumes of different SPARK data cleaning tasks, different numbers of atoms contained in different SPARK data cleaning programs and special execution requirements of users.

In one possible implementation manner, adjusting the parallelism parameter value according to the data size of the SPARK data cleansing task includes: determining a parallelism parameter value according to the data volume of the SPARK data cleaning task; and before the parallelism parameter value is used for processing the data volume of the SPARK data cleaning task each time, increasing the parallelism parameter value according to a preset proportion.

The method provides a specific implementation mode for adjusting the parallelism parameter value, firstly, the parallelism parameter value is determined according to the data volume of the SPARK data cleaning task, and then, the parallelism parameter is increased according to a preset proportion before the SPARK data cleaning task is executed each time.

In a possible implementation manner, adjusting an execution memory parameter value in the basic resource configuration parameter according to the number of atomic operations of the SPARK data scrubbing task includes: determining a first adjustment proportion according to the memory number of the cluster machine; according to the first adjustment proportion, increasing the size of a memory which can be used for executing the SPARK data cleaning task in the equipment; if the increased memory which can be used for executing the SPARK data cleaning task does not exceed the maximum memory value, executing the SPARK data cleaning task by adopting the increased memory value; and if the increased memory which can be used for executing the SPARK data cleaning task exceeds the maximum memory value, executing the SPARK data cleaning task by adopting the maximum memory value.

The method includes the steps that a first adjusting proportion is determined according to the number of memories of a cluster machine, the size of the memory which can be used for executing the SPARK data cleaning task is increased according to the first adjusting proportion, if the size of the memory which can be used for executing the SPARK data cleaning task does not exceed the maximum memory value, the SPARK data cleaning task is executed by using the increased memory value, and otherwise, the SPARK data cleaning task is executed by using the maximum memory.

In a possible implementation manner, adjusting the parameter values of the number of the actuators in the basic resource configuration parameters according to the execution requirement of the SPARK data cleansing task includes: determining a second adjustment proportion according to the size of the memory of the cluster machine and the number of the machines; increasing and adjusting the number parameter values of the actuators in the basic resource configuration parameters according to the second adjustment proportion; if the increased number parameter values of the actuators in the basic resource configuration parameters do not exceed the maximum number of the actuators, executing the SPARK data cleaning task by using the increased number parameter values of the actuators; if the increased actuator number parameter value in the basic resource configuration parameters exceeds the maximum actuator number, executing the SPARK data cleaning task by adopting the maximum actuator number; and the maximum number of the actuators is determined according to the memory size of the actuators.

The method includes the steps that a second adjustment proportion is determined according to the size of a cluster machine memory and the number of machines, then the number of actuators in the basic resource configuration parameters is increased and adjusted according to the second adjustment proportion, if the increased number of actuators does not exceed the maximum number of actuators, the increased number of actuators are used for executing the SPARK data cleaning task, and otherwise, the maximum number of actuators is used for executing the SPARK data cleaning task.

In a second aspect, an embodiment of the present invention further provides a device for parameter configuration, where the device includes: a processor and a transceiver, the device having functionality to implement the embodiments of the first aspect described above.

In a third aspect, an embodiment of the present invention further provides a device for parameter configuration, where the device includes: at least one processing unit, and at least one memory unit, the apparatus having functionality to implement embodiments of the first aspect described above.

In a fourth aspect, an embodiment of the present invention further provides a device for parameter configuration, where the device includes: a determining module and a processing module, the device having functionality to implement the embodiments of the first aspect described above.

In addition, for technical effects brought by any one implementation manner of the second aspect to the fourth aspect, reference may be made to technical effects brought by different implementation manners of the first aspect, and details are not described here.

These and other aspects of the present application will be more readily apparent from the following description of the embodiments.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a diagram illustrating a parameter configuration method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating values of reference parameters under reference data for different atomic operations according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a method for determining a reference parameter value of a SPARK task using a mean value formula according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a method for determining a reference parameter value of a SPARK task using a maximum value formula according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating basic resource configuration parameters for determining the SPARK task according to an embodiment of the present invention;

fig. 6 is a schematic flow chart illustrating parameter configuration according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a device configured with first parameters according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an apparatus of a second parameter configuration according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a device for configuring a third parameter according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiments of the present invention will be described in further detail with reference to the drawings attached hereto.

Big data, which refers to a data set that cannot be captured, managed and processed by a conventional software tool within a certain time range, is a massive, high-growth-rate and diversified information asset that needs a new processing mode to have stronger decision-making power, insight discovery power and process optimization capability. For example, when the big data is applied to e-commerce shopping, a user browses shopping pages, in order to recommend products to the user according to the shopping demand and preference of the user, the e-commerce platform collects the shopping pages browsed by the user in a recent period of time, and then counts the shopping pages, and the data collected by the e-commerce platform is big data because thousands of people browse the shopping pages every day.

Because the data volume is too large, when the e-commerce platform collects data, there may be problems with the data, such as duplicate data, erroneous data, missing data, etc., and at this time, the data needs to be subjected to deduplication processing, replacement processing, missing value processing, etc., and the processing of the data is called data cleansing. The data cleaning can be completed by using a distributed computing framework SPARK, and because the data processed each time are different, if the data is cleaned under the SPARK default parameters, the cleaning task cannot be executed correctly or the operation efficiency is very low, so the parameters of the SPARK need to be configured.

In implementation, in the field of data cleansing, common cleansing operations can be divided into atomic operations such as generating columns, default processing, condition replacement, deduplication processing, and the like. The data cleaning program corresponding to the data cleaning task can be formed by one or more atomic operations.

The parameter configuration method provided by the embodiment of the invention, as shown in fig. 1, specifically includes the following steps:

s100, determining the data volume of a received SPARK task of a rapid and general computing engine and at least one atomic operation contained in the SPARK task;

s101, determining basic resource configuration parameters of the SPARK task according to a reference parameter value, the data volume of the SPARK task and at least one atomic operation contained in the SPARK task.

The method comprises the steps of determining the data volume of a received SPARK data cleaning task and at least one atomic operation contained in a SPARK data cleaning program corresponding to the SPARK data cleaning task, training reference parameter values of different atomic operations under reference data according to the reference data and expected execution time under a limited cluster environment comprising at least one computing node, and determining basic resource configuration parameters of the SPARK data cleaning task according to the reference parameter values, the data volume of the SPARK data cleaning task and at least one atomic operation contained in the SPARK data cleaning program corresponding to the SPARK data cleaning task. The SPARK data cleaning programs corresponding to the SPARK data cleaning tasks are divided into different atomic operations, parameter configurations of the different atomic operations are trained in a limited environment, and the parameter configurations of the different SPARK data cleaning tasks are determined according to the parameter configurations of one or more atomic operations forming the different SPARK data cleaning programs, so that when a batch of SPARK data cleaning tasks are processed, the parameters can be automatically configured, and the performance of the SPARK is improved.

And training reference parameter values of different atomic operations under the reference data according to the reference data and the expected execution time under the environment of the limited cluster.

A cluster environment is defined herein as a cluster that includes at least one computing node. The reference data is the data size to be processed by the cluster, and the reference data is judged according to the task expectation and can be 1G, 2G or even higher. The determination of the expected execution time takes into account the amount of concurrency of tasks to be processed by the cluster, which is expected to be longer if the amount of concurrency of tasks is large, and which may be relatively shorter if the amount of concurrency of tasks is small.

According to the embodiment of the invention, the reference parameter value of each atomic operation meeting the expected execution time under the reference data can be trained through the reference parameter training module in SPARK.

In implementation, in general, the SPARK data cleansing task parameter configuration only needs to configure the relevant parameters of the executer and the driver, and specifically may follow the following training principle:

1. for operations only with map (mapping) process, such as default processing, related parameters of drivers are not needed to be configured, only parameter values of memory and core of an executive are needed to be adjusted, and the ratio of memory and core of an executive is generally controlled to be about 1:2 (SPARK official network recommended value).

2. For operations including reduce, such as deduplication processing, including more IO (input output) type tasks, the ratio of memory and cores of one executor is generally controlled to be about 1: 1.

3. For operations including mean and median values, such as conditional replacement processing, operators including intermediate calculation processes, which need to pull data to a driver end for calculation, parameter values of memory and cores of the driver need to be adjusted.

After the reference data, the expected execution time, and the training rules are determined, the training process will be described below.

For different atomic operations to be trained, under the reference data, according to a training principle, combining a given expected execution time and default parameter configuration, observing the actual execution time, and continuously adjusting the parameters according to an optimization principle in the training principle until the actual execution time meets the requirement of the expected execution time, wherein the parameter value at the moment is a reference parameter value.

The proposal is that a spark. execution or. extra javaoptions parameter is configured in the training process, and the memory information can be stored in a file when the task fails to be executed, so that the reason of the task failure can be conveniently analyzed.

As shown in FIG. 2, the different atoms operate on the reference parameter values under the reference data, where T_bTo meet the actual execution time of the desired execution time. If the reference parameter configuration is trained when the reference data is 1G in fig. 2, and the actual execution time is 0.2 hours when processing a default processing task with a data volume of 1G, 0.5G of memory needs to be configured for execution.

In practice, the trained reference parameter data may be stored in a reference parameter database and used while waiting for a specific task to be processed.

When an SPARK data cleaning task needs to be processed, firstly, the data volume of the SPARK data cleaning task and atomic operations contained in an SPARK data cleaning program corresponding to the SPARK data cleaning task are determined.

And determining the data volume of the received SPARK data cleaning task and at least one atomic operation contained in the SPARK data cleaning program corresponding to the SPARK data cleaning task.

In implementation, the data volume of the task and the atomic operation included in the data cleaning program corresponding to the task can be obtained through a task feature extraction module in the SPARK.

In the field of data cleaning, a data cleaning program corresponding to a data cleaning task is divided into atomic operations according to a division mode agreed in the industry at present, for example, deduplication, condition replacement and generation columns all belong to the atomic operations in the field of data cleaning.

Optionally, determining at least one atomic operation included in the received SPARK data cleansing program includes: carrying out generalized analysis on codes of Spark data cleaning tasks in the Spark data cleaning program; matching the analysis result with the atomic operation of the pre-configured data cleaning field; and determining at least one atomic operation contained in the SPARK data washing program according to the matching result.

Each data cleaning task corresponds to a data cleaning program, the data cleaning program is composed of atomic operations, and after the SPARK data cleaning tasks are received by the SPARK, the atomic operations contained in the SPARK data cleaning programs are determined.

After the data volume of the received SPARK data cleaning task and atomic operations contained in the SPARK data cleaning program corresponding to the SPARK data cleaning task are determined, basic resource parameters of the SPARK data cleaning task are configured by combining reference parameter values.

And determining basic resource configuration parameters of the SPARK data cleaning task according to the reference parameter value, the data volume of the SPARK data cleaning task and at least one atomic operation contained in the SPARK data cleaning program.

In implementation, the basic resource configuration parameters of the SPARK data cleansing task can be obtained in a task parameter configuration module in the SPARK.

The reference parameter value is trained in a reference parameter training module, and the data volume of the SPARK data cleaning task and at least one atomic operation contained in the SPARK data cleaning program are obtained in a task feature extraction module.

Optionally, determining a reference parameter value of at least one atomic operation included in the SPARK data washing program according to the reference parameter value and the at least one atomic operation included in the SPARK data washing program;

determining a reference parameter value of the SPARK data cleaning task by using a mean value formula or a maximum value formula according to a reference parameter value of at least one atomic operation contained in the SPARK data cleaning program;

and determining basic resource configuration parameters of the SPARK data cleaning task according to the reference data, the reference parameter value of the SPARK data cleaning task and the data volume of the SPARK data cleaning task.

The reference parameter values include reference parameter values of all atomic operations in the data cleansing field, and since the received SPARK data cleansing program may contain a part of atomic operations, the reference parameter values of the atomic operations contained in the received SPARK data cleansing program need to be determined.

For example, atomic operations in the field of data cleansing include row generation, record auditing, condition replacement, default processing, duplicate removal, and the like, during training, reference parameter values of all atomic operations need to be trained, and if a SPARK data cleansing program corresponding to a received SPARK data cleansing task only includes atomic operations for condition replacement and default processing, only the reference parameter values for condition replacement and default processing need to be used when processing the SPARK data cleansing task of this time, so that the reference parameter values for the atomic operations included in the SPARK data cleansing program corresponding to the received SPARK data cleansing task need to be determined.

After the reference parameter value of the atomic operation contained in the SPARK data cleaning program corresponding to the received SPARK data cleaning task is determined, the reference parameter value of the received SPARK data cleaning task needs to be determined.

In implementation, the reference parameter value of the SPARK data cleansing task may be determined using a mean value formula or a maximum value formula according to the reference parameter value of at least one atomic operation included in the SPARK data cleansing program.

For example, the SPARK data cleansing program corresponding to the SPARK data cleansing task includes two atomic operations of default processing and conditional replacement, and fig. 3 shows reference parameter values of the two atomic operations included in the SPARK data cleansing program and reference parameter values for determining the SPARK data cleansing task by using a mean formula.

In fig. 3, the parameter value of the default is 0.5g, the parameter value of the conditionally-replaced default is 0.3g, and the parameter value of the spare.memory of the SPARK data cleaning task is 0.4g according to the mean formula; the parameter value of execution of the default processing is 1, the parameter value of the conditionally-replaced execution of the default processing is 1, and the parameter value of the execution of the SPARK data cleaning task is 1 according to a mean value formula; the parameter value of execution and instances of the default processing is 2, the parameter value of the execution and instances of the conditional replacement is 2, and the parameter value of the execution and instances of the SPARK data cleaning task is 2 according to the mean value formula; the parameter value of the driver.memory of the default processing is 1g, the parameter value of the driver.memory of the conditional replacement is 1g, and the parameter value of the driver.memory of the SPARK data cleaning task is 1g according to a mean value formula; the parameter value of the driver.core of the default processing is 1, the parameter value of the conditionally-replaced driver.core is 1, and the parameter value of the driver.core of the SPARK data cleaning task is 1 according to the mean value formula.

FIG. 4 shows the reference parameter values for two atomic operations involved in the SPARK data cleansing task and the determination of the reference parameter values for the SPARK data cleansing task using a maximum formula.

In fig. 4, the value of the default.memory parameter for default processing is 0.5g, the value of the conditionally-replaced default.memory parameter is 0.3g, and the value of the default.memory parameter for the SPARK data cleaning task is 0.5g according to the maximum formula; the parameter value of execution of the default processing is 1, the parameter value of the conditionally-replaced execution of the default processing is 1, and the parameter value of the execution of the SPARK data cleaning task is 1 according to a maximum value formula; the parameter value of execution and instances of the default processing is 2, the parameter value of execution and instances of the conditional replacement is 2, and the parameter value of execution and instances of the SPARK data cleaning task is 2 according to the maximum value formula; the parameter value of the driver.memory of the default processing is 1g, the parameter value of the driver.memory of the conditional replacement is 1g, and the parameter value of the driver.memory of the SPARK data cleaning task is 1g according to a maximum value formula; the parameter value of the driver.core of the default processing is 1, the parameter value of the driver.core of the conditional replacement is 1, and the parameter value of the driver.core of the SPARK data cleaning task is 1 according to the maximum value formula.

After the reference parameter value of the SPARK data cleaning task is determined, the basic resource configuration parameters of the SPARK data cleaning task need to be determined.

Optionally, determining a data adjustment parameter value according to the data volume of the SPARK data cleaning task and the reference data;

and adjusting the reference parameter value of the SPARK data cleaning task according to the data adjusting parameter value to obtain the basic resource configuration parameter of the SPARK data cleaning task.

In implementation, assume that the data volume of the SPARK data cleansing task is D_iReference data is D_bThe reference parameter value of the SPARK data cleaning task is R_xWherein R is_xThe parameter values representing each reference parameter of the SPARK data cleaning task, such as an execution_iAnd reference data D_bDetermining a data adjustment parameter value D_i/D_bD is_i/D_b*R_xAnd obtaining the basic resource configuration parameters of the SPARK data cleaning task.

For example, as shown in FIG. 5, the volume D of the SPARK data cleansing task _i2G, reference data D_bThe basic resource configuration parameter of the SPARK data cleaning task is 1G, the value of an executive.memory parameter in the basic resource configuration parameters of the SPARK data cleaning task is 0.4G, the value of an executive.cores parameter is 1, the value of an executive.entities parameter is 2, the value of a driver.memory parameter is 1G, and the value of a driver.core parameter is 1, the basic resource configuration parameter of the SPARK data cleaning task isThe parameter value of executive.memory is 0.8g, the parameter value of executive.core is 2, the parameter value of executive.entities is 4, the parameter value of driver.memory is 2g, and the parameter value of driver.core is 2.

After the basic resource configuration parameters of the SPARK data cleaning task are determined, in consideration of the uncontrollable performance of the SPARK data cleaning task during execution, for example, different SPARK data cleaning tasks have different data volumes, different SPARK data cleaning tasks contain different numbers of atoms, special execution requirements of users and the like, after the basic resource configuration parameters of the SPARK data cleaning task, in order to improve the SPARK performance and fully utilize resources, partial parameters also need to be adjusted.

Optionally, after obtaining the basic resource configuration parameters of the SPARK data cleansing task, the method further includes performing part or all of the following adjustment modes:

In an implementation, the adjustment of the partial parameters may be performed in the task parameter adjustment module.

When adjusting part of the parameters, the part of the parameters comprise a parallelism parameter, an actuator memory parameter and an actuator number parameter. The method comprises the steps that an actuator memory parameter and an actuator number parameter are two parameters in basic resource configuration parameters, namely after the actuator memory parameter and the actuator number parameter are trained, the actuator memory parameter and the actuator number parameter are further adjusted according to the operation number of the SPARK data cleaning task and the number of calculation nodes when the SPARK data cleaning task is executed.

When some parameters are specifically adjusted, one or more of the some parameters may be adjusted.

Specifically, adjusting the parallelism parameter value according to the data size of the SPARK task includes:

determining a parallelism parameter value according to the data volume of the SPARK data cleaning task;

and before the parallelism parameter value is used for processing the data volume of the SPARK data cleaning task each time, increasing the parallelism parameter value according to a preset proportion.

Wherein, the preset proportion can be set manually.

When the SPARK executes the SPARK data cleaning task, the SPARK data cleaning task is divided into a plurality of tasks according to the data volume of the SPARK data cleaning task and the block (data block) of the cluster, the number of the tasks is the number of the slices, and the parallelism parameter value is the number of the slices divided by a preset value, such as 2, 3 or other values.

For example, the data volume of the SPARK data cleansing task is 500G, the block (data block) of the cluster is 128M, the fragmentation number is the ratio of the data volume of the SPARK data cleansing task to the block of the cluster, that is, the fragmentation number is 4000, and the parallelism parameter is 2000 or 4000/3, or may be larger.

The parallelism parameter is generally increased continuously according to a certain proportion, because the massive data can be fragmented and executed in a batch, if the parallelism is not set, the SPARK can be automatically set according to the size of the cluster block.

Specifically, adjusting the actuator memory parameter value in the basic resource configuration parameter according to the number of atomic operations of the SPARK data cleansing task includes:

determining a first adjustment proportion according to the memory number of the cluster machine;

according to the first adjustment proportion, increasing the size of a memory which can be used for executing the SPARK data cleaning task in the equipment;

if the increased memory which can be used for executing the SPARK data cleaning task does not exceed the maximum memory value, executing the SPARK data cleaning task by adopting the increased memory value;

and if the increased memory which can be used for executing the SPARK data cleaning task exceeds the maximum memory value, executing the SPARK data cleaning task by adopting the maximum memory value.

In implementation, the first adjustment proportion is determined according to the cluster machine memory parameter.

Assuming that the size of the memory available for executing the task is 75% of the size of the machine memory, on this basis, an acceptable single execution memory reference value, generally 1GB or 2GB, is set, and is specifically configured by the basic parameter configuration module.

After the reference value is set, a maximum memory value needs to be set, because the memory cannot be expanded without limit even if the number of atomic operations is more, and if the task is complicated to a certain extent, the expected execution time of the task should be longer. The maximum memory value of the memory is set according to the number of the memories of the machine, for example, the memory of the machine is 64G, and the maximum memory of the actuator can be set to be 4G or 8G; if the machine memory is 128G, the actuator maximum memory may be set to 8GB or 16 GB.

Specifically, adjusting the parameter values of the number of the actuators in the basic resource configuration parameters according to the execution requirements of the SPARK task includes:

determining a second adjustment proportion according to the size of the memory of the cluster machine and the number of the machines;

increasing and adjusting the number parameter values of the actuators in the basic resource configuration parameters according to the second adjustment proportion;

if the increased number parameter values of the actuators in the basic resource configuration parameters do not exceed the maximum number of the actuators, executing the SPARK data cleaning task by using the increased number parameter values of the actuators;

if the increased actuator number parameter value in the basic resource configuration parameters exceeds the maximum actuator number, executing the SPARK data cleaning task by adopting the maximum actuator number;

and the maximum number of the actuators is determined according to the memory size of the actuators.

In implementation, the second adjustment proportion is determined according to the memory size of the cluster machine and the number of the machines.

For example, the number of actuators per machine is not recommended to exceed 5, so the number of executors assigned to a task should be less than or equal to the number of machines multiplied by 5.

And if the number parameter values of the actuators in the basic resource configuration parameters are increased and adjusted according to the second adjustment proportion, and the increased number parameter values of the actuators are larger than the number of the machines multiplied by 5, executing the received SPARK data cleaning task by multiplying the number of the machines by 5.

As shown in fig. 6, a schematic flow chart of parameter configuration according to an embodiment of the present invention is as follows.

Step 600, a reference parameter training module trains reference parameter values of each atomic operation meeting expected execution time under reference data;

601, storing the reference parameter value in a reference parameter database;

step 602, submitting a task;

step 603, the task feature extraction module obtains the submitted data volume in the task and the atomic operations contained in the task;

step 604, the task parameter configuration module obtains basic resource configuration parameters according to the reference parameter values stored in the reference database, the data amount in the task acquired by the task feature extraction module and the atomic operations contained in the program corresponding to the task;

step 605, the task parameter adjusting module adjusts one or more of the parallelism parameter, the memory parameter of the executor in the basic resource configuration parameter, and the number parameter of the executor in the basic resource configuration parameter;

and step 606, submitting the processed tasks.

Based on the same inventive concept, the embodiment of the present invention further provides a device for parameter configuration, and since the principle of the device for solving the problem is similar to the method for parameter configuration in the embodiment of the present invention, the implementation of the device may refer to the implementation of the method, and repeated details are not described again.

As shown in fig. 7, an embodiment of the present invention further provides a device for parameter configuration, where the device includes: processor 700 and transceiver 701:

the processor 700 is configured to determine a data size of a received fast general-purpose computing engine SPARK data cleansing task and at least one atomic operation included in a SPARK data cleansing program corresponding to the SPARK data cleansing task; determining basic resource configuration parameters of the SPARK data cleaning task according to a reference parameter value, the data volume of the SPARK data cleaning task and at least one atomic operation contained in an SPARK data cleaning program corresponding to the SPARK data cleaning task; wherein the benchmark parameter value is a benchmark parameter value under the benchmark data for a different atomic operation trained from the benchmark data and the expected execution time in a defined cluster environment including at least one computing node.

Optionally, the processor 700 is specifically configured to:

carrying out generalized analysis on codes of the SPARK data washing task in the SPARK data washing program; matching the analysis result with the atomic operation of the pre-configured data cleaning field; and determining at least one atomic operation contained in the SPARK data washing program according to the matching result.

Optionally, the processor 700 is specifically configured to:

determining a reference parameter value of at least one atomic operation contained in the SPARK data washing program according to the reference parameter value and the at least one atomic operation contained in the SPARK data washing program; determining a reference parameter value of the SPARK data cleaning task by using a mean value formula or a maximum value formula according to a reference parameter value of at least one atomic operation contained in the SPARK data cleaning program; and determining basic resource configuration parameters of the SPARK data cleaning task according to the reference data, the reference parameter value of the SPARK data cleaning task and the data volume of the SPARK data cleaning task.

Optionally, the processor 700 is specifically configured to:

determining a data adjusting parameter value according to the data volume of the SPARK data cleaning task and the reference data; and adjusting the reference parameter value of the SPARK data cleaning task according to the data adjusting parameter value to obtain the basic resource configuration parameter of the SPARK data cleaning task.

Optionally, the processor 700 is further configured to:

performing some or all of the following adjustment:

Optionally, the processor 700 is specifically configured to:

determining a parallelism parameter value according to the data volume of the SPARK data cleaning task; and before the parallelism parameter value is used for processing the data volume of the SPARK data cleaning task each time, increasing the parallelism parameter value according to a preset proportion.

Optionally, the processor 700 is specifically configured to:

determining a first adjustment proportion according to the memory number of the cluster machine; according to the first adjustment proportion, increasing the size of a memory which can be used for executing the SPARK data cleaning task in the equipment; if the increased memory which can be used for executing the SPARK data cleaning task does not exceed the maximum memory value, executing the SPARK data cleaning task by adopting the increased memory value; and if the increased memory which can be used for executing the SPARK data cleaning task exceeds the maximum memory value, executing the SPARK data cleaning task by adopting the maximum memory value.

Optionally, the processor 700 is specifically configured to:

determining a second adjustment proportion according to the size of the memory of the cluster machine and the number of the machines; increasing and adjusting the number parameter values of the actuators in the basic resource configuration parameters according to the second adjustment proportion; if the increased number parameter values of the actuators in the basic resource configuration parameters do not exceed the maximum number of the actuators, executing the SPARK data cleaning task by using the increased number parameter values of the actuators; if the increased actuator number parameter value in the basic resource configuration parameters exceeds the maximum actuator number, executing the SPARK data cleaning task by adopting the maximum actuator number; and the maximum number of the actuators is determined according to the memory size of the actuators.

As shown in fig. 8, an embodiment of the present invention further provides a device for adjusting bandwidth, where the device includes: at least one processing unit 800, and at least one memory unit 801, wherein the memory unit 801 stores program code that, when executed by the processing unit 800, causes the processing unit 800 to perform the following:

determining the data volume of a received SPARK data cleaning task of a quick and general computing engine and at least one atomic operation contained in a SPARK data cleaning program corresponding to the SPARK data cleaning task; determining basic resource configuration parameters of the SPARK data cleaning task according to a reference parameter value, the data volume of the SPARK data cleaning task and at least one atomic operation contained in an SPARK data cleaning program corresponding to the SPARK data cleaning task; wherein the benchmark parameter value is a benchmark parameter value under the benchmark data for a different atomic operation trained from the benchmark data and the expected execution time in a defined cluster environment including at least one computing node.

Optionally, the processor 800 is specifically configured to:

Optionally, the processor 800 is further configured to:

performing some or all of the following adjustment:

Optionally, the processor 800 is specifically configured to:

As shown in fig. 9, an embodiment of the present invention further provides an apparatus for performing bandwidth adjustment, which includes a determining module 900 and a processing module 901:

a determining module 900, configured to determine a data size of a received fast general-purpose computing engine SPARK data cleansing task and at least one atomic operation included in a SPARK data cleansing program corresponding to the SPARK data cleansing task;

a processing module 901, configured to determine a basic resource configuration parameter of the SPARK data cleaning task according to a reference parameter value, a data size of the SPARK data cleaning task, and at least one atomic operation included in a SPARK data cleaning program corresponding to the SPARK data cleaning task.

Optionally, the processing module 901 is specifically configured to:

Optionally, the processing module 901 is further configured to:

performing some or all of the following adjustment:

Optionally, the processing module 901 is specifically configured to:

An embodiment of the present invention further provides a device-readable storage medium for synchronized parameter configuration, where the readable storage medium is non-volatile and includes program code, and when the program code runs on a computing device, the program code is configured to enable the computing device to execute the steps of the method for configuring the parameters by the device.

The present application is described above with reference to block diagrams and/or flowchart illustrations of methods, apparatus (systems) and/or computer program products according to embodiments of the application. It will be understood that one block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, and/or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer and/or other programmable data processing apparatus, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks.

Accordingly, the subject application may also be embodied in hardware and/or in software (including firmware, resident software, micro-code, etc.). Furthermore, the present application may take the form of a computer program product on a computer-usable or computer-readable storage medium having computer-usable or computer-readable program code embodied in the medium for use by or in connection with an instruction execution system. In the context of this application, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for configuring parameters, the method comprising:

determining the data volume of a received SPARK data cleaning task of a quick and general computing engine and at least one atomic operation contained in a SPARK data cleaning program corresponding to the SPARK data cleaning task;

determining basic resource configuration parameters of the SPARK data cleaning task according to a reference parameter value, the data volume of the SPARK data cleaning task and at least one atomic operation contained in an SPARK data cleaning program corresponding to the SPARK data cleaning task;

wherein the benchmark parameter value is a benchmark parameter value under the benchmark data for a different atomic operation trained from the benchmark data and the expected execution time in a defined cluster environment including at least one computing node.

2. The method of claim 1, wherein the determining at least one atomic operation included in the SPARK data washer comprises:

carrying out generalized analysis on codes of Spark data cleaning tasks in the Spark data cleaning program;

matching the analysis result with the atomic operation of the pre-configured data cleaning field;

and determining at least one atomic operation contained in the SPARK data washing program according to the matching result.

3. The method of claim 1, wherein determining the basic resource configuration parameters for the SPARK data cleansing task based on the baseline parameter values, the amount of data for the SPARK data cleansing task, and at least one atomic operation included in the SPARK data cleansing program comprises:

determining a reference parameter value of at least one atomic operation contained in the SPARK data washing program according to the reference parameter value and the at least one atomic operation contained in the SPARK data washing program;

4. The method of claim 3, wherein determining the basic resource configuration parameters of the SPARK data cleansing task based on the reference data, the reference parameter values of the SPARK data cleansing task, and the data volume of the SPARK data cleansing task comprises:

determining a data adjusting parameter value according to the data volume of the SPARK data cleaning task and the reference data;

5. The method of claim 1, wherein obtaining the basic resource configuration parameters for the SPARK data scrubbing task further comprises performing some or all of the following adjustments:

6. The method of claim 5, wherein adjusting the parallelism parameter value according to the amount of data for the SPARK data scrubbing task comprises:

7. The method of claim 5, wherein the adjusting the value of the actuator memory parameter in the basic resource configuration parameters according to the number of atomic operations of the SPARK data scrubbing process comprises:

8. The method of claim 5, wherein the adjusting the parameter value of the number of actuators in the basic resource configuration parameter according to the execution requirement of the SPARK data scrubbing task comprises:

9. An apparatus for parameter configuration, the apparatus comprising: a processor and a transceiver:

the processor is used for determining the data volume of the received SPARK data cleaning task of the quick and general computing engine and at least one atomic operation contained in an SPARK data cleaning program corresponding to the SPARK data cleaning task; determining basic resource configuration parameters of the SPARK data cleaning task according to a reference parameter value, the data volume of the SPARK data cleaning task and at least one atomic operation contained in an SPARK data cleaning program corresponding to the SPARK data cleaning task; wherein the benchmark parameter value is a benchmark parameter value under the benchmark data for a different atomic operation trained from the benchmark data and the expected execution time in a defined cluster environment including at least one computing node.

10. The device of claim 9, wherein the processor is specifically configured to:

11. The device of claim 9, wherein the processor is specifically configured to:

determining a reference parameter value of at least one atomic operation contained in the SPARK data washing program according to the reference parameter value and the at least one atomic operation contained in the SPARK data washing program; determining a reference parameter value of the SPARK data cleaning task by using a mean value formula or a maximum value formula according to a reference parameter value of at least one atomic operation contained in the SPARK data cleaning program; and determining basic resource configuration parameters of the SPARK task for data cleaning according to the reference data, the reference parameter value of the SPARK data cleaning task and the data volume of the SPARK data cleaning task.

12. The device of claim 11, wherein the processor is specifically configured to:

13. The device of claim 9, wherein the processor is further configured to:

performing some or all of the following adjustment:

14. The device of claim 13, wherein the processor is specifically configured to:

15. The device of claim 13, wherein the processor is specifically configured to:

16. The device of claim 13, wherein the processor is specifically configured to:

17. An apparatus for parameter configuration, the apparatus comprising: at least one processing unit and at least one memory unit, wherein the memory unit stores program code which, when executed by the processing unit, causes the processing unit to perform the steps of the method of any of claims 1 to 8.

18. A computer storage medium having a computer program stored thereon, the program, when executed by a processor, implementing the steps of the method according to any one of claims 1 to 8.