CN110427356B - Parameter configuration method and equipment - Google Patents

Parameter configuration method and equipment Download PDF

Info

Publication number
CN110427356B
CN110427356B CN201810385919.5A CN201810385919A CN110427356B CN 110427356 B CN110427356 B CN 110427356B CN 201810385919 A CN201810385919 A CN 201810385919A CN 110427356 B CN110427356 B CN 110427356B
Authority
CN
China
Prior art keywords
spark
data cleaning
spark data
data
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810385919.5A
Other languages
Chinese (zh)
Other versions
CN110427356A (en
Inventor
邵明路
王蕊
王衎
宋哲
张雨晴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Suzhou Software Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201810385919.5A priority Critical patent/CN110427356B/en
Publication of CN110427356A publication Critical patent/CN110427356A/en
Application granted granted Critical
Publication of CN110427356B publication Critical patent/CN110427356B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Stored Programmes (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a parameter configuration method and equipment, which are used for solving the problem of parameter configuration when batch SPARK data cleaning tasks are processed. The method and the device for processing the SPARK data cleaning task determine the data volume of the received SPARK data cleaning task and at least one atomic operation contained in a SPARK data cleaning program corresponding to the SPARK data cleaning task, train different atomic operation reference parameter values according to reference data and expected execution time, and then determine basic resource configuration parameters of the SPARK data cleaning task according to the reference parameter values, the data volume of the SPARK data cleaning task and the at least one atomic operation in the SPARK data cleaning program. The method comprises the steps of training parameter configurations of different atomic operations through atomic operations contained in SPARK data cleaning programs, and determining the parameter configurations of different SPARK data cleaning tasks according to the parameter configuration of at least one atomic operation forming different SPARK data cleaning programs, so that when a batch of SPARK data cleaning tasks are processed, the parameters can be automatically configured, and the performance of SPARK is improved.

Description

Parameter configuration method and equipment
Technical Field
The invention relates to the technical field of electronic information, big data, cloud computing and the like, in particular to a parameter configuration method and equipment.
Background
Big data refers to a collection of data whose contents cannot be captured, managed and processed within a certain time by conventional software tools. Big data technology, refers to the ability to quickly obtain valuable information from various types of data. To obtain valuable data from big data, the big data is processed, wherein data cleansing is a method for improving data quality.
The big data cleaning means that noise reduction and conversion are carried out on data, wherein the noise reduction mainly comprises the step of processing abnormal values, missing values, repeated values and the like in the data, and the conversion is to convert data in different formats into the same format, so that the data with higher consistency is obtained. Current big data cleansing systems are also built mainly on Hadoop (distributed computing) platforms, where HDFS (distributed file system) makes it possible to store large amounts of data at low cost, which can be processed by MapReduce (distributed computing system). But because of the bottleneck of the MapReduce framework and the implementation mechanism based on disk computing, the MapReduce can process a large amount of data, but the efficiency is more and more insufficient.
In recent years, memory-based distributed systems are becoming more widely used, and SPARK (fast general-purpose computing engine) is one of the representative distributed computing frameworks. The SPARK caches the data in the memory, and the next calculation is directly carried out on the data in the memory. However, when the SPARK computing framework performs a data cleansing task, some application programs cannot be executed correctly under the default parameters of the SPARK, or the running efficiency of the application programs is very low, so that the parameters of the SPARK need to be configured reasonably.
The conventional method for configuring the SPARK parameters is manual configuration, and the method is only suitable for single tasks and is not suitable for batch tasks when the SPARK parameters are manually configured.
Disclosure of Invention
The invention provides a parameter configuration method and equipment, which are used for solving the problem of parameter configuration when batch tasks are processed.
The method comprises the following steps:
in a first aspect, a method for configuring parameters provided in an embodiment of the present invention includes: determining the data volume of a received SPARK data cleaning task of a quick and general computing engine and at least one atomic operation contained in a SPARK data cleaning program corresponding to the SPARK data cleaning task; determining basic resource configuration parameters of the SPARK data cleaning task according to a reference parameter value, the data volume of the SPARK data cleaning task and at least one atomic operation contained in an SPARK data cleaning program corresponding to the SPARK data cleaning task; wherein the benchmark parameter value is a benchmark parameter value under the benchmark data for a different atomic operation trained from the benchmark data and the expected execution time in a defined cluster environment including at least one computing node.
The method comprises the steps of determining the data volume of a received SPARK data cleaning task and at least one atomic operation contained in a SPARK data cleaning program corresponding to the SPARK data cleaning task, training reference parameter values of different atomic operations under reference data according to the reference data and expected execution time under a limited cluster environment comprising at least one computing node, and determining basic resource configuration parameters of the SPARK data cleaning task according to the reference parameter values, the data volume of the SPARK data cleaning task and the at least one atomic operation contained in the SPARK data cleaning program. The SPARK data cleaning programs corresponding to the SPARK data cleaning tasks are divided into different atomic operations, parameter configurations of the different atomic operations are trained in a limited environment, and the parameter configurations of the different SPARK data cleaning tasks are determined according to the parameter configurations of one or more atomic operations forming the different SPARK data cleaning programs, so that when a batch of SPARK data cleaning tasks are processed, the parameters can be automatically configured, and the performance of the SPARK data cleaning is improved.
In one possible implementation manner, determining at least one atomic operation included in a SPARK data cleansing program corresponding to a received SPARK data cleansing task includes: carrying out generalized analysis on codes of the SPARK data washing task in the Spark data washing program; matching the analysis result with the atomic operation of the pre-configured data cleaning field; and determining at least one atomic operation contained in the SPARK data washing program according to the matching result.
The method comprises the steps of firstly carrying out generalized analysis on codes of the SPARK data washing tasks, then matching analysis results with preset atomic operations in the data washing field, and finally determining the atomic operations contained in the SPARK data washing programs.
In a possible implementation manner, obtaining the basic resource configuration parameters of the SPARK data cleansing task according to a reference parameter value, a data amount of the SPARK data cleansing task, and at least one atomic operation included in the SPARK data cleansing program includes: determining a reference parameter value of at least one atomic operation contained in the SPARK data washing program according to the reference parameter value and the at least one atomic operation contained in the SPARK data washing program; determining a reference parameter value of the SPARK data cleaning task by using a mean value formula or a maximum value formula according to a reference parameter value of at least one atomic operation contained in the SPARK data cleaning program; and determining basic resource configuration parameters of the SPARK data cleaning task according to the reference data, the reference parameter value of the SPARK data cleaning task and the data volume of the SPARK data cleaning task.
The method comprises the steps of firstly determining a reference parameter value of at least one atomic operation contained in an SPARK data cleaning program, then determining a reference parameter value of an SPARK data cleaning task, and finally determining basic resource configuration parameters of the SPARK data cleaning task, thereby completing parameter configuration.
In a possible implementation manner, determining a basic resource configuration parameter of the SPARK data cleansing task according to the reference data, the reference parameter value of the SPARK data cleansing task, and the data volume of the SPARK data cleansing task includes: determining a data adjusting parameter value according to the data volume of the SPARK data cleaning task and the reference data; and adjusting the reference parameter value of the SPARK data cleaning task according to the data adjusting parameter value to obtain the basic resource configuration parameter of the SPARK data cleaning task.
The method provides a specific implementation mode of parameter configuration, and the basic resource configuration parameters of the SPARK data cleaning task are obtained by determining data adjusting parameter values according to the data volume and the reference data of the SPARK data cleaning task and adjusting the reference parameter values of the SPARK data cleaning task according to the data adjusting parameters.
In a possible implementation manner, after obtaining the basic resource configuration parameters of the SPARK data cleansing task, the method further includes performing part or all of the following adjustment manners:
adjusting a mode 1, adjusting a parallelism parameter value according to the data volume of the SPARK data cleaning task;
adjusting mode 2, adjusting the actuator memory parameter value in the basic resource configuration parameter according to the operation number of the SPARK data cleaning tasks;
and adjusting the number parameter values of the actuators in the basic resource configuration parameters according to the execution requirements of the SPARK data cleaning task.
In the method, after basic resource configuration parameters of the SPARK data cleaning tasks are configured, in order to improve the SPARK performance and fully utilize resources, partial parameters such as parallelism parameter values, actuator memory parameter values and actuator number parameter values need to be adjusted in consideration of the uncontrollable performance of the SPARK data cleaning tasks during execution, such as different data volumes of different SPARK data cleaning tasks, different numbers of atoms contained in different SPARK data cleaning programs and special execution requirements of users.
In one possible implementation manner, adjusting the parallelism parameter value according to the data size of the SPARK data cleansing task includes: determining a parallelism parameter value according to the data volume of the SPARK data cleaning task; and before the parallelism parameter value is used for processing the data volume of the SPARK data cleaning task each time, increasing the parallelism parameter value according to a preset proportion.
The method provides a specific implementation mode for adjusting the parallelism parameter value, firstly, the parallelism parameter value is determined according to the data volume of the SPARK data cleaning task, and then, the parallelism parameter is increased according to a preset proportion before the SPARK data cleaning task is executed each time.
In a possible implementation manner, adjusting an execution memory parameter value in the basic resource configuration parameter according to the number of atomic operations of the SPARK data scrubbing task includes: determining a first adjustment proportion according to the memory number of the cluster machine; according to the first adjustment proportion, increasing the size of a memory which can be used for executing the SPARK data cleaning task in the equipment; if the increased memory which can be used for executing the SPARK data cleaning task does not exceed the maximum memory value, executing the SPARK data cleaning task by adopting the increased memory value; and if the increased memory which can be used for executing the SPARK data cleaning task exceeds the maximum memory value, executing the SPARK data cleaning task by adopting the maximum memory value.
The method includes the steps that a first adjusting proportion is determined according to the number of memories of a cluster machine, the size of the memory which can be used for executing the SPARK data cleaning task is increased according to the first adjusting proportion, if the size of the memory which can be used for executing the SPARK data cleaning task does not exceed the maximum memory value, the SPARK data cleaning task is executed by using the increased memory value, and otherwise, the SPARK data cleaning task is executed by using the maximum memory.
In a possible implementation manner, adjusting the parameter values of the number of the actuators in the basic resource configuration parameters according to the execution requirement of the SPARK data cleansing task includes: determining a second adjustment proportion according to the size of the memory of the cluster machine and the number of the machines; increasing and adjusting the number parameter values of the actuators in the basic resource configuration parameters according to the second adjustment proportion; if the increased number parameter values of the actuators in the basic resource configuration parameters do not exceed the maximum number of the actuators, executing the SPARK data cleaning task by using the increased number parameter values of the actuators; if the increased actuator number parameter value in the basic resource configuration parameters exceeds the maximum actuator number, executing the SPARK data cleaning task by adopting the maximum actuator number; and the maximum number of the actuators is determined according to the memory size of the actuators.
The method includes the steps that a second adjustment proportion is determined according to the size of a cluster machine memory and the number of machines, then the number of actuators in the basic resource configuration parameters is increased and adjusted according to the second adjustment proportion, if the increased number of actuators does not exceed the maximum number of actuators, the increased number of actuators are used for executing the SPARK data cleaning task, and otherwise, the maximum number of actuators is used for executing the SPARK data cleaning task.
In a second aspect, an embodiment of the present invention further provides a device for parameter configuration, where the device includes: a processor and a transceiver, the device having functionality to implement the embodiments of the first aspect described above.
In a third aspect, an embodiment of the present invention further provides a device for parameter configuration, where the device includes: at least one processing unit, and at least one memory unit, the apparatus having functionality to implement embodiments of the first aspect described above.
In a fourth aspect, an embodiment of the present invention further provides a device for parameter configuration, where the device includes: a determining module and a processing module, the device having functionality to implement the embodiments of the first aspect described above.
In addition, for technical effects brought by any one implementation manner of the second aspect to the fourth aspect, reference may be made to technical effects brought by different implementation manners of the first aspect, and details are not described here.
These and other aspects of the present application will be more readily apparent from the following description of the embodiments.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a diagram illustrating a parameter configuration method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating values of reference parameters under reference data for different atomic operations according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a method for determining a reference parameter value of a SPARK task using a mean value formula according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a method for determining a reference parameter value of a SPARK task using a maximum value formula according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating basic resource configuration parameters for determining the SPARK task according to an embodiment of the present invention;
fig. 6 is a schematic flow chart illustrating parameter configuration according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a device configured with first parameters according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an apparatus of a second parameter configuration according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of a device for configuring a third parameter according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiments of the present invention will be described in further detail with reference to the drawings attached hereto.
Big data, which refers to a data set that cannot be captured, managed and processed by a conventional software tool within a certain time range, is a massive, high-growth-rate and diversified information asset that needs a new processing mode to have stronger decision-making power, insight discovery power and process optimization capability. For example, when the big data is applied to e-commerce shopping, a user browses shopping pages, in order to recommend products to the user according to the shopping demand and preference of the user, the e-commerce platform collects the shopping pages browsed by the user in a recent period of time, and then counts the shopping pages, and the data collected by the e-commerce platform is big data because thousands of people browse the shopping pages every day.
Because the data volume is too large, when the e-commerce platform collects data, there may be problems with the data, such as duplicate data, erroneous data, missing data, etc., and at this time, the data needs to be subjected to deduplication processing, replacement processing, missing value processing, etc., and the processing of the data is called data cleansing. The data cleaning can be completed by using a distributed computing framework SPARK, and because the data processed each time are different, if the data is cleaned under the SPARK default parameters, the cleaning task cannot be executed correctly or the operation efficiency is very low, so the parameters of the SPARK need to be configured.
In implementation, in the field of data cleansing, common cleansing operations can be divided into atomic operations such as generating columns, default processing, condition replacement, deduplication processing, and the like. The data cleaning program corresponding to the data cleaning task can be formed by one or more atomic operations.
The parameter configuration method provided by the embodiment of the invention, as shown in fig. 1, specifically includes the following steps:
s100, determining the data volume of a received SPARK task of a rapid and general computing engine and at least one atomic operation contained in the SPARK task;
s101, determining basic resource configuration parameters of the SPARK task according to a reference parameter value, the data volume of the SPARK task and at least one atomic operation contained in the SPARK task.
The method comprises the steps of determining the data volume of a received SPARK data cleaning task and at least one atomic operation contained in a SPARK data cleaning program corresponding to the SPARK data cleaning task, training reference parameter values of different atomic operations under reference data according to the reference data and expected execution time under a limited cluster environment comprising at least one computing node, and determining basic resource configuration parameters of the SPARK data cleaning task according to the reference parameter values, the data volume of the SPARK data cleaning task and at least one atomic operation contained in the SPARK data cleaning program corresponding to the SPARK data cleaning task. The SPARK data cleaning programs corresponding to the SPARK data cleaning tasks are divided into different atomic operations, parameter configurations of the different atomic operations are trained in a limited environment, and the parameter configurations of the different SPARK data cleaning tasks are determined according to the parameter configurations of one or more atomic operations forming the different SPARK data cleaning programs, so that when a batch of SPARK data cleaning tasks are processed, the parameters can be automatically configured, and the performance of the SPARK is improved.
And training reference parameter values of different atomic operations under the reference data according to the reference data and the expected execution time under the environment of the limited cluster.
A cluster environment is defined herein as a cluster that includes at least one computing node. The reference data is the data size to be processed by the cluster, and the reference data is judged according to the task expectation and can be 1G, 2G or even higher. The determination of the expected execution time takes into account the amount of concurrency of tasks to be processed by the cluster, which is expected to be longer if the amount of concurrency of tasks is large, and which may be relatively shorter if the amount of concurrency of tasks is small.
According to the embodiment of the invention, the reference parameter value of each atomic operation meeting the expected execution time under the reference data can be trained through the reference parameter training module in SPARK.
In implementation, in general, the SPARK data cleansing task parameter configuration only needs to configure the relevant parameters of the executer and the driver, and specifically may follow the following training principle:
1. for operations only with map (mapping) process, such as default processing, related parameters of drivers are not needed to be configured, only parameter values of memory and core of an executive are needed to be adjusted, and the ratio of memory and core of an executive is generally controlled to be about 1:2 (SPARK official network recommended value).
2. For operations including reduce, such as deduplication processing, including more IO (input output) type tasks, the ratio of memory and cores of one executor is generally controlled to be about 1: 1.
3. For operations including mean and median values, such as conditional replacement processing, operators including intermediate calculation processes, which need to pull data to a driver end for calculation, parameter values of memory and cores of the driver need to be adjusted.
After the reference data, the expected execution time, and the training rules are determined, the training process will be described below.
For different atomic operations to be trained, under the reference data, according to a training principle, combining a given expected execution time and default parameter configuration, observing the actual execution time, and continuously adjusting the parameters according to an optimization principle in the training principle until the actual execution time meets the requirement of the expected execution time, wherein the parameter value at the moment is a reference parameter value.
The proposal is that a spark. execution or. extra javaoptions parameter is configured in the training process, and the memory information can be stored in a file when the task fails to be executed, so that the reason of the task failure can be conveniently analyzed.
As shown in FIG. 2, the different atoms operate on the reference parameter values under the reference data, where TbTo meet the actual execution time of the desired execution time. If the reference parameter configuration is trained when the reference data is 1G in fig. 2, and the actual execution time is 0.2 hours when processing a default processing task with a data volume of 1G, 0.5G of memory needs to be configured for execution.
In practice, the trained reference parameter data may be stored in a reference parameter database and used while waiting for a specific task to be processed.
When an SPARK data cleaning task needs to be processed, firstly, the data volume of the SPARK data cleaning task and atomic operations contained in an SPARK data cleaning program corresponding to the SPARK data cleaning task are determined.
And determining the data volume of the received SPARK data cleaning task and at least one atomic operation contained in the SPARK data cleaning program corresponding to the SPARK data cleaning task.
In implementation, the data volume of the task and the atomic operation included in the data cleaning program corresponding to the task can be obtained through a task feature extraction module in the SPARK.
In the field of data cleaning, a data cleaning program corresponding to a data cleaning task is divided into atomic operations according to a division mode agreed in the industry at present, for example, deduplication, condition replacement and generation columns all belong to the atomic operations in the field of data cleaning.
Optionally, determining at least one atomic operation included in the received SPARK data cleansing program includes: carrying out generalized analysis on codes of Spark data cleaning tasks in the Spark data cleaning program; matching the analysis result with the atomic operation of the pre-configured data cleaning field; and determining at least one atomic operation contained in the SPARK data washing program according to the matching result.
Each data cleaning task corresponds to a data cleaning program, the data cleaning program is composed of atomic operations, and after the SPARK data cleaning tasks are received by the SPARK, the atomic operations contained in the SPARK data cleaning programs are determined.
After the data volume of the received SPARK data cleaning task and atomic operations contained in the SPARK data cleaning program corresponding to the SPARK data cleaning task are determined, basic resource parameters of the SPARK data cleaning task are configured by combining reference parameter values.
And determining basic resource configuration parameters of the SPARK data cleaning task according to the reference parameter value, the data volume of the SPARK data cleaning task and at least one atomic operation contained in the SPARK data cleaning program.
In implementation, the basic resource configuration parameters of the SPARK data cleansing task can be obtained in a task parameter configuration module in the SPARK.
The reference parameter value is trained in a reference parameter training module, and the data volume of the SPARK data cleaning task and at least one atomic operation contained in the SPARK data cleaning program are obtained in a task feature extraction module.
Optionally, determining a reference parameter value of at least one atomic operation included in the SPARK data washing program according to the reference parameter value and the at least one atomic operation included in the SPARK data washing program;
determining a reference parameter value of the SPARK data cleaning task by using a mean value formula or a maximum value formula according to a reference parameter value of at least one atomic operation contained in the SPARK data cleaning program;
and determining basic resource configuration parameters of the SPARK data cleaning task according to the reference data, the reference parameter value of the SPARK data cleaning task and the data volume of the SPARK data cleaning task.
The reference parameter values include reference parameter values of all atomic operations in the data cleansing field, and since the received SPARK data cleansing program may contain a part of atomic operations, the reference parameter values of the atomic operations contained in the received SPARK data cleansing program need to be determined.
For example, atomic operations in the field of data cleansing include row generation, record auditing, condition replacement, default processing, duplicate removal, and the like, during training, reference parameter values of all atomic operations need to be trained, and if a SPARK data cleansing program corresponding to a received SPARK data cleansing task only includes atomic operations for condition replacement and default processing, only the reference parameter values for condition replacement and default processing need to be used when processing the SPARK data cleansing task of this time, so that the reference parameter values for the atomic operations included in the SPARK data cleansing program corresponding to the received SPARK data cleansing task need to be determined.
After the reference parameter value of the atomic operation contained in the SPARK data cleaning program corresponding to the received SPARK data cleaning task is determined, the reference parameter value of the received SPARK data cleaning task needs to be determined.
In implementation, the reference parameter value of the SPARK data cleansing task may be determined using a mean value formula or a maximum value formula according to the reference parameter value of at least one atomic operation included in the SPARK data cleansing program.
For example, the SPARK data cleansing program corresponding to the SPARK data cleansing task includes two atomic operations of default processing and conditional replacement, and fig. 3 shows reference parameter values of the two atomic operations included in the SPARK data cleansing program and reference parameter values for determining the SPARK data cleansing task by using a mean formula.
In fig. 3, the parameter value of the default is 0.5g, the parameter value of the conditionally-replaced default is 0.3g, and the parameter value of the spare.memory of the SPARK data cleaning task is 0.4g according to the mean formula; the parameter value of execution of the default processing is 1, the parameter value of the conditionally-replaced execution of the default processing is 1, and the parameter value of the execution of the SPARK data cleaning task is 1 according to a mean value formula; the parameter value of execution and instances of the default processing is 2, the parameter value of the execution and instances of the conditional replacement is 2, and the parameter value of the execution and instances of the SPARK data cleaning task is 2 according to the mean value formula; the parameter value of the driver.memory of the default processing is 1g, the parameter value of the driver.memory of the conditional replacement is 1g, and the parameter value of the driver.memory of the SPARK data cleaning task is 1g according to a mean value formula; the parameter value of the driver.core of the default processing is 1, the parameter value of the conditionally-replaced driver.core is 1, and the parameter value of the driver.core of the SPARK data cleaning task is 1 according to the mean value formula.
FIG. 4 shows the reference parameter values for two atomic operations involved in the SPARK data cleansing task and the determination of the reference parameter values for the SPARK data cleansing task using a maximum formula.
In fig. 4, the value of the default.memory parameter for default processing is 0.5g, the value of the conditionally-replaced default.memory parameter is 0.3g, and the value of the default.memory parameter for the SPARK data cleaning task is 0.5g according to the maximum formula; the parameter value of execution of the default processing is 1, the parameter value of the conditionally-replaced execution of the default processing is 1, and the parameter value of the execution of the SPARK data cleaning task is 1 according to a maximum value formula; the parameter value of execution and instances of the default processing is 2, the parameter value of execution and instances of the conditional replacement is 2, and the parameter value of execution and instances of the SPARK data cleaning task is 2 according to the maximum value formula; the parameter value of the driver.memory of the default processing is 1g, the parameter value of the driver.memory of the conditional replacement is 1g, and the parameter value of the driver.memory of the SPARK data cleaning task is 1g according to a maximum value formula; the parameter value of the driver.core of the default processing is 1, the parameter value of the driver.core of the conditional replacement is 1, and the parameter value of the driver.core of the SPARK data cleaning task is 1 according to the maximum value formula.
After the reference parameter value of the SPARK data cleaning task is determined, the basic resource configuration parameters of the SPARK data cleaning task need to be determined.
Optionally, determining a data adjustment parameter value according to the data volume of the SPARK data cleaning task and the reference data;
and adjusting the reference parameter value of the SPARK data cleaning task according to the data adjusting parameter value to obtain the basic resource configuration parameter of the SPARK data cleaning task.
In implementation, assume that the data volume of the SPARK data cleansing task is DiReference data is DbThe reference parameter value of the SPARK data cleaning task is RxWherein R isxThe parameter values representing each reference parameter of the SPARK data cleaning task, such as an executioniAnd reference data DbDetermining a data adjustment parameter value Di/DbD isi/Db*RxAnd obtaining the basic resource configuration parameters of the SPARK data cleaning task.
For example, as shown in FIG. 5, the volume D of the SPARK data cleansing task i2G, reference data DbThe basic resource configuration parameter of the SPARK data cleaning task is 1G, the value of an executive.memory parameter in the basic resource configuration parameters of the SPARK data cleaning task is 0.4G, the value of an executive.cores parameter is 1, the value of an executive.entities parameter is 2, the value of a driver.memory parameter is 1G, and the value of a driver.core parameter is 1, the basic resource configuration parameter of the SPARK data cleaning task isThe parameter value of executive.memory is 0.8g, the parameter value of executive.core is 2, the parameter value of executive.entities is 4, the parameter value of driver.memory is 2g, and the parameter value of driver.core is 2.
After the basic resource configuration parameters of the SPARK data cleaning task are determined, in consideration of the uncontrollable performance of the SPARK data cleaning task during execution, for example, different SPARK data cleaning tasks have different data volumes, different SPARK data cleaning tasks contain different numbers of atoms, special execution requirements of users and the like, after the basic resource configuration parameters of the SPARK data cleaning task, in order to improve the SPARK performance and fully utilize resources, partial parameters also need to be adjusted.
Optionally, after obtaining the basic resource configuration parameters of the SPARK data cleansing task, the method further includes performing part or all of the following adjustment modes:
adjusting a mode 1, adjusting a parallelism parameter value according to the data volume of the SPARK data cleaning task;
adjusting mode 2, adjusting the actuator memory parameter value in the basic resource configuration parameter according to the operation number of the SPARK data cleaning tasks;
and adjusting the number parameter values of the actuators in the basic resource configuration parameters according to the execution requirements of the SPARK data cleaning task.
In an implementation, the adjustment of the partial parameters may be performed in the task parameter adjustment module.
When adjusting part of the parameters, the part of the parameters comprise a parallelism parameter, an actuator memory parameter and an actuator number parameter. The method comprises the steps that an actuator memory parameter and an actuator number parameter are two parameters in basic resource configuration parameters, namely after the actuator memory parameter and the actuator number parameter are trained, the actuator memory parameter and the actuator number parameter are further adjusted according to the operation number of the SPARK data cleaning task and the number of calculation nodes when the SPARK data cleaning task is executed.
When some parameters are specifically adjusted, one or more of the some parameters may be adjusted.
Specifically, adjusting the parallelism parameter value according to the data size of the SPARK task includes:
determining a parallelism parameter value according to the data volume of the SPARK data cleaning task;
and before the parallelism parameter value is used for processing the data volume of the SPARK data cleaning task each time, increasing the parallelism parameter value according to a preset proportion.
Wherein, the preset proportion can be set manually.
When the SPARK executes the SPARK data cleaning task, the SPARK data cleaning task is divided into a plurality of tasks according to the data volume of the SPARK data cleaning task and the block (data block) of the cluster, the number of the tasks is the number of the slices, and the parallelism parameter value is the number of the slices divided by a preset value, such as 2, 3 or other values.
For example, the data volume of the SPARK data cleansing task is 500G, the block (data block) of the cluster is 128M, the fragmentation number is the ratio of the data volume of the SPARK data cleansing task to the block of the cluster, that is, the fragmentation number is 4000, and the parallelism parameter is 2000 or 4000/3, or may be larger.
The parallelism parameter is generally increased continuously according to a certain proportion, because the massive data can be fragmented and executed in a batch, if the parallelism is not set, the SPARK can be automatically set according to the size of the cluster block.
Specifically, adjusting the actuator memory parameter value in the basic resource configuration parameter according to the number of atomic operations of the SPARK data cleansing task includes:
determining a first adjustment proportion according to the memory number of the cluster machine;
according to the first adjustment proportion, increasing the size of a memory which can be used for executing the SPARK data cleaning task in the equipment;
if the increased memory which can be used for executing the SPARK data cleaning task does not exceed the maximum memory value, executing the SPARK data cleaning task by adopting the increased memory value;
and if the increased memory which can be used for executing the SPARK data cleaning task exceeds the maximum memory value, executing the SPARK data cleaning task by adopting the maximum memory value.
In implementation, the first adjustment proportion is determined according to the cluster machine memory parameter.
Assuming that the size of the memory available for executing the task is 75% of the size of the machine memory, on this basis, an acceptable single execution memory reference value, generally 1GB or 2GB, is set, and is specifically configured by the basic parameter configuration module.
After the reference value is set, a maximum memory value needs to be set, because the memory cannot be expanded without limit even if the number of atomic operations is more, and if the task is complicated to a certain extent, the expected execution time of the task should be longer. The maximum memory value of the memory is set according to the number of the memories of the machine, for example, the memory of the machine is 64G, and the maximum memory of the actuator can be set to be 4G or 8G; if the machine memory is 128G, the actuator maximum memory may be set to 8GB or 16 GB.
Specifically, adjusting the parameter values of the number of the actuators in the basic resource configuration parameters according to the execution requirements of the SPARK task includes:
determining a second adjustment proportion according to the size of the memory of the cluster machine and the number of the machines;
increasing and adjusting the number parameter values of the actuators in the basic resource configuration parameters according to the second adjustment proportion;
if the increased number parameter values of the actuators in the basic resource configuration parameters do not exceed the maximum number of the actuators, executing the SPARK data cleaning task by using the increased number parameter values of the actuators;
if the increased actuator number parameter value in the basic resource configuration parameters exceeds the maximum actuator number, executing the SPARK data cleaning task by adopting the maximum actuator number;
and the maximum number of the actuators is determined according to the memory size of the actuators.
In implementation, the second adjustment proportion is determined according to the memory size of the cluster machine and the number of the machines.
For example, the number of actuators per machine is not recommended to exceed 5, so the number of executors assigned to a task should be less than or equal to the number of machines multiplied by 5.
And if the number parameter values of the actuators in the basic resource configuration parameters are increased and adjusted according to the second adjustment proportion, and the increased number parameter values of the actuators are larger than the number of the machines multiplied by 5, executing the received SPARK data cleaning task by multiplying the number of the machines by 5.
As shown in fig. 6, a schematic flow chart of parameter configuration according to an embodiment of the present invention is as follows.
Step 600, a reference parameter training module trains reference parameter values of each atomic operation meeting expected execution time under reference data;
601, storing the reference parameter value in a reference parameter database;
step 602, submitting a task;
step 603, the task feature extraction module obtains the submitted data volume in the task and the atomic operations contained in the task;
step 604, the task parameter configuration module obtains basic resource configuration parameters according to the reference parameter values stored in the reference database, the data amount in the task acquired by the task feature extraction module and the atomic operations contained in the program corresponding to the task;
step 605, the task parameter adjusting module adjusts one or more of the parallelism parameter, the memory parameter of the executor in the basic resource configuration parameter, and the number parameter of the executor in the basic resource configuration parameter;
and step 606, submitting the processed tasks.
Based on the same inventive concept, the embodiment of the present invention further provides a device for parameter configuration, and since the principle of the device for solving the problem is similar to the method for parameter configuration in the embodiment of the present invention, the implementation of the device may refer to the implementation of the method, and repeated details are not described again.
As shown in fig. 7, an embodiment of the present invention further provides a device for parameter configuration, where the device includes: processor 700 and transceiver 701:
the processor 700 is configured to determine a data size of a received fast general-purpose computing engine SPARK data cleansing task and at least one atomic operation included in a SPARK data cleansing program corresponding to the SPARK data cleansing task; determining basic resource configuration parameters of the SPARK data cleaning task according to a reference parameter value, the data volume of the SPARK data cleaning task and at least one atomic operation contained in an SPARK data cleaning program corresponding to the SPARK data cleaning task; wherein the benchmark parameter value is a benchmark parameter value under the benchmark data for a different atomic operation trained from the benchmark data and the expected execution time in a defined cluster environment including at least one computing node.
Optionally, the processor 700 is specifically configured to:
carrying out generalized analysis on codes of the SPARK data washing task in the SPARK data washing program; matching the analysis result with the atomic operation of the pre-configured data cleaning field; and determining at least one atomic operation contained in the SPARK data washing program according to the matching result.
Optionally, the processor 700 is specifically configured to:
determining a reference parameter value of at least one atomic operation contained in the SPARK data washing program according to the reference parameter value and the at least one atomic operation contained in the SPARK data washing program; determining a reference parameter value of the SPARK data cleaning task by using a mean value formula or a maximum value formula according to a reference parameter value of at least one atomic operation contained in the SPARK data cleaning program; and determining basic resource configuration parameters of the SPARK data cleaning task according to the reference data, the reference parameter value of the SPARK data cleaning task and the data volume of the SPARK data cleaning task.
Optionally, the processor 700 is specifically configured to:
determining a data adjusting parameter value according to the data volume of the SPARK data cleaning task and the reference data; and adjusting the reference parameter value of the SPARK data cleaning task according to the data adjusting parameter value to obtain the basic resource configuration parameter of the SPARK data cleaning task.
Optionally, the processor 700 is further configured to:
performing some or all of the following adjustment:
adjusting a mode 1, adjusting a parallelism parameter value according to the data volume of the SPARK data cleaning task;
adjusting mode 2, adjusting the actuator memory parameter value in the basic resource configuration parameter according to the operation number of the SPARK data cleaning tasks;
and adjusting the number parameter values of the actuators in the basic resource configuration parameters according to the execution requirements of the SPARK data cleaning task.
Optionally, the processor 700 is specifically configured to:
determining a parallelism parameter value according to the data volume of the SPARK data cleaning task; and before the parallelism parameter value is used for processing the data volume of the SPARK data cleaning task each time, increasing the parallelism parameter value according to a preset proportion.
Optionally, the processor 700 is specifically configured to:
determining a first adjustment proportion according to the memory number of the cluster machine; according to the first adjustment proportion, increasing the size of a memory which can be used for executing the SPARK data cleaning task in the equipment; if the increased memory which can be used for executing the SPARK data cleaning task does not exceed the maximum memory value, executing the SPARK data cleaning task by adopting the increased memory value; and if the increased memory which can be used for executing the SPARK data cleaning task exceeds the maximum memory value, executing the SPARK data cleaning task by adopting the maximum memory value.
Optionally, the processor 700 is specifically configured to:
determining a second adjustment proportion according to the size of the memory of the cluster machine and the number of the machines; increasing and adjusting the number parameter values of the actuators in the basic resource configuration parameters according to the second adjustment proportion; if the increased number parameter values of the actuators in the basic resource configuration parameters do not exceed the maximum number of the actuators, executing the SPARK data cleaning task by using the increased number parameter values of the actuators; if the increased actuator number parameter value in the basic resource configuration parameters exceeds the maximum actuator number, executing the SPARK data cleaning task by adopting the maximum actuator number; and the maximum number of the actuators is determined according to the memory size of the actuators.
Based on the same inventive concept, the embodiment of the present invention further provides a device for parameter configuration, and since the principle of the device for solving the problem is similar to the method for parameter configuration in the embodiment of the present invention, the implementation of the device may refer to the implementation of the method, and repeated details are not described again.
As shown in fig. 8, an embodiment of the present invention further provides a device for adjusting bandwidth, where the device includes: at least one processing unit 800, and at least one memory unit 801, wherein the memory unit 801 stores program code that, when executed by the processing unit 800, causes the processing unit 800 to perform the following:
determining the data volume of a received SPARK data cleaning task of a quick and general computing engine and at least one atomic operation contained in a SPARK data cleaning program corresponding to the SPARK data cleaning task; determining basic resource configuration parameters of the SPARK data cleaning task according to a reference parameter value, the data volume of the SPARK data cleaning task and at least one atomic operation contained in an SPARK data cleaning program corresponding to the SPARK data cleaning task; wherein the benchmark parameter value is a benchmark parameter value under the benchmark data for a different atomic operation trained from the benchmark data and the expected execution time in a defined cluster environment including at least one computing node.
Optionally, the processor 800 is specifically configured to:
carrying out generalized analysis on codes of the SPARK data washing task in the SPARK data washing program; matching the analysis result with the atomic operation of the pre-configured data cleaning field; and determining at least one atomic operation contained in the SPARK data washing program according to the matching result.
Optionally, the processor 800 is specifically configured to:
determining a reference parameter value of at least one atomic operation contained in the SPARK data washing program according to the reference parameter value and the at least one atomic operation contained in the SPARK data washing program; determining a reference parameter value of the SPARK data cleaning task by using a mean value formula or a maximum value formula according to a reference parameter value of at least one atomic operation contained in the SPARK data cleaning program; and determining basic resource configuration parameters of the SPARK data cleaning task according to the reference data, the reference parameter value of the SPARK data cleaning task and the data volume of the SPARK data cleaning task.
Optionally, the processor 800 is specifically configured to:
determining a data adjusting parameter value according to the data volume of the SPARK data cleaning task and the reference data; and adjusting the reference parameter value of the SPARK data cleaning task according to the data adjusting parameter value to obtain the basic resource configuration parameter of the SPARK data cleaning task.
Optionally, the processor 800 is further configured to:
performing some or all of the following adjustment:
adjusting a mode 1, adjusting a parallelism parameter value according to the data volume of the SPARK data cleaning task;
adjusting mode 2, adjusting the actuator memory parameter value in the basic resource configuration parameter according to the operation number of the SPARK data cleaning tasks;
and adjusting the number parameter values of the actuators in the basic resource configuration parameters according to the execution requirements of the SPARK data cleaning task.
Optionally, the processor 800 is specifically configured to:
determining a parallelism parameter value according to the data volume of the SPARK data cleaning task; and before the parallelism parameter value is used for processing the data volume of the SPARK data cleaning task each time, increasing the parallelism parameter value according to a preset proportion.
Optionally, the processor 800 is specifically configured to:
determining a first adjustment proportion according to the memory number of the cluster machine; according to the first adjustment proportion, increasing the size of a memory which can be used for executing the SPARK data cleaning task in the equipment; if the increased memory which can be used for executing the SPARK data cleaning task does not exceed the maximum memory value, executing the SPARK data cleaning task by adopting the increased memory value; and if the increased memory which can be used for executing the SPARK data cleaning task exceeds the maximum memory value, executing the SPARK data cleaning task by adopting the maximum memory value.
Optionally, the processor 800 is specifically configured to:
determining a second adjustment proportion according to the size of the memory of the cluster machine and the number of the machines; increasing and adjusting the number parameter values of the actuators in the basic resource configuration parameters according to the second adjustment proportion; if the increased number parameter values of the actuators in the basic resource configuration parameters do not exceed the maximum number of the actuators, executing the SPARK data cleaning task by using the increased number parameter values of the actuators; if the increased actuator number parameter value in the basic resource configuration parameters exceeds the maximum actuator number, executing the SPARK data cleaning task by adopting the maximum actuator number; and the maximum number of the actuators is determined according to the memory size of the actuators.
Based on the same inventive concept, the embodiment of the present invention further provides a device for parameter configuration, and since the principle of the device for solving the problem is similar to the method for parameter configuration in the embodiment of the present invention, the implementation of the device may refer to the implementation of the method, and repeated details are not described again.
As shown in fig. 9, an embodiment of the present invention further provides an apparatus for performing bandwidth adjustment, which includes a determining module 900 and a processing module 901:
a determining module 900, configured to determine a data size of a received fast general-purpose computing engine SPARK data cleansing task and at least one atomic operation included in a SPARK data cleansing program corresponding to the SPARK data cleansing task;
a processing module 901, configured to determine a basic resource configuration parameter of the SPARK data cleaning task according to a reference parameter value, a data size of the SPARK data cleaning task, and at least one atomic operation included in a SPARK data cleaning program corresponding to the SPARK data cleaning task.
Optionally, the processing module 901 is specifically configured to:
carrying out generalized analysis on codes of the SPARK data washing task in the SPARK data washing program; matching the analysis result with the atomic operation of the pre-configured data cleaning field; and determining at least one atomic operation contained in the SPARK data washing program according to the matching result.
Optionally, the processing module 901 is specifically configured to:
determining a reference parameter value of at least one atomic operation contained in the SPARK data washing program according to the reference parameter value and the at least one atomic operation contained in the SPARK data washing program; determining a reference parameter value of the SPARK data cleaning task by using a mean value formula or a maximum value formula according to a reference parameter value of at least one atomic operation contained in the SPARK data cleaning program; and determining basic resource configuration parameters of the SPARK data cleaning task according to the reference data, the reference parameter value of the SPARK data cleaning task and the data volume of the SPARK data cleaning task.
Optionally, the processing module 901 is specifically configured to:
determining a data adjusting parameter value according to the data volume of the SPARK data cleaning task and the reference data; and adjusting the reference parameter value of the SPARK data cleaning task according to the data adjusting parameter value to obtain the basic resource configuration parameter of the SPARK data cleaning task.
Optionally, the processing module 901 is further configured to:
performing some or all of the following adjustment:
adjusting a mode 1, adjusting a parallelism parameter value according to the data volume of the SPARK data cleaning task;
adjusting mode 2, adjusting the actuator memory parameter value in the basic resource configuration parameter according to the operation number of the SPARK data cleaning tasks;
and adjusting the number parameter values of the actuators in the basic resource configuration parameters according to the execution requirements of the SPARK data cleaning task.
Optionally, the processing module 901 is specifically configured to:
determining a parallelism parameter value according to the data volume of the SPARK data cleaning task; and before the parallelism parameter value is used for processing the data volume of the SPARK data cleaning task each time, increasing the parallelism parameter value according to a preset proportion.
Optionally, the processing module 901 is specifically configured to:
determining a first adjustment proportion according to the memory number of the cluster machine; according to the first adjustment proportion, increasing the size of a memory which can be used for executing the SPARK data cleaning task in the equipment; if the increased memory which can be used for executing the SPARK data cleaning task does not exceed the maximum memory value, executing the SPARK data cleaning task by adopting the increased memory value; and if the increased memory which can be used for executing the SPARK data cleaning task exceeds the maximum memory value, executing the SPARK data cleaning task by adopting the maximum memory value.
Optionally, the processing module 901 is specifically configured to:
determining a second adjustment proportion according to the size of the memory of the cluster machine and the number of the machines; increasing and adjusting the number parameter values of the actuators in the basic resource configuration parameters according to the second adjustment proportion; if the increased number parameter values of the actuators in the basic resource configuration parameters do not exceed the maximum number of the actuators, executing the SPARK data cleaning task by using the increased number parameter values of the actuators; if the increased actuator number parameter value in the basic resource configuration parameters exceeds the maximum actuator number, executing the SPARK data cleaning task by adopting the maximum actuator number; and the maximum number of the actuators is determined according to the memory size of the actuators.
An embodiment of the present invention further provides a device-readable storage medium for synchronized parameter configuration, where the readable storage medium is non-volatile and includes program code, and when the program code runs on a computing device, the program code is configured to enable the computing device to execute the steps of the method for configuring the parameters by the device.
The present application is described above with reference to block diagrams and/or flowchart illustrations of methods, apparatus (systems) and/or computer program products according to embodiments of the application. It will be understood that one block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, and/or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer and/or other programmable data processing apparatus, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks.
Accordingly, the subject application may also be embodied in hardware and/or in software (including firmware, resident software, micro-code, etc.). Furthermore, the present application may take the form of a computer program product on a computer-usable or computer-readable storage medium having computer-usable or computer-readable program code embodied in the medium for use by or in connection with an instruction execution system. In the context of this application, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (18)

1. A method for configuring parameters, the method comprising:
determining the data volume of a received SPARK data cleaning task of a quick and general computing engine and at least one atomic operation contained in a SPARK data cleaning program corresponding to the SPARK data cleaning task;
determining basic resource configuration parameters of the SPARK data cleaning task according to a reference parameter value, the data volume of the SPARK data cleaning task and at least one atomic operation contained in an SPARK data cleaning program corresponding to the SPARK data cleaning task;
wherein the benchmark parameter value is a benchmark parameter value under the benchmark data for a different atomic operation trained from the benchmark data and the expected execution time in a defined cluster environment including at least one computing node.
2. The method of claim 1, wherein the determining at least one atomic operation included in the SPARK data washer comprises:
carrying out generalized analysis on codes of Spark data cleaning tasks in the Spark data cleaning program;
matching the analysis result with the atomic operation of the pre-configured data cleaning field;
and determining at least one atomic operation contained in the SPARK data washing program according to the matching result.
3. The method of claim 1, wherein determining the basic resource configuration parameters for the SPARK data cleansing task based on the baseline parameter values, the amount of data for the SPARK data cleansing task, and at least one atomic operation included in the SPARK data cleansing program comprises:
determining a reference parameter value of at least one atomic operation contained in the SPARK data washing program according to the reference parameter value and the at least one atomic operation contained in the SPARK data washing program;
determining a reference parameter value of the SPARK data cleaning task by using a mean value formula or a maximum value formula according to a reference parameter value of at least one atomic operation contained in the SPARK data cleaning program;
and determining basic resource configuration parameters of the SPARK data cleaning task according to the reference data, the reference parameter value of the SPARK data cleaning task and the data volume of the SPARK data cleaning task.
4. The method of claim 3, wherein determining the basic resource configuration parameters of the SPARK data cleansing task based on the reference data, the reference parameter values of the SPARK data cleansing task, and the data volume of the SPARK data cleansing task comprises:
determining a data adjusting parameter value according to the data volume of the SPARK data cleaning task and the reference data;
and adjusting the reference parameter value of the SPARK data cleaning task according to the data adjusting parameter value to obtain the basic resource configuration parameter of the SPARK data cleaning task.
5. The method of claim 1, wherein obtaining the basic resource configuration parameters for the SPARK data scrubbing task further comprises performing some or all of the following adjustments:
adjusting a mode 1, adjusting a parallelism parameter value according to the data volume of the SPARK data cleaning task;
adjusting mode 2, adjusting the actuator memory parameter value in the basic resource configuration parameter according to the operation number of the SPARK data cleaning tasks;
and adjusting the number parameter values of the actuators in the basic resource configuration parameters according to the execution requirements of the SPARK data cleaning task.
6. The method of claim 5, wherein adjusting the parallelism parameter value according to the amount of data for the SPARK data scrubbing task comprises:
determining a parallelism parameter value according to the data volume of the SPARK data cleaning task;
and before the parallelism parameter value is used for processing the data volume of the SPARK data cleaning task each time, increasing the parallelism parameter value according to a preset proportion.
7. The method of claim 5, wherein the adjusting the value of the actuator memory parameter in the basic resource configuration parameters according to the number of atomic operations of the SPARK data scrubbing process comprises:
determining a first adjustment proportion according to the memory number of the cluster machine;
according to the first adjustment proportion, increasing the size of a memory which can be used for executing the SPARK data cleaning task in the equipment;
if the increased memory which can be used for executing the SPARK data cleaning task does not exceed the maximum memory value, executing the SPARK data cleaning task by adopting the increased memory value;
and if the increased memory which can be used for executing the SPARK data cleaning task exceeds the maximum memory value, executing the SPARK data cleaning task by adopting the maximum memory value.
8. The method of claim 5, wherein the adjusting the parameter value of the number of actuators in the basic resource configuration parameter according to the execution requirement of the SPARK data scrubbing task comprises:
determining a second adjustment proportion according to the size of the memory of the cluster machine and the number of the machines;
increasing and adjusting the number parameter values of the actuators in the basic resource configuration parameters according to the second adjustment proportion;
if the increased number parameter values of the actuators in the basic resource configuration parameters do not exceed the maximum number of the actuators, executing the SPARK data cleaning task by using the increased number parameter values of the actuators;
if the increased actuator number parameter value in the basic resource configuration parameters exceeds the maximum actuator number, executing the SPARK data cleaning task by adopting the maximum actuator number;
and the maximum number of the actuators is determined according to the memory size of the actuators.
9. An apparatus for parameter configuration, the apparatus comprising: a processor and a transceiver:
the processor is used for determining the data volume of the received SPARK data cleaning task of the quick and general computing engine and at least one atomic operation contained in an SPARK data cleaning program corresponding to the SPARK data cleaning task; determining basic resource configuration parameters of the SPARK data cleaning task according to a reference parameter value, the data volume of the SPARK data cleaning task and at least one atomic operation contained in an SPARK data cleaning program corresponding to the SPARK data cleaning task; wherein the benchmark parameter value is a benchmark parameter value under the benchmark data for a different atomic operation trained from the benchmark data and the expected execution time in a defined cluster environment including at least one computing node.
10. The device of claim 9, wherein the processor is specifically configured to:
carrying out generalized analysis on codes of the SPARK data washing task in the SPARK data washing program; matching the analysis result with the atomic operation of the pre-configured data cleaning field; and determining at least one atomic operation contained in the SPARK data washing program according to the matching result.
11. The device of claim 9, wherein the processor is specifically configured to:
determining a reference parameter value of at least one atomic operation contained in the SPARK data washing program according to the reference parameter value and the at least one atomic operation contained in the SPARK data washing program; determining a reference parameter value of the SPARK data cleaning task by using a mean value formula or a maximum value formula according to a reference parameter value of at least one atomic operation contained in the SPARK data cleaning program; and determining basic resource configuration parameters of the SPARK task for data cleaning according to the reference data, the reference parameter value of the SPARK data cleaning task and the data volume of the SPARK data cleaning task.
12. The device of claim 11, wherein the processor is specifically configured to:
determining a data adjusting parameter value according to the data volume of the SPARK data cleaning task and the reference data; and adjusting the reference parameter value of the SPARK data cleaning task according to the data adjusting parameter value to obtain the basic resource configuration parameter of the SPARK data cleaning task.
13. The device of claim 9, wherein the processor is further configured to:
performing some or all of the following adjustment:
adjusting a mode 1, adjusting a parallelism parameter value according to the data volume of the SPARK data cleaning task;
adjusting mode 2, adjusting the actuator memory parameter value in the basic resource configuration parameter according to the operation number of the SPARK data cleaning tasks;
and adjusting the number parameter values of the actuators in the basic resource configuration parameters according to the execution requirements of the SPARK data cleaning task.
14. The device of claim 13, wherein the processor is specifically configured to:
determining a parallelism parameter value according to the data volume of the SPARK data cleaning task; and before the parallelism parameter value is used for processing the data volume of the SPARK data cleaning task each time, increasing the parallelism parameter value according to a preset proportion.
15. The device of claim 13, wherein the processor is specifically configured to:
determining a first adjustment proportion according to the memory number of the cluster machine; according to the first adjustment proportion, increasing the size of a memory which can be used for executing the SPARK data cleaning task in the equipment; if the increased memory which can be used for executing the SPARK data cleaning task does not exceed the maximum memory value, executing the SPARK data cleaning task by adopting the increased memory value; and if the increased memory which can be used for executing the SPARK data cleaning task exceeds the maximum memory value, executing the SPARK data cleaning task by adopting the maximum memory value.
16. The device of claim 13, wherein the processor is specifically configured to:
determining a second adjustment proportion according to the size of the memory of the cluster machine and the number of the machines; increasing and adjusting the number parameter values of the actuators in the basic resource configuration parameters according to the second adjustment proportion; if the increased number parameter values of the actuators in the basic resource configuration parameters do not exceed the maximum number of the actuators, executing the SPARK data cleaning task by using the increased number parameter values of the actuators; if the increased actuator number parameter value in the basic resource configuration parameters exceeds the maximum actuator number, executing the SPARK data cleaning task by adopting the maximum actuator number; and the maximum number of the actuators is determined according to the memory size of the actuators.
17. An apparatus for parameter configuration, the apparatus comprising: at least one processing unit and at least one memory unit, wherein the memory unit stores program code which, when executed by the processing unit, causes the processing unit to perform the steps of the method of any of claims 1 to 8.
18. A computer storage medium having a computer program stored thereon, the program, when executed by a processor, implementing the steps of the method according to any one of claims 1 to 8.
CN201810385919.5A 2018-04-26 2018-04-26 Parameter configuration method and equipment Active CN110427356B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810385919.5A CN110427356B (en) 2018-04-26 2018-04-26 Parameter configuration method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810385919.5A CN110427356B (en) 2018-04-26 2018-04-26 Parameter configuration method and equipment

Publications (2)

Publication Number Publication Date
CN110427356A CN110427356A (en) 2019-11-08
CN110427356B true CN110427356B (en) 2021-08-13

Family

ID=68408248

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810385919.5A Active CN110427356B (en) 2018-04-26 2018-04-26 Parameter configuration method and equipment

Country Status (1)

Country Link
CN (1) CN110427356B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609309A (en) * 2012-01-19 2012-07-25 中兴通讯股份有限公司 Strategy scheduling system for cloud computing and strategy scheduling method for cloud computing
CN103593323A (en) * 2013-11-07 2014-02-19 浪潮电子信息产业股份有限公司 Machine learning method for Map Reduce task resource allocation parameters
CN106126641A (en) * 2016-06-24 2016-11-16 中国科学技术大学 A kind of real-time recommendation system and method based on Spark
CN106202431A (en) * 2016-07-13 2016-12-07 华中科技大学 A kind of Hadoop parameter automated tuning method and system based on machine learning
CN106202569A (en) * 2016-08-09 2016-12-07 北京北信源软件股份有限公司 A kind of cleaning method based on big data quantity
CN106294745A (en) * 2016-08-10 2017-01-04 东方网力科技股份有限公司 Big data cleaning method and device
CN106648654A (en) * 2016-12-20 2017-05-10 深圳先进技术研究院 Data sensing-based Spark configuration parameter automatic optimization method
CN107194490A (en) * 2016-03-14 2017-09-22 商业对象软件有限公司 Predict modeling optimization
CN107229693A (en) * 2017-05-22 2017-10-03 哈工大大数据产业有限公司 The method and system of big data system configuration parameter tuning based on deep learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609309A (en) * 2012-01-19 2012-07-25 中兴通讯股份有限公司 Strategy scheduling system for cloud computing and strategy scheduling method for cloud computing
CN103593323A (en) * 2013-11-07 2014-02-19 浪潮电子信息产业股份有限公司 Machine learning method for Map Reduce task resource allocation parameters
CN107194490A (en) * 2016-03-14 2017-09-22 商业对象软件有限公司 Predict modeling optimization
CN106126641A (en) * 2016-06-24 2016-11-16 中国科学技术大学 A kind of real-time recommendation system and method based on Spark
CN106202431A (en) * 2016-07-13 2016-12-07 华中科技大学 A kind of Hadoop parameter automated tuning method and system based on machine learning
CN106202569A (en) * 2016-08-09 2016-12-07 北京北信源软件股份有限公司 A kind of cleaning method based on big data quantity
CN106294745A (en) * 2016-08-10 2017-01-04 东方网力科技股份有限公司 Big data cleaning method and device
CN106648654A (en) * 2016-12-20 2017-05-10 深圳先进技术研究院 Data sensing-based Spark configuration parameter automatic optimization method
CN107229693A (en) * 2017-05-22 2017-10-03 哈工大大数据产业有限公司 The method and system of big data system configuration parameter tuning based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"基于Spark框架的电力大数据清洗模型";王冲等;《电测与仪表》;20170822(第2017年14期);33-38 *
"基于Spark的LIBSVM参数优选并行化算法";李坤等;《南京大学学报(自然科学)》;20160331;第52卷(第2期);343-352 *

Also Published As

Publication number Publication date
CN110427356A (en) 2019-11-08

Similar Documents

Publication Publication Date Title
CN109993299B (en) Data training method and device, storage medium and electronic device
EP3495951B1 (en) Hybrid cloud migration delay risk prediction engine
US8719833B2 (en) Adaptive demand-driven load balancing
CN110908788B (en) Spark Streaming based data processing method and device, computer equipment and storage medium
TWI547817B (en) Method, system and apparatus of planning resources for cluster computing architecture
US11314808B2 (en) Hybrid flows containing a continous flow
CN113994350A (en) Generating parallel computing schemes for neural networks
CN111126621B (en) Online model training method and device
CN102541858A (en) Data equality processing method, device and system based on mapping and protocol
CN110941251B (en) Production control method, device, equipment and medium based on digital twin body
CN110825522A (en) Spark parameter self-adaptive optimization method and system
US20160299827A1 (en) Generating a visualization of a metric at a level of execution
Şen et al. A strong preemptive relaxation for weighted tardiness and earliness/tardiness problems on unrelated parallel machines
WO2020120123A1 (en) Platform for selection of items used for the configuration of an industrial system
CN109800078B (en) Task processing method, task distribution terminal and task execution terminal
CN107204998B (en) Method and device for processing data
CN109558232B (en) Determination method, apparatus, equipment and the medium of degree of parallelism
CN113222253B (en) Scheduling optimization method, device, equipment and computer readable storage medium
CN110427356B (en) Parameter configuration method and equipment
CN108959458A (en) Data generate and application method, system, medium and computer equipment
CN101495978A (en) Reduction of message flow between bus-connected consumers and producers
CN113205128A (en) Distributed deep learning performance guarantee method based on serverless computing
CN109213105B (en) Reconfigurable device, reconfigurable realization method and distributed control system
US10353896B2 (en) Data processing method and apparatus
CN116134387B (en) Method and system for determining the compression ratio of an AI model for an industrial task

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant