CN113157538B - Spark operation parameter determination method, device, equipment and storage medium - Google Patents

Spark operation parameter determination method, device, equipment and storage medium Download PDF

Info

Publication number
CN113157538B
CN113157538B CN202110142577.6A CN202110142577A CN113157538B CN 113157538 B CN113157538 B CN 113157538B CN 202110142577 A CN202110142577 A CN 202110142577A CN 113157538 B CN113157538 B CN 113157538B
Authority
CN
China
Prior art keywords
operation parameter
parameter set
model
data
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110142577.6A
Other languages
Chinese (zh)
Other versions
CN113157538A (en
Inventor
童轩
李杨
孔庆云
潘登
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xi'an Tianhe Defense Technology Co ltd
Original Assignee
Xi'an Tianhe Defense Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xi'an Tianhe Defense Technology Co ltd filed Critical Xi'an Tianhe Defense Technology Co ltd
Priority to CN202110142577.6A priority Critical patent/CN113157538B/en
Publication of CN113157538A publication Critical patent/CN113157538A/en
Application granted granted Critical
Publication of CN113157538B publication Critical patent/CN113157538B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3419Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/086Learning methods using evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Physiology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Stored Programmes (AREA)
  • Combined Controls Of Internal Combustion Engines (AREA)

Abstract

The application is applicable to the technical field of computers, and provides a method, a device, equipment and a storage medium for determining Spark operating parameters, wherein the method comprises the following steps: acquiring a first operation parameter set and a second operation parameter set, wherein the first operation parameter set and the second operation parameter set are acquired in different modes; inputting the data size and the task type of the task data into a first model based on a first operation parameter set, and determining a target model in the first model according to a first output result, wherein the first model comprises a plurality of data models; and inputting the data size and the task type into the target model based on the second operation parameter set, and determining the target operation parameters of Spark according to the second output result. According to the method, firstly, a target model matched with the task type is determined based on a first operation parameter set, then, a target operation parameter matched with the task type is determined based on a second operation parameter set, the optimal operation parameter can be determined aiming at task data of different task types, and therefore the accuracy of Spark operation parameters is improved.

Description

Spark operation parameter determination method, device, equipment and storage medium
Technical Field
The application belongs to the technical field of computers, and particularly relates to a method, a device, equipment and a storage medium for determining Spark operating parameters.
Background
Apache Spark (Spark for short) is a fast and general-purpose computing engine designed for large-scale data processing, and Spark has become one of the main means of data mining in the industry along with exponential growth of data scale in recent years. Statistically, the operating parameters of the Spark can reach as many as several hundred, and the setting of the operating parameters has a crucial influence on the operating efficiency of the Spark, which is time-consuming and labor-consuming if the operating parameters are manually set.
In the conventional technology, a certain training set is usually adopted to train to obtain a network model, and then the network model is used to determine the operating parameters of Spark through a genetic algorithm. However, the accuracy of the operating parameters determined in the conventional art is not high.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a storage medium for determining Spark operating parameters, and can solve the problem that the accuracy of the determined operating parameters in the prior art is not high.
In a first aspect, an embodiment of the present application provides a method for determining a Spark operating parameter, including:
acquiring a first operation parameter set and a second operation parameter set, wherein the first operation parameter set and the second operation parameter set are acquired in different modes;
inputting the data size and the task type of the task data into a first model based on the first operation parameter set, and determining a target model in the first model according to a first output result, wherein the first model comprises a plurality of data models;
and inputting the data size and the task type into the target model based on the second operation parameter set, and determining a target operation parameter of Spark according to a second output result.
In the above embodiment, the target model matched with the task type is determined based on the first operation parameter set, and then the target operation parameter matched with the task type is determined based on the second operation parameter set, so that the corresponding optimal operation parameter can be determined for the task data of different task types, and the accuracy of the determined Spark operation parameter is further improved.
In a possible implementation manner of the first aspect, inputting a data size of task data and a task type into a first model based on the first operation parameter set, and determining a target model in the first model according to a first output result includes:
for each data model, inputting the data size and the task type into the data model based on the first operation parameter set, and outputting a predicted operation time corresponding to each group of first operation parameters;
and determining a target model in the plurality of data models according to the predicted operation time corresponding to each group of first operation parameters.
In the above embodiment, the target model in the multiple data models is determined by analyzing the predicted running time output by each data model, that is, the target model matched with the current task type is obtained, and then the optimal running parameter when the task data of the type is processed is determined, so that the accuracy of the determined Spark running parameter is improved.
In a possible implementation manner of the first aspect, determining a target model in the plurality of data models according to the predicted operation time corresponding to each group of the first operation parameters includes:
determining a decision coefficient corresponding to each data model according to the predicted operation time corresponding to each group of first operation parameters for each data model;
and taking the data model corresponding to the maximum decision coefficient as the target model.
In the above embodiment, the determination coefficient of each data model is determined by analyzing the predicted running time output by each data model, so as to obtain the target model matched with the current task type, that is, determine the optimal running parameter when processing the task data of the type, thereby improving the accuracy of the determined Spark running parameter.
In a possible implementation manner of the first aspect, inputting the data size and the task type into the target model based on the second operation parameter set, and determining a target operation parameter of Spark according to a second output result includes:
inputting the data size and the task type into the target model based on the second operation parameter set to obtain predicted operation time corresponding to each group of second operation parameters;
and taking the second operation parameter corresponding to the shortest predicted operation time as the target operation parameter.
In the above embodiment, the target operation parameter matched with the task type is determined based on the second operation parameter set, and the corresponding optimal operation parameter may be determined for task data of different task types, so that the accuracy of the determined Spark operation parameter is improved.
In a possible implementation manner of the first aspect, the method further includes:
and adding the target operation parameters into the first operation parameter set, and storing the target operation parameters and the predicted operation time corresponding to the target operation parameters into a database.
In the above embodiment, when the target operation parameter needs to be determined for the task data of another task type, the extended first operation parameter set may be used to determine the target model, that is, the parameter set data is continuously enriched, so as to reduce the model error and improve the accuracy of the algorithm.
In a possible implementation manner of the first aspect, after the first operation parameter set is obtained, the method further includes:
and carrying out interpolation processing on null value data in the first operation parameter set by adopting a preset interpolation algorithm to obtain a processed first operation parameter set.
In the above embodiment, by performing interpolation processing on null data, the comprehensiveness of the first operation parameter set can be improved, which is helpful for improving the accuracy of the determined target model.
In a possible implementation manner of the first aspect, the method further includes:
if the first operation parameter set comprises non-numerical operation parameters, converting the non-numerical operation parameters into numerical operation parameters.
In a second aspect, an embodiment of the present application provides a device for determining a Spark operating parameter, including:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a first operation parameter set and a second operation parameter set, and the acquisition modes of the first operation parameter set and the second operation parameter set are different;
a first determining module, configured to input a data size and a task type of task data into a first model based on the first operation parameter set, and determine a target model in the first model according to a first output result, where the first model includes multiple data models;
and the second determining module is used for inputting the data size and the task type into the target model based on the second operation parameter set and determining target operation parameters of Spark according to a second output result.
In a third aspect, an embodiment of the present application provides a computer device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the method for determining Spark operating parameters according to any of the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the method for determining a Spark operating parameter according to any one of the above first aspects is implemented.
In a fifth aspect, an embodiment of the present application provides a computer program product, which when run on a computer device, causes the computer device to execute the method for determining a Spark operating parameter according to any one of the above first aspects.
It is to be understood that, for the beneficial effects of the second aspect to the fifth aspect, reference may be made to the relevant description in the first aspect, and details are not described herein again.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a schematic structural diagram of a computer device provided in an embodiment of the present application;
fig. 2 is a schematic flowchart of a method for determining a Spark operating parameter according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a method for determining a Spark operating parameter according to another embodiment of the present application;
fig. 4 is a schematic structural diagram of a device for determining a Spark operating parameter according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
The Spark operation parameter determining method provided in the embodiment of the present application may be applied to a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, a super-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), a server, and other computer devices, and the embodiment of the present application does not limit specific types of the computer devices.
Fig. 1 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 1, the computer apparatus 1 of this embodiment includes: at least one processor 10 (only one shown in fig. 1), a memory 11, and a computer program 12 stored in the memory 11 and executable on the at least one processor 10, wherein the processor 10 executes the computer program 12 to implement the steps in any of the various embodiments of the video key frame extraction method described above.
The computer device 1 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer device 1 may include, but is not limited to, a processor 10, a memory 11. Those skilled in the art will appreciate that fig. 1 is merely an example of the computer device 1, and does not constitute a limitation of the computer device 1, and may include more or less components than those shown, or combine some components, or different components, such as an input-output device, a network access device, etc.
The processor 10 may be a Central Processing Unit (CPU), and the processor 10 may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field-programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 11 may in some embodiments be an internal storage unit of the computer device 1, such as a hard disk or a memory of the computer device 1. In other embodiments, the memory 11 may also be an external storage device of the computer device 1, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash memory card (flash card), and the like provided on the computer device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the computer apparatus 1. The memory 11 is used for storing an operating system, an application program, a boot loader (bootloader), data, and other programs, such as a program code of the computer program. The memory 11 may also be used to temporarily store data that has been output or is to be output.
At present, spark requires up to four hundred operating parameters when large-scale data processing is performed, and the setting of these parameters by manual work is time-consuming and labor-consuming, so it is necessary to study how to set Spark operating parameters automatically. In the traditional technology, the running time of Spark tasks under different Spark running parameters is usually collected to obtain a training set, the neural network model is trained by adopting the training set according to the thought of genetic evolution to obtain a performance prediction model, the performance prediction model is reused, and the Spark optimal running parameters are searched through a genetic algorithm. However, different Spark tasks are usually different types, and if a performance prediction model in the conventional technology is adopted, the determined operation parameters cannot be adapted to the different types of Spark tasks, so that the accuracy of the determined operation parameters in the conventional technology is not high. The method, the device, the equipment and the storage medium for determining the Spark operating parameter provided by the embodiment of the application aim to solve the technical problem.
Fig. 2 shows a schematic flow chart of a method for determining Spark operating parameters provided in the present application, which may be applied to the computer device 1 described above by way of example and not limitation, and includes:
s101, a first operation parameter set and a second operation parameter set are obtained, and the first operation parameter set and the second operation parameter set are obtained in different modes.
Specifically, the first operation parameter set may be obtained by the computer device based on a big data benchmark test platform BigDataBench, and the operation script of the Spark task in the BigDataBench is adapted to enable the computer device to operate using the customized operation parameter set, so as to collect the first operation parameter set. The second set of operating parameters may be randomly generated by the computer device according to a range of Spark operating parameters. It should be noted that, in this embodiment, the first operation parameter set and the second operation parameter set are obtained in different manners, but the manners of obtaining the two parameter sets are not limited.
And S102, inputting the data size and the task type of the task data into a first model based on the first operation parameter set, and determining a target model in the first model according to a first output result, wherein the first model comprises a plurality of data models.
Specifically, the first model may be a performance prediction model, including a plurality of data models including, but not limited to, data models established by a linear regression algorithm, a gradient boosting tree regression algorithm, a decision tree regression algorithm, a random forest regression algorithm, or a neural network; the processing of the input data by the first model makes it possible to predict the performance parameters of the first operating parameters used. The task data is input data for executing a Spark task, and the task type of the task data can be marked by a developer and input into the computer equipment.
The first output result may include performance parameters, such as operating time, operating speed, etc., that characterize the first operating parameter as good or bad, and may also include a recommendation score derived from the performance parameters. When the computer device inputs the data size and the task type of the task data into the first model based on the first operation parameter set, that is, the data size and the task type are respectively input into each data model, each data model can respectively output the predicted recommendation score corresponding to each group of first operation parameters. For example, the data model 1 may output a first set of recommended scores for the first operating parameter of 90 points, a second set of recommended scores for the first operating parameter of 80 points, and so on. Alternatively, for each data model, the computer device may average and sum the resulting recommendation scores and then take the data model with the highest average score as the target model, which may be understood to be the most predictive in predicting task data for performing the task type described above under the first set of operating parameters.
S103, inputting the data size and the task type into the target model based on the second operation parameter set, and determining a target operation parameter of Spark according to a second output result.
Specifically, on the basis of the second operation parameter set, the computer device inputs the data size and the task type of the task data into the determined target model, and the target model may also output the recommendation score (i.e., the second output result) corresponding to each predicted group of second operation parameters, and takes the second operation parameter corresponding to the highest recommendation score as the target operation parameter, which is the optimal parameter when Spark runs the task data of the task type. Optionally, the computer device may further determine the target operating parameter by inputting the second operating parameter into the target model through a random recursive search algorithm.
It should be noted that, for task data of another task type, the computer device may also use the above process of determining the target model and then determining the target operation parameters to determine the optimal parameters for Spark to operate the task data of this type.
According to the Spark operation parameter determining method, the computer device determines the target model matched with the task type based on the first operation parameter set, determines the target operation parameter matched with the task type based on the second operation parameter set, and can determine the corresponding optimal operation parameter according to the task data of different task types, so that the accuracy of the determined Spark operation parameter is improved.
In an embodiment, the step S102 of inputting the data size and the task type of the task data into the first model based on the first operation parameter set, and the determining the target model in the first model according to the first output result may include: for each data model, inputting the data size and the task type into the data model based on the first operation parameter set, and outputting a predicted operation time corresponding to each group of first operation parameters; and determining a target model in the plurality of data models according to the predicted operation time corresponding to each group of first operation parameters.
The data model may be based on the first operation parameter set, and a corresponding relationship between the data size of the task data, the task type, the operation parameter, and the operation time is established, so that for each data model, the predicted operation time corresponding to each set of the first operation parameter may be output. Alternatively, for each data model, the predicted run times may be summed, and the corresponding data model having the shortest sum of the run times (or the shortest average run time) may be used as the target model.
Optionally, for each data model, the computer device may further determine, according to the predicted operation time corresponding to each set of first operation parameters, a decision coefficient corresponding to the data model, and use the data model corresponding to the largest decision coefficient as the target model. By way of example, and not limitation, a computer device may employ
Figure BDA0002929468150000091
Calculating a decision coefficient corresponding to the data model, wherein y (i) For the predicted operating time corresponding to the ith group of first operating parameters, <' >>
Figure BDA0002929468150000092
For the average predicted operating time of the m sets of first operating parameters, <' >>
Figure BDA0002929468150000093
Figure BDA0002929468150000094
Is the sum of the predicted operating times of the m sets of first operating parameters>
Figure BDA0002929468150000101
R 2 To determine the coefficients, the values are in the range (— infinity, 1)]The closer the value is to 1, the better the corresponding data model.
According to the method for determining Spark operating parameters, the predicted operating time output by each data model is analyzed, the target model in the multiple data models is determined, the target model matched with the current task type is obtained, the optimal operating parameters when the task data of the type are processed are further determined, and the accuracy of the determined Spark operating parameters is improved.
In another embodiment, the step S103 of inputting the data size and the task type into the target model based on the second operation parameter set, and determining the target operation parameter of Spark according to the second output result includes: inputting the data size and the task type into the target model based on the second operation parameter set to obtain predicted operation time corresponding to each group of second operation parameters; and taking the second operation parameter corresponding to the shortest predicted operation time as the target operation parameter.
For example, assuming that the operation parameters required by Spark operation include a, B and C, the first operation parameter set { (a) is formed when a, B and C take different values respectively 1 、B 1 、C 1 )、(A 2 、B 2 、C 2 )...(A 10 、B 10 、C 10 ) And a second set of operating parameters (A) 11 、B 11 、C 11 )、(A 12 、B 12 、C 12 )...(A 20 、B 20 、C 20 ) It should be noted that, in this example, each of the first operating parameter set and the second operating parameter set includes 10 sets of parametersThe values are given as examples, but the specific numbers are not limited thereto. After the computer device inputs the data size and the task type of the task data into the plurality of data models of the first model based on the first operation parameter set, the data model 1 may output the operation parameters (a) 1 、B 1 、C 1 ) Corresponding predicted runtime, (A) 2 、B 2 、C 2 ) Corresponding predicted running time up to (A) 10 、B 10 、C 10 ) And corresponding prediction running time, namely the data models 2 to n can also output the prediction running time corresponding to each group of running parameters, and then the decision coefficient corresponding to each data model is calculated according to the prediction running time corresponding to each group of running parameters, so that the target model is determined. Assuming that the target model is the data model 2, the computer device inputs the data size and the task type into the data model 2 based on the second operation data set, and may output the operation parameters (A) 11 、B 11 、C 11 ) Corresponding predicted runtime, (A) 12 、B 12 、C 12 ) Corresponding predicted run time up to (A) 20 、B 20 、C 20 ) And corresponding to the predicted running time, and then taking the second running parameter corresponding to the shortest predicted running time as the target running parameter. Assume a target operating parameter of (A) 12 、B 12 、C 12 ) Then, the Spark can be considered to be adopted when running the task data of the task type (A) 12 、B 12 、C 12 ) The performance of the parameters is best.
In another embodiment, after the computer device determines the target operation parameter corresponding to the Spark to operate the task data of the task type, the target operation parameter may be configured on the Spark platform to execute the task data of the same task type. In addition, the target operation parameter can be added into the first operation parameter set, and when the target operation parameter needs to be determined for task data of another task type, the expanded first operation parameter set can be used for determining a target model, namely parameter set data is enriched continuously, so that model errors are reduced, and algorithm accuracy is improved. Further, the target operating parameter and the predicted operating time corresponding to the target operating parameter may also be stored in a database.
In another embodiment, null data may exist in the first operation parameter set obtained in step S101, and the computer device may further perform interpolation processing on the null data in the first operation parameter set by using a preset interpolation algorithm, so as to obtain a processed first operation parameter set. Alternatively, the null data in the first operating parameter set may be interpolated by using algorithms such as lagrange interpolation, newton interpolation, KNN interpolation, and an improvement method thereof. Preferably, a barycentric lagrange interpolation method is adopted, and the interpolation function is as follows:
Figure BDA0002929468150000111
wherein it is present>
Figure BDA0002929468150000112
n represents the number of parameters in the first operating parameter, x represents the position of the null value, x i 、x j Representing the value (or position), ω, of an independent variable i Representing the weight of the center of gravity, y i Denotes the argument (or position) as x i The value of time. By interpolating the null data, the comprehensiveness of the first operating parameter set can be improved, contributing to improving the accuracy of the determined target model. On the other hand, if the first operating parameter set includes non-numerical operating parameters, such as category-type operating parameters, the computer device may further convert the non-numerical operating parameters into numerical operating parameters. Alternatively, the non-numeric operating parameters may be converted using serial number encoding, unique thermal encoding, binary encoding, or category encoding techniques.
To facilitate understanding of the whole process of the above method for determining Spark operating parameters, the method is further described below with an embodiment, as shown in fig. 3, and includes:
s201, a first operation parameter set and a second operation parameter set are obtained.
S202, carrying out interpolation processing on null value data in the first operation parameter set by adopting a preset interpolation algorithm to obtain a processed first operation parameter set.
S203, if the first operation parameter set includes a non-numerical operation parameter, converting the non-numerical operation parameter into a numerical operation parameter.
S204, aiming at each data model, based on the first operation parameter set, inputting the data size and the task type into the data model, and outputting the predicted operation time corresponding to each group of first operation parameters.
S205, determining a decision coefficient corresponding to the data model according to the predicted operation time corresponding to each group of first operation parameters; and taking the data model corresponding to the maximum decision coefficient as the target model.
And S206, based on the second operation parameter set, inputting the data size and the task type into the target model to obtain the predicted operation time corresponding to each group of second operation parameters.
And S207, taking the second operation parameter corresponding to the shortest predicted operation time as the target operation parameter.
S208, adding the target operation parameters into the first operation parameter set, and storing the target operation parameters and the predicted operation time corresponding to the target operation parameters into a database.
For the implementation process of each step in this embodiment, reference may be made to the description of the above embodiment, and the implementation principle and the technical effect are similar, which are not described herein again.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by functions and internal logic of the process, and should not constitute any limitation to the implementation process of the embodiments of the present application.
Fig. 4 shows a structural block diagram of a device for determining a Spark operating parameter provided in the embodiment of the present application, and for convenience of description, only a part related to the embodiment of the present application is shown.
Referring to fig. 4, the apparatus includes: an acquisition module 21, a first determination module 22 and a second determination module 23.
Specifically, the obtaining module 21 is configured to obtain a first operation parameter set and a second operation parameter set, where the obtaining manners of the first operation parameter set and the second operation parameter set are different.
A first determining module 22, configured to input the data size and the task type of the task data into a first model based on the first operation parameter set, and determine a target model in the first model according to a first output result, where the first model includes a plurality of data models.
And a second determining module 23, configured to input the data size and the task type into the target model based on the second operation parameter set, and determine a target operation parameter of Spark according to a second output result.
In an embodiment, the first determining module 22 is specifically configured to, for each data model, input the data size and the task type into the data model based on the first operation parameter set, and output a predicted operation time corresponding to each group of first operation parameters; and determining a target model in the plurality of data models according to the predicted operation time corresponding to each group of first operation parameters.
In an embodiment, the first determining module 22 is specifically configured to determine, for each data model, a decision coefficient corresponding to the data model according to the predicted operation time corresponding to each set of the first operation parameters; and taking the data model corresponding to the maximum decision coefficient as the target model.
In an embodiment, the second determining module 23 is specifically configured to input the data size and the task type into the target model based on the second operation parameter set, so as to obtain a predicted operation time corresponding to each group of second operation parameters; and taking the second operation parameter corresponding to the shortest predicted operation time as the target operation parameter.
In an embodiment, the apparatus further includes a storage module, configured to add the target operation parameter into the first operation parameter set, and store the target operation parameter and a predicted operation time corresponding to the target operation parameter in a database.
In an embodiment, the apparatus further includes a processing module, configured to perform interpolation processing on null data in the first operation parameter set by using a preset interpolation algorithm, so as to obtain a processed first operation parameter set.
In an embodiment, the processing module is further configured to convert the non-numerical type operating parameter into a numerical type operating parameter when the first operating parameter set includes the non-numerical type operating parameter.
It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. For the specific working processes of the units and modules in the system, reference may be made to the corresponding processes in the foregoing method embodiments, which are not described herein again.
An embodiment of the present application further provides a computer device, where the computer device includes: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, the processor implementing the steps of any of the various method embodiments described above when executing the computer program.
The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.
The embodiments of the present application provide a computer program product, which when executed on a computer device, enables the computer device to implement the steps in the above method embodiments.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include at least: any entity or apparatus capable of carrying computer program code to the aforementioned apparatus/computer device, recording medium, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In some jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and proprietary practices.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/computer device and method may be implemented in other ways. For example, the above-described apparatus/computer device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present application, and they should be construed as being included in the present application.

Claims (8)

1. A method for determining Spark operating parameters, comprising:
acquiring a first operation parameter set and a second operation parameter set, wherein the first operation parameter set and the second operation parameter set are acquired in different modes;
inputting the data size and the task type of the task data into a first model based on the first operation parameter set, and determining a target model in the first model according to a first output result, wherein the first model comprises a plurality of data models, and the first model is a performance prediction model;
inputting the data size and the task type into the target model based on the second operation parameter set, and determining a target operation parameter of Spark according to a second output result;
the step of inputting the data size and the task type of the task data into a first model based on the first operation parameter set, and determining a target model in the first model according to a first output result comprises the following steps:
for each data model, inputting the data size and the task type into the data model based on the first operation parameter set, and outputting a predicted operation time corresponding to each group of first operation parameters;
determining a target model in the plurality of data models according to the predicted operation time corresponding to each group of first operation parameters;
the inputting the data size and the task type into the target model based on the second operation parameter set, and determining a target operation parameter of Spark according to a second output result, including:
inputting the data size and the task type into the target model based on the second operation parameter set to obtain predicted operation time corresponding to each group of second operation parameters;
and taking the second operation parameter corresponding to the shortest predicted operation time as the target operation parameter.
2. The method of claim 1, wherein said determining a target model of said plurality of data models based on a predicted runtime for each set of first operating parameters comprises:
determining a decision coefficient corresponding to each data model according to the predicted operation time corresponding to each group of first operation parameters for each data model;
and taking the data model corresponding to the maximum decision coefficient as the target model.
3. The method of claim 1, wherein the method further comprises:
and adding the target operation parameters into the first operation parameter set, and storing the target operation parameters and the predicted operation time corresponding to the target operation parameters into a database.
4. The method of claim 1, wherein after obtaining the first set of operating parameters, the method further comprises:
and carrying out interpolation processing on null value data in the first operation parameter set by adopting a preset interpolation algorithm to obtain a processed first operation parameter set.
5. The method of claim 4, wherein the method further comprises:
if the first operation parameter set comprises non-numerical operation parameters, converting the non-numerical operation parameters into numerical operation parameters.
6. A Spark operating parameter determining apparatus, comprising:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a first operation parameter set and a second operation parameter set, and the acquisition modes of the first operation parameter set and the second operation parameter set are different;
a first determining module, configured to input a data size and a task type of task data into a first model based on the first operating parameter set, and determine a target model in the first model according to a first output result, where the first model includes multiple data models, and the first model is a performance prediction model;
a second determining module, configured to input the data size and the task type into the target model based on the second operation parameter set, and determine a target operation parameter of Spark according to a second output result;
the first determining module is specifically configured to, for each data model, input the data size and the task type into the data model based on the first operation parameter set, and output a predicted operation time corresponding to each group of first operation parameters; determining a target model in the plurality of data models according to the predicted operation time corresponding to each group of first operation parameters;
the second determining module is specifically configured to input the data size and the task type into the target model based on the second operation parameter set, so as to obtain a predicted operation time corresponding to each group of second operation parameters; and taking the second operation parameter corresponding to the shortest predicted operation time as the target operation parameter.
7. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 5 when executing the computer program.
8. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 5.
CN202110142577.6A 2021-02-02 2021-02-02 Spark operation parameter determination method, device, equipment and storage medium Active CN113157538B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110142577.6A CN113157538B (en) 2021-02-02 2021-02-02 Spark operation parameter determination method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110142577.6A CN113157538B (en) 2021-02-02 2021-02-02 Spark operation parameter determination method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113157538A CN113157538A (en) 2021-07-23
CN113157538B true CN113157538B (en) 2023-04-18

Family

ID=76879152

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110142577.6A Active CN113157538B (en) 2021-02-02 2021-02-02 Spark operation parameter determination method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113157538B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106648654A (en) * 2016-12-20 2017-05-10 深圳先进技术研究院 Data sensing-based Spark configuration parameter automatic optimization method
CN111176832B (en) * 2019-12-06 2022-07-01 重庆邮电大学 Performance optimization and parameter configuration method based on memory computing framework Spark
CN111367591B (en) * 2020-03-30 2024-01-30 中国工商银行股份有限公司 Spark task processing method and device
CN111782402B (en) * 2020-07-17 2024-08-13 Oppo广东移动通信有限公司 Data processing method and device and electronic equipment

Also Published As

Publication number Publication date
CN113157538A (en) 2021-07-23

Similar Documents

Publication Publication Date Title
US11915104B2 (en) Normalizing text attributes for machine learning models
CN110941951B (en) Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment
CN112434188B (en) Data integration method, device and storage medium of heterogeneous database
WO2021098615A1 (en) Filling method and device for genotype data missing, and server
CN113360711B (en) Model training and executing method, device, equipment and medium for video understanding task
CN111178537A (en) Feature extraction model training method and device
CN116580702A (en) Speech recognition method, device, computer equipment and medium based on artificial intelligence
CN112836513B (en) Named entity linking method, device, equipment and readable storage medium
WO2018205391A1 (en) Method, system and apparatus for evaluating accuracy of information retrieval, and computer-readable storage medium
CN113157538B (en) Spark operation parameter determination method, device, equipment and storage medium
CN113449062B (en) Track processing method, track processing device, electronic equipment and storage medium
CN115906064A (en) Detection method, detection device, electronic equipment and computer readable medium
CN115409070A (en) Method, device and equipment for determining critical point of discrete data sequence
CN115292583A (en) Project recommendation method and related equipment thereof
CN113010571A (en) Data detection method, data detection device, electronic equipment, storage medium and program product
CN116483735B (en) Method, device, storage medium and equipment for analyzing influence of code change
CN112214387B (en) Knowledge graph-based user operation behavior prediction method and device
CN111475985B (en) Method, device and equipment for controlling size of ball mill load parameter integrated model
CN115204161A (en) Text processing method and device, electronic equipment and storage medium
CN116680327A (en) Data structuring method, device, terminal and storage medium based on product attributes
CN118643170A (en) Word cloud image generation method, word cloud image generation device and readable storage medium
CN118171060A (en) Soil pollutant traceability analysis method, device, equipment and storage medium
CN117807231A (en) Method, device, equipment and storage medium for determining carbon bank information
CN116881569A (en) User reading material recommendation method, device, equipment and medium
CN118760747A (en) Information processing method and device of intelligent body, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant