WO2018098670A1 - Method and apparatus for performing data processing - Google Patents

Method and apparatus for performing data processing Download PDF

Info

Publication number
WO2018098670A1
WO2018098670A1 PCT/CN2016/107948 CN2016107948W WO2018098670A1 WO 2018098670 A1 WO2018098670 A1 WO 2018098670A1 CN 2016107948 W CN2016107948 W CN 2016107948W WO 2018098670 A1 WO2018098670 A1 WO 2018098670A1
Authority
WO
WIPO (PCT)
Prior art keywords
data processing
value
data
processing request
prediction model
Prior art date
Application number
PCT/CN2016/107948
Other languages
French (fr)
Chinese (zh)
Inventor
常玉立
王海彬
程捷
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2016/107948 priority Critical patent/WO2018098670A1/en
Priority to CN201680031201.5A priority patent/CN108463813B/en
Publication of WO2018098670A1 publication Critical patent/WO2018098670A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions

Definitions

  • the present invention relates to the field of Internet technologies, and in particular, to a method and apparatus for performing data processing.
  • the distributed data processing system can divide data into multiple blocks of preset size and store them in multiples.
  • the nodes can be one of the distributed data processing systems
  • each node can process a portion of the data in parallel.
  • a distributed data processing system includes a plurality of resource configuration parameters, wherein the resource configuration parameters will affect the efficiency of processing data by the distributed data processing system, and the resource configuration parameters may include a data block size, a total number of processes for processing data, and each node. The number of processes that process data, and so on.
  • the technician can configure the parameter values of the corresponding service parameters for each resource configuration parameter in the distributed data processing system according to his own experience before the distributed data processing system performs data processing corresponding to a certain service, and then, whenever the distributed data is distributed,
  • the processing system receives a data processing request corresponding to a certain service (for example, may be sorted data in a distributed data processing system, or may be a maximum value of data in the distributed data processing system), distributed
  • the data processing system may perform processing corresponding to the data processing request based on the resource configuration parameter value of the resource configuration parameter set in advance.
  • the technician needs to configure the resource configuration parameters, and often the number of resource configuration parameters is relatively large. In this case, the technician needs a pair of resource configuration parameters. Input, thus, leads to inefficient configuration.
  • an embodiment of the present invention provides a method and apparatus for performing data processing.
  • the technical solution is as follows:
  • a method of performing data processing comprising:
  • the complexity of the data processing stage, and the ratio of the data output to the data input for each data data processing stage; further, the resource configuration parameter values for each data processing stage can be determined based on the operational eigenvalues.
  • the user may send a data processing request of the target service to the server through the console.
  • the server may, according to the identifier of the target service, the data volume of the data to be processed, and the to-be-processed data.
  • the distribution of the keys of the data is processed, the operational characteristic values of the data processing request are determined, and further, the resource configuration parameter values of each data processing stage are determined according to the determined operational characteristics.
  • the distributed data processing system can automatically calculate the resource configuration parameter value of each data processing stage of the data processing request, and the technician configures multiple resources before performing data processing.
  • the parameters are input into the resource configuration parameter values one by one, thereby improving the configuration efficiency.
  • determining a resource configuration parameter value of each data processing stage according to the running feature value including: determining a resource configuration parameter value of each data processing stage according to the performance prediction model and the running feature value, where The performance prediction model is trained by historical data.
  • the performance prediction model may be pre-stored in the server. After the server receives the data processing request of the target service, the server may determine each of the performance prediction models and the determined operational feature values according to the pre-stored performance prediction model. Resource configuration parameter values for the data processing phase.
  • the running feature value of the data processing request is determined according to the identifier of the target service, the data volume of the data to be processed, and the distribution of the key of the data to be processed, where the running feature includes the data processing request.
  • the complexity of each data processing stage, and the ratio of the data output to the data input for each data data processing stage including: determining the data feature value of the data to be processed for each data processing stage of the data processing request;
  • the operational feature of the data feature and the running feature are used as variables to predict the correspondence between the model and the business, and the target operational feature prediction model corresponding to the target service is determined; for each data processing phase, the feature prediction model is operated according to the target, and the value of the data feature is calculated.
  • the running feature value corresponding to the feature is run to obtain the running feature value of the data processing stage. Determining resource configuration parameter values for each data processing stage according to the running feature values, including: for each data processing stage, calculating the running according to a pre-stored performance prediction model with operating characteristics, resource configuration parameters, and business performance variables When the value of the feature is the running feature value of the data processing phase, when the service performance reaches the optimal value, the resource configuration parameter pair The value of the resource configuration parameter should be the value of the resource configuration parameter corresponding to the data processing stage.
  • the server may determine the data feature value of each data processing stage of the data processing request, where the data feature value includes the data amount of the data to be processed, to be The distribution of the key of the data is processed. Further, for each data processing stage, the server may use the data feature value of the data processing stage as the value of the data feature, and substitute it into the target operational feature prediction model corresponding to the target service, and calculate the running characteristic.
  • the running feature values give the operational eigenvalues for each data processing stage.
  • a performance prediction model with operational characteristics, resource configuration parameters, and business performance variables can be stored in the server.
  • the server may determine the running feature value of each data processing stage of the data processing request, and further, the server may determine the resource configuration of each data processing stage. Parameter values, so that the server can perform processing of each data processing stage based on the calculated resource configuration parameter values corresponding to each data processing stage, respectively, so that each data processing stage can be processed based on the data processing
  • the resource configuration parameter values corresponding to the phase in turn, can improve the processing efficiency of each data processing phase.
  • the running feature value corresponding to each data processing stage may include the data amount of the data to be processed of the data to be processed in each data processing stage, and the distribution of the keys of the data to be processed in each data processing stage.
  • the method further includes: running a data processing request according to the resource configuration parameter value; and counting actual operating characteristic values and actual values of each data processing stage Service performance value; adjust the performance prediction model according to the resource configuration parameter value, the actual running characteristic value, and the actual business performance value.
  • the solution may be run based on the resource configuration parameter value, and the actual running characteristic value and the actual service performance value of each data processing stage may be counted.
  • the server may use the resource configuration parameter value, the actual running feature, and the actual service performance value of each data processing stage as training data, and re-train the obtained performance prediction model. In this way, the accuracy of the performance prediction model can be guaranteed to be higher and higher.
  • the method further includes: when receiving the data processing request, running the data processing request according to the first preset resource configuration parameter value of the current data processing request, and counting the current data processing The first actual running feature value, the first preset resource configuration parameter value, and the first actual service performance value; when the number of received data processing requests reaches a first preset number, based on the first preset number of data Processing the actual running feature value of the request, the preset resource configuration parameter value, and The actual business performance value is trained on the preset performance prediction model reference formula to obtain a performance prediction model; the performance prediction model is stored.
  • the server may perform processing corresponding to the current data processing request based on the preset resource configuration parameter value when receiving the data processing request, and correspondingly record the data.
  • the actual running feature value, the preset resource configuration parameter value, and the actual service performance value of the request are processed.
  • the server may run records based on the historical operations. Training a performance prediction model between operational characteristics, resource configuration parameters, and business performance, and storing it so that when a subsequent data processing request is received, the server can automatically calculate resources for each data processing stage of the data processing request Configure parameter values.
  • the performance prediction model is stored, including: when receiving the data processing request, running a data processing request based on a second preset resource configuration parameter value of the current data processing request, and counting a second actual running characteristic value and a second actual business performance value of the current data processing request; determining a third service performance value of the current data processing request according to the performance prediction model and the second actual running characteristic value; according to the second actual service The difference between the performance value and the determined third service performance value, determining the accuracy of the performance prediction model, obtaining the accuracy of the performance prediction model under the current data processing request; and receiving the data processing after obtaining the performance prediction model
  • the average accuracy of the performance prediction model is calculated according to the accuracy of the performance prediction model under the second preset number of data processing requests; if the average accuracy reaches the preset accuracy threshold, then The performance prediction model is stored.
  • the accuracy of the server may be verified, and the accuracy reaches the preset accuracy threshold. Then, the server automatically determines the resource configuration parameter value corresponding to the data processing request by using the performance prediction model with the running feature, the resource configuration parameter, and the service performance as variables, thereby executing the data processing request based on the automatically determined resource configuration parameter value. Corresponding processing. In this way, the accuracy of the final stored performance prediction model with operational characteristics, resource configuration parameters, and business performance variables can be improved, and the processing efficiency of the processing corresponding to the execution of the data processing request can be improved.
  • a server comprising a processor, a memory, a transceiver configured to execute instructions stored in the memory, and the processor implementing the data processing provided by the first aspect by executing the instructions Methods.
  • an apparatus for performing data processing comprising at least one module for implementing the method for data processing provided by the first aspect above.
  • a data processing request for receiving a target service where the data to be processed of the data processing request includes a plurality of key value pairs; according to the identifier of the target service, the data amount of the data to be processed, and the key of the data to be processed a distribution condition determining a running feature value of the data processing request, wherein the running feature includes a complexity of each data processing stage of the data processing request, and a ratio of data output to data input of each data data processing stage; Determine the resource configuration parameter values for each data processing stage.
  • the distributed data processing system can automatically calculate the resource configuration parameter value of the data processing request, and the technician inputs the plurality of resource configuration parameters one by one before performing the data processing. Resource configuration parameter values, in turn, can improve configuration efficiency.
  • FIG. 1 is a schematic diagram of a system framework provided by an embodiment of the present invention.
  • FIG. 2 is a schematic structural diagram of a server according to an embodiment of the present disclosure
  • FIG. 3 is a flowchart of a method for performing data processing according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of an apparatus for performing data processing according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of an apparatus for performing data processing according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of an apparatus for performing data processing according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of an apparatus for performing data processing according to an embodiment of the present invention.
  • An embodiment of the present invention provides a method for performing data processing, where the method may be implemented by a server, where the server may be a server group composed of multiple servers, that is, may be distributed numbers.
  • the server may also be a server.
  • the distributed data processing system may pre-process the data to be processed.
  • the data block size is divided into multiple data blocks, and each server can perform data processing on a part of the data in parallel.
  • the server may include a transceiver 210, a processor 220, a memory 230, and the memory 230 and the transceiver 210 may be coupled to the processor 220, respectively, as shown in FIG.
  • the transceiver 210 can be used to receive messages or data.
  • the transceiver 210 can include, but is not limited to, at least one amplifier, a tuner, one or more oscillators, a coupler, an LNA (Low Noise Amplifier), a duplexer. Etc., in the present invention, the transceiver 210 can be configured to receive a data processing request.
  • the processor 220 may include one or more processing units; the processor 220 may be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP for short, etc.; Signal processor (DSP), application specific integrated circuit (ASIC), field programmable gate array (FPGA) or other programmable logic device.
  • the program can include program code, the program code including computer operating instructions.
  • the server may also include a memory 230 that may be used to store software programs and modules, and the processor 220 performs data processing by reading software code and modules stored in the memory 230.
  • Step 301 The distributed data processing system receives a data processing request of the target service, where the data to be processed of the data processing request includes a plurality of key value pairs.
  • the target service may be any one of the services that the distributed data processing system can perform.
  • the target service may be a maximum value in determining the to-be-processed data.
  • the data processing request of the target service may be a request for performing target service processing on the specific data to be processed, such as a request for sorting the specified data to be processed.
  • the technician in order to improve the processing speed of processing data, the technician often uses a distributed data processing system to process the data.
  • the technician can send a data processing request corresponding to a certain service (which may be referred to as a target service) to the distributed data processing system through the console, where the data processing request may specify the to-be-processed data of the data processing request (for example, The data processor request may carry a storage address of the data to be processed).
  • the data to be processed corresponding to the data processing request may be a series of key-value key value pairs, and the data processing of the data to be processed is generally performed on the key. Line data processing.
  • the distributed data processing system may determine the data feature value of the data to be processed of the data processing request, where the data feature value may include the data amount of the data to be processed and the distribution of the key of the data to be processed. happening.
  • the distributed data processing system can determine the distribution of the keys of the data to be processed in the following manner: the distributed data processing system can be preset, and after determining the data to be processed, it can be determined by determining the distribution of all the keys.
  • the distribution of the data to be processed specifically, when the preset data distribution model in the distributed data processing system is evenly distributed, the distributed data processing system can calculate the mean value of the key and use the mean value of the key as the key of the data to be processed.
  • Distribution when the preset data distribution model in the distributed data processing system is normally distributed, the distributed data processing system can calculate the mean and standard deviation of the key, and use the mean and standard deviation of the key as the key of the data to be processed. Distribution.
  • the distributed data processing system can sample the data to be processed according to a preset sampling rate, and determine the data to be processed of the data processing request by determining the distribution of the keys of the sampled data to be processed after sampling. The distribution of the keys.
  • This processing step can be specifically implemented by a transceiver.
  • Step 302 The distributed data processing system determines, according to the identifier of the target service, the data volume of the data to be processed, and the distribution of the keys of the data to be processed, the running feature value of the data processing request, where the running feature value includes the data processing request.
  • the distributed data processing system may determine the identifier of the target service of the data processing request, and further, according to the determined identifier of the target service, the data volume of the data to be processed, and the data to be processed.
  • the distribution of the keys determines the operational eigenvalues of the data processing request.
  • the running feature value may be a ratio of a data amount of the processed output data corresponding to the data processing request and a data amount of the input data (ie, the data to be processed) corresponding to the data processing request (wherein the ratio may be used)
  • the ratio indicates that the ratio corresponding to the data processing request may be the value of ratio, or may be the complexity of the processing corresponding to the execution of the data processing request (where the complexity may be represented by complexity, and the complexity of the processing corresponding to the execution of the data processing request is The value of complexity), wherein the complexity can be used to represent the computational complexity of the processing corresponding to the execution of the data processing request, ie, the computational (or algorithmic) complexity of the business processing logic of the service corresponding to the data processing request.
  • the complexity corresponding to the data processing request may be a CPU (Central Processing Unit) effective time t used by the distributed data processing system to perform processing corresponding to the data processing request, and a data amount of the to-be-processed data of the data processing request.
  • the distributed data processing system may divide the data to be processed into a plurality of data blocks having a preset data block size (wherein the size of the default data block may be 64 MB (megabytes).
  • the distributed data processing system may store a plurality of data blocks in each of the servers.
  • the embodiment of the present invention does not limit the manner in which the data blocks are allocated. Specifically, the distributed data processing system may use the data blocks.
  • each server can simultaneously process the data blocks stored in each of the corresponding services (for example, sorting the data in the respective stored data blocks, or obtaining the local storage).
  • the maximum value of the data in which each server can simultaneously start multiple threads to process each data block at the same time when processing data.
  • the processing corresponding to the execution of the data processing request is divided into N parallel processes or thread processing, and the time used by each process or thread is represented by t i , and the data amount of the data to be processed corresponding to the data processing request is represented by size (where size Representing the sum of the data amounts of the data to be processed by each process), the formula of complexity can be as shown in formula (1).
  • the complexity can be further expressed by the formula (2).
  • t i represents the CPU effective time used by the i-th process or thread
  • cpu_frequency i represents the main frequency of the central processor of the server corresponding to the i-th process or line.
  • the distributed data processing system may preset a data processing phase for performing processing corresponding to the data processing request, wherein the data processing phase corresponding to the same service is fixed, and each data processing phase executes the data processing phase.
  • the processing is performed, that is, the processing corresponding to the data processing request is completed, wherein the output data of the previous data processing stage is the data to be processed (ie, the input data) of the next data processing stage, for example, the execution data processing request corresponds to
  • the processing includes two data processing stages
  • the data to be processed specified by the data processing request is the input data of the first data processing stage
  • the output data of the first data processing stage is the input number of the second data data stage.
  • an application for a distributed data processing system capable of implementing distributed processing data hasoop a data processing stage in Hadoop for performing processing corresponding to a data processing request includes two data processings of map (map) and reduce (reduction).
  • the distributed data processing system may first perform processing on the map data processing stage on the data to be processed specified by the data processing request, and then output data in the map data processing stage.
  • Performing processing of the reduce data processing stage For example, the processing corresponding to the data processing request currently received by the distributed data processing system is: determining the maximum value in the data to be processed, and after receiving the data processing request, the distributed data processing system receives the data processing request.
  • the distributed data processing system can simultaneously open multiple map threads, and simultaneously process a part of the data to be processed to obtain the maximum value of each part of the data to be processed, and then determine multiple times through the reduce data processing stage.
  • the maximum value in the output data of the map, to be treated The final maximum value in the data.
  • the distributed data processing system determines the running feature values corresponding to each data processing stage.
  • the running characteristics include the ratio and the complexity
  • the distributed data processing system may separately determine the numerical value and the complexity value of the ratio of each data processing stage.
  • the running feature value of each data processing stage may be a ratio of the data amount of the output data of each data processing stage to the data amount of the input data (ie, the data to be processed), and the complexity of each data processing stage.
  • the complexity corresponding to each data processing stage may be a ratio of a total CPU effective time used by the distributed data processing system to perform processing of the data processing stage to a data amount of the to-be-processed data corresponding to the data processing stage, and may also be Calculate the complexity of each data processing stage according to formula (3), where x in equation (3) represents a certain data processing stage, and t xi represents the time used by the i-th process of data processing stage x to process data.
  • Cpu_frequency i represents the main frequency of the central processing unit of the server corresponding to the i-th process or line, and size x represents the data amount of the data to be processed in the x data processing stage.
  • the running feature value of each data processing stage may be predicted by the data feature value of the data to be processed in each data processing stage.
  • the processing may be as follows: determining each data processing stage of the data processing request Data feature value of the data to be processed; predicting the target operational characteristic prediction model corresponding to the target service according to the pre-stored correspondence between the operational feature prediction model and the service with the data feature and the running feature as variables; for each data processing phase, according to Target operation feature prediction model, when the value of the calculated data feature is the data feature value corresponding to the data processing phase, The running characteristic value of the levy obtains the running characteristic value of the data processing stage.
  • the data feature may be a variable for characterizing data feature information of the data to be processed
  • the running feature may be a variable for characterizing the running feature information corresponding to the data processing request.
  • the data distribution model may be pre-stored in the distributed data processing system, wherein the data distribution model may be an average distribution, a normal distribution, or other data distribution commonly used in mathematics.
  • the distributed data processing system may pre-store the calculation formula of the parameter of the data distribution model.
  • the distributed data processing system may pre-store the calculation formula of the mean of the mean distribution, or may store the normal distribution in advance. The formula for calculating the mean and standard deviation.
  • the data to be processed in each data processing stage of the data processing request may be determined, and further, the data feature information (ie, the data feature value) of the data to be processed may be determined.
  • the data feature value of each data processing stage may be the data amount of the data to be processed and/or the distribution of the keys in the data processing stage (the distribution of the keys may be a value corresponding to the data distribution keydist).
  • the distributed data processing system may pre-store the correspondence between the running feature prediction model and the service identifier with the data feature and the running feature as variables, as shown in Table 1, where the service identifier may be a service name or a service corresponding software program.
  • the identifier of the service may be manually named and maintained, or may be an identifier extracted from the service data processing request by a specific rule.
  • the target operation characteristic prediction model corresponding to the target service may be determined in the correspondence relationship shown in Table 1, and, for each data processing stage, The distributed data processing system can take the data feature value of the data processing stage as the value of the data feature, substitute it into the target running feature prediction model, calculate the running feature value of the running feature, and obtain the running characteristic value of each data processing stage. .
  • the process can be specifically implemented by a processor.
  • Step 303 The distributed data processing system determines a resource configuration parameter value of each data processing stage according to the running feature value.
  • the distributed data processing system determines operational characteristic values for each data processing stage or may determine resource configuration parameter values for each data processing stage based on operational characteristic values for each data processing stage.
  • the distributed data processing system may further determine a resource configuration parameter value of each data processing stage according to the data amount of the data to be processed, that is, the distributed data processing system determines each according to the running feature value and the data amount of the data to be processed. Resource configuration parameter values for the data processing phase.
  • the distributed data processing system may determine resource configuration parameter values for each data processing stage according to the performance prediction model and the running feature values of each data processing stage, wherein the performance prediction model is trained by historical data.
  • the distributed data processing system calculates the running characteristics of the running feature as the data processing stage according to the pre-stored performance prediction model with the running feature, the resource configuration parameter, and the service performance as variables.
  • the resource configuration parameter value of the resource configuration parameter obtains the resource configuration parameter value in the data processing phase.
  • the resource configuration parameter may be a resource configuration configuration parameter of the distributed data processing system, and the resource configuration parameter may be a parameter that affects the processing efficiency of the processing corresponding to the data processing request by the distributed data processing system, and may include, for example, a data block size and processing.
  • the service performance may be a parameter used to characterize the processing efficiency of the distributed data processing system executing the processing corresponding to the data processing request, such as may be the time used to perform the processing corresponding to the data processing request.
  • the distributed data processing system may pre-store a performance prediction model with operating characteristics, resource configuration parameters, and service performance as variables, wherein the performance prediction model has no direct correspondence with the service, the performance prediction model and the service.
  • the performance prediction model can be a function of running characteristics, resource configuration parameters, and business performance as variables.
  • the distributed data processing system can calculate the resource configuration parameter value of the resource configuration parameter when the service performance obtains the optimal value (such as the minimum value). Obtaining the resource configuration parameter value of the data processing stage, that is, obtaining the resource configuration parameter value of the resource configuration parameter in the data processing stage.
  • the process can be specifically implemented by a processor.
  • the corresponding processing process may be as follows: running a data processing request according to the resource configuration parameter value, and counting actual operating characteristic values of each data processing stage and The actual service performance value; the performance prediction model is adjusted according to the resource configuration parameter value, the actual running characteristic value, and the actual business performance value.
  • the resource configuration parameter values may be based on the data processing stage.
  • the specific processing manner may be as follows: when the distributed data processing system receives the data processing request of the target service, the distributed data When determining the data feature value of each data processing stage, the processing system may first determine the data feature value of the to-be-processed data corresponding to the first data processing stage (the data feature value is the data of the to-be-processed data corresponding to the data processing request) The feature value) can then determine the running feature value corresponding to the data processing phase of the target service and determine the resource configuration parameter value of the data processing phase according to the above manner.
  • the distributed data processing system can be based on the resource of the data processing phase. Configure parameter values for processing in this data processing phase. Then, in the above manner, the distributed data processing system determines the resource configuration parameter values of the next data processing stage, and performs processing of the data processing stage based on the determined resource configuration parameter values until the processing of all data processing stages is completed.
  • the distributed data processing system can count the actual running characteristic value and the actual business performance value of each data processing stage, and further, the resource configuration parameter value and actual value of each data processing stage of the data processing request can be The operational characteristics and actual business performance values are used as training data to re-train the performance prediction model.
  • the process can be specifically implemented by a processor.
  • the training process of the performance prediction model with the running feature, the resource configuration parameter, and the service performance variable mentioned above may be as follows: each time a data processing request is received, the first preset according to the current data processing request is received.
  • the resource configuration parameter value, the data processing request is run, and the first actual running feature value, the first preset resource configuration parameter value, and the first actual serviceability of the current data processing request are collected.
  • the performance prediction model is designed to be trained to obtain a performance prediction model; the performance prediction model is stored.
  • the distributed data processing system can operate in a model training phase prior to receiving a data processing request for the corresponding target service. That is, before receiving the data processing request of the target service, the distributed data processing system may, based on the current data processing request, the first preset resource configuration parameter value of the resource configuration parameter, when the data processing request is received.
  • the processing of the data processing request is performed, where the first preset resource configuration parameter value may be a resource configuration default parameter value, or may be a resource configuration parameter value configured by the user for the current data processing request to the resource configuration parameter.
  • the data processing request may carry the first preset resource configuration parameter value, or the configuration of the resource configuration parameter value by the technician before the data processing request, and the preset resource configuration parameter corresponding to each data processing request.
  • the values can be the same or different and are determined by the user's settings.
  • the distributed data processing system may record the first actual running feature value corresponding to the current data processing request (where the first actual running feature value is also the actual corresponding operation of the current data processing request) The feature value), the first preset resource configuration parameter value, and the first actual business performance value (where the first actual business performance value is actually the time used by the distributed data processing system to run the current data processing request (ie, from the beginning) The duration of processing to the end), or the cumulative total time of the time used by each thread).
  • the actual running characteristic value, the preset resource configuration parameter value, and the actual business performance value of the data processing request are recorded, and each pair of actual records can be recorded.
  • the running feature value, the preset resource configuration parameter value, and the actual business performance value are taken as samples, and the performance prediction model with the running feature, the resource configuration parameter, and the service performance as variables is obtained.
  • the distributed data processing system can detect the number of received data processing requests, that is, the number of the processed data processing requests can be detected, and when the number of received data processing requests is detected reaches a preset number.
  • the distributed data processing system may analyze the actual running feature value, the preset resource configuration parameter value, and the actual service performance value of the first preset number of data processing requests that are recorded, A performance prediction model between operational characteristics, resource configuration parameters, and service performance is determined.
  • the distributed data processing system may employ a linear regression method to determine a performance prediction model between operational characteristics, resource configuration parameters, and service performance.
  • the performance prediction model reference can be pre-stored in the distributed data processing system, when the received When the number of the data processing requests reaches the first preset number, the preset performance prediction model reference may be determined according to the low-level running feature value, the preset resource configuration parameter value, and the actual service performance value of the first preset number of data processing requests. Training is performed, wherein the performance prediction model reference can include operational characteristics, resource configuration parameters, service performance, and some parameters to be trained. During the training process, the actual running feature value, the preset resource configuration parameter value, and the actual service performance value of the first preset number of data processing requests may be used as the running feature, the resource configuration parameter, and the service performance, respectively.
  • the performance prediction model reference stored in advance by the distributed data processing system may be a function reference with a preset function form.
  • the preset function form may be a linear function form or a curve or a parabolic function. It is preset by the technician based on the operational characteristics, resource configuration parameters, and functional relationships that may be satisfied by the business performance.
  • the technician may also not set the function form of the performance prediction model reference type in advance, such as a neural network model, and the parameters to be trained are parameters in the neural network model, that is, the distributed data processing system separately sets the first preset number of data processing requests.
  • the corresponding running characteristic value, the preset resource configuration parameter value and the actually detected business performance value are respectively input as a neural network model, and the parameters in the neural network model are obtained, that is, the running characteristic, the resource configuration parameter and the service performance are obtained as variables.
  • Performance prediction model After obtaining the performance prediction model with the running characteristics, the resource configuration parameters, and the service performance as variables, the distributed data processing system may store the data, so that when the data processing request is received subsequently, the resource configuration parameter value corresponding to the data processing request is determined.
  • the process can be specifically implemented by a processor.
  • the accuracy of the performance prediction model may be verified.
  • the storage performance may be stored.
  • the processing may be as follows: When receiving the data processing request, running a data processing request based on the second preset resource configuration parameter value of the current data processing request, and collecting the second actual running characteristic value and the second actual service performance value of the current data processing request.
  • performance predictions with operational characteristics, resource configuration parameters, and business performance variables are obtained.
  • the distributed data processing system can work in the model verification phase. That is to say, after obtaining the above performance prediction model, the distributed data processing system can run the data processing request according to the model training phase whenever receiving the data processing request.
  • the method, running the current data processing request may run the data processing request based on the second preset resource configuration parameter value corresponding to the resource configuration parameter of the current data processing request, and after the processing ends, the distributed data processing system may correspondingly record the current The second actual running characteristic value corresponding to the data processing request (where the second actual running characteristic value is also the running characteristic value actually corresponding to the running current data processing request) and the second actual business performance value corresponding to the actually detected service performance, And the second actual running feature value is used as the value of the running feature and the second preset resource configuration parameter value is used as the value of the resource configuration parameter, and is substituted into the performance prediction model with the running feature, the resource configuration parameter, and the service performance as variables.
  • Calculating the third business performance value may calculate a second difference between the actual value of the detected actual business performance and the calculated values of the third service performance. After obtaining the difference between the two, the difference can be used as the accuracy corresponding to the performance prediction model under the current data processing request.
  • the average accuracy of the accuracy corresponding to the performance prediction model under the second predetermined number of data processing requests may be calculated; After the degree, the magnitude relationship between the average accuracy and the preset accuracy threshold may be determined. If the average accuracy reaches the preset accuracy threshold, the performance prediction model with the running feature, the resource configuration parameter, and the service performance as variables may be used. Store. If the average accuracy does not reach the preset accuracy threshold, the processing corresponding to the received data processing request may be performed according to the manner in which the model training phase performs the processing corresponding to the data processing request, and the corresponding data processing request may be recorded.
  • the actual running characteristic value, the preset resource configuration parameter value, and the actual service performance value are re-trained according to the actual running characteristic value, the preset resource configuration parameter value, and the actual service performance value of all the data processing requests, and the running characteristics are obtained.
  • the resource configuration parameters and the service performance are the performance prediction models of the variables, and then the accuracy is verified according to the above manner until the accuracy of the obtained performance prediction model reaches the preset accuracy threshold.
  • the process can be specifically implemented by a processor.
  • the performance prediction model described above with operational characteristics, resource configuration parameters, and service performance as variables may also be obtained from historical data of each data processing stage of the data processing request according to the historical time period.
  • the distributed data processing system can work in the model training phase before receiving the data processing request corresponding to the target service. That is, the distributed data processing system can be based on the current number each time a data processing request is received before receiving the data processing request of the target service.
  • the third preset resource configuration parameter value corresponding to the resource configuration parameter of the processing request respectively performing processing of each data processing phase of the data processing request, and correspondingly recording the actual operation corresponding to each data processing phase of the current data processing request
  • the performance value is used to train the preset performance prediction model reference model to obtain a performance prediction model with operational characteristics, resource configuration parameters, and service performance as variables. Further, the operational characteristics, resource configuration parameters, and service performance can be obtained.
  • the performance prediction model of the variable is stored. That is to say, in the training process, the actual running feature value, the preset resource configuration parameter value and the actual service performance value of each data processing stage of the third predetermined number of data processing requests are used as the training performance prediction model. Training data.
  • the running feature prediction model with the data feature and the running feature as variables may also be the historical value corresponding to the data feature and the running feature when the data processing request of each service is run according to the history. owned.
  • a data processing request for receiving a target service where the data to be processed of the data processing request includes a plurality of key value pairs; according to the identifier of the target service, the data amount of the data to be processed, and the key of the data to be processed a distribution condition determining a running feature value of the data processing request, wherein the running feature includes a complexity of each data processing stage of the data processing request, and a ratio of data output to data input of each data data processing stage; Determine the resource configuration parameter values for each data processing stage.
  • the distributed data processing system can automatically calculate the resource configuration parameter value of the data processing request, and the technician inputs the plurality of resource configuration parameters one by one before performing the data processing. Resource configuration parameter values, in turn, can improve configuration efficiency.
  • FIG. 4 is a block diagram of an apparatus for performing data processing according to an embodiment of the present invention.
  • the device for performing data processing may be implemented as part or all of the device by software, hardware, or a combination of both.
  • the apparatus for performing data processing according to the embodiment of the present invention may implement the process described in FIG. 3 of the embodiment of the present invention, where the apparatus for performing data processing includes:
  • the receiving module 410 is configured to receive a data processing request of the target service, where the data to be processed of the data processing request includes a plurality of key value pairs, and specifically, the receiving function in the foregoing step 301 and other implicit steps may be implemented.
  • a determining module 420 configured to determine an operating characteristic value of the data processing request according to an identifier of the target service, a data amount of the data to be processed, and a distribution of keys of the to-be-processed data, where
  • the operational feature value includes a complexity of each data processing phase of the data processing request, and a ratio of data output to data input of each of the data data processing phases; determining each of the data based on the operational feature value
  • the value of the resource configuration parameter in the processing stage may specifically implement the determining function in the foregoing steps 302 and 303, and other implicit steps.
  • the determining module 420 is configured to:
  • the device further includes:
  • the running module 430 after determining the resource configuration parameter value of each data processing stage, running the data processing request according to the resource configuration parameter value;
  • the statistics module 440 is configured to collect actual operating characteristic values and actual service performance values of each data processing stage
  • the adjusting module 450 is configured to adjust the performance prediction model according to the resource configuration parameter value, the actual running feature value, and the actual service performance value.
  • the apparatus further includes:
  • the recording module 460 is configured to: when receiving the data processing request, run the data processing request according to the first preset resource configuration parameter value of the current data processing request, and collect the first actual running characteristic value of the current data processing request.
  • the training module 470 is configured to: when the number of received data processing requests reaches a first preset number, based on the actual running feature values, the preset resource configuration parameter values, and the actual service performance values of the first preset number of data processing requests And training the preset performance prediction model reference formula to obtain the performance prediction model;
  • the storage module 480 is configured to store the performance prediction model.
  • the apparatus further includes:
  • the verification module 490 is configured to: when receiving the data processing request, run the data processing request based on the second preset resource configuration parameter value of the current data processing request, and collect the second actual operation of the current data processing request.
  • the storage module 480 is configured to store the performance prediction model if the average accuracy reaches a preset accuracy threshold.
  • the foregoing determining module 420, the running module 430, the statistic module 440, the adjusting module 450, the recording module 460, the training module 470, the checking module 490, and the storage module 480 may be implemented by a processor, or the processor may be implemented by using a memory. Alternatively, the processor may execute the program instructions in the memory, and the receiving module 410 may be implemented by the transceiver.
  • a data processing request for receiving a target service where the data to be processed of the data processing request includes a plurality of key value pairs; according to the identifier of the target service, the data amount of the data to be processed, and the key of the data to be processed a distribution condition determining a running feature value of the data processing request, wherein the running feature includes a complexity of each data processing stage of the data processing request, and a ratio of data output to data input of each data data processing stage; Determine the resource configuration parameter values for each data processing stage.
  • the distributed data processing system can automatically calculate the resource configuration parameter value of the data processing request, and the technician inputs the plurality of resource configuration parameters one by one before performing the data processing. Resource configuration parameter values, in turn, can improve configuration efficiency.
  • the apparatus for performing data processing in the foregoing embodiment is only illustrated by dividing the foregoing functional modules.
  • the foregoing functions may be allocated by different functional modules as needed. Completion, that is, the internal structure of the server is divided into different functional modules to complete all or part of the functions described above.
  • the device for performing data processing provided by the foregoing embodiment is the same as the method for performing the data processing. The specific implementation process is described in detail in the method embodiment, and details are not described herein again.
  • the above mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like.

Abstract

Disclosed are a method and apparatus for performing data processing, which belong to the technical field of the Internet. The method comprises: receiving a data processing request of a target service, wherein data to be processed of the data processing request includes a plurality of key-value pairs; determining an operation feature value of the data processing request according to an identifier of the target service, a data amount of the data to be processed and a distribution condition of keys of the data to be processed, wherein the operation feature includes the complexity of each data processing stage of the data processing request, and a ratio of data output to data input of each data processing stage; and according to the operation feature value, determining a resource configuration parameter value of each data processing stage. In the present invention, resource configuration parameter values of data processing stages of the data processing request are automatically calculated, without needing any technical personnel to input, before the data processing request is operated, the resource configuration parameter values one by one for a plurality of resource configuration parameters, and thus configuration efficiency may be improved.

Description

一种进行数据处理的方法和装置Method and device for performing data processing 技术领域Technical field
本发明涉及互联网技术领域,特别涉及一种进行数据处理的方法和装置。The present invention relates to the field of Internet technologies, and in particular, to a method and apparatus for performing data processing.
背景技术Background technique
为提高数据处理的速度,技术人员往往使用分布式数据处理系统对数据进行处理,其中,分布式数据处理系统可以将数据切分为多个预设大小的数据块,并将其存储到多个节点上(其中,分布式数据处理系统是包括多个服务器的服务器集群,节点可以是分布式数据处理系统中的一个服务器),每个节点可以并行处理其中的一部分数据。In order to improve the speed of data processing, technicians often use distributed data processing systems to process data. The distributed data processing system can divide data into multiple blocks of preset size and store them in multiples. On the node (where the distributed data processing system is a server cluster comprising multiple servers, the nodes can be one of the distributed data processing systems), each node can process a portion of the data in parallel.
分布式数据处理系统包含很多资源配置参数,其中,这些资源配置参数将会影响分布式数据处理系统处理数据的效率,这些资源配置参数可以包含数据块大小、处理数据的总进程数、每个节点处理数据的进程数等等。技术人员可以在分布式数据处理系统执行对应某种业务的数据处理前,根据自身经验对分布式数据处理系统中的各资源配置参数配置对应该种业务的参数值,进而,每当分布式数据处理系统接收到对应某种业务(比如,可以是对分布式数据处理系统中的数据进行排序,或者,可以是确定分布式数据处理系统中的数据的最大值)的数据处理请求时,分布式数据处理系统可以基于预先设置好的资源配置参数的资源配置参数值,执行该数据处理请求对应的处理。A distributed data processing system includes a plurality of resource configuration parameters, wherein the resource configuration parameters will affect the efficiency of processing data by the distributed data processing system, and the resource configuration parameters may include a data block size, a total number of processes for processing data, and each node. The number of processes that process data, and so on. The technician can configure the parameter values of the corresponding service parameters for each resource configuration parameter in the distributed data processing system according to his own experience before the distributed data processing system performs data processing corresponding to a certain service, and then, whenever the distributed data is distributed, The processing system receives a data processing request corresponding to a certain service (for example, may be sorted data in a distributed data processing system, or may be a maximum value of data in the distributed data processing system), distributed The data processing system may perform processing corresponding to the data processing request based on the resource configuration parameter value of the resource configuration parameter set in advance.
在实现本发明的过程中,发明人发现现有技术至少存在以下问题:In the process of implementing the present invention, the inventors have found that the prior art has at least the following problems:
基于上述处理方式,在分布式数据处理系统执行数据处理前,都需要技术人员对资源配置参数进行配置,往往资源配置参数的数目比较多,此种情况,需要技术人员一一对资源配置参数进行输入,从而,导致配置效率低。Based on the above processing method, before the distributed data processing system performs data processing, the technician needs to configure the resource configuration parameters, and often the number of resource configuration parameters is relatively large. In this case, the technician needs a pair of resource configuration parameters. Input, thus, leads to inefficient configuration.
发明内容Summary of the invention
为了实现提高配置效率的目的,本发明实施例提供了一种进行数据处理的方法和装置。所述技术方案如下:In order to achieve the purpose of improving configuration efficiency, an embodiment of the present invention provides a method and apparatus for performing data processing. The technical solution is as follows:
第一方面,提供了一种进行数据处理的方法,该方法包括:In a first aspect, a method of performing data processing is provided, the method comprising:
接收目标业务的数据处理请求,其中,数据处理请求的待处理数据包含多 个键值对;根据目标业务的标识、待处理数据的数据量、以及待处理数据的键的分布情况,确定数据处理请求的运行特征值,其中,运行特征值可以包含数据处理请求的每个数据处理阶段的复杂度,以及每个数据数据处理阶段的数据输出与数据输入的比值;进而,可以根据运行特征值,确定每个数据处理阶段的资源配置参数值。Receiving a data processing request of the target service, wherein the data to be processed of the data processing request includes more Key value pairs; determining a running feature value of the data processing request according to the identifier of the target service, the data amount of the data to be processed, and the distribution of the keys of the data to be processed, wherein the running feature value may include each of the data processing requests The complexity of the data processing stage, and the ratio of the data output to the data input for each data data processing stage; further, the resource configuration parameter values for each data processing stage can be determined based on the operational eigenvalues.
本发明实施例所示的方案,用户可以通过控制台向服务器发送目标业务的数据处理请求,服务器接收到目标业务的数据处理请求后,可以根据目标业务的标识、待处理数据的数据量和待处理数据的键的分布情况,确定数据处理请求的运行特征值,进而,根据确定出的运行特征,确定每个数据处理阶段的资源配置参数值。这样,每当接收到数据处理请求时,分布式数据处理系统均可自动计算出该数据处理请求的每个数据处理阶段的资源配置参数值,无需执行数据处理前,技术人员对多个资源配置参数一一输入其资源配置参数值,进而,可以提高配置效率。In the solution shown in the embodiment of the present invention, the user may send a data processing request of the target service to the server through the console. After receiving the data processing request of the target service, the server may, according to the identifier of the target service, the data volume of the data to be processed, and the to-be-processed data. The distribution of the keys of the data is processed, the operational characteristic values of the data processing request are determined, and further, the resource configuration parameter values of each data processing stage are determined according to the determined operational characteristics. In this way, each time a data processing request is received, the distributed data processing system can automatically calculate the resource configuration parameter value of each data processing stage of the data processing request, and the technician configures multiple resources before performing data processing. The parameters are input into the resource configuration parameter values one by one, thereby improving the configuration efficiency.
在一种可能的实现方式中,根据运行特征值,确定每个数据处理阶段的资源配置参数值,包括:根据性能预测模型和运行特征值,确定每个数据处理阶段的资源配置参数值,其中,性能预测模型由历史数据训练得到。In a possible implementation manner, determining a resource configuration parameter value of each data processing stage according to the running feature value, including: determining a resource configuration parameter value of each data processing stage according to the performance prediction model and the running feature value, where The performance prediction model is trained by historical data.
本发明实施例所示的方案,服务器中可以预先存储有性能预测模型,当服务器接收到目标业务的数据处理请求后,可以根据预先存储的性能预测模型和确定出的运行特征值,确定每个数据处理阶段的资源配置参数值。In the solution shown in the embodiment of the present invention, the performance prediction model may be pre-stored in the server. After the server receives the data processing request of the target service, the server may determine each of the performance prediction models and the determined operational feature values according to the pre-stored performance prediction model. Resource configuration parameter values for the data processing phase.
在一种可能的实现方式中,根据目标业务的标识、待处理数据的数据量、以及待处理数据的键的分布情况,确定数据处理请求的运行特征值,其中,运行特征包含数据处理请求的每个数据处理阶段的复杂度,以及每个数据数据处理阶段的数据输出与数据输入的比值,包括:确定数据处理请求的每个数据处理阶段的待处理数据的数据特征值;根据预先存储的以数据特征和运行特征为变量的运行特征预测模型与业务的对应关系,确定目标业务对应的目标运行特征预测模型;对于每个数据处理阶段,根据目标运行特征预测模型,计算数据特征的取值为数据处理阶段的数据特征值时,运行特征对应的运行特征值,得到数据处理阶段的运行特征值。根据运行特征值,确定每个数据处理阶段的资源配置参数值,包括:对于每个数据处理阶段,根据预先存储的以运行特征、资源配置参数和业务性能为变量的性能预测模型,计算在运行特征的取值为数据处理阶段的运行特征值的情况下,业务性能达到最优值时,资源配置参数对 应的资源配置参数值,得到数据处理阶段对应的资源配置参数值。In a possible implementation manner, the running feature value of the data processing request is determined according to the identifier of the target service, the data volume of the data to be processed, and the distribution of the key of the data to be processed, where the running feature includes the data processing request. The complexity of each data processing stage, and the ratio of the data output to the data input for each data data processing stage, including: determining the data feature value of the data to be processed for each data processing stage of the data processing request; The operational feature of the data feature and the running feature are used as variables to predict the correspondence between the model and the business, and the target operational feature prediction model corresponding to the target service is determined; for each data processing phase, the feature prediction model is operated according to the target, and the value of the data feature is calculated. When the data feature value of the data processing stage is used, the running feature value corresponding to the feature is run to obtain the running feature value of the data processing stage. Determining resource configuration parameter values for each data processing stage according to the running feature values, including: for each data processing stage, calculating the running according to a pre-stored performance prediction model with operating characteristics, resource configuration parameters, and business performance variables When the value of the feature is the running feature value of the data processing phase, when the service performance reaches the optimal value, the resource configuration parameter pair The value of the resource configuration parameter should be the value of the resource configuration parameter corresponding to the data processing stage.
本发明实施例所示的方案,服务器接收到目标业务的数据处理请求后,可以确定数据处理请求的每个数据处理阶段的数据特征值,其中,数据特征值包含待处理数据的数据量、待处理数据的键的分布情况,进而,对于每个数据处理阶段,服务器可以将该数据处理阶段的数据特征值作为数据特征的取值,代入目标业务对应的目标运行特征预测模型中,计算运行特征的运行特征值,即可得到每个数据处理阶段的运行特征值。In the solution shown in the embodiment of the present invention, after receiving the data processing request of the target service, the server may determine the data feature value of each data processing stage of the data processing request, where the data feature value includes the data amount of the data to be processed, to be The distribution of the key of the data is processed. Further, for each data processing stage, the server may use the data feature value of the data processing stage as the value of the data feature, and substitute it into the target operational feature prediction model corresponding to the target service, and calculate the running characteristic. The running feature values give the operational eigenvalues for each data processing stage.
服务器中可以存储有以运行特征、资源配置参数和业务性能为变量的性能预测模型。当接收目标业务的数据处理请求时,对于预设的每个数据处理阶段,服务器可以确定数据处理请求的每个数据处理阶段的运行特征值,进而,服务器可以确定每个数据处理阶段的资源配置参数值,以便服务器可以分别基于计算出的每个数据处理阶段对应的资源配置参数值,进行每个数据处理阶段的处理,从而,可以使得每个数据处理阶段的处理,都是基于该数据处理阶段对应的资源配置参数值,进而,可以提高每个数据处理阶段的处理效率。其中,每个数据处理阶段对应的运行特征值可以包含每个数据处理阶段的待处理数据的待处理数据的数据量、以及每个数据处理阶段的待处理数据的键的分布情况。A performance prediction model with operational characteristics, resource configuration parameters, and business performance variables can be stored in the server. When receiving the data processing request of the target service, for each preset data processing stage, the server may determine the running feature value of each data processing stage of the data processing request, and further, the server may determine the resource configuration of each data processing stage. Parameter values, so that the server can perform processing of each data processing stage based on the calculated resource configuration parameter values corresponding to each data processing stage, respectively, so that each data processing stage can be processed based on the data processing The resource configuration parameter values corresponding to the phase, in turn, can improve the processing efficiency of each data processing phase. The running feature value corresponding to each data processing stage may include the data amount of the data to be processed of the data to be processed in each data processing stage, and the distribution of the keys of the data to be processed in each data processing stage.
在一种可能的实现方式中,确定每个数据处理阶段的资源配置参数值之后,方法还包括:根据资源配置参数值,运行数据处理请求;统计每个数据处理阶段的实际运行特征值和实际业务性能值;根据资源配置参数值、实际运行特征值和实际业务性能值调整性能预测模型。In a possible implementation manner, after determining a resource configuration parameter value of each data processing stage, the method further includes: running a data processing request according to the resource configuration parameter value; and counting actual operating characteristic values and actual values of each data processing stage Service performance value; adjust the performance prediction model according to the resource configuration parameter value, the actual running characteristic value, and the actual business performance value.
本发明实施例所示的方案,确定出资源配置参数值后,可以基于资源配置参数值,运行数据处理请求,并可以统计每个数据处理阶段的实际运行特征值和实际业务性能值,最后,服务器可以将统计的每个数据处理阶段的资源配置参数值、实际运行特征和实际业务性能值作为训练数据,对得到的性能预测模型进行重新训练。这样,可以保证性能预测模型的准确度越来越高。After the resource configuration parameter value is determined, the solution may be run based on the resource configuration parameter value, and the actual running characteristic value and the actual service performance value of each data processing stage may be counted. Finally, The server may use the resource configuration parameter value, the actual running feature, and the actual service performance value of each data processing stage as training data, and re-train the obtained performance prediction model. In this way, the accuracy of the performance prediction model can be guaranteed to be higher and higher.
在一种可能的实现方式中,该方法还包括:每当接收到数据处理请求时,根据当前的数据处理请求的第一预设资源配置参数值,运行数据处理请求,并统计当前的数据处理请求的第一实际运行特征值、第一预设资源配置参数值和第一实际业务性能值;当接收到的数据处理请求的数目达到第一预设数目时,基于第一预设数目个数据处理请求的实际运行特征值、预设资源配置参数值和 实际业务性能值,对预设的性能预测模型基准式进行训练,得到性能预测模型;将性能预测模型进行存储。In a possible implementation manner, the method further includes: when receiving the data processing request, running the data processing request according to the first preset resource configuration parameter value of the current data processing request, and counting the current data processing The first actual running feature value, the first preset resource configuration parameter value, and the first actual service performance value; when the number of received data processing requests reaches a first preset number, based on the first preset number of data Processing the actual running feature value of the request, the preset resource configuration parameter value, and The actual business performance value is trained on the preset performance prediction model reference formula to obtain a performance prediction model; the performance prediction model is stored.
本发明实施例所示的方案,在得到足够的样本前,服务器可以在接收到数据处理请求时,基于预设的资源配置参数值,执行当前的数据处理请求对应的处理,并对应记录该数据处理请求的实际运行特征值、预设资源配置参数值和实际业务性能值,当得到足够的实际运行特征值、预设资源配置参数值和实际业务性能值时,服务器可以基于这些历史运行记录,训练关于运行特征、资源配置参数和业务性能之间的性能预测模型,并将其进行存储,以便在后续接收到数据处理请求时,服务器可以自动计算出数据处理请求的每个数据处理阶段的资源配置参数值。In the solution shown in the embodiment of the present invention, before obtaining a sufficient sample, the server may perform processing corresponding to the current data processing request based on the preset resource configuration parameter value when receiving the data processing request, and correspondingly record the data. The actual running feature value, the preset resource configuration parameter value, and the actual service performance value of the request are processed. When sufficient actual running feature values, preset resource configuration parameter values, and actual business performance values are obtained, the server may run records based on the historical operations. Training a performance prediction model between operational characteristics, resource configuration parameters, and business performance, and storing it so that when a subsequent data processing request is received, the server can automatically calculate resources for each data processing stage of the data processing request Configure parameter values.
在一种可能的实现方式中,将性能预测模型进行存储,包括:每当接收到数据处理请求时,基于当前的数据处理请求的第二预设资源配置参数值,运行数据处理请求,并统计当前的数据处理请求的第二实际运行特征值和第二实际业务性能值;根据性能预测模型和第二实际运行特征值,确定当前的数据处理请求的第三业务性能值;根据第二实际业务性能值和确定出的第三业务性能值之间的差值,确定性能预测模型的准确度,得到当前的数据处理请求下性能预测模型的准确度;当得到性能预测模型后接收到的数据处理请求的数目达到第二预设数目时,根据第二预设数目个数据处理请求下性能预测模型的准确度,计算性能预测模型的平均准确度;如果平均准确度达到预设准确度阈值,则将性能预测模型进行存储。In a possible implementation manner, the performance prediction model is stored, including: when receiving the data processing request, running a data processing request based on a second preset resource configuration parameter value of the current data processing request, and counting a second actual running characteristic value and a second actual business performance value of the current data processing request; determining a third service performance value of the current data processing request according to the performance prediction model and the second actual running characteristic value; according to the second actual service The difference between the performance value and the determined third service performance value, determining the accuracy of the performance prediction model, obtaining the accuracy of the performance prediction model under the current data processing request; and receiving the data processing after obtaining the performance prediction model When the number of requests reaches the second preset number, the average accuracy of the performance prediction model is calculated according to the accuracy of the performance prediction model under the second preset number of data processing requests; if the average accuracy reaches the preset accuracy threshold, then The performance prediction model is stored.
本发明实施例所示的方案,服务器得到以运行特征、资源配置参数和业务性能为变量的性能预测模型后,还可以对其的准确度进行验证,在准确度达到预设准确度阈值的情况下,服务器才通过以运行特征、资源配置参数和业务性能为变量的性能预测模型,自动确定数据处理请求对应的资源配置参数值,从而,基于自动确定出的资源配置参数值,执行数据处理请求对应的处理。这样,可以提高最终存储的以运行特征、资源配置参数和业务性能为变量的性能预测模型的准确度,进而,提高执行数据处理请求对应的处理的处理效率。In the solution shown in the embodiment of the present invention, after the server obtains the performance prediction model with the running feature, the resource configuration parameter, and the service performance as variables, the accuracy of the server may be verified, and the accuracy reaches the preset accuracy threshold. Then, the server automatically determines the resource configuration parameter value corresponding to the data processing request by using the performance prediction model with the running feature, the resource configuration parameter, and the service performance as variables, thereby executing the data processing request based on the automatically determined resource configuration parameter value. Corresponding processing. In this way, the accuracy of the final stored performance prediction model with operational characteristics, resource configuration parameters, and business performance variables can be improved, and the processing efficiency of the processing corresponding to the execution of the data processing request can be improved.
第二方面,提供了一种服务器,该服务器包括处理器、存储器、收发器,处理器被配置为执行存储器中存储的指令;处理器通过执行指令来实现上述第一方面所提供的进行数据处理的方法。 In a second aspect, a server is provided, the server comprising a processor, a memory, a transceiver configured to execute instructions stored in the memory, and the processor implementing the data processing provided by the first aspect by executing the instructions Methods.
第三方面,提供了一种进行数据处理的装置,该装置包括至少一个模块,该至少一个模块用于实现上述第一方面所提供的进行数据处理的方法。In a third aspect, there is provided an apparatus for performing data processing, the apparatus comprising at least one module for implementing the method for data processing provided by the first aspect above.
上述本发明实施例第二到第三方面所获得的技术效果与第一方面中对应的技术手段获得的技术效果近似,在这里不再赘述。The technical effects obtained by the second to third aspects of the embodiments of the present invention are similar to those obtained by the corresponding technical means in the first aspect, and are not described herein again.
本发明实施例提供的技术方案带来的有益效果是:The beneficial effects brought by the technical solutions provided by the embodiments of the present invention are:
本发明实施例中,接收目标业务的数据处理请求,其中,数据处理请求的待处理数据包含多个键值对;根据目标业务的标识、待处理数据的数据量、以及待处理数据的键的分布情况,确定数据处理请求的运行特征值,其中,运行特征包含数据处理请求的每个数据处理阶段的复杂度,以及每个数据数据处理阶段的数据输出与数据输入的比值;根据运行特征值,确定每个数据处理阶段的资源配置参数值。这样,每当接收到数据处理请求时,分布式数据处理系统均可自动计算出该数据处理请求的资源配置参数值,无需执行数据处理前,技术人员对多个资源配置参数一一输入其资源配置参数值,进而,可以提高配置效率。In the embodiment of the present invention, a data processing request for receiving a target service, where the data to be processed of the data processing request includes a plurality of key value pairs; according to the identifier of the target service, the data amount of the data to be processed, and the key of the data to be processed a distribution condition determining a running feature value of the data processing request, wherein the running feature includes a complexity of each data processing stage of the data processing request, and a ratio of data output to data input of each data data processing stage; Determine the resource configuration parameter values for each data processing stage. In this way, each time a data processing request is received, the distributed data processing system can automatically calculate the resource configuration parameter value of the data processing request, and the technician inputs the plurality of resource configuration parameters one by one before performing the data processing. Resource configuration parameter values, in turn, can improve configuration efficiency.
附图说明DRAWINGS
图1是本发明实施例提供的一种系统框架示意图;1 is a schematic diagram of a system framework provided by an embodiment of the present invention;
图2是本方明实施例提供的一种服务器的结构示意图;2 is a schematic structural diagram of a server according to an embodiment of the present disclosure;
图3是本发明实施例提供的一种进行数据处理的方法流程图;3 is a flowchart of a method for performing data processing according to an embodiment of the present invention;
图4是本发明实施例提供的一种进行数据处理的装置结构示意图;4 is a schematic structural diagram of an apparatus for performing data processing according to an embodiment of the present invention;
图5是本发明实施例提供的一种进行数据处理的装置结构示意图;FIG. 5 is a schematic structural diagram of an apparatus for performing data processing according to an embodiment of the present invention; FIG.
图6是本发明实施例提供的一种进行数据处理的装置结构示意图;6 is a schematic structural diagram of an apparatus for performing data processing according to an embodiment of the present invention;
图7是本发明实施例提供的一种进行数据处理的装置结构示意图。FIG. 7 is a schematic structural diagram of an apparatus for performing data processing according to an embodiment of the present invention.
具体实施方式detailed description
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。The embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.
本发明实施例提供了一种进行数据处理的方法,该方法可以由服务器实现,其中,该服务器可以是由多个服务器组成的服务器组,即可以是分布式数 据处理系统,如图1所示,该服务器也可以是一个服务器,优选的,可以是分布式数据处理系统,当接收到数据处理请求时,分布式数据处理系统可以将待处理的数据按照预设的数据块大小,将待处理的数据划分为多个数据块,进而,每个服务器可以并行对一部分数据进行数据处理。An embodiment of the present invention provides a method for performing data processing, where the method may be implemented by a server, where the server may be a server group composed of multiple servers, that is, may be distributed numbers. According to the processing system, as shown in FIG. 1, the server may also be a server. Preferably, it may be a distributed data processing system. When receiving a data processing request, the distributed data processing system may pre-process the data to be processed. The data block size is divided into multiple data blocks, and each server can perform data processing on a part of the data in parallel.
服务器可以包括收发器210、处理器220、存储器230,存储器230和收发器210可以分别与处理器220连接,如图2所示。收发器210可以用于接收消息或数据,收发器210可以包括但不限于至少一个放大器、调谐器、一个或多个振荡器、耦合器、LNA(Low Noise Amplifier,低噪声放大器)、双工器等,本发明中,收发器210可以用于接收数据处理请求。处理器220可以包括一个或多个处理单元;处理器220可以是通用处理器,包括中央处理器(Central Processing Unit,简称CPU)、网络处理器(Network Processor,简称NP)等;还可以是数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其他可编程逻辑器件等。具体地,程序可以包括程序代码,程序代码包括计算机操作指令。服务器还可以包括存储器230,存储器230可用于存储软件程序以及模块,处理器220通过读取存储在存储器230的软件代码以及模块,从而执行数据处理。The server may include a transceiver 210, a processor 220, a memory 230, and the memory 230 and the transceiver 210 may be coupled to the processor 220, respectively, as shown in FIG. The transceiver 210 can be used to receive messages or data. The transceiver 210 can include, but is not limited to, at least one amplifier, a tuner, one or more oscillators, a coupler, an LNA (Low Noise Amplifier), a duplexer. Etc., in the present invention, the transceiver 210 can be configured to receive a data processing request. The processor 220 may include one or more processing units; the processor 220 may be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP for short, etc.; Signal processor (DSP), application specific integrated circuit (ASIC), field programmable gate array (FPGA) or other programmable logic device. In particular, the program can include program code, the program code including computer operating instructions. The server may also include a memory 230 that may be used to store software programs and modules, and the processor 220 performs data processing by reading software code and modules stored in the memory 230.
下面将结合具体实施方式,对图3所示的处理流程进行详细的说明,下述以服务器为分布式数据处理系统为例,其他情况与之类似,不再进行赘述,内容可以如下:The process flow shown in FIG. 3 will be described in detail below with reference to the specific embodiments. The following is a description of the server as a distributed data processing system. Other situations are similar and will not be described again. The content may be as follows:
步骤301,分布式数据处理系统接收目标业务的数据处理请求,其中,数据处理请求的待处理数据包含多个键值对。Step 301: The distributed data processing system receives a data processing request of the target service, where the data to be processed of the data processing request includes a plurality of key value pairs.
其中,目标业务可以是分布式数据处理系统能够进行的业务中的任一业务,比如目标业务可以是确定待处理数据中的最大值。目标业务的数据处理请求可以是用于对特定待处理数据进行目标业务处理的请求,比如可以是用于对指定的待处理数据进行排序处理的请求。The target service may be any one of the services that the distributed data processing system can perform. For example, the target service may be a maximum value in determining the to-be-processed data. The data processing request of the target service may be a request for performing target service processing on the specific data to be processed, such as a request for sorting the specified data to be processed.
在实施中,为提高处理数据的处理速度,技术人员往往使用分布式数据处理系统对数据进行处理。具体的,技术人员可以通过控制台向分布式数据处理系统发送对应某一业务(可以称为目标业务)的数据处理请求,其中,数据处理请求可以指定此次数据处理请求的待处理数据(比如,数据处理器请求中可以携带有待处理数据的存储地址)。另外,数据处理请求对应的待处理数据可以是一系列的key-value键值对,对待处理数据进行数据处理一般是对key进 行数据处理。接收到目标业务的数据处理请求后,分布式数据处理系统可以确定数据处理请求的待处理数据的数据特征值,其中,数据特征值可以包含待处理数据的数据量、待处理数据的键的分布情况。In the implementation, in order to improve the processing speed of processing data, the technician often uses a distributed data processing system to process the data. Specifically, the technician can send a data processing request corresponding to a certain service (which may be referred to as a target service) to the distributed data processing system through the console, where the data processing request may specify the to-be-processed data of the data processing request (for example, The data processor request may carry a storage address of the data to be processed). In addition, the data to be processed corresponding to the data processing request may be a series of key-value key value pairs, and the data processing of the data to be processed is generally performed on the key. Line data processing. After receiving the data processing request of the target service, the distributed data processing system may determine the data feature value of the data to be processed of the data processing request, where the data feature value may include the data amount of the data to be processed and the distribution of the key of the data to be processed. Happening.
具体的,分布式数据处理系统可以通过如下方式确定待处理数据的键的分布情况:分布式数据处理系统中可以预先设置,确定出待处理数据后,可以通过确定所有的key的分布情况来确定待处理数据的分布情况,具体的,当分布式数据处理系统中预设数据分布模型是平均分布时,分布式数据处理系统可以计算key的均值,并将key的均值作为待处理数据的键的分布情况,当分布式数据处理系统中预设数据分布模型是正态分布时,分布式数据处理系统可以计算key的均值和标准差,并将key的均值和标准差作为待处理数据的键的分布情况。此外,为减少计算量,分布式数据处理系统中可以对待处理数据按照预设的采样率进行采样,通过确定采样后的部分待处理数据的键的分布情况,来确定数据处理请求的待处理数据的键的分布情况。Specifically, the distributed data processing system can determine the distribution of the keys of the data to be processed in the following manner: the distributed data processing system can be preset, and after determining the data to be processed, it can be determined by determining the distribution of all the keys. The distribution of the data to be processed, specifically, when the preset data distribution model in the distributed data processing system is evenly distributed, the distributed data processing system can calculate the mean value of the key and use the mean value of the key as the key of the data to be processed. Distribution, when the preset data distribution model in the distributed data processing system is normally distributed, the distributed data processing system can calculate the mean and standard deviation of the key, and use the mean and standard deviation of the key as the key of the data to be processed. Distribution. In addition, in order to reduce the amount of calculation, the distributed data processing system can sample the data to be processed according to a preset sampling rate, and determine the data to be processed of the data processing request by determining the distribution of the keys of the sampled data to be processed after sampling. The distribution of the keys.
该处理步骤具体可以由收发器实现。This processing step can be specifically implemented by a transceiver.
步骤302,分布式数据处理系统根据目标业务的标识、待处理数据的数据量、以及待处理数据的键的分布情况,确定数据处理请求的运行特征值,其中,运行特征值包含数据处理请求的每个数据处理阶段的复杂度,以及每个数据数据处理阶段的数据输出与数据输入的比值。Step 302: The distributed data processing system determines, according to the identifier of the target service, the data volume of the data to be processed, and the distribution of the keys of the data to be processed, the running feature value of the data processing request, where the running feature value includes the data processing request. The complexity of each data processing stage, as well as the ratio of data output to data input for each data data processing stage.
在实施中,分布式数据处理系统接收到数据处理请求后,可以确定数据处理请求的目标业务的标识,进而,可以根据确定出的目标业务的标识、待处理数据的数据量、以及待处理数据的键的分布情况,确定数据处理请求的运行特征值。In an implementation, after receiving the data processing request, the distributed data processing system may determine the identifier of the target service of the data processing request, and further, according to the determined identifier of the target service, the data volume of the data to be processed, and the data to be processed. The distribution of the keys determines the operational eigenvalues of the data processing request.
具体的,运行特征值可以是执行数据处理请求对应的处理后的输出数据的数据量与执行数据处理请求对应的处理的输入数据(即待处理数据)的数据量的比值(其中,比值可以用ratio表示,数据处理请求对应的比值可以是ratio的数值),也可以是执行数据处理请求对应的处理的复杂度(其中,复杂度可以用complexity表示,执行数据处理请求对应的处理的复杂度是complexity的数值),其中,complexity可以用于表示执行数据处理请求对应的处理的计算复杂度,即可以表示数据处理请求对应的业务的业务处理逻辑的计算(或算法)复杂度。Specifically, the running feature value may be a ratio of a data amount of the processed output data corresponding to the data processing request and a data amount of the input data (ie, the data to be processed) corresponding to the data processing request (wherein the ratio may be used) The ratio indicates that the ratio corresponding to the data processing request may be the value of ratio, or may be the complexity of the processing corresponding to the execution of the data processing request (where the complexity may be represented by complexity, and the complexity of the processing corresponding to the execution of the data processing request is The value of complexity), wherein the complexity can be used to represent the computational complexity of the processing corresponding to the execution of the data processing request, ie, the computational (or algorithmic) complexity of the business processing logic of the service corresponding to the data processing request.
其中,数据处理请求对应的complexity可以是分布式数据处理系统执行数 据处理请求对应的处理所使用的CPU(Central Processing Unit,中央处理器)有效时间t与该数据处理请求的待处理数据的数据量size的比值(即complexity=t/size)。另外,在执行某数据处理请求对应的处理时,分布式数据处理系统可以将待处理数据划分为多个具有预设数据块大小的数据块(其中,默认数据块的大小可以64MB(兆字节)),分布式数据处理系统可以将多个数据块分别存储在每个服务器中,其中,本发明实施例对数据块的分配方式不作限定,具体的,分布式数据处理系统可以将各个数据块平均分配到每个服务器中存储,进而,每个服务器可以同时对各自中存储的数据块进行对应某业务的处理(比如,对各自存储的数据块中的数据进行排序,或者求取本地中存储的数据的最大值),其中,每个服务器在处理数据时,可以同时启动多个线程对各个数据块同时进行处理。此种情况下,complexity也可以是每个服务器中每个线程所使用的时间的累加总CPU有效时间T与数据处理请求对应的待处理数据的数据量size的比值(即complexity=T/size),例如,执行数据处理请求对应的处理分成N个并行进程或线程处理,每个进程或线程使用的时间用ti表示,数据处理请求对应的待处理数据的数据量用size表示(其中,size表示的是每个进程要处理的数据的数据量的总和),则complexity的公式可以如公式(1)所示。The complexity corresponding to the data processing request may be a CPU (Central Processing Unit) effective time t used by the distributed data processing system to perform processing corresponding to the data processing request, and a data amount of the to-be-processed data of the data processing request. The ratio of size (ie complexity=t/size). In addition, when performing processing corresponding to a data processing request, the distributed data processing system may divide the data to be processed into a plurality of data blocks having a preset data block size (wherein the size of the default data block may be 64 MB (megabytes). The distributed data processing system may store a plurality of data blocks in each of the servers. The embodiment of the present invention does not limit the manner in which the data blocks are allocated. Specifically, the distributed data processing system may use the data blocks. It is evenly distributed to each server for storage. In turn, each server can simultaneously process the data blocks stored in each of the corresponding services (for example, sorting the data in the respective stored data blocks, or obtaining the local storage). The maximum value of the data), in which each server can simultaneously start multiple threads to process each data block at the same time when processing data. In this case, the complexity may also be the ratio of the cumulative total CPU effective time T of the time used by each thread in each server to the data size of the data to be processed corresponding to the data processing request (ie, complexity=T/size). For example, the processing corresponding to the execution of the data processing request is divided into N parallel processes or thread processing, and the time used by each process or thread is represented by t i , and the data amount of the data to be processed corresponding to the data processing request is represented by size (where size Representing the sum of the data amounts of the data to be processed by each process), the formula of complexity can be as shown in formula (1).
Figure PCTCN2016107948-appb-000001
Figure PCTCN2016107948-appb-000001
此外,考虑到服务器的中央处理器的主频率不同,complexity也可以进一步用公式(2)表示。In addition, considering the main frequency of the server's central processing unit, the complexity can be further expressed by the formula (2).
Figure PCTCN2016107948-appb-000002
Figure PCTCN2016107948-appb-000002
其中,ti表示第i个进程或线程所使用的CPU有效时间,cpu_frequencyi表示第i个进程或线对应的服务器的中央处理器的主频率。Where t i represents the CPU effective time used by the i-th process or thread, and cpu_frequency i represents the main frequency of the central processor of the server corresponding to the i-th process or line.
具体的,分布式数据处理系统可以预先设置执行数据处理请求对应的处理的数据处理阶段,其中,同一个业务对应的数据处理阶段是固定的,当每个数据处理阶段都执行完该数据处理阶段的处理时,即代表完成了该数据处理请求对应的处理,其中,上一个数据处理阶段的输出数据即是下一数据处理阶段的待处理数据(即输入数据),例如,执行数据处理请求对应的处理一共包括两个数据处理阶段时,则数据处理请求指定的待处理数据即是第一数据处理阶段的输入数据,第一数据处理阶段的输出数据即是第二数据数据阶段的输入数 据。例如,用于分布式数据处理系统能够实现分布式处理数据的应用程序hadoop,hadoop中的用于执行数据处理请求对应的处理的数据处理阶段包括map(映射)和reduce(减少)两个数据处理阶段,每当分布式数据处理系统接收到数据处理请求时,分布式数据处理系统可以先对数据处理请求指定的待处理数据进行map数据处理阶段的处理,然后再对map数据处理阶段的输出数据进行reduce数据处理阶段的处理,再例如,分布式数据处理系统当前接收到的数据处理请求对应的处理是:确定待处理数据中的最大值,则分布式数据处理系统接收到数据处理请求后,在map数据处理阶段,分布式数据处理系统可以同时开启多个map线程,同时对一部分待处理数据进行处理,得到每部分待处理数据中的最大值,然后,通过reduce数据处理阶段,确定多个map的输出数据中的最大值,得到待处理数据中最终的最大值。Specifically, the distributed data processing system may preset a data processing phase for performing processing corresponding to the data processing request, wherein the data processing phase corresponding to the same service is fixed, and each data processing phase executes the data processing phase. When the processing is performed, that is, the processing corresponding to the data processing request is completed, wherein the output data of the previous data processing stage is the data to be processed (ie, the input data) of the next data processing stage, for example, the execution data processing request corresponds to When the processing includes two data processing stages, the data to be processed specified by the data processing request is the input data of the first data processing stage, and the output data of the first data processing stage is the input number of the second data data stage. according to. For example, an application for a distributed data processing system capable of implementing distributed processing data hasoop, a data processing stage in Hadoop for performing processing corresponding to a data processing request includes two data processings of map (map) and reduce (reduction). In the stage, whenever the distributed data processing system receives the data processing request, the distributed data processing system may first perform processing on the map data processing stage on the data to be processed specified by the data processing request, and then output data in the map data processing stage. Performing processing of the reduce data processing stage. For example, the processing corresponding to the data processing request currently received by the distributed data processing system is: determining the maximum value in the data to be processed, and after receiving the data processing request, the distributed data processing system receives the data processing request. In the map data processing stage, the distributed data processing system can simultaneously open multiple map threads, and simultaneously process a part of the data to be processed to obtain the maximum value of each part of the data to be processed, and then determine multiple times through the reduce data processing stage. The maximum value in the output data of the map, to be treated The final maximum value in the data.
此种情况下,分布式数据处理系统接收到目标业务的数据处理请求后,分别确定每个数据处理阶段对应的运行特征值。对于运行特征包括比值、复杂度的情况,分布式数据处理系统接收到数据处理请求后,可以分别确定每个数据处理阶段的比值的数值、复杂度的数值。具体的,每个数据处理阶段的运行特征值,可以是每个数据处理阶段的输出数据的数据量与输入数据(即待处理数据)的数据量的比值,以及每个数据处理阶段的复杂度,其中,每个数据处理阶段对应的复杂度可以是分布式数据处理系统执行数据处理阶段的处理所使用的总CPU有效时间与该数据处理阶段对应的待处理数据的数据量的比值,也可以按照公式(3)计算每个数据处理阶段的复杂度,其中,公式(3)中的x表示某个数据处理阶段,txi表示数据处理阶段x的第i个进程处理数据所使用的时间,cpu_frequencyi表示第i个进程或线对应的服务器的中央处理器的主频率,sizex表示x数据处理阶段的待处理数据的数据量。In this case, after receiving the data processing request of the target service, the distributed data processing system determines the running feature values corresponding to each data processing stage. For the case where the running characteristics include the ratio and the complexity, after receiving the data processing request, the distributed data processing system may separately determine the numerical value and the complexity value of the ratio of each data processing stage. Specifically, the running feature value of each data processing stage may be a ratio of the data amount of the output data of each data processing stage to the data amount of the input data (ie, the data to be processed), and the complexity of each data processing stage. The complexity corresponding to each data processing stage may be a ratio of a total CPU effective time used by the distributed data processing system to perform processing of the data processing stage to a data amount of the to-be-processed data corresponding to the data processing stage, and may also be Calculate the complexity of each data processing stage according to formula (3), where x in equation (3) represents a certain data processing stage, and t xi represents the time used by the i-th process of data processing stage x to process data. Cpu_frequency i represents the main frequency of the central processing unit of the server corresponding to the i-th process or line, and size x represents the data amount of the data to be processed in the x data processing stage.
Figure PCTCN2016107948-appb-000003
Figure PCTCN2016107948-appb-000003
另外,每个数据处理阶段的运行特征值可以是通过每个数据处理阶段的待处理数据的数据特征值预测得到的,相应的,处理过程可以如下:确定数据处理请求的每个数据处理阶段的待处理数据的数据特征值;根据预先存储的以数据特征和运行特征为变量的运行特征预测模型与业务的对应关系,确定目标业务对应的目标运行特征预测模型;对于每个数据处理阶段,根据目标运行特征预测模型,计算数据特征的取值为数据处理阶段对应的数据特征值时,运行特 征的运行特征值,得到数据处理阶段的运行特征值。In addition, the running feature value of each data processing stage may be predicted by the data feature value of the data to be processed in each data processing stage. Accordingly, the processing may be as follows: determining each data processing stage of the data processing request Data feature value of the data to be processed; predicting the target operational characteristic prediction model corresponding to the target service according to the pre-stored correspondence between the operational feature prediction model and the service with the data feature and the running feature as variables; for each data processing phase, according to Target operation feature prediction model, when the value of the calculated data feature is the data feature value corresponding to the data processing phase, The running characteristic value of the levy obtains the running characteristic value of the data processing stage.
其中,数据特征可以是用于表征待处理数据的数据特征信息的变量,运行特征可以是用于表征数据处理请求对应的运行特征信息的变量。The data feature may be a variable for characterizing data feature information of the data to be processed, and the running feature may be a variable for characterizing the running feature information corresponding to the data processing request.
在实施中,分布式数据处理系统中可以预先存储有数据分布模型,其中,数据分布模型可以是平均分布,也可以是正态分布,也可以是数学中常用的其他数据分布。具体的,分布式数据处理系统中可以预先存储有数据分布模型的参数的计算公式,比如,分布式数据处理系统中可以预先存储有均值分布的均值的计算公式,或者,预先存储有正态分布的均值和标准差的计算公式。In the implementation, the data distribution model may be pre-stored in the distributed data processing system, wherein the data distribution model may be an average distribution, a normal distribution, or other data distribution commonly used in mathematics. Specifically, the distributed data processing system may pre-store the calculation formula of the parameter of the data distribution model. For example, the distributed data processing system may pre-store the calculation formula of the mean of the mean distribution, or may store the normal distribution in advance. The formula for calculating the mean and standard deviation.
当分布式数据处理系统接收到目标业务的数据处理请求时,可以确定该数据处理请求的每个数据处理阶段的待处理数据,进而,可以确定待处理数据的数据特征信息(即数据特征值),具体的,每个数据处理阶段的数据特征值可以是数据处理阶段的待处理数据的数据量和/或键的分布情况(键的分布情况可以是数据分布keydist对应的数值)。When the distributed data processing system receives the data processing request of the target service, the data to be processed in each data processing stage of the data processing request may be determined, and further, the data feature information (ie, the data feature value) of the data to be processed may be determined. Specifically, the data feature value of each data processing stage may be the data amount of the data to be processed and/or the distribution of the keys in the data processing stage (the distribution of the keys may be a value corresponding to the data distribution keydist).
分布式数据处理系统中可以预先存储有以数据特征和运行特征为变量的运行特征预测模型与业务标识的对应关系,如表1所示,其中,业务标识可以业务名称、或者业务对应软件程序的程序名称、类名,或者其他能区分目标业务和其他业务的标识,业务的标识可以是人工命名维护的,也可以是通过某种特定的规则从业务数据处理请求中提取的标识。运行特征预测模型可以是以数据特征和运行特征为变量的函数式,数据特征可以是运行特征预测模型中的自变量,运行特征可以是运行特征预测模型中的因变量,即运行特征预测模型可以是运行特征=f(数据特征),f()表示数据特征和运行特征之间满足的函数关系式。具体的,当数据特征包括数据量和数据分布,运行特征包括复杂度和比值时,运行特征预测模型可以是(复杂度,比值)=f(数据量,数据分布)(即,(complexity,ratio)=f(size,keydist))。分布式数据处理系统确定出每个数据处理阶段的数据特征值后,可以在表1中所示的对应关系中,确定目标业务对应的目标运行特征预测模型,进而,对于每个数据处理阶段,分布式数据处理系统可以将该数据处理阶段的数据特征值作为数据特征的取值,代入目标运行特征预测模型中,计算运行特征的运行特征值,即可得到每个数据处理阶段的运行特征值。The distributed data processing system may pre-store the correspondence between the running feature prediction model and the service identifier with the data feature and the running feature as variables, as shown in Table 1, where the service identifier may be a service name or a service corresponding software program. The program name, class name, or other identifier that can distinguish between the target service and other services. The identifier of the service may be manually named and maintained, or may be an identifier extracted from the service data processing request by a specific rule. The operational feature prediction model may be a function of data characteristics and operational characteristics as variables, the data feature may be an independent variable in the operational feature prediction model, and the operational feature may be a dependent variable in the operational feature prediction model, ie, the operational feature prediction model may Is the running feature = f (data feature), f () represents the functional relationship between the data feature and the running feature. Specifically, when the data feature includes the data volume and the data distribution, and the running feature includes the complexity and the ratio, the running feature prediction model may be (complexity, ratio) = f (data amount, data distribution) (ie, (complexity, ratio) ) = f (size, keydist)). After the distributed data processing system determines the data feature value of each data processing stage, the target operation characteristic prediction model corresponding to the target service may be determined in the correspondence relationship shown in Table 1, and, for each data processing stage, The distributed data processing system can take the data feature value of the data processing stage as the value of the data feature, substitute it into the target running feature prediction model, calculate the running feature value of the running feature, and obtain the running characteristic value of each data processing stage. .
表1Table 1
业务标识Business identifier 运行特征预测模型Operational feature prediction model
业务标识1Business logo 1 预测模型1Predictive model 1
业务标识2Business logo 2 预测模型2Forecast model 2
业务标识3Business logo 3 预测模型3Forecast model 3
该处理过程具体可以由处理器实现。The process can be specifically implemented by a processor.
步骤303,分布式数据处理系统根据运行特征值,确定每个数据处理阶段的资源配置参数值。Step 303: The distributed data processing system determines a resource configuration parameter value of each data processing stage according to the running feature value.
在实施中,分布式数据处理系统确定出每个数据处理阶段的运行特征值或,可以根据每个数据处理阶段的运行特征值,确定每个数据处理阶段的资源配置参数值。另外,分布式数据处理系统还可以根据待处理数据的数据量,确定每个数据处理阶段的资源配置参数值,即分布式数据处理系统根据运行特征值和待处理数据的数据量,确定每个数据处理阶段的资源配置参数值。In an implementation, the distributed data processing system determines operational characteristic values for each data processing stage or may determine resource configuration parameter values for each data processing stage based on operational characteristic values for each data processing stage. In addition, the distributed data processing system may further determine a resource configuration parameter value of each data processing stage according to the data amount of the data to be processed, that is, the distributed data processing system determines each according to the running feature value and the data amount of the data to be processed. Resource configuration parameter values for the data processing phase.
可选的,分布式数据处理系统可以根据性能预测模型和每个数据处理阶段的运行特征值,确定每个数据处理阶段的资源配置参数值,其中,性能预测模型由历史数据训练得到。Optionally, the distributed data processing system may determine resource configuration parameter values for each data processing stage according to the performance prediction model and the running feature values of each data processing stage, wherein the performance prediction model is trained by historical data.
具体的,对于每个数据处理阶段,分布式数据处理系统根据预先存储的以运行特征、资源配置参数和业务性能为变量的性能预测模型,计算在运行特征的取值为数据处理阶段的运行特征值的情况下,业务性能达到最优值时,资源配置参数的资源配置参数值,得到数据处理阶段的资源配置参数值。Specifically, for each data processing stage, the distributed data processing system calculates the running characteristics of the running feature as the data processing stage according to the pre-stored performance prediction model with the running feature, the resource configuration parameter, and the service performance as variables. In the case of a value, when the service performance reaches an optimal value, the resource configuration parameter value of the resource configuration parameter obtains the resource configuration parameter value in the data processing phase.
其中,资源配置参数可以是分布式数据处理系统的资源配置配置参数,资源配置参数可以是影响分布式数据处理系统执行数据处理请求对应的处理的处理效率的参数,比如可以包括数据块大小、处理数据的总进程数、每个节点处理数据的进程数等等。业务性能可以是用于表征分布式数据处理系统执行数据处理请求对应的处理的处理效率的参数,比如可以是执行数据处理请求对应的处理所使用的时间。The resource configuration parameter may be a resource configuration configuration parameter of the distributed data processing system, and the resource configuration parameter may be a parameter that affects the processing efficiency of the processing corresponding to the data processing request by the distributed data processing system, and may include, for example, a data block size and processing. The total number of processes of data, the number of processes each node processes data, and so on. The service performance may be a parameter used to characterize the processing efficiency of the distributed data processing system executing the processing corresponding to the data processing request, such as may be the time used to perform the processing corresponding to the data processing request.
在实施中,分布式数据处理系统中可以预先存储有以运行特征、资源配置参数和业务性能为变量的性能预测模型,其中,该性能预测模型与业务没有直接的对应关系,性能预测模型和业务的其中一个特定关系直接相关,该性能预测模型可以以运行特征、资源配置参数和业务性能为变量的函数式。分布式数据处理系统确定出每个数据处理阶段的运行特征值后,对于每个数据处理阶段,分布式数据处理系统可以将该数据处理阶段的运行特征值作为运行特征对应的取值,输入以运行特征、资源配置参数和业务性能为变量的性能预测模型 中,此时,性能预测模型中只有业务性能和资源配置参数,进而,分布式数据处理系统可以计算业务性能取得最优值(比如最小值)时,资源配置参数的资源配置参数值,即可得到该数据处理阶段的资源配置参数值,即得到资源配置参数在该数据处理阶段下的资源配置参数值。In the implementation, the distributed data processing system may pre-store a performance prediction model with operating characteristics, resource configuration parameters, and service performance as variables, wherein the performance prediction model has no direct correspondence with the service, the performance prediction model and the service. One of the specific relationships is directly related, and the performance prediction model can be a function of running characteristics, resource configuration parameters, and business performance as variables. After the distributed data processing system determines the running feature value of each data processing stage, for each data processing stage, the distributed data processing system may use the running feature value of the data processing stage as the value corresponding to the running feature, and input Performance prediction model with operational characteristics, resource configuration parameters, and business performance as variables In this case, only the service performance and resource configuration parameters are included in the performance prediction model. Further, the distributed data processing system can calculate the resource configuration parameter value of the resource configuration parameter when the service performance obtains the optimal value (such as the minimum value). Obtaining the resource configuration parameter value of the data processing stage, that is, obtaining the resource configuration parameter value of the resource configuration parameter in the data processing stage.
该处理过程具体可以由处理器实现。The process can be specifically implemented by a processor.
可选的,针对确定出每个数据处理阶段的资源配置参数值的情况,相应的处理过程可以如下:根据资源配置参数值,运行数据处理请求,统计每个数据处理阶段的实际运行特征值和实际业务性能值;根据资源配置参数值、实际运行特征值和实际业务性能值调整性能预测模型。Optionally, for determining a resource configuration parameter value of each data processing stage, the corresponding processing process may be as follows: running a data processing request according to the resource configuration parameter value, and counting actual operating characteristic values of each data processing stage and The actual service performance value; the performance prediction model is adjusted according to the resource configuration parameter value, the actual running characteristic value, and the actual business performance value.
在实施中,分布式数据处理系统确定出每个数据处理阶段对应的资源配置参数值后,在进行每个数据处理阶段的处理时,可以基于该数据处理阶段的资源配置参数值。另外,对于上述根据每个数据处理阶段对应的数据特征值确定对应的运行特征值的情况,具体的处理方式可以如下:当分布式数据处理系统接收到目标业务的数据处理请求时,分布式数据处理系统在确定每个数据处理阶段的数据特征值时,可以首先确定第一个数据处理阶段对应的待处理数据的数据特征值(该数据特征值即是数据处理请求对应的待处理数据的数据特征值),然后可以按照上述方式,确定目标业务的该数据处理阶段对应的运行特征值和确定该数据处理阶段的资源配置参数值,进而,分布式数据处理系统可以基于该数据处理阶段的资源配置参数值,进行该数据处理阶段的处理。然后,按照上述方式,分布式数据处理系统确定下一个数据处理阶段的资源配置参数值,基于确定出的资源配置参数值,进行该数据处理阶段的处理,直到将所有数据处理阶段的处理完成。In an implementation, after the distributed data processing system determines the resource configuration parameter values corresponding to each data processing stage, when performing the processing of each data processing stage, the resource configuration parameter values may be based on the data processing stage. In addition, for the case where the corresponding running feature value is determined according to the data feature value corresponding to each data processing stage, the specific processing manner may be as follows: when the distributed data processing system receives the data processing request of the target service, the distributed data When determining the data feature value of each data processing stage, the processing system may first determine the data feature value of the to-be-processed data corresponding to the first data processing stage (the data feature value is the data of the to-be-processed data corresponding to the data processing request) The feature value) can then determine the running feature value corresponding to the data processing phase of the target service and determine the resource configuration parameter value of the data processing phase according to the above manner. Further, the distributed data processing system can be based on the resource of the data processing phase. Configure parameter values for processing in this data processing phase. Then, in the above manner, the distributed data processing system determines the resource configuration parameter values of the next data processing stage, and performs processing of the data processing stage based on the determined resource configuration parameter values until the processing of all data processing stages is completed.
运行数据处理请求后,分布式数据处理系统可以统计每个数据处理阶段的实际运行特征值和实际业务性能值,进而,可以将该数据处理请求的每个数据处理阶段的资源配置参数值、实际运行特征和实际业务性能值作为训练数据,重新对性能预测模型进行训练。After running the data processing request, the distributed data processing system can count the actual running characteristic value and the actual business performance value of each data processing stage, and further, the resource configuration parameter value and actual value of each data processing stage of the data processing request can be The operational characteristics and actual business performance values are used as training data to re-train the performance prediction model.
该处理过程具体可以由处理器实现。The process can be specifically implemented by a processor.
可选的,上述提到的以运行特征、资源配置参数和业务性能为变量的性能预测模型的训练过程可以如下:每当接收到数据处理请求时,根据当前的数据处理请求的第一预设资源配置参数值,运行数据处理请求,并统计当前的数据处理请求的第一实际运行特征值、第一预设资源配置参数值和第一实际业务性 能值;当接收到的数据处理请求的数目达到第一预设数目时,基于第一预设数目个数据处理请求的实际运行特征值、预设资源配置参数值和实际业务性能值,对预设的性能预测模型基准式进行训练,得到性能预测模型;将性能预测模型进行存储。Optionally, the training process of the performance prediction model with the running feature, the resource configuration parameter, and the service performance variable mentioned above may be as follows: each time a data processing request is received, the first preset according to the current data processing request is received. The resource configuration parameter value, the data processing request is run, and the first actual running feature value, the first preset resource configuration parameter value, and the first actual serviceability of the current data processing request are collected. The energy value; when the number of received data processing requests reaches the first preset number, based on the actual running feature value, the preset resource configuration parameter value, and the actual service performance value of the first preset number of data processing requests, The performance prediction model is designed to be trained to obtain a performance prediction model; the performance prediction model is stored.
在实施中,在接收到对应目标业务的数据处理请求之前,分布式数据处理系统可以工作在模型训练阶段。也就是说,在接收到目标业务的数据处理请求之前,分布式数据处理系统每当接收到数据处理请求时,可以基于当前的数据处理请求下资源配置参数的第一预设资源配置参数值,执行数据处理请求对应的处理,其中,第一预设资源配置参数值可以是资源配置默认参数值,也可以是用户针对当前的数据处理请求对资源配置参数配置的资源配置参数值(此种情况下,数据处理请求中可以携带有第一预设资源配置参数值,也可以是技术人员在数据处理请求前对资源配置参数值进行的配置),每个数据处理请求对应的预设资源配置参数值可以相同,也可以不相同,是由用户的设置决定的。执行当前的数据处理请求对应的处理后,分布式数据处理系统可以记录当前的数据处理请求对应的第一实际运行特征值(其中,第一实际运行特征值也是当前的数据处理请求实际对应的运行特征值)、第一预设资源配置参数值和第一实际业务性能值(其中,第一实际业务性能值实际上是分布式数据处理系统运行当前的数据处理请求所使用的时间(即从开始处理到结束的持续时间),或者是每个线程所使用的时间的累加总时间)。这样,分布式数据处理系统每运行一次接收到的数据处理请求,就记录一次数据处理请求的实际运行特征值、预设资源配置参数值和实际业务性能值,并可以将其记录的每对实际运行特征值、预设资源配置参数值和实际业务性能值作为样本,得到上述以运行特征、资源配置参数和业务性能为变量的性能预测模型。In an implementation, the distributed data processing system can operate in a model training phase prior to receiving a data processing request for the corresponding target service. That is, before receiving the data processing request of the target service, the distributed data processing system may, based on the current data processing request, the first preset resource configuration parameter value of the resource configuration parameter, when the data processing request is received. The processing of the data processing request is performed, where the first preset resource configuration parameter value may be a resource configuration default parameter value, or may be a resource configuration parameter value configured by the user for the current data processing request to the resource configuration parameter. The data processing request may carry the first preset resource configuration parameter value, or the configuration of the resource configuration parameter value by the technician before the data processing request, and the preset resource configuration parameter corresponding to each data processing request. The values can be the same or different and are determined by the user's settings. After performing the processing corresponding to the current data processing request, the distributed data processing system may record the first actual running feature value corresponding to the current data processing request (where the first actual running feature value is also the actual corresponding operation of the current data processing request) The feature value), the first preset resource configuration parameter value, and the first actual business performance value (where the first actual business performance value is actually the time used by the distributed data processing system to run the current data processing request (ie, from the beginning) The duration of processing to the end), or the cumulative total time of the time used by each thread). In this way, each time the distributed data processing system runs the received data processing request, the actual running characteristic value, the preset resource configuration parameter value, and the actual business performance value of the data processing request are recorded, and each pair of actual records can be recorded. The running feature value, the preset resource configuration parameter value, and the actual business performance value are taken as samples, and the performance prediction model with the running feature, the resource configuration parameter, and the service performance as variables is obtained.
具体的,分布式数据处理系统可以对接收到的数据处理请求的数目进行检测,即可以对运行过的数据处理请求的数目进行检测,当检测到接收到的数据处理请求的数目达到预设数目(可以称为第一预设数目)时,分布式数据处理系统可以对记录的第一预设数目个数据处理请求的实际运行特征值、预设资源配置参数值和实际业务性能值进行分析,确定运行特征、资源配置参数和业务性能之间的性能预测模型,其中,分布式数据处理系统可以采用线性回归的方法,确定运行特征、资源配置参数和业务性能之间的性能预测模型。另外,分布式数据处理系统中可以预先存储有性能预测模型基准式,当检测到接收到的 数据处理请求的数目达到第一预设数目时,可以根据第一预设数目个数据处理请求的低级运行特征值、预设资源配置参数值和实际业务性能值,对预设的性能预测模型基准式进行训练,其中,性能预测模型基准式中可以包含运行特征、资源配置参数、业务性能和一些待训练参数。在训练过程中,可以将第一预设数目个数据处理请求的实际运行特征值、预设资源配置参数值和实际业务性能值,分别作为运行特征、资源配置参数、业务性能的取值,从而,得到待训练参数的参数值,进而,得到以运行特征、资源配置参数和业务性能为变量的性能预测模型。其中,分布式数据处理系统预先存储的性能预测模型基准式可以是具有预设函数形式的函数基准式,比如,预设函数形式可以是线性函数形式,也可以是曲线或抛物线函数形式,这可以由技术人员根据运行特征、资源配置参数和业务性能可能满足的函数关系预先设置。技术人员也可以不预先设置性能预测模型基准式的函数形式,比如神经网络模型,待训练参数即是神经网络模型中的参数,即分布式数据处理系统将第一预设数目个数据处理请求分别对应的运行特征值、预设资源配置参数值和实际检测的业务性能值分别作为神经网络模型的输入,得到神经网络模型中的参数,即得到以运行特征、资源配置参数和业务性能为变量的性能预测模型。得到以运行特征、资源配置参数和业务性能为变量的性能预测模型后,分布式数据处理系统可以将其进行存储,以便后续接收到数据处理请求时,确定数据处理请求对应的资源配置参数值。Specifically, the distributed data processing system can detect the number of received data processing requests, that is, the number of the processed data processing requests can be detected, and when the number of received data processing requests is detected reaches a preset number. The distributed data processing system may analyze the actual running feature value, the preset resource configuration parameter value, and the actual service performance value of the first preset number of data processing requests that are recorded, A performance prediction model between operational characteristics, resource configuration parameters, and service performance is determined. The distributed data processing system may employ a linear regression method to determine a performance prediction model between operational characteristics, resource configuration parameters, and service performance. In addition, the performance prediction model reference can be pre-stored in the distributed data processing system, when the received When the number of the data processing requests reaches the first preset number, the preset performance prediction model reference may be determined according to the low-level running feature value, the preset resource configuration parameter value, and the actual service performance value of the first preset number of data processing requests. Training is performed, wherein the performance prediction model reference can include operational characteristics, resource configuration parameters, service performance, and some parameters to be trained. During the training process, the actual running feature value, the preset resource configuration parameter value, and the actual service performance value of the first preset number of data processing requests may be used as the running feature, the resource configuration parameter, and the service performance, respectively. Obtaining the parameter values of the parameters to be trained, and further obtaining a performance prediction model with operational characteristics, resource configuration parameters, and business performance as variables. The performance prediction model reference stored in advance by the distributed data processing system may be a function reference with a preset function form. For example, the preset function form may be a linear function form or a curve or a parabolic function. It is preset by the technician based on the operational characteristics, resource configuration parameters, and functional relationships that may be satisfied by the business performance. The technician may also not set the function form of the performance prediction model reference type in advance, such as a neural network model, and the parameters to be trained are parameters in the neural network model, that is, the distributed data processing system separately sets the first preset number of data processing requests. The corresponding running characteristic value, the preset resource configuration parameter value and the actually detected business performance value are respectively input as a neural network model, and the parameters in the neural network model are obtained, that is, the running characteristic, the resource configuration parameter and the service performance are obtained as variables. Performance prediction model. After obtaining the performance prediction model with the running characteristics, the resource configuration parameters, and the service performance as variables, the distributed data processing system may store the data, so that when the data processing request is received subsequently, the resource configuration parameter value corresponding to the data processing request is determined.
该处理过程具体可以由处理器实现。The process can be specifically implemented by a processor.
可选的,得到上述性能预测模型后,还可以对其的准确度进行验证,当性能预测模型的准确度达到预设准确度阈值时,再将其存储,相应的,处理过程可以如下:每当接收到数据处理请求时,基于当前的数据处理请求的第二预设资源配置参数值,运行数据处理请求,并统计当前的数据处理请求的第二实际运行特征值和第二实际业务性能值;根据性能预测模型和第二实际运行特征值,确定当前的数据处理请求的第三业务性能值;根据第二实际业务性能值和确定出的第三业务性能值之间的差值,确定性能预测模型的准确度,得到当前的数据处理请求下性能预测模型的准确度;当得到性能预测模型后接收到的数据处理请求的数目达到第二预设数目时,根据第二预设数目个数据处理请求下性能预测模型的准确度,计算性能预测模型的平均准确度;如果平均准确度达到预设准确度阈值,则将性能预测模型进行存储。Optionally, after obtaining the performance prediction model, the accuracy of the performance prediction model may be verified. When the accuracy of the performance prediction model reaches a preset accuracy threshold, the storage performance may be stored. Correspondingly, the processing may be as follows: When receiving the data processing request, running a data processing request based on the second preset resource configuration parameter value of the current data processing request, and collecting the second actual running characteristic value and the second actual service performance value of the current data processing request. Determining a third service performance value of the current data processing request according to the performance prediction model and the second actual running feature value; determining performance according to a difference between the second actual service performance value and the determined third service performance value Predicting the accuracy of the model, obtaining the accuracy of the performance prediction model under the current data processing request; when the number of received data processing requests reaches the second preset number after obtaining the performance prediction model, according to the second preset number of data Processing the accuracy of the performance prediction model under the request, and calculating the average accuracy of the performance prediction model; Accuracy accuracy reaches a preset threshold, the storage performance prediction model.
在实施中,得到以运行特征、资源配置参数和业务性能为变量的性能预测 模型后,分布式数据处理系统可以工作在模型验证阶段,也就是说,得到上述性能预测模型后,每当接收到数据处理请求时,分布式数据处理系统可以按照模型训练阶段运行数据处理请求的方式,运行当前的数据处理请求,即可以基于当前的数据处理请求下资源配置参数对应的第二预设资源配置参数值,运行数据处理请求,处理结束后,分布式数据处理系统可以对应记录当前的数据处理请求对应的第二实际运行特征值(其中,第二实际运行特征值也是运行当前的数据处理请求实际对应的运行特征值)和实际检测的业务性能对应的第二实际业务性能值,并可以将第二实际运行特征值作为运行特征的取值、第二预设资源配置参数值作为资源配置参数的取值,代入以运行特征、资源配置参数和业务性能为变量的性能预测模型中,计算第三业务性能值,进而,分布式数据处理系统可以计算实际检测的第二实际业务性能值和计算出的第三业务性能值的差值。得到两者的差值后,可以将该差值作为当前的数据处理请求下上述性能预测模型对应的准确度。In the implementation, performance predictions with operational characteristics, resource configuration parameters, and business performance variables are obtained. After the model, the distributed data processing system can work in the model verification phase. That is to say, after obtaining the above performance prediction model, the distributed data processing system can run the data processing request according to the model training phase whenever receiving the data processing request. The method, running the current data processing request, may run the data processing request based on the second preset resource configuration parameter value corresponding to the resource configuration parameter of the current data processing request, and after the processing ends, the distributed data processing system may correspondingly record the current The second actual running characteristic value corresponding to the data processing request (where the second actual running characteristic value is also the running characteristic value actually corresponding to the running current data processing request) and the second actual business performance value corresponding to the actually detected service performance, And the second actual running feature value is used as the value of the running feature and the second preset resource configuration parameter value is used as the value of the resource configuration parameter, and is substituted into the performance prediction model with the running feature, the resource configuration parameter, and the service performance as variables. Calculating the third business performance value, and further Distributed data processing system may calculate a second difference between the actual value of the detected actual business performance and the calculated values of the third service performance. After obtaining the difference between the two, the difference can be used as the accuracy corresponding to the performance prediction model under the current data processing request.
当得到性能预测模型后接收到的数据处理请求的数目达到第二预设数目时,可以计算第二预设数目个数据处理请求下上述性能预测模型对应的准确度的平均准确度;得到平均准确度后,可以判断平均准确度与预设准确度阈值之间的大小关系,如果平均准确度达到预设准确度阈值,则可以将以运行特征、资源配置参数和业务性能为变量的性能预测模型进行存储。如果平均准确度没有达到预设准确度阈值,则可以继续按照模型训练阶段执行数据处理请求对应的处理的方式,执行接收到的数据处理请求对应的处理,并可以记录每次数据处理请求对应的实际运行特征值、预设资源配置参数值和实际业务性能值,并根据所有的数据处理请求的实际运行特征值、预设资源配置参数值和实际业务性能值,重新进行训练,得到以运行特征、资源配置参数和业务性能为变量的性能预测模型,再按照上述方式对其准确度进行验证,直到得到的性能预测模型的准确度达到预设准确度阈值。When the number of received data processing requests after the performance prediction model is obtained reaches a second preset number, the average accuracy of the accuracy corresponding to the performance prediction model under the second predetermined number of data processing requests may be calculated; After the degree, the magnitude relationship between the average accuracy and the preset accuracy threshold may be determined. If the average accuracy reaches the preset accuracy threshold, the performance prediction model with the running feature, the resource configuration parameter, and the service performance as variables may be used. Store. If the average accuracy does not reach the preset accuracy threshold, the processing corresponding to the received data processing request may be performed according to the manner in which the model training phase performs the processing corresponding to the data processing request, and the corresponding data processing request may be recorded. The actual running characteristic value, the preset resource configuration parameter value, and the actual service performance value are re-trained according to the actual running characteristic value, the preset resource configuration parameter value, and the actual service performance value of all the data processing requests, and the running characteristics are obtained. The resource configuration parameters and the service performance are the performance prediction models of the variables, and then the accuracy is verified according to the above manner until the accuracy of the obtained performance prediction model reaches the preset accuracy threshold.
该处理过程具体可以由处理器实现。The process can be specifically implemented by a processor.
另外,上述讲述的以运行特征、资源配置参数和业务性能为变量的性能预测模型,也可以根据历史时段运行数据处理请求的每个数据处理阶段的历史数据得到。具体的,在接收到对应目标业务的数据处理请求之前,分布式数据处理系统可以工作在模型训练阶段。也就是说,在接收到目标业务的数据处理请求之前,分布式数据处理系统每当接收到数据处理请求时,可以基于当前的数 据处理请求下资源配置参数对应的第三预设资源配置参数值,分别执行数据处理请求的每个数据处理阶段的处理,并对应记录当前的数据处理请求的每个数据处理阶段对应的实际运行特征值、第三预设资源配置参数值和实际业务性能对应的业务性能值。当接收到的数据处理请求的数目达到第三预设数目时,基于第三预设数目个数据处理请求的每个数据处理阶段分别对应的实际运行特征值、预设资源配置参数值和实际业务性能值,对预设的性能预测模型基准式进行训练,得到以运行特征、资源配置参数和业务性能为变量的性能预测模型,进而,可以将得到的以运行特征、资源配置参数和业务性能为变量的性能预测模型进行存储。也就是说在该训练过程中,第三预设数目个数据处理请求的每个数据处理阶段的实际运行特征值、预设资源配置参数值和实际业务性能值,都将作为训练性能预测模型的训练数据。In addition, the performance prediction model described above with operational characteristics, resource configuration parameters, and service performance as variables may also be obtained from historical data of each data processing stage of the data processing request according to the historical time period. Specifically, the distributed data processing system can work in the model training phase before receiving the data processing request corresponding to the target service. That is, the distributed data processing system can be based on the current number each time a data processing request is received before receiving the data processing request of the target service. According to the third preset resource configuration parameter value corresponding to the resource configuration parameter of the processing request, respectively performing processing of each data processing phase of the data processing request, and correspondingly recording the actual operation corresponding to each data processing phase of the current data processing request The feature value, the third preset resource configuration parameter value, and the service performance value corresponding to the actual service performance. When the number of received data processing requests reaches a third preset number, the actual running feature values, the preset resource configuration parameter values, and the actual services respectively corresponding to each data processing phase of the third preset number of data processing requests The performance value is used to train the preset performance prediction model reference model to obtain a performance prediction model with operational characteristics, resource configuration parameters, and service performance as variables. Further, the operational characteristics, resource configuration parameters, and service performance can be obtained. The performance prediction model of the variable is stored. That is to say, in the training process, the actual running feature value, the preset resource configuration parameter value and the actual service performance value of each data processing stage of the third predetermined number of data processing requests are used as the training performance prediction model. Training data.
另外,对于表1中的每个业务对应的以数据特征和运行特征为变量的运行特征预测模型,也可以是根据历史运行每个业务的数据处理请求时,数据特征和运行特征对应的历史数值得到的。In addition, for each service in Table 1, the running feature prediction model with the data feature and the running feature as variables may also be the historical value corresponding to the data feature and the running feature when the data processing request of each service is run according to the history. owned.
本发明实施例中,接收目标业务的数据处理请求,其中,数据处理请求的待处理数据包含多个键值对;根据目标业务的标识、待处理数据的数据量、以及待处理数据的键的分布情况,确定数据处理请求的运行特征值,其中,运行特征包含数据处理请求的每个数据处理阶段的复杂度,以及每个数据数据处理阶段的数据输出与数据输入的比值;根据运行特征值,确定每个数据处理阶段的资源配置参数值。这样,每当接收到数据处理请求时,分布式数据处理系统均可自动计算出该数据处理请求的资源配置参数值,无需执行数据处理前,技术人员对多个资源配置参数一一输入其资源配置参数值,进而,可以提高配置效率。In the embodiment of the present invention, a data processing request for receiving a target service, where the data to be processed of the data processing request includes a plurality of key value pairs; according to the identifier of the target service, the data amount of the data to be processed, and the key of the data to be processed a distribution condition determining a running feature value of the data processing request, wherein the running feature includes a complexity of each data processing stage of the data processing request, and a ratio of data output to data input of each data data processing stage; Determine the resource configuration parameter values for each data processing stage. In this way, each time a data processing request is received, the distributed data processing system can automatically calculate the resource configuration parameter value of the data processing request, and the technician inputs the plurality of resource configuration parameters one by one before performing the data processing. Resource configuration parameter values, in turn, can improve configuration efficiency.
图4是本发明实施例提供的进行数据处理的装置的框图。该进行数据处理的装置可以通过软件、硬件或者两者的结合实现成为设备中的部分或者全部。本发明实施例提供的进行数据处理的装置可以实现本发明实施例图3所述的流程,该进行数据处理的装置包括:4 is a block diagram of an apparatus for performing data processing according to an embodiment of the present invention. The device for performing data processing may be implemented as part or all of the device by software, hardware, or a combination of both. The apparatus for performing data processing according to the embodiment of the present invention may implement the process described in FIG. 3 of the embodiment of the present invention, where the apparatus for performing data processing includes:
接收模块410,用于接收目标业务的数据处理请求,其中,所述数据处理请求的待处理数据包含多个键值对,具体可以实现上述步骤301中的接收功能,以及其他隐含步骤。 The receiving module 410 is configured to receive a data processing request of the target service, where the data to be processed of the data processing request includes a plurality of key value pairs, and specifically, the receiving function in the foregoing step 301 and other implicit steps may be implemented.
确定模块420,用于根据所述目标业务的标识、所述待处理数据的数据量、以及所述待处理数据的键的分布情况,确定所述数据处理请求的运行特征值,其中,所述运行特征值包含所述数据处理请求的每个数据处理阶段的复杂度,以及所述每个数据数据处理阶段的数据输出与数据输入的比值;根据所述运行特征值,确定所述每个数据处理阶段的资源配置参数值,具体可以实现上述步骤302、303中的确定功能,以及其他隐含步骤。a determining module 420, configured to determine an operating characteristic value of the data processing request according to an identifier of the target service, a data amount of the data to be processed, and a distribution of keys of the to-be-processed data, where The operational feature value includes a complexity of each data processing phase of the data processing request, and a ratio of data output to data input of each of the data data processing phases; determining each of the data based on the operational feature value The value of the resource configuration parameter in the processing stage may specifically implement the determining function in the foregoing steps 302 and 303, and other implicit steps.
可选的,所述确定模块420,用于:Optionally, the determining module 420 is configured to:
根据性能预测模型和所述运行特征值,确定所述每个数据处理阶段的资源配置参数值,其中,所述性能预测模型由历史数据训练得到。And determining, according to the performance prediction model and the running feature value, a resource configuration parameter value of each data processing stage, wherein the performance prediction model is trained by historical data.
可选的,如图5所示,所述装置还包括:Optionally, as shown in FIG. 5, the device further includes:
运行模块430,确定所述每个数据处理阶段的资源配置参数值之后,根据所述资源配置参数值,运行所述数据处理请求;The running module 430, after determining the resource configuration parameter value of each data processing stage, running the data processing request according to the resource configuration parameter value;
统计模块440,用于统计所述每个数据处理阶段的实际运行特征值和实际业务性能值;The statistics module 440 is configured to collect actual operating characteristic values and actual service performance values of each data processing stage;
调整模块450,用于根据所述资源配置参数值、所述实际运行特征值和所述实际业务性能值调整所述性能预测模型。The adjusting module 450 is configured to adjust the performance prediction model according to the resource configuration parameter value, the actual running feature value, and the actual service performance value.
可选的,如图6所示,所述装置还包括:Optionally, as shown in FIG. 6, the apparatus further includes:
记录模块460,用于每当接收到数据处理请求时,根据当前的数据处理请求的第一预设资源配置参数值,运行数据处理请求,并统计当前的数据处理请求的第一实际运行特征值、所述第一预设资源配置参数值和第一实际业务性能值;The recording module 460 is configured to: when receiving the data processing request, run the data processing request according to the first preset resource configuration parameter value of the current data processing request, and collect the first actual running characteristic value of the current data processing request. The first preset resource configuration parameter value and the first actual service performance value;
训练模块470,用于当接收到的数据处理请求的数目达到第一预设数目时,基于第一预设数目个数据处理请求的实际运行特征值、预设资源配置参数值和实际业务性能值,对预设的性能预测模型基准式进行训练,得到所述性能预测模型;The training module 470 is configured to: when the number of received data processing requests reaches a first preset number, based on the actual running feature values, the preset resource configuration parameter values, and the actual service performance values of the first preset number of data processing requests And training the preset performance prediction model reference formula to obtain the performance prediction model;
存储模块480,用于将所述性能预测模型进行存储。The storage module 480 is configured to store the performance prediction model.
可选的,如图7所示,所述装置还包括:Optionally, as shown in FIG. 7, the apparatus further includes:
校验模块490,用于:每当接收到数据处理请求时,基于当前的数据处理请求的第二预设资源配置参数值,运行数据处理请求,并统计当前的数据处理请求的第二实际运行特征值和第二实际业务性能值;The verification module 490 is configured to: when receiving the data processing request, run the data processing request based on the second preset resource configuration parameter value of the current data processing request, and collect the second actual operation of the current data processing request. The feature value and the second actual business performance value;
根据所述性能预测模型和所述第二实际运行特征值,确定当前的数据处理 请求的第三业务性能值;Determining current data processing according to the performance prediction model and the second actual running feature value The requested third service performance value;
根据所述第二实际业务性能值和确定出的所述第三业务性能值之间的差值,确定所述性能预测模型的准确度,得到所述当前的数据处理请求下所述性能预测模型的准确度;Determining an accuracy of the performance prediction model according to a difference between the second actual service performance value and the determined third service performance value, and obtaining the performance prediction model under the current data processing request Accuracy
当得到所述性能预测模型后接收到的数据处理请求的数目达到第二预设数目时,根据第二预设数目个数据处理请求下所述性能预测模型的准确度,计算所述性能预测模型的平均准确度;When the number of received data processing requests after the performance prediction model is obtained reaches a second preset number, calculating the performance prediction model according to the accuracy of the performance prediction model under the second preset number of data processing requests Average accuracy;
所述存储模块480,用于:如果所述平均准确度达到预设准确度阈值,则将所述性能预测模型进行存储。The storage module 480 is configured to store the performance prediction model if the average accuracy reaches a preset accuracy threshold.
需要说明的是,上述确定模块420、运行模块430、统计模块440、调整模块450、记录模块460、训练模块470、检验模块490、存储模块480可以由处理器实现,或者处理器配合存储器来实现,或者,处理器执行存储器中的程序指令来实现,接收模块410可以由收发器实现。It should be noted that the foregoing determining module 420, the running module 430, the statistic module 440, the adjusting module 450, the recording module 460, the training module 470, the checking module 490, and the storage module 480 may be implemented by a processor, or the processor may be implemented by using a memory. Alternatively, the processor may execute the program instructions in the memory, and the receiving module 410 may be implemented by the transceiver.
本发明实施例中,接收目标业务的数据处理请求,其中,数据处理请求的待处理数据包含多个键值对;根据目标业务的标识、待处理数据的数据量、以及待处理数据的键的分布情况,确定数据处理请求的运行特征值,其中,运行特征包含数据处理请求的每个数据处理阶段的复杂度,以及每个数据数据处理阶段的数据输出与数据输入的比值;根据运行特征值,确定每个数据处理阶段的资源配置参数值。这样,每当接收到数据处理请求时,分布式数据处理系统均可自动计算出该数据处理请求的资源配置参数值,无需执行数据处理前,技术人员对多个资源配置参数一一输入其资源配置参数值,进而,可以提高配置效率。In the embodiment of the present invention, a data processing request for receiving a target service, where the data to be processed of the data processing request includes a plurality of key value pairs; according to the identifier of the target service, the data amount of the data to be processed, and the key of the data to be processed a distribution condition determining a running feature value of the data processing request, wherein the running feature includes a complexity of each data processing stage of the data processing request, and a ratio of data output to data input of each data data processing stage; Determine the resource configuration parameter values for each data processing stage. In this way, each time a data processing request is received, the distributed data processing system can automatically calculate the resource configuration parameter value of the data processing request, and the technician inputs the plurality of resource configuration parameters one by one before performing the data processing. Resource configuration parameter values, in turn, can improve configuration efficiency.
需要说明的是:上述实施例提供的进行数据处理的装置在进行数据处理时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将服务器的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的进行数据处理的装置与进行数据处理的方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that, when performing data processing, the apparatus for performing data processing in the foregoing embodiment is only illustrated by dividing the foregoing functional modules. In an actual application, the foregoing functions may be allocated by different functional modules as needed. Completion, that is, the internal structure of the server is divided into different functional modules to complete all or part of the functions described above. In addition, the device for performing data processing provided by the foregoing embodiment is the same as the method for performing the data processing. The specific implementation process is described in detail in the method embodiment, and details are not described herein again.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储 于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。Those skilled in the art can understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored. In a computer readable storage medium, the above mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like.
以上所述仅为本发明一个实施例,并不用以限制本发明,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。 The above is only one embodiment of the present invention, and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present application are included in the scope of the present application. Inside.

Claims (15)

  1. 一种进行数据处理的方法,其特征在于,所述方法包括:A method for performing data processing, the method comprising:
    接收目标业务的数据处理请求,其中,所述数据处理请求的待处理数据包含多个键值对;Receiving a data processing request of the target service, where the data to be processed of the data processing request includes multiple key value pairs;
    根据所述目标业务的标识、所述待处理数据的数据量、以及所述待处理数据的键的分布情况,确定所述数据处理请求的运行特征值,其中,所述运行特征值包含所述数据处理请求的每个数据处理阶段的复杂度,以及所述每个数据数据处理阶段的数据输出与数据输入的比值;Determining, according to the identifier of the target service, the data amount of the data to be processed, and the distribution of the key of the data to be processed, an operation feature value of the data processing request, where the running feature value includes the The complexity of each data processing stage of the data processing request, and the ratio of the data output to the data input for each of the data data processing stages;
    根据所述运行特征值,确定所述每个数据处理阶段的资源配置参数值。And determining, according to the running feature value, a resource configuration parameter value of each data processing stage.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述运行特征值,确定所述每个数据处理阶段的资源配置参数值,包括:The method according to claim 1, wherein the determining the resource configuration parameter value of each data processing stage according to the running feature value comprises:
    根据性能预测模型和所述运行特征值,确定所述每个数据处理阶段的资源配置参数值,其中,所述性能预测模型由历史数据训练得到。And determining, according to the performance prediction model and the running feature value, a resource configuration parameter value of each data processing stage, wherein the performance prediction model is trained by historical data.
  3. 根据权利要求2所述的方法,其特征在于,所述确定所述每个数据处理阶段的资源配置参数值之后,所述方法还包括:The method according to claim 2, wherein after the determining the resource configuration parameter value of each of the data processing stages, the method further comprises:
    根据所述资源配置参数值,运行所述数据处理请求;Running the data processing request according to the resource configuration parameter value;
    统计所述每个数据处理阶段的实际运行特征值和实际业务性能值;Counting the actual running characteristic value and the actual business performance value of each data processing stage;
    根据所述资源配置参数值、所述实际运行特征值和所述实际业务性能值调整所述性能预测模型。And adjusting the performance prediction model according to the resource configuration parameter value, the actual running feature value, and the actual service performance value.
  4. 根据权利要求2所述的方法,其特征在于,所述方法还包括:The method of claim 2, wherein the method further comprises:
    每当接收到数据处理请求时,根据当前的数据处理请求的第一预设资源配置参数值,运行数据处理请求,并统计当前的数据处理请求的第一实际运行特征值、所述第一预设资源配置参数值和第一实际业务性能值;Whenever the data processing request is received, the data processing request is run according to the first preset resource configuration parameter value of the current data processing request, and the first actual running feature value of the current data processing request is calculated, and the first pre- Setting a resource configuration parameter value and a first actual service performance value;
    当接收到的数据处理请求的数目达到第一预设数目时,基于第一预设数目个数据处理请求的实际运行特征值、预设资源配置参数值和实际业务性能值,对预设的性能预测模型基准式进行训练,得到所述性能预测模型;When the number of the received data processing requests reaches the first preset number, the preset performance is determined based on the actual running feature values, the preset resource configuration parameter values, and the actual service performance values of the first preset number of data processing requests. The prediction model reference is trained to obtain the performance prediction model;
    将所述性能预测模型进行存储。The performance prediction model is stored.
  5. 根据权利要求4所述的方法,其特征在于,所述将所述性能预测模型进行存储,包括:The method according to claim 4, wherein said storing said performance prediction model comprises:
    每当接收到数据处理请求时,基于当前的数据处理请求的第二预设资源配 置参数值,运行数据处理请求,并统计当前的数据处理请求的第二实际运行特征值和第二实际业务性能值;Whenever a data processing request is received, the second preset resource allocation based on the current data processing request Setting a parameter value, running a data processing request, and collecting a second actual running characteristic value and a second actual business performance value of the current data processing request;
    根据所述性能预测模型和所述第二实际运行特征值,确定当前的数据处理请求的第三业务性能值;Determining, according to the performance prediction model and the second actual running feature value, a third service performance value of the current data processing request;
    根据所述第二实际业务性能值和确定出的所述第三业务性能值之间的差值,确定所述性能预测模型的准确度,得到所述当前的数据处理请求下所述性能预测模型的准确度;Determining an accuracy of the performance prediction model according to a difference between the second actual service performance value and the determined third service performance value, and obtaining the performance prediction model under the current data processing request Accuracy
    当得到所述性能预测模型后接收到的数据处理请求的数目达到第二预设数目时,根据第二预设数目个数据处理请求下所述性能预测模型的准确度,计算所述性能预测模型的平均准确度;When the number of received data processing requests after the performance prediction model is obtained reaches a second preset number, calculating the performance prediction model according to the accuracy of the performance prediction model under the second preset number of data processing requests Average accuracy;
    如果所述平均准确度达到预设准确度阈值,则将所述性能预测模型进行存储。If the average accuracy reaches a preset accuracy threshold, the performance prediction model is stored.
  6. 一种服务器,其特征在于,所述服务器包括收发器和处理器,其中:A server, characterized in that the server comprises a transceiver and a processor, wherein:
    收发器,用于接收目标业务的数据处理请求,其中,所述数据处理请求的待处理数据包含多个键值对;a transceiver, configured to receive a data processing request of the target service, where the data to be processed of the data processing request includes multiple key value pairs;
    处理器,用于根据所述目标业务的标识、所述待处理数据的数据量、以及所述待处理数据的键的分布情况,确定所述数据处理请求的运行特征值,其中,所述运行特征值包含所述数据处理请求的每个数据处理阶段的复杂度,以及所述每个数据数据处理阶段的数据输出与数据输入的比值;根据所述运行特征值,确定所述每个数据处理阶段的资源配置参数值。a processor, configured to determine an operation characteristic value of the data processing request according to an identifier of the target service, a data amount of the to-be-processed data, and a distribution of keys of the to-be-processed data, where the operation is performed The feature value includes a complexity of each data processing stage of the data processing request, and a ratio of data output to data input of each of the data data processing stages; determining each of the data processing based on the operational feature value The resource configuration parameter value for the phase.
  7. 根据权利要求6所述的服务器,其特征在于,所述处理器,用于:The server according to claim 6, wherein the processor is configured to:
    根据性能预测模型和所述运行特征值,确定所述每个数据处理阶段的资源配置参数值,其中,所述性能预测模型由历史数据训练得到。And determining, according to the performance prediction model and the running feature value, a resource configuration parameter value of each data processing stage, wherein the performance prediction model is trained by historical data.
  8. 根据权利要求7所述的服务器,其特征在于,所述处理器,还用于:The server according to claim 7, wherein the processor is further configured to:
    确定所述每个数据处理阶段的资源配置参数值之后,根据所述资源配置参数值,运行所述数据处理请求;After determining the resource configuration parameter value of each data processing stage, running the data processing request according to the resource configuration parameter value;
    统计所述每个数据处理阶段的实际运行特征值和实际业务性能值;Counting the actual running characteristic value and the actual business performance value of each data processing stage;
    根据所述资源配置参数值、所述实际运行特征值和所述实际业务性能值调整所述性能预测模型。And adjusting the performance prediction model according to the resource configuration parameter value, the actual running feature value, and the actual service performance value.
  9. 根据权利要求7所述的服务器,其特征在于,所述处理器,还用于: The server according to claim 7, wherein the processor is further configured to:
    每当接收到数据处理请求时,根据当前的数据处理请求的第一预设资源配置参数值,运行数据处理请求,并统计当前的数据处理请求的第一实际运行特征值、所述第一预设资源配置参数值和第一实际业务性能值;Whenever the data processing request is received, the data processing request is run according to the first preset resource configuration parameter value of the current data processing request, and the first actual running feature value of the current data processing request is calculated, and the first pre- Setting a resource configuration parameter value and a first actual service performance value;
    当接收到的数据处理请求的数目达到第一预设数目时,基于第一预设数目个数据处理请求的实际运行特征值、预设资源配置参数值和实际业务性能值,对预设的性能预测模型基准式进行训练,得到所述性能预测模型;When the number of the received data processing requests reaches the first preset number, the preset performance is determined based on the actual running feature values, the preset resource configuration parameter values, and the actual service performance values of the first preset number of data processing requests. The prediction model reference is trained to obtain the performance prediction model;
    将所述性能预测模型进行存储。The performance prediction model is stored.
  10. 根据权利要求9所述的服务器,其特征在于,所述处理器,用于:The server according to claim 9, wherein said processor is configured to:
    每当接收到数据处理请求时,基于当前的数据处理请求的第二预设资源配置参数值,运行数据处理请求,并统计当前的数据处理请求的第二实际运行特征值和第二实际业务性能值;Whenever a data processing request is received, the data processing request is run based on the second preset resource configuration parameter value of the current data processing request, and the second actual running characteristic value and the second actual service performance of the current data processing request are counted. value;
    根据所述性能预测模型和所述第二实际运行特征值,确定当前的数据处理请求的第三业务性能值;Determining, according to the performance prediction model and the second actual running feature value, a third service performance value of the current data processing request;
    根据所述第二实际业务性能值和确定出的所述第三业务性能值之间的差值,确定所述性能预测模型的准确度,得到所述当前的数据处理请求下所述性能预测模型的准确度;Determining an accuracy of the performance prediction model according to a difference between the second actual service performance value and the determined third service performance value, and obtaining the performance prediction model under the current data processing request Accuracy
    当得到所述性能预测模型后接收到的数据处理请求的数目达到第二预设数目时,根据第二预设数目个数据处理请求下所述性能预测模型的准确度,计算所述性能预测模型的平均准确度;When the number of received data processing requests after the performance prediction model is obtained reaches a second preset number, calculating the performance prediction model according to the accuracy of the performance prediction model under the second preset number of data processing requests Average accuracy;
    如果所述平均准确度达到预设准确度阈值,则将所述性能预测模型进行存储。If the average accuracy reaches a preset accuracy threshold, the performance prediction model is stored.
  11. 一种进行数据处理的装置,其特征在于,所述装置包括:An apparatus for performing data processing, the apparatus comprising:
    接收模块,用于接收目标业务的数据处理请求,其中,所述数据处理请求的待处理数据包含多个键值对;a receiving module, configured to receive a data processing request of the target service, where the data to be processed of the data processing request includes multiple key value pairs;
    确定模块,用于根据所述目标业务的标识、所述待处理数据的数据量、以及所述待处理数据的键的分布情况,确定所述数据处理请求的运行特征值,其中,所述运行特征值包含所述数据处理请求的每个数据处理阶段的复杂度,以及所述每个数据数据处理阶段的数据输出与数据输入的比值;根据所述运行特征值,确定所述每个数据处理阶段的资源配置参数值。a determining module, configured to determine an operating characteristic value of the data processing request according to an identifier of the target service, a data amount of the to-be-processed data, and a distribution of keys of the to-be-processed data, where the running The feature value includes a complexity of each data processing stage of the data processing request, and a ratio of data output to data input of each of the data data processing stages; determining each of the data processing based on the operational feature value The resource configuration parameter value for the phase.
  12. 根据权利要求11所述的装置,其特征在于,所述确定模块,用于: The device according to claim 11, wherein the determining module is configured to:
    根据性能预测模型和所述运行特征值,确定所述每个数据处理阶段的资源配置参数值,其中,所述性能预测模型由历史数据训练得到。And determining, according to the performance prediction model and the running feature value, a resource configuration parameter value of each data processing stage, wherein the performance prediction model is trained by historical data.
  13. 根据权利要求12所述的装置,其特征在于,所述装置还包括:The device of claim 12, wherein the device further comprises:
    运行模块,确定所述每个数据处理阶段的资源配置参数值之后,根据所述资源配置参数值,运行所述数据处理请求;Running the module, after determining the resource configuration parameter value of each data processing stage, running the data processing request according to the resource configuration parameter value;
    统计模块,用于统计所述每个数据处理阶段的实际运行特征值和实际业务性能值;a statistics module, configured to collect actual operating characteristic values and actual business performance values of each data processing stage;
    调整模块,用于根据所述资源配置参数值、所述实际运行特征值和所述实际业务性能值调整所述性能预测模型。And an adjustment module, configured to adjust the performance prediction model according to the resource configuration parameter value, the actual running feature value, and the actual service performance value.
  14. 根据权利要求12所述的装置,其特征在于,所述装置还包括:The device of claim 12, wherein the device further comprises:
    记录模块,用于每当接收到数据处理请求时,根据当前的数据处理请求的第一预设资源配置参数值,运行数据处理请求,并统计当前的数据处理请求的第一实际运行特征值、所述第一预设资源配置参数值和第一实际业务性能值;a recording module, configured to: when the data processing request is received, run a data processing request according to the first preset resource configuration parameter value of the current data processing request, and collect a first actual running feature value of the current data processing request, The first preset resource configuration parameter value and the first actual service performance value;
    训练模块,用于当接收到的数据处理请求的数目达到第一预设数目时,基于第一预设数目个数据处理请求的实际运行特征值、预设资源配置参数值和实际业务性能值,对预设的性能预测模型基准式进行训练,得到所述性能预测模型;a training module, configured to: when the number of received data processing requests reaches a first preset number, based on actual operating characteristic values, preset resource configuration parameter values, and actual service performance values of the first preset number of data processing requests, Training the preset performance prediction model reference formula to obtain the performance prediction model;
    存储模块,用于将所述性能预测模型进行存储。a storage module, configured to store the performance prediction model.
  15. 根据权利要求14所述的装置,其特征在于,所述装置还包括:The device according to claim 14, wherein the device further comprises:
    校验模块,用于:每当接收到数据处理请求时,基于当前的数据处理请求的第二预设资源配置参数值,运行数据处理请求,并统计当前的数据处理请求的第二实际运行特征值和第二实际业务性能值;a verification module, configured to: when receiving the data processing request, run a data processing request based on a second preset resource configuration parameter value of the current data processing request, and collect a second actual running characteristic of the current data processing request Value and second actual business performance value;
    根据所述性能预测模型和所述第二实际运行特征值,确定当前的数据处理请求的第三业务性能值;Determining, according to the performance prediction model and the second actual running feature value, a third service performance value of the current data processing request;
    根据所述第二实际业务性能值和确定出的所述第三业务性能值之间的差值,确定所述性能预测模型的准确度,得到所述当前的数据处理请求下所述性能预测模型的准确度;Determining an accuracy of the performance prediction model according to a difference between the second actual service performance value and the determined third service performance value, and obtaining the performance prediction model under the current data processing request Accuracy
    当得到所述性能预测模型后接收到的数据处理请求的数目达到第二预设数目时,根据第二预设数目个数据处理请求下所述性能预测模型的准确度,计算所述性能预测模型的平均准确度;When the number of received data processing requests after the performance prediction model is obtained reaches a second preset number, calculating the performance prediction model according to the accuracy of the performance prediction model under the second preset number of data processing requests Average accuracy;
    所述存储模块,用于:如果所述平均准确度达到预设准确度阈值,则将所 述性能预测模型进行存储。 The storage module is configured to: if the average accuracy reaches a preset accuracy threshold, The performance prediction model is stored.
PCT/CN2016/107948 2016-11-30 2016-11-30 Method and apparatus for performing data processing WO2018098670A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2016/107948 WO2018098670A1 (en) 2016-11-30 2016-11-30 Method and apparatus for performing data processing
CN201680031201.5A CN108463813B (en) 2016-11-30 2016-11-30 Method and device for processing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/107948 WO2018098670A1 (en) 2016-11-30 2016-11-30 Method and apparatus for performing data processing

Publications (1)

Publication Number Publication Date
WO2018098670A1 true WO2018098670A1 (en) 2018-06-07

Family

ID=62241012

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/107948 WO2018098670A1 (en) 2016-11-30 2016-11-30 Method and apparatus for performing data processing

Country Status (2)

Country Link
CN (1) CN108463813B (en)
WO (1) WO2018098670A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111104596A (en) * 2019-12-17 2020-05-05 腾讯科技(深圳)有限公司 Information processing method and device, electronic equipment and storage medium
CN111158798A (en) * 2019-12-27 2020-05-15 中国银行股份有限公司 Service data processing method and device
CN113452562A (en) * 2021-06-28 2021-09-28 中国建设银行股份有限公司 Configuration parameter calibration method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104662530A (en) * 2012-10-30 2015-05-27 英特尔公司 Tuning for distributed data storage and processing systems
CN104750780A (en) * 2015-03-04 2015-07-01 北京航空航天大学 Hadoop configuration parameter optimization method based on statistic analysis
US20160048771A1 (en) * 2014-08-13 2016-02-18 Microsoft Corporation Distributed stage-wise parallel machine learning
CN106020719A (en) * 2016-05-13 2016-10-12 广东电网有限责任公司信息中心 Initial parameter configuration method of distributed storage system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222923A (en) * 2015-09-11 2019-09-10 福建师范大学 Dynamically configurable big data analysis system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104662530A (en) * 2012-10-30 2015-05-27 英特尔公司 Tuning for distributed data storage and processing systems
US20160048771A1 (en) * 2014-08-13 2016-02-18 Microsoft Corporation Distributed stage-wise parallel machine learning
CN104750780A (en) * 2015-03-04 2015-07-01 北京航空航天大学 Hadoop configuration parameter optimization method based on statistic analysis
CN106020719A (en) * 2016-05-13 2016-10-12 广东电网有限责任公司信息中心 Initial parameter configuration method of distributed storage system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111104596A (en) * 2019-12-17 2020-05-05 腾讯科技(深圳)有限公司 Information processing method and device, electronic equipment and storage medium
CN111158798A (en) * 2019-12-27 2020-05-15 中国银行股份有限公司 Service data processing method and device
CN113452562A (en) * 2021-06-28 2021-09-28 中国建设银行股份有限公司 Configuration parameter calibration method and device
CN113452562B (en) * 2021-06-28 2022-07-12 中国建设银行股份有限公司 Configuration parameter calibration method and device

Also Published As

Publication number Publication date
CN108463813A (en) 2018-08-28
CN108463813B (en) 2020-12-04

Similar Documents

Publication Publication Date Title
WO2020093694A1 (en) Method for generating video analysis model, and video analysis system
CN110362612B (en) Abnormal data detection method and device executed by electronic equipment and electronic equipment
CN109120463B (en) Flow prediction method and device
CN112286644B (en) Elastic scheduling method, system, equipment and storage medium for GPU (graphics processing Unit) virtualization computing power
CN112751726B (en) Data processing method and device, electronic equipment and storage medium
CN112311611B (en) Data anomaly monitoring method and device and electronic equipment
CN111294812B (en) Resource capacity-expansion planning method and system
WO2018098670A1 (en) Method and apparatus for performing data processing
CN109672936B (en) Method and device for determining video evaluation set and electronic equipment
CN110572297A (en) Network performance evaluation method, server and storage medium
CN113746798B (en) Cloud network shared resource abnormal root cause positioning method based on multi-dimensional analysis
CN111311286A (en) Intelligent customer service data processing method and device, computing equipment and storage medium
WO2022142013A1 (en) Artificial intelligence-based ab testing method and apparatus, computer device and medium
CN111753875A (en) Power information system operation trend analysis method and device and storage medium
CN111680085A (en) Data processing task analysis method and device, electronic equipment and readable storage medium
WO2020233021A1 (en) Test result analysis method based on intelligent decision, and related apparatus
CN109992408B (en) Resource allocation method, device, electronic equipment and storage medium
CN110781950A (en) Message processing method and device
CN116185797A (en) Method, device and storage medium for predicting server resource saturation
CN110866831A (en) Asset activity level determination method and device and server
CN114595146A (en) AB test method, device, system, electronic equipment and medium
CN112800089B (en) Intermediate data storage level adjusting method, storage medium and computer equipment
CN113238911B (en) Alarm processing method and device
WO2021082939A1 (en) Virtual machine tuning method and apparatus
CN109086207B (en) Page response fault analysis method, computer readable storage medium and terminal device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16922828

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16922828

Country of ref document: EP

Kind code of ref document: A1