WO2018098670A1

WO2018098670A1 - Method and apparatus for performing data processing

Info

Publication number: WO2018098670A1
Application number: PCT/CN2016/107948
Authority: WO
Inventors: 常玉立; 王海彬; 程捷
Original assignee: 华为技术有限公司
Priority date: 2016-11-30
Filing date: 2016-11-30
Publication date: 2018-06-07
Also published as: CN108463813A; CN108463813B

Abstract

Disclosed are a method and apparatus for performing data processing, which belong to the technical field of the Internet. The method comprises: receiving a data processing request of a target service, wherein data to be processed of the data processing request includes a plurality of key-value pairs; determining an operation feature value of the data processing request according to an identifier of the target service, a data amount of the data to be processed and a distribution condition of keys of the data to be processed, wherein the operation feature includes the complexity of each data processing stage of the data processing request, and a ratio of data output to data input of each data processing stage; and according to the operation feature value, determining a resource configuration parameter value of each data processing stage. In the present invention, resource configuration parameter values of data processing stages of the data processing request are automatically calculated, without needing any technical personnel to input, before the data processing request is operated, the resource configuration parameter values one by one for a plurality of resource configuration parameters, and thus configuration efficiency may be improved.

Description

Method and device for performing data processing

Technical field

The present invention relates to the field of Internet technologies, and in particular, to a method and apparatus for performing data processing.

Background technique

In order to improve the speed of data processing, technicians often use distributed data processing systems to process data. The distributed data processing system can divide data into multiple blocks of preset size and store them in multiples. On the node (where the distributed data processing system is a server cluster comprising multiple servers, the nodes can be one of the distributed data processing systems), each node can process a portion of the data in parallel.

A distributed data processing system includes a plurality of resource configuration parameters, wherein the resource configuration parameters will affect the efficiency of processing data by the distributed data processing system, and the resource configuration parameters may include a data block size, a total number of processes for processing data, and each node. The number of processes that process data, and so on. The technician can configure the parameter values of the corresponding service parameters for each resource configuration parameter in the distributed data processing system according to his own experience before the distributed data processing system performs data processing corresponding to a certain service, and then, whenever the distributed data is distributed, The processing system receives a data processing request corresponding to a certain service (for example, may be sorted data in a distributed data processing system, or may be a maximum value of data in the distributed data processing system), distributed The data processing system may perform processing corresponding to the data processing request based on the resource configuration parameter value of the resource configuration parameter set in advance.

In the process of implementing the present invention, the inventors have found that the prior art has at least the following problems:

Based on the above processing method, before the distributed data processing system performs data processing, the technician needs to configure the resource configuration parameters, and often the number of resource configuration parameters is relatively large. In this case, the technician needs a pair of resource configuration parameters. Input, thus, leads to inefficient configuration.

Summary of the invention

In order to achieve the purpose of improving configuration efficiency, an embodiment of the present invention provides a method and apparatus for performing data processing. The technical solution is as follows:

In a first aspect, a method of performing data processing is provided, the method comprising:

Receiving a data processing request of the target service, wherein the data to be processed of the data processing request includes more Key value pairs; determining a running feature value of the data processing request according to the identifier of the target service, the data amount of the data to be processed, and the distribution of the keys of the data to be processed, wherein the running feature value may include each of the data processing requests The complexity of the data processing stage, and the ratio of the data output to the data input for each data data processing stage; further, the resource configuration parameter values for each data processing stage can be determined based on the operational eigenvalues.

In the solution shown in the embodiment of the present invention, the user may send a data processing request of the target service to the server through the console. After receiving the data processing request of the target service, the server may, according to the identifier of the target service, the data volume of the data to be processed, and the to-be-processed data. The distribution of the keys of the data is processed, the operational characteristic values of the data processing request are determined, and further, the resource configuration parameter values of each data processing stage are determined according to the determined operational characteristics. In this way, each time a data processing request is received, the distributed data processing system can automatically calculate the resource configuration parameter value of each data processing stage of the data processing request, and the technician configures multiple resources before performing data processing. The parameters are input into the resource configuration parameter values one by one, thereby improving the configuration efficiency.

In a possible implementation manner, determining a resource configuration parameter value of each data processing stage according to the running feature value, including: determining a resource configuration parameter value of each data processing stage according to the performance prediction model and the running feature value, where The performance prediction model is trained by historical data.

In the solution shown in the embodiment of the present invention, the performance prediction model may be pre-stored in the server. After the server receives the data processing request of the target service, the server may determine each of the performance prediction models and the determined operational feature values according to the pre-stored performance prediction model. Resource configuration parameter values for the data processing phase.

In a possible implementation manner, the running feature value of the data processing request is determined according to the identifier of the target service, the data volume of the data to be processed, and the distribution of the key of the data to be processed, where the running feature includes the data processing request. The complexity of each data processing stage, and the ratio of the data output to the data input for each data data processing stage, including: determining the data feature value of the data to be processed for each data processing stage of the data processing request; The operational feature of the data feature and the running feature are used as variables to predict the correspondence between the model and the business, and the target operational feature prediction model corresponding to the target service is determined; for each data processing phase, the feature prediction model is operated according to the target, and the value of the data feature is calculated. When the data feature value of the data processing stage is used, the running feature value corresponding to the feature is run to obtain the running feature value of the data processing stage. Determining resource configuration parameter values for each data processing stage according to the running feature values, including: for each data processing stage, calculating the running according to a pre-stored performance prediction model with operating characteristics, resource configuration parameters, and business performance variables When the value of the feature is the running feature value of the data processing phase, when the service performance reaches the optimal value, the resource configuration parameter pair The value of the resource configuration parameter should be the value of the resource configuration parameter corresponding to the data processing stage.

In the solution shown in the embodiment of the present invention, after receiving the data processing request of the target service, the server may determine the data feature value of each data processing stage of the data processing request, where the data feature value includes the data amount of the data to be processed, to be The distribution of the key of the data is processed. Further, for each data processing stage, the server may use the data feature value of the data processing stage as the value of the data feature, and substitute it into the target operational feature prediction model corresponding to the target service, and calculate the running characteristic. The running feature values give the operational eigenvalues for each data processing stage.

A performance prediction model with operational characteristics, resource configuration parameters, and business performance variables can be stored in the server. When receiving the data processing request of the target service, for each preset data processing stage, the server may determine the running feature value of each data processing stage of the data processing request, and further, the server may determine the resource configuration of each data processing stage. Parameter values, so that the server can perform processing of each data processing stage based on the calculated resource configuration parameter values corresponding to each data processing stage, respectively, so that each data processing stage can be processed based on the data processing The resource configuration parameter values corresponding to the phase, in turn, can improve the processing efficiency of each data processing phase. The running feature value corresponding to each data processing stage may include the data amount of the data to be processed of the data to be processed in each data processing stage, and the distribution of the keys of the data to be processed in each data processing stage.

In a possible implementation manner, after determining a resource configuration parameter value of each data processing stage, the method further includes: running a data processing request according to the resource configuration parameter value; and counting actual operating characteristic values and actual values of each data processing stage Service performance value; adjust the performance prediction model according to the resource configuration parameter value, the actual running characteristic value, and the actual business performance value.

After the resource configuration parameter value is determined, the solution may be run based on the resource configuration parameter value, and the actual running characteristic value and the actual service performance value of each data processing stage may be counted. Finally, The server may use the resource configuration parameter value, the actual running feature, and the actual service performance value of each data processing stage as training data, and re-train the obtained performance prediction model. In this way, the accuracy of the performance prediction model can be guaranteed to be higher and higher.

In a possible implementation manner, the method further includes: when receiving the data processing request, running the data processing request according to the first preset resource configuration parameter value of the current data processing request, and counting the current data processing The first actual running feature value, the first preset resource configuration parameter value, and the first actual service performance value; when the number of received data processing requests reaches a first preset number, based on the first preset number of data Processing the actual running feature value of the request, the preset resource configuration parameter value, and The actual business performance value is trained on the preset performance prediction model reference formula to obtain a performance prediction model; the performance prediction model is stored.

In the solution shown in the embodiment of the present invention, before obtaining a sufficient sample, the server may perform processing corresponding to the current data processing request based on the preset resource configuration parameter value when receiving the data processing request, and correspondingly record the data. The actual running feature value, the preset resource configuration parameter value, and the actual service performance value of the request are processed. When sufficient actual running feature values, preset resource configuration parameter values, and actual business performance values are obtained, the server may run records based on the historical operations. Training a performance prediction model between operational characteristics, resource configuration parameters, and business performance, and storing it so that when a subsequent data processing request is received, the server can automatically calculate resources for each data processing stage of the data processing request Configure parameter values.

In a possible implementation manner, the performance prediction model is stored, including: when receiving the data processing request, running a data processing request based on a second preset resource configuration parameter value of the current data processing request, and counting a second actual running characteristic value and a second actual business performance value of the current data processing request; determining a third service performance value of the current data processing request according to the performance prediction model and the second actual running characteristic value; according to the second actual service The difference between the performance value and the determined third service performance value, determining the accuracy of the performance prediction model, obtaining the accuracy of the performance prediction model under the current data processing request; and receiving the data processing after obtaining the performance prediction model When the number of requests reaches the second preset number, the average accuracy of the performance prediction model is calculated according to the accuracy of the performance prediction model under the second preset number of data processing requests; if the average accuracy reaches the preset accuracy threshold, then The performance prediction model is stored.

In the solution shown in the embodiment of the present invention, after the server obtains the performance prediction model with the running feature, the resource configuration parameter, and the service performance as variables, the accuracy of the server may be verified, and the accuracy reaches the preset accuracy threshold. Then, the server automatically determines the resource configuration parameter value corresponding to the data processing request by using the performance prediction model with the running feature, the resource configuration parameter, and the service performance as variables, thereby executing the data processing request based on the automatically determined resource configuration parameter value. Corresponding processing. In this way, the accuracy of the final stored performance prediction model with operational characteristics, resource configuration parameters, and business performance variables can be improved, and the processing efficiency of the processing corresponding to the execution of the data processing request can be improved.

In a second aspect, a server is provided, the server comprising a processor, a memory, a transceiver configured to execute instructions stored in the memory, and the processor implementing the data processing provided by the first aspect by executing the instructions Methods.

In a third aspect, there is provided an apparatus for performing data processing, the apparatus comprising at least one module for implementing the method for data processing provided by the first aspect above.

The technical effects obtained by the second to third aspects of the embodiments of the present invention are similar to those obtained by the corresponding technical means in the first aspect, and are not described herein again.

The beneficial effects brought by the technical solutions provided by the embodiments of the present invention are:

In the embodiment of the present invention, a data processing request for receiving a target service, where the data to be processed of the data processing request includes a plurality of key value pairs; according to the identifier of the target service, the data amount of the data to be processed, and the key of the data to be processed a distribution condition determining a running feature value of the data processing request, wherein the running feature includes a complexity of each data processing stage of the data processing request, and a ratio of data output to data input of each data data processing stage; Determine the resource configuration parameter values for each data processing stage. In this way, each time a data processing request is received, the distributed data processing system can automatically calculate the resource configuration parameter value of the data processing request, and the technician inputs the plurality of resource configuration parameters one by one before performing the data processing. Resource configuration parameter values, in turn, can improve configuration efficiency.

DRAWINGS

1 is a schematic diagram of a system framework provided by an embodiment of the present invention;

2 is a schematic structural diagram of a server according to an embodiment of the present disclosure;

3 is a flowchart of a method for performing data processing according to an embodiment of the present invention;

4 is a schematic structural diagram of an apparatus for performing data processing according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an apparatus for performing data processing according to an embodiment of the present invention; FIG.

6 is a schematic structural diagram of an apparatus for performing data processing according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of an apparatus for performing data processing according to an embodiment of the present invention.

detailed description

The embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.

An embodiment of the present invention provides a method for performing data processing, where the method may be implemented by a server, where the server may be a server group composed of multiple servers, that is, may be distributed numbers. According to the processing system, as shown in FIG. 1, the server may also be a server. Preferably, it may be a distributed data processing system. When receiving a data processing request, the distributed data processing system may pre-process the data to be processed. The data block size is divided into multiple data blocks, and each server can perform data processing on a part of the data in parallel.

The server may include a transceiver 210, a processor 220, a memory 230, and the memory 230 and the transceiver 210 may be coupled to the processor 220, respectively, as shown in FIG. The transceiver 210 can be used to receive messages or data. The transceiver 210 can include, but is not limited to, at least one amplifier, a tuner, one or more oscillators, a coupler, an LNA (Low Noise Amplifier), a duplexer. Etc., in the present invention, the transceiver 210 can be configured to receive a data processing request. The processor 220 may include one or more processing units; the processor 220 may be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP for short, etc.; Signal processor (DSP), application specific integrated circuit (ASIC), field programmable gate array (FPGA) or other programmable logic device. In particular, the program can include program code, the program code including computer operating instructions. The server may also include a memory 230 that may be used to store software programs and modules, and the processor 220 performs data processing by reading software code and modules stored in the memory 230.

The process flow shown in FIG. 3 will be described in detail below with reference to the specific embodiments. The following is a description of the server as a distributed data processing system. Other situations are similar and will not be described again. The content may be as follows:

Step 301: The distributed data processing system receives a data processing request of the target service, where the data to be processed of the data processing request includes a plurality of key value pairs.

The target service may be any one of the services that the distributed data processing system can perform. For example, the target service may be a maximum value in determining the to-be-processed data. The data processing request of the target service may be a request for performing target service processing on the specific data to be processed, such as a request for sorting the specified data to be processed.

In the implementation, in order to improve the processing speed of processing data, the technician often uses a distributed data processing system to process the data. Specifically, the technician can send a data processing request corresponding to a certain service (which may be referred to as a target service) to the distributed data processing system through the console, where the data processing request may specify the to-be-processed data of the data processing request (for example, The data processor request may carry a storage address of the data to be processed). In addition, the data to be processed corresponding to the data processing request may be a series of key-value key value pairs, and the data processing of the data to be processed is generally performed on the key. Line data processing. After receiving the data processing request of the target service, the distributed data processing system may determine the data feature value of the data to be processed of the data processing request, where the data feature value may include the data amount of the data to be processed and the distribution of the key of the data to be processed. Happening.

Specifically, the distributed data processing system can determine the distribution of the keys of the data to be processed in the following manner: the distributed data processing system can be preset, and after determining the data to be processed, it can be determined by determining the distribution of all the keys. The distribution of the data to be processed, specifically, when the preset data distribution model in the distributed data processing system is evenly distributed, the distributed data processing system can calculate the mean value of the key and use the mean value of the key as the key of the data to be processed. Distribution, when the preset data distribution model in the distributed data processing system is normally distributed, the distributed data processing system can calculate the mean and standard deviation of the key, and use the mean and standard deviation of the key as the key of the data to be processed. Distribution. In addition, in order to reduce the amount of calculation, the distributed data processing system can sample the data to be processed according to a preset sampling rate, and determine the data to be processed of the data processing request by determining the distribution of the keys of the sampled data to be processed after sampling. The distribution of the keys.

This processing step can be specifically implemented by a transceiver.

Step 302: The distributed data processing system determines, according to the identifier of the target service, the data volume of the data to be processed, and the distribution of the keys of the data to be processed, the running feature value of the data processing request, where the running feature value includes the data processing request. The complexity of each data processing stage, as well as the ratio of data output to data input for each data data processing stage.

In an implementation, after receiving the data processing request, the distributed data processing system may determine the identifier of the target service of the data processing request, and further, according to the determined identifier of the target service, the data volume of the data to be processed, and the data to be processed. The distribution of the keys determines the operational eigenvalues of the data processing request.

Specifically, the running feature value may be a ratio of a data amount of the processed output data corresponding to the data processing request and a data amount of the input data (ie, the data to be processed) corresponding to the data processing request (wherein the ratio may be used) The ratio indicates that the ratio corresponding to the data processing request may be the value of ratio, or may be the complexity of the processing corresponding to the execution of the data processing request (where the complexity may be represented by complexity, and the complexity of the processing corresponding to the execution of the data processing request is The value of complexity), wherein the complexity can be used to represent the computational complexity of the processing corresponding to the execution of the data processing request, ie, the computational (or algorithmic) complexity of the business processing logic of the service corresponding to the data processing request.

The complexity corresponding to the data processing request may be a CPU (Central Processing Unit) effective time t used by the distributed data processing system to perform processing corresponding to the data processing request, and a data amount of the to-be-processed data of the data processing request. The ratio of size (ie complexity=t/size). In addition, when performing processing corresponding to a data processing request, the distributed data processing system may divide the data to be processed into a plurality of data blocks having a preset data block size (wherein the size of the default data block may be 64 MB (megabytes). The distributed data processing system may store a plurality of data blocks in each of the servers. The embodiment of the present invention does not limit the manner in which the data blocks are allocated. Specifically, the distributed data processing system may use the data blocks. It is evenly distributed to each server for storage. In turn, each server can simultaneously process the data blocks stored in each of the corresponding services (for example, sorting the data in the respective stored data blocks, or obtaining the local storage). The maximum value of the data), in which each server can simultaneously start multiple threads to process each data block at the same time when processing data. In this case, the complexity may also be the ratio of the cumulative total CPU effective time T of the time used by each thread in each server to the data size of the data to be processed corresponding to the data processing request (ie, complexity=T/size). For example, the processing corresponding to the execution of the data processing request is divided into N parallel processes or thread processing, and the time used by each process or thread is represented by t _i , and the data amount of the data to be processed corresponding to the data processing request is represented by size (where size Representing the sum of the data amounts of the data to be processed by each process), the formula of complexity can be as shown in formula (1).

In addition, considering the main frequency of the server's central processing unit, the complexity can be further expressed by the formula (2).

Where t _i represents the CPU effective time used by the i-th process or thread, and cpu_frequency _i represents the main frequency of the central processor of the server corresponding to the i-th process or line.

Specifically, the distributed data processing system may preset a data processing phase for performing processing corresponding to the data processing request, wherein the data processing phase corresponding to the same service is fixed, and each data processing phase executes the data processing phase. When the processing is performed, that is, the processing corresponding to the data processing request is completed, wherein the output data of the previous data processing stage is the data to be processed (ie, the input data) of the next data processing stage, for example, the execution data processing request corresponds to When the processing includes two data processing stages, the data to be processed specified by the data processing request is the input data of the first data processing stage, and the output data of the first data processing stage is the input number of the second data data stage. according to. For example, an application for a distributed data processing system capable of implementing distributed processing data hasoop, a data processing stage in Hadoop for performing processing corresponding to a data processing request includes two data processings of map (map) and reduce (reduction). In the stage, whenever the distributed data processing system receives the data processing request, the distributed data processing system may first perform processing on the map data processing stage on the data to be processed specified by the data processing request, and then output data in the map data processing stage. Performing processing of the reduce data processing stage. For example, the processing corresponding to the data processing request currently received by the distributed data processing system is: determining the maximum value in the data to be processed, and after receiving the data processing request, the distributed data processing system receives the data processing request. In the map data processing stage, the distributed data processing system can simultaneously open multiple map threads, and simultaneously process a part of the data to be processed to obtain the maximum value of each part of the data to be processed, and then determine multiple times through the reduce data processing stage. The maximum value in the output data of the map, to be treated The final maximum value in the data.

In this case, after receiving the data processing request of the target service, the distributed data processing system determines the running feature values corresponding to each data processing stage. For the case where the running characteristics include the ratio and the complexity, after receiving the data processing request, the distributed data processing system may separately determine the numerical value and the complexity value of the ratio of each data processing stage. Specifically, the running feature value of each data processing stage may be a ratio of the data amount of the output data of each data processing stage to the data amount of the input data (ie, the data to be processed), and the complexity of each data processing stage. The complexity corresponding to each data processing stage may be a ratio of a total CPU effective time used by the distributed data processing system to perform processing of the data processing stage to a data amount of the to-be-processed data corresponding to the data processing stage, and may also be Calculate the complexity of each data processing stage according to formula (3), where x in equation (3) represents a certain data processing stage, and t _xi represents the time used by the i-th process of data processing stage x to process data. Cpu_frequency _i represents the main frequency of the central processing unit of the server corresponding to the i-th process or line, and size _x represents the data amount of the data to be processed in the x data processing stage.

In addition, the running feature value of each data processing stage may be predicted by the data feature value of the data to be processed in each data processing stage. Accordingly, the processing may be as follows: determining each data processing stage of the data processing request Data feature value of the data to be processed; predicting the target operational characteristic prediction model corresponding to the target service according to the pre-stored correspondence between the operational feature prediction model and the service with the data feature and the running feature as variables; for each data processing phase, according to Target operation feature prediction model, when the value of the calculated data feature is the data feature value corresponding to the data processing phase, The running characteristic value of the levy obtains the running characteristic value of the data processing stage.

The data feature may be a variable for characterizing data feature information of the data to be processed, and the running feature may be a variable for characterizing the running feature information corresponding to the data processing request.

In the implementation, the data distribution model may be pre-stored in the distributed data processing system, wherein the data distribution model may be an average distribution, a normal distribution, or other data distribution commonly used in mathematics. Specifically, the distributed data processing system may pre-store the calculation formula of the parameter of the data distribution model. For example, the distributed data processing system may pre-store the calculation formula of the mean of the mean distribution, or may store the normal distribution in advance. The formula for calculating the mean and standard deviation.

When the distributed data processing system receives the data processing request of the target service, the data to be processed in each data processing stage of the data processing request may be determined, and further, the data feature information (ie, the data feature value) of the data to be processed may be determined. Specifically, the data feature value of each data processing stage may be the data amount of the data to be processed and/or the distribution of the keys in the data processing stage (the distribution of the keys may be a value corresponding to the data distribution keydist).

The distributed data processing system may pre-store the correspondence between the running feature prediction model and the service identifier with the data feature and the running feature as variables, as shown in Table 1, where the service identifier may be a service name or a service corresponding software program. The program name, class name, or other identifier that can distinguish between the target service and other services. The identifier of the service may be manually named and maintained, or may be an identifier extracted from the service data processing request by a specific rule. The operational feature prediction model may be a function of data characteristics and operational characteristics as variables, the data feature may be an independent variable in the operational feature prediction model, and the operational feature may be a dependent variable in the operational feature prediction model, ie, the operational feature prediction model may Is the running feature = f (data feature), f () represents the functional relationship between the data feature and the running feature. Specifically, when the data feature includes the data volume and the data distribution, and the running feature includes the complexity and the ratio, the running feature prediction model may be (complexity, ratio) = f (data amount, data distribution) (ie, (complexity, ratio) ) = f (size, keydist)). After the distributed data processing system determines the data feature value of each data processing stage, the target operation characteristic prediction model corresponding to the target service may be determined in the correspondence relationship shown in Table 1, and, for each data processing stage, The distributed data processing system can take the data feature value of the data processing stage as the value of the data feature, substitute it into the target running feature prediction model, calculate the running feature value of the running feature, and obtain the running characteristic value of each data processing stage. .

Table 1

Business identifier

Operational feature prediction model

业务标识1Business logo 1	预测模型1Predictive model 1
业务标识2Business logo 2	预测模型2Forecast model 2
业务标识3Business logo 3	预测模型3Forecast model 3

The process can be specifically implemented by a processor.

Step 303: The distributed data processing system determines a resource configuration parameter value of each data processing stage according to the running feature value.

In an implementation, the distributed data processing system determines operational characteristic values for each data processing stage or may determine resource configuration parameter values for each data processing stage based on operational characteristic values for each data processing stage. In addition, the distributed data processing system may further determine a resource configuration parameter value of each data processing stage according to the data amount of the data to be processed, that is, the distributed data processing system determines each according to the running feature value and the data amount of the data to be processed. Resource configuration parameter values for the data processing phase.

Optionally, the distributed data processing system may determine resource configuration parameter values for each data processing stage according to the performance prediction model and the running feature values of each data processing stage, wherein the performance prediction model is trained by historical data.

Specifically, for each data processing stage, the distributed data processing system calculates the running characteristics of the running feature as the data processing stage according to the pre-stored performance prediction model with the running feature, the resource configuration parameter, and the service performance as variables. In the case of a value, when the service performance reaches an optimal value, the resource configuration parameter value of the resource configuration parameter obtains the resource configuration parameter value in the data processing phase.

The resource configuration parameter may be a resource configuration configuration parameter of the distributed data processing system, and the resource configuration parameter may be a parameter that affects the processing efficiency of the processing corresponding to the data processing request by the distributed data processing system, and may include, for example, a data block size and processing. The total number of processes of data, the number of processes each node processes data, and so on. The service performance may be a parameter used to characterize the processing efficiency of the distributed data processing system executing the processing corresponding to the data processing request, such as may be the time used to perform the processing corresponding to the data processing request.

In the implementation, the distributed data processing system may pre-store a performance prediction model with operating characteristics, resource configuration parameters, and service performance as variables, wherein the performance prediction model has no direct correspondence with the service, the performance prediction model and the service. One of the specific relationships is directly related, and the performance prediction model can be a function of running characteristics, resource configuration parameters, and business performance as variables. After the distributed data processing system determines the running feature value of each data processing stage, for each data processing stage, the distributed data processing system may use the running feature value of the data processing stage as the value corresponding to the running feature, and input Performance prediction model with operational characteristics, resource configuration parameters, and business performance as variables In this case, only the service performance and resource configuration parameters are included in the performance prediction model. Further, the distributed data processing system can calculate the resource configuration parameter value of the resource configuration parameter when the service performance obtains the optimal value (such as the minimum value). Obtaining the resource configuration parameter value of the data processing stage, that is, obtaining the resource configuration parameter value of the resource configuration parameter in the data processing stage.

The process can be specifically implemented by a processor.

Optionally, for determining a resource configuration parameter value of each data processing stage, the corresponding processing process may be as follows: running a data processing request according to the resource configuration parameter value, and counting actual operating characteristic values of each data processing stage and The actual service performance value; the performance prediction model is adjusted according to the resource configuration parameter value, the actual running characteristic value, and the actual business performance value.

In an implementation, after the distributed data processing system determines the resource configuration parameter values corresponding to each data processing stage, when performing the processing of each data processing stage, the resource configuration parameter values may be based on the data processing stage. In addition, for the case where the corresponding running feature value is determined according to the data feature value corresponding to each data processing stage, the specific processing manner may be as follows: when the distributed data processing system receives the data processing request of the target service, the distributed data When determining the data feature value of each data processing stage, the processing system may first determine the data feature value of the to-be-processed data corresponding to the first data processing stage (the data feature value is the data of the to-be-processed data corresponding to the data processing request) The feature value) can then determine the running feature value corresponding to the data processing phase of the target service and determine the resource configuration parameter value of the data processing phase according to the above manner. Further, the distributed data processing system can be based on the resource of the data processing phase. Configure parameter values for processing in this data processing phase. Then, in the above manner, the distributed data processing system determines the resource configuration parameter values of the next data processing stage, and performs processing of the data processing stage based on the determined resource configuration parameter values until the processing of all data processing stages is completed.

After running the data processing request, the distributed data processing system can count the actual running characteristic value and the actual business performance value of each data processing stage, and further, the resource configuration parameter value and actual value of each data processing stage of the data processing request can be The operational characteristics and actual business performance values are used as training data to re-train the performance prediction model.

The process can be specifically implemented by a processor.

Optionally, the training process of the performance prediction model with the running feature, the resource configuration parameter, and the service performance variable mentioned above may be as follows: each time a data processing request is received, the first preset according to the current data processing request is received. The resource configuration parameter value, the data processing request is run, and the first actual running feature value, the first preset resource configuration parameter value, and the first actual serviceability of the current data processing request are collected. The energy value; when the number of received data processing requests reaches the first preset number, based on the actual running feature value, the preset resource configuration parameter value, and the actual service performance value of the first preset number of data processing requests, The performance prediction model is designed to be trained to obtain a performance prediction model; the performance prediction model is stored.

In an implementation, the distributed data processing system can operate in a model training phase prior to receiving a data processing request for the corresponding target service. That is, before receiving the data processing request of the target service, the distributed data processing system may, based on the current data processing request, the first preset resource configuration parameter value of the resource configuration parameter, when the data processing request is received. The processing of the data processing request is performed, where the first preset resource configuration parameter value may be a resource configuration default parameter value, or may be a resource configuration parameter value configured by the user for the current data processing request to the resource configuration parameter. The data processing request may carry the first preset resource configuration parameter value, or the configuration of the resource configuration parameter value by the technician before the data processing request, and the preset resource configuration parameter corresponding to each data processing request. The values can be the same or different and are determined by the user's settings. After performing the processing corresponding to the current data processing request, the distributed data processing system may record the first actual running feature value corresponding to the current data processing request (where the first actual running feature value is also the actual corresponding operation of the current data processing request) The feature value), the first preset resource configuration parameter value, and the first actual business performance value (where the first actual business performance value is actually the time used by the distributed data processing system to run the current data processing request (ie, from the beginning) The duration of processing to the end), or the cumulative total time of the time used by each thread). In this way, each time the distributed data processing system runs the received data processing request, the actual running characteristic value, the preset resource configuration parameter value, and the actual business performance value of the data processing request are recorded, and each pair of actual records can be recorded. The running feature value, the preset resource configuration parameter value, and the actual business performance value are taken as samples, and the performance prediction model with the running feature, the resource configuration parameter, and the service performance as variables is obtained.

Specifically, the distributed data processing system can detect the number of received data processing requests, that is, the number of the processed data processing requests can be detected, and when the number of received data processing requests is detected reaches a preset number. The distributed data processing system may analyze the actual running feature value, the preset resource configuration parameter value, and the actual service performance value of the first preset number of data processing requests that are recorded, A performance prediction model between operational characteristics, resource configuration parameters, and service performance is determined. The distributed data processing system may employ a linear regression method to determine a performance prediction model between operational characteristics, resource configuration parameters, and service performance. In addition, the performance prediction model reference can be pre-stored in the distributed data processing system, when the received When the number of the data processing requests reaches the first preset number, the preset performance prediction model reference may be determined according to the low-level running feature value, the preset resource configuration parameter value, and the actual service performance value of the first preset number of data processing requests. Training is performed, wherein the performance prediction model reference can include operational characteristics, resource configuration parameters, service performance, and some parameters to be trained. During the training process, the actual running feature value, the preset resource configuration parameter value, and the actual service performance value of the first preset number of data processing requests may be used as the running feature, the resource configuration parameter, and the service performance, respectively. Obtaining the parameter values of the parameters to be trained, and further obtaining a performance prediction model with operational characteristics, resource configuration parameters, and business performance as variables. The performance prediction model reference stored in advance by the distributed data processing system may be a function reference with a preset function form. For example, the preset function form may be a linear function form or a curve or a parabolic function. It is preset by the technician based on the operational characteristics, resource configuration parameters, and functional relationships that may be satisfied by the business performance. The technician may also not set the function form of the performance prediction model reference type in advance, such as a neural network model, and the parameters to be trained are parameters in the neural network model, that is, the distributed data processing system separately sets the first preset number of data processing requests. The corresponding running characteristic value, the preset resource configuration parameter value and the actually detected business performance value are respectively input as a neural network model, and the parameters in the neural network model are obtained, that is, the running characteristic, the resource configuration parameter and the service performance are obtained as variables. Performance prediction model. After obtaining the performance prediction model with the running characteristics, the resource configuration parameters, and the service performance as variables, the distributed data processing system may store the data, so that when the data processing request is received subsequently, the resource configuration parameter value corresponding to the data processing request is determined.

The process can be specifically implemented by a processor.

Optionally, after obtaining the performance prediction model, the accuracy of the performance prediction model may be verified. When the accuracy of the performance prediction model reaches a preset accuracy threshold, the storage performance may be stored. Correspondingly, the processing may be as follows: When receiving the data processing request, running a data processing request based on the second preset resource configuration parameter value of the current data processing request, and collecting the second actual running characteristic value and the second actual service performance value of the current data processing request. Determining a third service performance value of the current data processing request according to the performance prediction model and the second actual running feature value; determining performance according to a difference between the second actual service performance value and the determined third service performance value Predicting the accuracy of the model, obtaining the accuracy of the performance prediction model under the current data processing request; when the number of received data processing requests reaches the second preset number after obtaining the performance prediction model, according to the second preset number of data Processing the accuracy of the performance prediction model under the request, and calculating the average accuracy of the performance prediction model; Accuracy accuracy reaches a preset threshold, the storage performance prediction model.

In the implementation, performance predictions with operational characteristics, resource configuration parameters, and business performance variables are obtained. After the model, the distributed data processing system can work in the model verification phase. That is to say, after obtaining the above performance prediction model, the distributed data processing system can run the data processing request according to the model training phase whenever receiving the data processing request. The method, running the current data processing request, may run the data processing request based on the second preset resource configuration parameter value corresponding to the resource configuration parameter of the current data processing request, and after the processing ends, the distributed data processing system may correspondingly record the current The second actual running characteristic value corresponding to the data processing request (where the second actual running characteristic value is also the running characteristic value actually corresponding to the running current data processing request) and the second actual business performance value corresponding to the actually detected service performance, And the second actual running feature value is used as the value of the running feature and the second preset resource configuration parameter value is used as the value of the resource configuration parameter, and is substituted into the performance prediction model with the running feature, the resource configuration parameter, and the service performance as variables. Calculating the third business performance value, and further Distributed data processing system may calculate a second difference between the actual value of the detected actual business performance and the calculated values of the third service performance. After obtaining the difference between the two, the difference can be used as the accuracy corresponding to the performance prediction model under the current data processing request.

When the number of received data processing requests after the performance prediction model is obtained reaches a second preset number, the average accuracy of the accuracy corresponding to the performance prediction model under the second predetermined number of data processing requests may be calculated; After the degree, the magnitude relationship between the average accuracy and the preset accuracy threshold may be determined. If the average accuracy reaches the preset accuracy threshold, the performance prediction model with the running feature, the resource configuration parameter, and the service performance as variables may be used. Store. If the average accuracy does not reach the preset accuracy threshold, the processing corresponding to the received data processing request may be performed according to the manner in which the model training phase performs the processing corresponding to the data processing request, and the corresponding data processing request may be recorded. The actual running characteristic value, the preset resource configuration parameter value, and the actual service performance value are re-trained according to the actual running characteristic value, the preset resource configuration parameter value, and the actual service performance value of all the data processing requests, and the running characteristics are obtained. The resource configuration parameters and the service performance are the performance prediction models of the variables, and then the accuracy is verified according to the above manner until the accuracy of the obtained performance prediction model reaches the preset accuracy threshold.

The process can be specifically implemented by a processor.

In addition, the performance prediction model described above with operational characteristics, resource configuration parameters, and service performance as variables may also be obtained from historical data of each data processing stage of the data processing request according to the historical time period. Specifically, the distributed data processing system can work in the model training phase before receiving the data processing request corresponding to the target service. That is, the distributed data processing system can be based on the current number each time a data processing request is received before receiving the data processing request of the target service. According to the third preset resource configuration parameter value corresponding to the resource configuration parameter of the processing request, respectively performing processing of each data processing phase of the data processing request, and correspondingly recording the actual operation corresponding to each data processing phase of the current data processing request The feature value, the third preset resource configuration parameter value, and the service performance value corresponding to the actual service performance. When the number of received data processing requests reaches a third preset number, the actual running feature values, the preset resource configuration parameter values, and the actual services respectively corresponding to each data processing phase of the third preset number of data processing requests The performance value is used to train the preset performance prediction model reference model to obtain a performance prediction model with operational characteristics, resource configuration parameters, and service performance as variables. Further, the operational characteristics, resource configuration parameters, and service performance can be obtained. The performance prediction model of the variable is stored. That is to say, in the training process, the actual running feature value, the preset resource configuration parameter value and the actual service performance value of each data processing stage of the third predetermined number of data processing requests are used as the training performance prediction model. Training data.

In addition, for each service in Table 1, the running feature prediction model with the data feature and the running feature as variables may also be the historical value corresponding to the data feature and the running feature when the data processing request of each service is run according to the history. owned.

4 is a block diagram of an apparatus for performing data processing according to an embodiment of the present invention. The device for performing data processing may be implemented as part or all of the device by software, hardware, or a combination of both. The apparatus for performing data processing according to the embodiment of the present invention may implement the process described in FIG. 3 of the embodiment of the present invention, where the apparatus for performing data processing includes:

The receiving module 410 is configured to receive a data processing request of the target service, where the data to be processed of the data processing request includes a plurality of key value pairs, and specifically, the receiving function in the foregoing step 301 and other implicit steps may be implemented.

a determining module 420, configured to determine an operating characteristic value of the data processing request according to an identifier of the target service, a data amount of the data to be processed, and a distribution of keys of the to-be-processed data, where The operational feature value includes a complexity of each data processing phase of the data processing request, and a ratio of data output to data input of each of the data data processing phases; determining each of the data based on the operational feature value The value of the resource configuration parameter in the processing stage may specifically implement the determining function in the foregoing

steps

302 and 303, and other implicit steps.

Optionally, the determining module 420 is configured to:

And determining, according to the performance prediction model and the running feature value, a resource configuration parameter value of each data processing stage, wherein the performance prediction model is trained by historical data.

Optionally, as shown in FIG. 5, the device further includes:

The running module 430, after determining the resource configuration parameter value of each data processing stage, running the data processing request according to the resource configuration parameter value;

The statistics module 440 is configured to collect actual operating characteristic values and actual service performance values of each data processing stage;

The adjusting module 450 is configured to adjust the performance prediction model according to the resource configuration parameter value, the actual running feature value, and the actual service performance value.

Optionally, as shown in FIG. 6, the apparatus further includes:

The recording module 460 is configured to: when receiving the data processing request, run the data processing request according to the first preset resource configuration parameter value of the current data processing request, and collect the first actual running characteristic value of the current data processing request. The first preset resource configuration parameter value and the first actual service performance value;

The training module 470 is configured to: when the number of received data processing requests reaches a first preset number, based on the actual running feature values, the preset resource configuration parameter values, and the actual service performance values of the first preset number of data processing requests And training the preset performance prediction model reference formula to obtain the performance prediction model;

The storage module 480 is configured to store the performance prediction model.

Optionally, as shown in FIG. 7, the apparatus further includes:

The verification module 490 is configured to: when receiving the data processing request, run the data processing request based on the second preset resource configuration parameter value of the current data processing request, and collect the second actual operation of the current data processing request. The feature value and the second actual business performance value;

Determining current data processing according to the performance prediction model and the second actual running feature value The requested third service performance value;

Determining an accuracy of the performance prediction model according to a difference between the second actual service performance value and the determined third service performance value, and obtaining the performance prediction model under the current data processing request Accuracy

When the number of received data processing requests after the performance prediction model is obtained reaches a second preset number, calculating the performance prediction model according to the accuracy of the performance prediction model under the second preset number of data processing requests Average accuracy;

The storage module 480 is configured to store the performance prediction model if the average accuracy reaches a preset accuracy threshold.

It should be noted that the foregoing determining module 420, the running module 430, the statistic module 440, the adjusting module 450, the recording module 460, the training module 470, the checking module 490, and the storage module 480 may be implemented by a processor, or the processor may be implemented by using a memory. Alternatively, the processor may execute the program instructions in the memory, and the receiving module 410 may be implemented by the transceiver.

It should be noted that, when performing data processing, the apparatus for performing data processing in the foregoing embodiment is only illustrated by dividing the foregoing functional modules. In an actual application, the foregoing functions may be allocated by different functional modules as needed. Completion, that is, the internal structure of the server is divided into different functional modules to complete all or part of the functions described above. In addition, the device for performing data processing provided by the foregoing embodiment is the same as the method for performing the data processing. The specific implementation process is described in detail in the method embodiment, and details are not described herein again.

Those skilled in the art can understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored. In a computer readable storage medium, the above mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like.

The above is only one embodiment of the present invention, and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present application are included in the scope of the present application. Inside.

Claims

A method for performing data processing, the method comprising:

Receiving a data processing request of the target service, where the data to be processed of the data processing request includes multiple key value pairs;

Determining, according to the identifier of the target service, the data amount of the data to be processed, and the distribution of the key of the data to be processed, an operation feature value of the data processing request, where the running feature value includes the The complexity of each data processing stage of the data processing request, and the ratio of the data output to the data input for each of the data data processing stages;

And determining, according to the running feature value, a resource configuration parameter value of each data processing stage.
The method according to claim 1, wherein the determining the resource configuration parameter value of each data processing stage according to the running feature value comprises:

And determining, according to the performance prediction model and the running feature value, a resource configuration parameter value of each data processing stage, wherein the performance prediction model is trained by historical data.
The method according to claim 2, wherein after the determining the resource configuration parameter value of each of the data processing stages, the method further comprises:

Running the data processing request according to the resource configuration parameter value;

Counting the actual running characteristic value and the actual business performance value of each data processing stage;

And adjusting the performance prediction model according to the resource configuration parameter value, the actual running feature value, and the actual service performance value.
The method of claim 2, wherein the method further comprises:

Whenever the data processing request is received, the data processing request is run according to the first preset resource configuration parameter value of the current data processing request, and the first actual running feature value of the current data processing request is calculated, and the first pre- Setting a resource configuration parameter value and a first actual service performance value;

When the number of the received data processing requests reaches the first preset number, the preset performance is determined based on the actual running feature values, the preset resource configuration parameter values, and the actual service performance values of the first preset number of data processing requests. The prediction model reference is trained to obtain the performance prediction model;

The performance prediction model is stored.
The method according to claim 4, wherein said storing said performance prediction model comprises:

Whenever a data processing request is received, the second preset resource allocation based on the current data processing request Setting a parameter value, running a data processing request, and collecting a second actual running characteristic value and a second actual business performance value of the current data processing request;

Determining, according to the performance prediction model and the second actual running feature value, a third service performance value of the current data processing request;

Determining an accuracy of the performance prediction model according to a difference between the second actual service performance value and the determined third service performance value, and obtaining the performance prediction model under the current data processing request Accuracy

When the number of received data processing requests after the performance prediction model is obtained reaches a second preset number, calculating the performance prediction model according to the accuracy of the performance prediction model under the second preset number of data processing requests Average accuracy;

If the average accuracy reaches a preset accuracy threshold, the performance prediction model is stored.
A server, characterized in that the server comprises a transceiver and a processor, wherein:

a transceiver, configured to receive a data processing request of the target service, where the data to be processed of the data processing request includes multiple key value pairs;

a processor, configured to determine an operation characteristic value of the data processing request according to an identifier of the target service, a data amount of the to-be-processed data, and a distribution of keys of the to-be-processed data, where the operation is performed The feature value includes a complexity of each data processing stage of the data processing request, and a ratio of data output to data input of each of the data data processing stages; determining each of the data processing based on the operational feature value The resource configuration parameter value for the phase.
The server according to claim 6, wherein the processor is configured to:

And determining, according to the performance prediction model and the running feature value, a resource configuration parameter value of each data processing stage, wherein the performance prediction model is trained by historical data.
The server according to claim 7, wherein the processor is further configured to:

After determining the resource configuration parameter value of each data processing stage, running the data processing request according to the resource configuration parameter value;

Counting the actual running characteristic value and the actual business performance value of each data processing stage;

And adjusting the performance prediction model according to the resource configuration parameter value, the actual running feature value, and the actual service performance value.
The server according to claim 7, wherein the processor is further configured to:

Whenever the data processing request is received, the data processing request is run according to the first preset resource configuration parameter value of the current data processing request, and the first actual running feature value of the current data processing request is calculated, and the first pre- Setting a resource configuration parameter value and a first actual service performance value;

When the number of the received data processing requests reaches the first preset number, the preset performance is determined based on the actual running feature values, the preset resource configuration parameter values, and the actual service performance values of the first preset number of data processing requests. The prediction model reference is trained to obtain the performance prediction model;

The performance prediction model is stored.
The server according to claim 9, wherein said processor is configured to:

Whenever a data processing request is received, the data processing request is run based on the second preset resource configuration parameter value of the current data processing request, and the second actual running characteristic value and the second actual service performance of the current data processing request are counted. value;

Determining, according to the performance prediction model and the second actual running feature value, a third service performance value of the current data processing request;

Determining an accuracy of the performance prediction model according to a difference between the second actual service performance value and the determined third service performance value, and obtaining the performance prediction model under the current data processing request Accuracy

When the number of received data processing requests after the performance prediction model is obtained reaches a second preset number, calculating the performance prediction model according to the accuracy of the performance prediction model under the second preset number of data processing requests Average accuracy;

If the average accuracy reaches a preset accuracy threshold, the performance prediction model is stored.
An apparatus for performing data processing, the apparatus comprising:

a receiving module, configured to receive a data processing request of the target service, where the data to be processed of the data processing request includes multiple key value pairs;

a determining module, configured to determine an operating characteristic value of the data processing request according to an identifier of the target service, a data amount of the to-be-processed data, and a distribution of keys of the to-be-processed data, where the running The feature value includes a complexity of each data processing stage of the data processing request, and a ratio of data output to data input of each of the data data processing stages; determining each of the data processing based on the operational feature value The resource configuration parameter value for the phase.
The device according to claim 11, wherein the determining module is configured to:

And determining, according to the performance prediction model and the running feature value, a resource configuration parameter value of each data processing stage, wherein the performance prediction model is trained by historical data.
The device of claim 12, wherein the device further comprises:

Running the module, after determining the resource configuration parameter value of each data processing stage, running the data processing request according to the resource configuration parameter value;

a statistics module, configured to collect actual operating characteristic values and actual business performance values of each data processing stage;

And an adjustment module, configured to adjust the performance prediction model according to the resource configuration parameter value, the actual running feature value, and the actual service performance value.
The device of claim 12, wherein the device further comprises:

a recording module, configured to: when the data processing request is received, run a data processing request according to the first preset resource configuration parameter value of the current data processing request, and collect a first actual running feature value of the current data processing request, The first preset resource configuration parameter value and the first actual service performance value;

a training module, configured to: when the number of received data processing requests reaches a first preset number, based on actual operating characteristic values, preset resource configuration parameter values, and actual service performance values of the first preset number of data processing requests, Training the preset performance prediction model reference formula to obtain the performance prediction model;

a storage module, configured to store the performance prediction model.
The device according to claim 14, wherein the device further comprises:

a verification module, configured to: when receiving the data processing request, run a data processing request based on a second preset resource configuration parameter value of the current data processing request, and collect a second actual running characteristic of the current data processing request Value and second actual business performance value;

Determining, according to the performance prediction model and the second actual running feature value, a third service performance value of the current data processing request;

Determining an accuracy of the performance prediction model according to a difference between the second actual service performance value and the determined third service performance value, and obtaining the performance prediction model under the current data processing request Accuracy

When the number of received data processing requests after the performance prediction model is obtained reaches a second preset number, calculating the performance prediction model according to the accuracy of the performance prediction model under the second preset number of data processing requests Average accuracy;

The storage module is configured to: if the average accuracy reaches a preset accuracy threshold, The performance prediction model is stored.