CN108463813A

CN108463813A - A kind of method and apparatus carrying out data processing

Info

Publication number: CN108463813A
Application number: CN201680031201.5A
Authority: CN
Inventors: 常玉立; 王海彬; 程捷
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2016-11-30
Filing date: 2016-11-30
Publication date: 2018-08-28
Anticipated expiration: 2036-11-30
Also published as: CN108463813B; WO2018098670A1

Abstract

The embodiment of the invention discloses a kind of method and apparatus carrying out data processing, belong to Internet technical field.The method includes：Receive the data processing request of target service, wherein the pending data of data processing request includes multiple key-value pairs；According to the distribution situation of the key of the mark of target service, the data volume of pending data and pending data, determine the operation characteristic value of data processing request, wherein, operation characteristic includes the complexity of each data processing stage of the data processing request, and each ratio of the data output and data input of Data Data processing stage；According to operation characteristic value, the resource configuration parameter value of each data processing stage is determined.The application by calculating the resource configuration parameter value of the data processing stage of the data processing request automatically, and before operation data processing request, technical staff inputs its resource configuration parameter value one by one to multiple resource configuration parameters, in turn, can improve allocative efficiency.

Description

Method and device for processing data

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a method and an apparatus for data processing.

Background

To increase the speed of data processing, technicians often use a distributed data processing system to process data, wherein the distributed data processing system can divide the data into a plurality of data blocks with preset sizes and store the data blocks on a plurality of nodes (wherein the distributed data processing system is a server cluster comprising a plurality of servers, and a node can be one server in the distributed data processing system), and each node can process a part of the data in parallel.

The distributed data processing system contains many resource configuration parameters that affect the efficiency of the distributed data processing system in processing data, and these resource configuration parameters may include the size of a data block, the total number of processes processing data, the number of processes processing data per node, and so on. A technician may configure, before the distributed data processing system performs data processing corresponding to a certain service, parameter values corresponding to the certain service for each resource configuration parameter in the distributed data processing system according to experience of the technician, and when the distributed data processing system receives a data processing request corresponding to the certain service (for example, data in the distributed data processing system may be sorted, or a maximum value of data in the distributed data processing system may be determined), the distributed data processing system may perform processing corresponding to the data processing request based on the resource configuration parameter values of the preset resource configuration parameters.

In the process of implementing the invention, the inventor finds that the prior art has at least the following problems:

based on the above processing method, before the distributed data processing system performs data processing, technicians are required to configure the resource configuration parameters, and the number of the resource configuration parameters is often large, in this case, the technicians are required to input the resource configuration parameters one by one, thereby resulting in low configuration efficiency.

Disclosure of Invention

In order to achieve the purpose of improving configuration efficiency, embodiments of the present invention provide a method and an apparatus for data processing. The technical scheme is as follows:

in a first aspect, a method for data processing is provided, the method including:

receiving a data processing request of a target service, wherein the data to be processed of the data processing request comprises a plurality of key value pairs; determining an operation characteristic value of the data processing request according to the identification of the target service, the data volume of the data to be processed and the distribution condition of keys of the data to be processed, wherein the operation characteristic value can comprise the complexity of each data processing stage of the data processing request and the ratio of data output to data input of each data processing stage; furthermore, the resource configuration parameter value of each data processing stage can be determined according to the operation characteristic value.

According to the scheme shown in the embodiment of the invention, a user can send a data processing request of a target service to the server through the console, after the server receives the data processing request of the target service, the server can determine the operation characteristic value of the data processing request according to the identification of the target service, the data quantity of the data to be processed and the distribution condition of keys of the data to be processed, and further determine the resource configuration parameter value of each data processing stage according to the determined operation characteristic. In this way, when a data processing request is received, the distributed data processing system can automatically calculate the resource configuration parameter value of each data processing stage of the data processing request, and technicians do not need to input the resource configuration parameter values of a plurality of resource configuration parameters one by one before data processing is performed, so that the configuration efficiency can be improved.

In a possible implementation manner, determining a resource configuration parameter value of each data processing stage according to the running characteristic value includes: and determining the resource configuration parameter value of each data processing stage according to the performance prediction model and the operation characteristic value, wherein the performance prediction model is obtained by training historical data.

According to the scheme shown in the embodiment of the invention, the performance prediction model can be stored in the server in advance, and after the server receives the data processing request of the target service, the resource configuration parameter value of each data processing stage can be determined according to the pre-stored performance prediction model and the determined operation characteristic value.

In a possible implementation manner, determining an operation characteristic value of the data processing request according to the identifier of the target service, the data amount of the data to be processed, and the distribution of keys of the data to be processed, where the operation characteristic includes a complexity of each data processing stage of the data processing request and a ratio of data output to data input of each data processing stage, includes: determining a data characteristic value of the data to be processed of each data processing stage of the data processing request; determining a target operation characteristic prediction model corresponding to the target service according to a pre-stored corresponding relation between an operation characteristic prediction model taking data characteristics and operation characteristics as variables and the service; and for each data processing stage, calculating an operation characteristic value corresponding to the operation characteristic when the value of the data characteristic is the data characteristic value of the data processing stage according to the target operation characteristic prediction model, and obtaining the operation characteristic value of the data processing stage. Determining the resource configuration parameter value of each data processing stage according to the operation characteristic value, wherein the resource configuration parameter value comprises the following steps: and for each data processing stage, calculating a resource configuration parameter value corresponding to the resource configuration parameter when the service performance reaches an optimal value under the condition that the value of the operation characteristic is the operation characteristic value of the data processing stage according to a pre-stored performance prediction model taking the operation characteristic, the resource configuration parameter and the service performance as variables, and obtaining the resource configuration parameter value corresponding to the data processing stage.

According to the scheme shown in the embodiment of the invention, after the server receives the data processing request of the target service, the data characteristic value of each data processing stage of the data processing request can be determined, wherein the data characteristic value comprises the data quantity of the data to be processed and the distribution condition of keys of the data to be processed, and further, for each data processing stage, the server can take the data characteristic value of the data processing stage as the value of the data characteristic, substitute the value into the target operation characteristic prediction model corresponding to the target service, calculate the operation characteristic value of the operation characteristic, and obtain the operation characteristic value of each data processing stage.

The server may store a performance prediction model with the operating characteristics, the resource configuration parameters, and the service performance as variables. When a data processing request of a target service is received, for each preset data processing stage, the server may determine an operation characteristic value of each data processing stage of the data processing request, and further, the server may determine a resource configuration parameter value of each data processing stage, so that the server may perform processing of each data processing stage based on the calculated resource configuration parameter value corresponding to each data processing stage, respectively, thereby enabling processing of each data processing stage to be based on the resource configuration parameter value corresponding to the data processing stage, and further, improving processing efficiency of each data processing stage. The operation characteristic value corresponding to each data processing stage may include a data amount of data to be processed of the data to be processed of each data processing stage and a distribution situation of keys of the data to be processed of each data processing stage.

In one possible implementation, after determining the resource configuration parameter value of each data processing stage, the method further includes: running a data processing request according to the resource configuration parameter value; counting the actual operation characteristic value and the actual service performance value of each data processing stage; and adjusting the performance prediction model according to the resource configuration parameter value, the actual operation characteristic value and the actual service performance value.

According to the scheme shown in the embodiment of the invention, after the resource configuration parameter value is determined, the data processing request can be operated based on the resource configuration parameter value, the actual operation characteristic value and the actual service performance value of each data processing stage can be counted, and finally, the server can take the counted resource configuration parameter value, actual operation characteristic and actual service performance value of each data processing stage as training data to retrain the obtained performance prediction model. Thus, the accuracy of the performance prediction model can be ensured to be higher and higher.

In one possible implementation, the method further includes: when a data processing request is received, operating the data processing request according to a first preset resource configuration parameter value of the current data processing request, and counting a first actual operation characteristic value, a first preset resource configuration parameter value and a first actual service performance value of the current data processing request; when the number of the received data processing requests reaches a first preset number, training a preset performance prediction model benchmark formula based on the actual operation characteristic values, the preset resource allocation parameter values and the actual service performance values of the data processing requests of the first preset number to obtain a performance prediction model; and storing the performance prediction model.

According to the scheme of the embodiment of the invention, before enough samples are obtained, the server can execute the processing corresponding to the current data processing request based on the preset resource configuration parameter values when receiving the data processing request, and correspondingly record the actual operation characteristic value, the preset resource configuration parameter value and the actual service performance value of the data processing request, and when the enough actual operation characteristic value, the preset resource configuration parameter value and the actual service performance value are obtained, the server can train the performance prediction model among the operation characteristic, the resource configuration parameter and the service performance based on the historical operation records and store the performance prediction model, so that the server can automatically calculate the resource configuration parameter value of each data processing stage of the data processing request when receiving the data processing request subsequently.

In one possible implementation, storing the performance prediction model includes: when a data processing request is received, operating the data processing request based on a second preset resource configuration parameter value of the current data processing request, and counting a second actual operation characteristic value and a second actual service performance value of the current data processing request; determining a third service performance value of the current data processing request according to the performance prediction model and the second actual operation characteristic value; determining the accuracy of the performance prediction model according to the difference value between the second actual service performance value and the determined third service performance value, and obtaining the accuracy of the performance prediction model under the current data processing request; when the number of the received data processing requests reaches a second preset number after the performance prediction model is obtained, calculating the average accuracy of the performance prediction model according to the accuracy of the performance prediction model under the second preset number of data processing requests; and if the average accuracy reaches a preset accuracy threshold, storing the performance prediction model.

According to the scheme disclosed by the embodiment of the invention, after the server obtains the performance prediction model taking the operation characteristics, the resource configuration parameters and the service performance as variables, the accuracy of the performance prediction model can be verified, and only when the accuracy reaches the preset accuracy threshold value, the server automatically determines the resource configuration parameter values corresponding to the data processing request through the performance prediction model taking the operation characteristics, the resource configuration parameters and the service performance as variables, so that the processing corresponding to the data processing request is executed based on the automatically determined resource configuration parameter values. Therefore, the accuracy of the finally stored performance prediction model with the operation characteristics, the resource configuration parameters and the service performance as variables can be improved, and the processing efficiency of the processing corresponding to the execution data processing request is improved.

In a second aspect, a server is provided, the server comprising a processor, a memory, a transceiver, the processor configured to execute instructions stored in the memory; the processor implements the method for data processing provided by the first aspect by executing the instructions.

In a third aspect, an apparatus for data processing is provided, where the apparatus includes at least one module, and the at least one module is configured to implement the method for data processing provided in the first aspect.

The technical effects obtained by the second to third aspects of the embodiments of the present invention are similar to the technical effects obtained by the corresponding technical means in the first aspect, and are not described herein again.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, a data processing request of a target service is received, wherein data to be processed of the data processing request comprises a plurality of key value pairs; determining an operation characteristic value of the data processing request according to the identification of the target service, the data volume of the data to be processed and the distribution condition of keys of the data to be processed, wherein the operation characteristic comprises the complexity of each data processing stage of the data processing request and the ratio of data output to data input of each data processing stage; and determining the resource configuration parameter value of each data processing stage according to the operation characteristic value. Therefore, when a data processing request is received, the distributed data processing system can automatically calculate the resource configuration parameter value of the data processing request, and technicians do not need to input the resource configuration parameter values of a plurality of resource configuration parameters one by one before data processing is carried out, so that the configuration efficiency can be improved.

Drawings

FIG. 1 is a system framework diagram provided by an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a server provided by the embodiment of the present invention;

FIG. 3 is a flow chart of a method for data processing according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an apparatus for data processing according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an apparatus for data processing according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an apparatus for data processing according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an apparatus for data processing according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

An embodiment of the present invention provides a method for processing data, where the method may be implemented by a server, where the server may be a server group composed of a plurality of servers, that is, a distributed data processing system, as shown in fig. 1, the server may also be a server, preferably, a distributed data processing system, and when a data processing request is received, the distributed data processing system may divide data to be processed into a plurality of data blocks according to a preset data block size, and further, each server may perform data processing on a part of data in parallel.

The server may include a transceiver 210, a processor 220, a memory 230, and the memory 230 and the transceiver 210 may be respectively connected with the processor 220, as shown in fig. 2. The transceiver 210 may be used to receive messages or data, the transceiver 210 may include, but is not limited to, at least one Amplifier, a tuner, one or more oscillators, a coupler, a LNA (Low Noise Amplifier), a duplexer, etc., and in the present invention, the transceiver 210 may be used to receive data processing requests. Processor 220 may include one or more processing units; the Processor 220 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, etc. In particular, the program may include program code including computer operating instructions. The server may further include a memory 230, the memory 230 may be used to store software programs and modules, and the processor 220 performs data processing by reading the software codes and modules stored in the memory 230.

The following will describe the processing flow shown in fig. 3 in detail with reference to the specific embodiment, where the following takes a server as a distributed data processing system as an example, and other situations are similar and will not be described again, and the content may be as follows:

step 301, a distributed data processing system receives a data processing request of a target service, wherein data to be processed of the data processing request includes a plurality of key value pairs.

The target service may be any service that can be performed by the distributed data processing system, for example, the target service may be a maximum value in the data to be processed. The data processing request of the target service may be a request for performing target service processing on specific to-be-processed data, for example, may be a request for performing ordering processing on specified to-be-processed data.

In practice, to increase the processing speed of processing data, technicians often use distributed data processing systems to process data. Specifically, the technician may send a data processing request corresponding to a certain service (which may be referred to as a target service) to the distributed data processing system through the console, where the data processing request may specify data to be processed of the data processing request (for example, a memory address of the data to be processed may be carried in the data processor request). In addition, the data to be processed corresponding to the data processing request may be a series of key-value key value pairs, and the data processing on the data to be processed generally refers to data processing on a key. After receiving a data processing request of a target service, the distributed data processing system may determine a data characteristic value of data to be processed of the data processing request, where the data characteristic value may include a data amount of the data to be processed and a distribution condition of keys of the data to be processed.

Specifically, the distributed data processing system may determine the distribution of the keys of the data to be processed by: the distributed data processing system may be preset, and after the data to be processed is determined, the distribution of the data to be processed may be determined by determining the distribution of all keys, specifically, when the preset data distribution model in the distributed data processing system is average distribution, the distributed data processing system may calculate the mean value of the keys, and use the mean value of the keys as the distribution of the keys of the data to be processed, and when the preset data distribution model in the distributed data processing system is normal distribution, the distributed data processing system may calculate the mean value and the standard deviation of the keys, and use the mean value and the standard deviation of the keys as the distribution of the keys of the data to be processed. In addition, in order to reduce the calculation amount, the distributed data processing system may sample the data to be processed according to a preset sampling rate, and determine the distribution of the keys of the data to be processed of the data processing request by determining the distribution of the keys of the sampled part of the data to be processed.

The processing step may specifically be implemented by a transceiver.

Step 302, the distributed data processing system determines an operation characteristic value of the data processing request according to the identifier of the target service, the data amount of the data to be processed, and the distribution condition of the key of the data to be processed, where the operation characteristic value includes the complexity of each data processing stage of the data processing request and the ratio of the data output to the data input of each data processing stage.

In implementation, after receiving the data processing request, the distributed data processing system may determine an identifier of a target service of the data processing request, and further determine an operation characteristic value of the data processing request according to the determined identifier of the target service, the data amount of the data to be processed, and the distribution condition of keys of the data to be processed.

Specifically, the operation characteristic value may be a ratio of a data amount of the processed output data corresponding to the data processing request to a data amount of the processed input data (i.e., to-be-processed data) corresponding to the data processing request (where the ratio may be represented by ratio, and the ratio corresponding to the data processing request may be a numerical value of ratio), or may be a complexity of processing corresponding to the data processing request (where the complexity may be represented by complexity, and the complexity of processing corresponding to the data processing request is a numerical value of complexity), where complexity may be used to represent a computational complexity of processing corresponding to the data processing request, that is, a computational (or algorithm) complexity of a service processing logic that may represent a service corresponding to the data processing request.

The complexity corresponding to the data Processing request may be a ratio of a Central Processing Unit (CPU) valid time t used by the distributed data Processing system to execute Processing corresponding to the data Processing request to a data size of the data to be processed of the data Processing request (i.e., the complexity is t/size). In additionIn addition, when performing processing corresponding to a certain data processing request, the distributed data processing system may divide the data to be processed into a plurality of data blocks having a preset data block size (wherein the default data block size may be 64MB (megabyte)), store the plurality of data blocks in each server respectively, the distribution mode of the data blocks is not limited in the embodiments of the present invention, and specifically, the distributed data processing system may equally distribute the data blocks to each server for storage, furthermore, each server can simultaneously process the data blocks stored in each server for a certain service (for example, sequence the data in the data blocks stored in each server, or find the maximum value of the data stored in the local server), when each server processes data, a plurality of threads can be started simultaneously to process each data block simultaneously. In this case, the complexity may be a ratio of an accumulated total CPU effective time T of the time used by each thread in each server to a data size of the data to be processed corresponding to the data processing request (i.e., the complexity is T/size), for example, the process corresponding to the data processing request may be divided into N parallel processes or thread processes, and the time used by each process or thread may be T_iThat is, the data amount of the data to be processed corresponding to the data processing request is represented by a size (where the size represents the sum of the data amounts of the data to be processed by each process), the formula of complexity may be as shown in formula (1).

In addition, considering the difference of the main frequencies of the central processors of the servers, the complexity can be further expressed by formula (2).

Wherein, t_iIndicates the CPU effective time, CPU _ frequency, used by the ith process or thread_iIndicating the main frequency of the central processor of the server corresponding to the ith process or line.

Specifically, the distributed data processing system may preset data processing stages for executing processing corresponding to the data processing request, where the data processing stages corresponding to the same service are fixed, and when each data processing stage finishes processing of the data processing stage, it represents that processing corresponding to the data processing request is finished, where output data of a previous data processing stage is to-be-processed data (i.e., input data) of a next data processing stage, for example, when processing corresponding to the data processing request includes two data processing stages in total, the to-be-processed data specified by the data processing request is input data of a first data processing stage, and the output data of the first data processing stage is input data of a second data processing stage. For example, an application hadoop used in a distributed data processing system can implement distributed data processing, where a data processing stage for executing processing corresponding to a data processing request in the hadoop includes two data processing stages, namely, map (mapping) and reduce (reduction), and when the distributed data processing system receives a data processing request, the distributed data processing system may first perform processing of the map data processing stage on data to be processed specified by the data processing request, and then perform processing of the reduce data processing stage on output data of the map data processing stage, and then, for example, the processing corresponding to the data processing request currently received by the distributed data processing system is: determining the maximum value in the data to be processed, after the distributed data processing system receives the data processing request, in a map data processing stage, the distributed data processing system can simultaneously start a plurality of map threads, simultaneously process a part of the data to be processed to obtain the maximum value in each part of the data to be processed, and then, in a reduce data processing stage, determining the maximum value in the output data of a plurality of maps to obtain the final maximum value in the data to be processed.

In this case, after receiving the data processing request of the target service, the distributed data processing system determines the operation characteristic value corresponding to each data processing stage. For the case that the operation characteristics include the ratio and the complexity, after the distributed data processing system receives the data processing request, the value of the ratio and the value of the complexity of each data processing stage can be respectively determined. Specifically, the operation characteristic value of each data processing stage may be a ratio of a data amount of output data to a data amount of input data (i.e., data to be processed) of each data processing stage, and a data position of each dataThe complexity of each data processing stage may be a ratio of a total CPU effective time used by the distributed data processing system to perform the processing of the data processing stage to a data amount of data to be processed corresponding to the data processing stage, or may be calculated according to formula (3), where x in formula (3) represents a certain data processing stage, t represents a data processing stage, and t represents a data amount of data to be processed corresponding to the data processing stage_xiIndicates the time used by the ith process of data processing stage x to process data, cpu _ frequency_iIndicating the main frequency, size, of the central processor of the server corresponding to the ith process or line_xRepresenting the amount of data to be processed at the x data processing stage.

In addition, the operation characteristic value of each data processing stage may be predicted from the data characteristic value of the data to be processed of each data processing stage, and accordingly, the processing procedure may be as follows: determining a data characteristic value of the data to be processed of each data processing stage of the data processing request; determining a target operation characteristic prediction model corresponding to the target service according to a pre-stored corresponding relation between an operation characteristic prediction model taking data characteristics and operation characteristics as variables and the service; and for each data processing stage, calculating the operation characteristic value of the operation characteristic when the value of the data characteristic is the data characteristic value corresponding to the data processing stage according to the target operation characteristic prediction model, and obtaining the operation characteristic value of the data processing stage.

The data characteristics may be variables of data characteristic information used for characterizing the data to be processed, and the operation characteristics may be variables of operation characteristic information corresponding to the data processing request.

In implementation, a data distribution model may be stored in advance in the distributed data processing system, where the data distribution model may be an average distribution, a normal distribution, or another data distribution commonly used in mathematics. Specifically, the distributed data processing system may store a calculation formula of parameters of the data distribution model in advance, for example, the distributed data processing system may store a calculation formula of a mean value of the mean distribution in advance, or store a calculation formula of a mean value and a standard deviation of the normal distribution in advance.

When the distributed data processing system receives a data processing request of a target service, to-be-processed data of each data processing stage of the data processing request may be determined, and further, data characteristic information (i.e., a data characteristic value) of the to-be-processed data may be determined, specifically, the data characteristic value of each data processing stage may be a data amount of the to-be-processed data of the data processing stage and/or a distribution situation of keys (the distribution situation of the keys may be a numerical value corresponding to data distribution keydist).

The distributed data processing system may store in advance a correspondence between an operation feature prediction model using data features and operation features as variables and a service identifier, as shown in table 1, where the service identifier may be a service name, or a program name and a class name of a software program corresponding to the service, or another identifier capable of distinguishing a target service from another service, and the service identifier may be manually named and maintained, or may be an identifier extracted from a service data processing request by using a certain specific rule. The operation characteristic prediction model may be a functional expression in which a data characteristic and an operation characteristic are variables, the data characteristic may be an independent variable in the operation characteristic prediction model, and the operation characteristic may be a dependent variable in the operation characteristic prediction model, that is, the operation characteristic prediction model may be an operation characteristic (f (data characteristic)), and f () represents a functional relational expression that is satisfied between the data characteristic and the operation characteristic. Specifically, when the data characteristics include data volume and data distribution and the operation characteristics include complexity and ratio, the operation characteristic prediction model may be (complexity, ratio) ═ f (data volume, data distribution) (i.e., (complexity, ratio) ═ f (size, keydist)). After the distributed data processing system determines the data characteristic value of each data processing stage, a target operation characteristic prediction model corresponding to the target service can be determined in the corresponding relation shown in table 1, and then, for each data processing stage, the distributed data processing system can take the data characteristic value of the data processing stage as the value of the data characteristic, substitute the value into the target operation characteristic prediction model, calculate the operation characteristic value of the operation characteristic, and obtain the operation characteristic value of each data processing stage.

TABLE 1

Service identification

Operating characteristic prediction model

Service identification 1	Prediction model 1
Service identification 2	Prediction model 2
Service identification 3	Prediction model 3

The process may specifically be implemented by a processor.

Step 303, the distributed data processing system determines the resource allocation parameter value of each data processing stage according to the operation characteristic value.

In implementation, the distributed data processing system determines the operation characteristic value of each data processing stage or may determine the resource configuration parameter value of each data processing stage according to the operation characteristic value of each data processing stage. In addition, the distributed data processing system may also determine the resource configuration parameter value of each data processing stage according to the data volume of the data to be processed, that is, the distributed data processing system determines the resource configuration parameter value of each data processing stage according to the operation characteristic value and the data volume of the data to be processed.

Optionally, the distributed data processing system may determine the resource configuration parameter value of each data processing stage according to the performance prediction model and the operation characteristic value of each data processing stage, where the performance prediction model is trained from historical data.

Specifically, for each data processing stage, the distributed data processing system calculates, according to a pre-stored performance prediction model using the operation characteristic, the resource allocation parameter, and the service performance as variables, a resource allocation parameter value of the resource allocation parameter when the service performance reaches an optimal value under the condition that the value of the operation characteristic is the operation characteristic value of the data processing stage, to obtain the resource allocation parameter value of the data processing stage.

The resource configuration parameter may be a resource configuration parameter of the distributed data processing system, and the resource configuration parameter may be a parameter that affects processing efficiency of processing corresponding to the data processing request performed by the distributed data processing system, and may include, for example, a data block size, a total number of processes for processing data, a number of processes for processing data per node, and the like. The service performance may be a parameter for characterizing the processing efficiency of the distributed data processing system for performing the processing corresponding to the data processing request, such as the time used for performing the processing corresponding to the data processing request.

In implementation, a performance prediction model using the operation characteristics, the resource configuration parameters and the service performance as variables may be stored in advance in the distributed data processing system, where the performance prediction model has no direct correspondence with the service, and one specific relationship between the performance prediction model and the service is directly related, and the performance prediction model may be a functional expression using the operation characteristics, the resource configuration parameters and the service performance as variables. After the distributed data processing system determines the operation characteristic value of each data processing stage, for each data processing stage, the distributed data processing system may use the operation characteristic value of the data processing stage as a value corresponding to the operation characteristic, and input the value into a performance prediction model using the operation characteristic, the resource configuration parameter, and the service performance as variables, at this time, only the service performance and the resource configuration parameter are in the performance prediction model, and further, when the distributed data processing system can calculate the service performance to obtain an optimal value (for example, a minimum value), the resource configuration parameter value of the resource configuration parameter may be obtained, that is, the resource configuration parameter value of the data processing stage, that is, the resource configuration parameter value of the resource configuration parameter in the data processing stage is obtained.

The process may specifically be implemented by a processor.

Optionally, for the case that the resource configuration parameter value of each data processing stage is determined, the corresponding processing procedure may be as follows: according to the resource configuration parameter values, the data processing requests are operated, and the actual operation characteristic values and the actual service performance values of each data processing stage are counted; and adjusting the performance prediction model according to the resource configuration parameter value, the actual operation characteristic value and the actual service performance value.

In implementation, after the distributed data processing system determines the resource configuration parameter value corresponding to each data processing stage, when performing the processing of each data processing stage, the distributed data processing system may be based on the resource configuration parameter value of the data processing stage. In addition, for the above case where the corresponding operation characteristic value is determined according to the data characteristic value corresponding to each data processing stage, a specific processing manner may be as follows: when the distributed data processing system receives a data processing request of a target service, when the distributed data processing system determines a data characteristic value of each data processing stage, it may first determine a data characteristic value of to-be-processed data corresponding to a first data processing stage (the data characteristic value is a data characteristic value of to-be-processed data corresponding to the data processing request), then may determine an operation characteristic value corresponding to the data processing stage of the target service and determine a resource configuration parameter value of the data processing stage according to the above manner, and further, the distributed data processing system may perform processing of the data processing stage based on the resource configuration parameter value of the data processing stage. Then, according to the above manner, the distributed data processing system determines the resource configuration parameter value of the next data processing stage, and performs the processing of the data processing stage based on the determined resource configuration parameter value until the processing of all the data processing stages is completed.

After the data processing request is run, the distributed data processing system may count the actual running characteristic value and the actual service performance value of each data processing stage, and may further use the resource configuration parameter value, the actual running characteristic value and the actual service performance value of each data processing stage of the data processing request as training data to train the performance prediction model again.

The process may specifically be implemented by a processor.

Optionally, the training process of the performance prediction model with the operating characteristics, the resource configuration parameters, and the service performance as variables may be as follows: when a data processing request is received, operating the data processing request according to a first preset resource configuration parameter value of the current data processing request, and counting a first actual operation characteristic value, a first preset resource configuration parameter value and a first actual service performance value of the current data processing request; when the number of the received data processing requests reaches a first preset number, training a preset performance prediction model benchmark formula based on the actual operation characteristic values, the preset resource allocation parameter values and the actual service performance values of the data processing requests of the first preset number to obtain a performance prediction model; and storing the performance prediction model.

In implementation, the distributed data processing system may operate in a model training phase before receiving a data processing request corresponding to a target service. That is, before receiving a data processing request for a target service, the distributed data processing system, whenever a data processing request is received, the processing corresponding to the data processing request may be executed based on a first preset resource allocation parameter value of the resource allocation parameter in the current data processing request, the first preset resource configuration parameter value may be a resource configuration default parameter value, or a resource configuration parameter value configured by the user for the resource configuration parameter for the current data processing request (in this case, the data processing request may carry the first preset resource configuration parameter value, or may be a configuration performed by a technician on the resource configuration parameter value before the data processing request), and the preset resource configuration parameter value corresponding to each data processing request may be the same or different, and is determined by the setting of the user. After performing the processing corresponding to the current data processing request, the distributed data processing system may record a first actual operation characteristic value corresponding to the current data processing request (where the first actual operation characteristic value is also an operation characteristic value actually corresponding to the current data processing request), a first preset resource configuration parameter value, and a first actual service performance value (where the first actual service performance value is actually a time (i.e., a duration from start of processing to end of processing) used by the distributed data processing system to operate the current data processing request, or an accumulated total time of the time used by each thread). In this way, the distributed data processing system records the actual operation characteristic value, the preset resource allocation parameter value and the actual service performance value of the data processing request once when the data processing request is received once, and each pair of the actual operation characteristic value, the preset resource allocation parameter value and the actual service performance value recorded by the distributed data processing system can be used as a sample to obtain the performance prediction model taking the operation characteristic, the resource allocation parameter and the service performance as variables.

Specifically, the distributed data processing system may detect the number of received data processing requests, that is, may detect the number of executed data processing requests, and when it is detected that the number of received data processing requests reaches a preset number (may be referred to as a first preset number), the distributed data processing system may analyze an actual operation characteristic value, a preset resource configuration parameter value, and an actual service performance value of the recorded first preset number of data processing requests, and determine a performance prediction model among the operation characteristic, the resource configuration parameter, and the service performance, where the distributed data processing system may determine the performance prediction model among the operation characteristic, the resource configuration parameter, and the service performance by using a linear regression method. In addition, a performance prediction model benchmark may be stored in the distributed data processing system in advance, and when it is detected that the number of received data processing requests reaches a first preset number, the preset performance prediction model benchmark may be trained according to the low-level operation characteristic values, preset resource configuration parameter values, and actual service performance values of the first preset number of data processing requests, where the performance prediction model benchmark may include operation characteristics, resource configuration parameters, service performance, and some parameters to be trained. In the training process, the actual operation characteristic values, the preset resource configuration parameter values and the actual service performance values of the first preset number of data processing requests can be respectively used as values of the operation characteristics, the resource configuration parameters and the service performance, so that the parameter values of the parameters to be trained are obtained, and further, a performance prediction model with the operation characteristics, the resource configuration parameters and the service performance as variables is obtained. The performance prediction model reference formula pre-stored in the distributed data processing system may be a function reference formula having a preset function form, for example, the preset function form may be a linear function form, or a curve or a parabolic function form, which may be preset by a technician according to a function relationship that may be satisfied by the operation characteristics, the resource configuration parameters, and the service performance. The technical staff may not preset a reference function form of the performance prediction model, such as the neural network model, the parameter to be trained is a parameter in the neural network model, that is, the distributed data processing system takes an operation characteristic value, a preset resource configuration parameter value and an actually detected service performance value corresponding to the first preset number of data processing requests as input of the neural network model, respectively, to obtain a parameter in the neural network model, and thus, the performance prediction model with the operation characteristic, the resource configuration parameter and the service performance as variables is obtained. After obtaining the performance prediction model with the operation characteristics, the resource configuration parameters, and the service performance as variables, the distributed data processing system may store the performance prediction model so as to determine resource configuration parameter values corresponding to the data processing request when subsequently receiving the data processing request.

The process may specifically be implemented by a processor.

Optionally, after the performance prediction model is obtained, the accuracy of the performance prediction model may be verified, and when the accuracy of the performance prediction model reaches a preset accuracy threshold, the performance prediction model is stored, and accordingly, the processing procedure may be as follows: when a data processing request is received, operating the data processing request based on a second preset resource configuration parameter value of the current data processing request, and counting a second actual operation characteristic value and a second actual service performance value of the current data processing request; determining a third service performance value of the current data processing request according to the performance prediction model and the second actual operation characteristic value; determining the accuracy of the performance prediction model according to the difference value between the second actual service performance value and the determined third service performance value, and obtaining the accuracy of the performance prediction model under the current data processing request; when the number of the received data processing requests reaches a second preset number after the performance prediction model is obtained, calculating the average accuracy of the performance prediction model according to the accuracy of the performance prediction model under the second preset number of data processing requests; and if the average accuracy reaches a preset accuracy threshold, storing the performance prediction model.

In implementation, after obtaining the performance prediction model with the operation characteristics, the resource allocation parameters, and the service performance as variables, the distributed data processing system may operate in a model verification stage, that is, after obtaining the performance prediction model, when receiving a data processing request, the distributed data processing system may operate the current data processing request in a manner of operating the data processing request in a model training stage, that is, may operate the data processing request based on a second preset resource allocation parameter value corresponding to the resource allocation parameter in the current data processing request, and after the processing is finished, the distributed data processing system may correspondingly record a second actual operation characteristic value corresponding to the current data processing request (where the second actual operation characteristic value is also an operation characteristic value actually corresponding to the performance of the operation of the current data processing request) and a second actual service value corresponding to the actually detected service performance, and the second actual operation characteristic value can be used as a value of the operation characteristic, the second preset resource configuration parameter value can be used as a value of the resource configuration parameter, the second actual operation characteristic value, the second preset resource configuration parameter value and the second preset resource configuration parameter value are substituted into a performance prediction model taking the operation characteristic, the resource configuration parameter and the service performance as variables, and a third service performance value is calculated. After obtaining the difference between the two, the difference can be used as the corresponding accuracy of the performance prediction model under the current data processing request.

When the number of the received data processing requests after the performance prediction model is obtained reaches a second preset number, the average accuracy of the accuracy corresponding to the performance prediction model under the second preset number of data processing requests can be calculated; after the average accuracy is obtained, the magnitude relation between the average accuracy and a preset accuracy threshold can be judged, and if the average accuracy reaches the preset accuracy threshold, a performance prediction model with the operation characteristics, the resource configuration parameters and the service performance as variables can be stored. If the average accuracy does not reach the preset accuracy threshold, the processing corresponding to the received data processing request can be executed continuously according to the mode of executing the processing corresponding to the data processing request in the model training stage, the actual operation characteristic value, the preset resource configuration parameter value and the actual service performance value corresponding to each data processing request can be recorded, the training is carried out again according to the actual operation characteristic values, the preset resource configuration parameter values and the actual service performance values of all the data processing requests, a performance prediction model with the operation characteristic, the resource configuration parameter and the service performance as variables is obtained, and the accuracy of the performance prediction model is verified according to the mode until the accuracy of the obtained performance prediction model reaches the preset accuracy threshold.

The process may specifically be implemented by a processor.

In addition, the performance prediction model described above, which uses the operation characteristics, the resource allocation parameters, and the service performance as variables, may also be obtained from historical data of each data processing stage in which the data processing request is operated in a historical period. Specifically, the distributed data processing system may operate in a model training phase before receiving a data processing request corresponding to a target service. That is, before receiving the data processing request of the target service, each time the distributed data processing system receives the data processing request, the distributed data processing system may perform processing of each data processing stage of the data processing request based on a third preset resource configuration parameter value corresponding to the resource configuration parameter under the current data processing request, and correspondingly record an actual operation characteristic value, the third preset resource configuration parameter value, and a service performance value corresponding to the actual service performance corresponding to each data processing stage of the current data processing request. When the number of the received data processing requests reaches a third preset number, training a preset performance prediction model benchmark formula based on an actual operation characteristic value, a preset resource configuration parameter value and an actual service performance value which respectively correspond to each data processing stage of the data processing requests of the third preset number to obtain a performance prediction model taking the operation characteristic, the resource configuration parameter and the service performance as variables, and further storing the obtained performance prediction model taking the operation characteristic, the resource configuration parameter and the service performance as variables. That is, in the training process, the actual operation characteristic value, the preset resource allocation parameter value and the actual service performance value of each data processing stage of the third preset number of data processing requests are all used as training data of the training performance prediction model.

In addition, the operation characteristic prediction model corresponding to each service in table 1, which takes the data characteristic and the operation characteristic as variables, may be obtained from a history value corresponding to the data characteristic and the operation characteristic when a data processing request for each service is operated according to history.

Fig. 4 is a block diagram of an apparatus for data processing according to an embodiment of the present invention. The means for data processing may be implemented as part or all of a device in software, hardware, or a combination of both. The apparatus for data processing provided in the embodiment of the present invention may implement the process described in fig. 3 in the embodiment of the present invention, where the apparatus for data processing includes:

a receiving module 410, configured to receive a data processing request of a target service, where to-be-processed data of the data processing request includes a plurality of key value pairs, and the receiving function in step 301 above and other implicit steps may be specifically implemented.

A determining module 420, configured to determine an operation characteristic value of the data processing request according to the identifier of the target service, the data amount of the data to be processed, and a distribution condition of keys of the data to be processed, where the operation characteristic value includes a complexity of each data processing stage of the data processing request, and a ratio of data output to data input of each data processing stage; determining the resource configuration parameter value of each data processing stage according to the operation characteristic value, which may specifically implement the determining function in steps 302 and 303, and other implicit steps.

Optionally, the determining module 420 is configured to:

and determining the resource configuration parameter value of each data processing stage according to a performance prediction model and the operation characteristic value, wherein the performance prediction model is obtained by training historical data.

Optionally, as shown in fig. 5, the apparatus further includes:

the operation module 430, configured to, after determining the resource configuration parameter value of each data processing stage, operate the data processing request according to the resource configuration parameter value;

a statistic module 440, configured to count the actual operation characteristic value and the actual service performance value at each data processing stage;

an adjusting module 450, configured to adjust the performance prediction model according to the resource configuration parameter value, the actual operation characteristic value, and the actual service performance value.

Optionally, as shown in fig. 6, the apparatus further includes:

a recording module 460, configured to run the data processing request according to a first preset resource configuration parameter value of the current data processing request when the data processing request is received, and count a first actual running characteristic value, the first preset resource configuration parameter value, and a first actual service performance value of the current data processing request;

the training module 470 is configured to train a preset reference formula of the performance prediction model to obtain the performance prediction model based on the actual operation characteristic value, the preset resource configuration parameter value, and the actual service performance value of the first preset number of data processing requests when the number of the received data processing requests reaches a first preset number;

the storage module 480 is configured to store the performance prediction model.

Optionally, as shown in fig. 7, the apparatus further includes:

a verification module 490 configured to: when a data processing request is received, operating the data processing request based on a second preset resource configuration parameter value of the current data processing request, and counting a second actual operation characteristic value and a second actual service performance value of the current data processing request;

determining a third service performance value of the current data processing request according to the performance prediction model and the second actual operation characteristic value;

determining the accuracy of the performance prediction model according to the difference between the second actual service performance value and the determined third service performance value, so as to obtain the accuracy of the performance prediction model under the current data processing request;

when the number of the received data processing requests reaches a second preset number after the performance prediction model is obtained, calculating the average accuracy of the performance prediction model according to the accuracy of the performance prediction model under the second preset number of data processing requests;

the storage module 480 is configured to: and if the average accuracy reaches a preset accuracy threshold, storing the performance prediction model.

It should be noted that the determining module 420, the running module 430, the counting module 440, the adjusting module 450, the recording module 460, the training module 470, the checking module 490, and the storing module 480 may be implemented by a processor, or a processor and a memory, or a processor executes program instructions in a memory and a receiving module 410 may be implemented by a transceiver.

It should be noted that: in the data processing apparatus provided in the above embodiment, only the division of the functional modules is illustrated when performing data processing, and in practical applications, the functions may be distributed by different functional modules as needed, that is, the internal structure of the server may be divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus for performing data processing and the method for performing data processing provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only an example of the present invention and should not be taken as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

A method of data processing, the method comprising:

receiving a data processing request of a target service, wherein to-be-processed data of the data processing request comprises a plurality of key value pairs;

determining an operation characteristic value of the data processing request according to the identification of the target service, the data volume of the data to be processed and the distribution condition of keys of the data to be processed, wherein the operation characteristic value comprises the complexity of each data processing stage of the data processing request and the ratio of data output to data input of each data processing stage;

and determining the resource configuration parameter value of each data processing stage according to the operation characteristic value.
The method of claim 1, wherein said determining a resource configuration parameter value for each of said data processing stages based on said operating characteristic value comprises:

and determining the resource configuration parameter value of each data processing stage according to a performance prediction model and the operation characteristic value, wherein the performance prediction model is obtained by training historical data.
The method of claim 2, wherein after determining the resource configuration parameter value for each data processing stage, the method further comprises:

running the data processing request according to the resource configuration parameter value;

counting the actual operation characteristic value and the actual service performance value of each data processing stage;

and adjusting the performance prediction model according to the resource configuration parameter value, the actual operation characteristic value and the actual service performance value.
The method of claim 2, further comprising:

when a data processing request is received, operating the data processing request according to a first preset resource configuration parameter value of the current data processing request, and counting a first actual operation characteristic value, the first preset resource configuration parameter value and a first actual service performance value of the current data processing request;

when the number of the received data processing requests reaches a first preset number, training a preset performance prediction model benchmark formula based on the actual operation characteristic values, the preset resource allocation parameter values and the actual service performance values of the data processing requests of the first preset number to obtain the performance prediction model;

and storing the performance prediction model.
The method of claim 4, wherein storing the performance prediction model comprises:

when a data processing request is received, operating the data processing request based on a second preset resource configuration parameter value of the current data processing request, and counting a second actual operation characteristic value and a second actual service performance value of the current data processing request;

determining a third service performance value of the current data processing request according to the performance prediction model and the second actual operation characteristic value;

determining the accuracy of the performance prediction model according to the difference between the second actual service performance value and the determined third service performance value, so as to obtain the accuracy of the performance prediction model under the current data processing request;

when the number of the received data processing requests reaches a second preset number after the performance prediction model is obtained, calculating the average accuracy of the performance prediction model according to the accuracy of the performance prediction model under the second preset number of data processing requests;

and if the average accuracy reaches a preset accuracy threshold, storing the performance prediction model.
A server, comprising a transceiver and a processor, wherein:

the system comprises a transceiver and a data processing module, wherein the transceiver is used for receiving a data processing request of a target service, and the data to be processed of the data processing request comprises a plurality of key value pairs;

the processor is used for determining an operation characteristic value of the data processing request according to the identification of the target service, the data volume of the data to be processed and the distribution situation of keys of the data to be processed, wherein the operation characteristic value comprises the complexity of each data processing stage of the data processing request and the ratio of data output to data input of each data processing stage; and determining the resource configuration parameter value of each data processing stage according to the operation characteristic value.
The server of claim 6, wherein the processor is configured to:

and determining the resource configuration parameter value of each data processing stage according to a performance prediction model and the operation characteristic value, wherein the performance prediction model is obtained by training historical data.
The server of claim 7, wherein the processor is further configured to:

after the resource configuration parameter value of each data processing stage is determined, the data processing request is operated according to the resource configuration parameter value;

counting the actual operation characteristic value and the actual service performance value of each data processing stage;

and adjusting the performance prediction model according to the resource configuration parameter value, the actual operation characteristic value and the actual service performance value.
The server of claim 7, wherein the processor is further configured to:

when a data processing request is received, operating the data processing request according to a first preset resource configuration parameter value of the current data processing request, and counting a first actual operation characteristic value, the first preset resource configuration parameter value and a first actual service performance value of the current data processing request;

when the number of the received data processing requests reaches a first preset number, training a preset performance prediction model benchmark formula based on the actual operation characteristic values, the preset resource allocation parameter values and the actual service performance values of the data processing requests of the first preset number to obtain the performance prediction model;

and storing the performance prediction model.
The server of claim 9, wherein the processor is configured to:

when a data processing request is received, operating the data processing request based on a second preset resource configuration parameter value of the current data processing request, and counting a second actual operation characteristic value and a second actual service performance value of the current data processing request;

determining a third service performance value of the current data processing request according to the performance prediction model and the second actual operation characteristic value;

determining the accuracy of the performance prediction model according to the difference between the second actual service performance value and the determined third service performance value, so as to obtain the accuracy of the performance prediction model under the current data processing request;

when the number of the received data processing requests reaches a second preset number after the performance prediction model is obtained, calculating the average accuracy of the performance prediction model according to the accuracy of the performance prediction model under the second preset number of data processing requests;

and if the average accuracy reaches a preset accuracy threshold, storing the performance prediction model.
An apparatus for data processing, the apparatus comprising:

the system comprises a receiving module, a sending module and a receiving module, wherein the receiving module is used for receiving a data processing request of a target service, and the data to be processed of the data processing request comprises a plurality of key value pairs;

a determining module, configured to determine an operation characteristic value of the data processing request according to the identifier of the target service, the data amount of the data to be processed, and a distribution condition of keys of the data to be processed, where the operation characteristic value includes complexity of each data processing stage of the data processing request, and a ratio of data output to data input of each data processing stage; and determining the resource configuration parameter value of each data processing stage according to the operation characteristic value.
The apparatus of claim 11, wherein the determining module is configured to:

and determining the resource configuration parameter value of each data processing stage according to a performance prediction model and the operation characteristic value, wherein the performance prediction model is obtained by training historical data.
The apparatus of claim 12, further comprising:

the operation module is used for operating the data processing request according to the resource configuration parameter value after determining the resource configuration parameter value of each data processing stage;

the statistical module is used for counting the actual operation characteristic value and the actual service performance value of each data processing stage;

and the adjusting module is used for adjusting the performance prediction model according to the resource configuration parameter value, the actual operation characteristic value and the actual service performance value.
The apparatus of claim 12, further comprising:

the recording module is used for running the data processing request according to a first preset resource configuration parameter value of the current data processing request when the data processing request is received, and counting a first actual running characteristic value, the first preset resource configuration parameter value and a first actual service performance value of the current data processing request;

the training module is used for training a preset performance prediction model benchmark formula to obtain a performance prediction model based on the actual operation characteristic value, the preset resource allocation parameter value and the actual service performance value of the data processing requests of the first preset number when the number of the received data processing requests reaches the first preset number;

and the storage module is used for storing the performance prediction model.
The apparatus of claim 14, further comprising:

a verification module to: when a data processing request is received, operating the data processing request based on a second preset resource configuration parameter value of the current data processing request, and counting a second actual operation characteristic value and a second actual service performance value of the current data processing request;

determining a third service performance value of the current data processing request according to the performance prediction model and the second actual operation characteristic value;

determining the accuracy of the performance prediction model according to the difference between the second actual service performance value and the determined third service performance value, so as to obtain the accuracy of the performance prediction model under the current data processing request;

when the number of the received data processing requests reaches a second preset number after the performance prediction model is obtained, calculating the average accuracy of the performance prediction model according to the accuracy of the performance prediction model under the second preset number of data processing requests;

the storage module is configured to: and if the average accuracy reaches a preset accuracy threshold, storing the performance prediction model.