WO2016173351A1 - 一种数据处理方法及装置 - Google Patents

一种数据处理方法及装置 Download PDF

Info

Publication number
WO2016173351A1
WO2016173351A1 PCT/CN2016/077379 CN2016077379W WO2016173351A1 WO 2016173351 A1 WO2016173351 A1 WO 2016173351A1 CN 2016077379 W CN2016077379 W CN 2016077379W WO 2016173351 A1 WO2016173351 A1 WO 2016173351A1
Authority
WO
WIPO (PCT)
Prior art keywords
subtask
computing framework
task
candidate
framework
Prior art date
Application number
PCT/CN2016/077379
Other languages
English (en)
French (fr)
Inventor
谭卫国
邵刚
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP16785782.0A priority Critical patent/EP3267310B1/en
Publication of WO2016173351A1 publication Critical patent/WO2016173351A1/zh
Priority to US15/728,879 priority patent/US10606654B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5044Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5055Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering software capabilities, i.e. software resources associated or available to the machine
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/501Performance criteria

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a data processing method and apparatus.
  • the trend of big data processing is to process big data through a data processing platform that integrates multiple computing frameworks, that is, to accommodate multiple computing frameworks through a resource management system in a cluster of computers, such as Mesos and YARN. .
  • the embodiment of the invention provides a data processing method and device, which is used to solve the problem that the resource management system that combines multiple computing frameworks in the prior art processes the data task, and does not select the computing framework by running time and resource consumption, and the data processing efficiency is better. Low problems reduce the performance of the system.
  • a data processing method includes:
  • the target computing framework for executing the sub-task is selected;
  • the corresponding subtask is executed based on the filtered target computing framework that executes each subtask in the subtask set.
  • the task request further carries input data of the task
  • the input data for executing each subtask is determined according to the input data of the task carried in the task request.
  • the method before receiving the task request, the method further includes:
  • an application interface API that executes the same task type in all computing frameworks that perform the same task type is encapsulated by a preset programming language to form a unified API;
  • a computing framework having the ability to perform the subtask is determined as a candidate computing framework in all computing frameworks of the system configuration, including:
  • obtaining a prediction model corresponding to the candidate computing framework includes:
  • Each of the training sample sets except the running time and the resource consumption is trained to obtain a prediction model corresponding to the candidate computing framework, with the running time and the resource consumption as the target features.
  • the corresponding running time and resource consumption are performed according to the predicted each candidate computing framework when the sub-task is executed.
  • the target computing framework for performing the subtask is selected, including:
  • selecting a candidate computing framework in which the predicted resource consumption is less than the available resources of the system is used as the first candidate computing framework
  • the first candidate computing framework with the smallest predicted running time is selected as the target computing framework.
  • a fifth possible implementation manner after performing the corresponding sub-task based on the determined target computing framework for executing each sub-task in the sub-task set, :
  • Each feature generated by the subtask will be executed in the target computing framework of the subtask as a new training sample
  • a data processing apparatus includes:
  • a receiving unit configured to receive a task request, where the task request carries a task submitted by a user
  • a generating unit configured to generate a subtask set including at least one subtask according to the task in the task request
  • a determining unit for determining input data for executing each subtask
  • a processing unit for performing the following operations for each subtask in the subtask set:
  • the target computing framework for executing the sub-task is selected;
  • a running unit configured to execute a corresponding subtask based on the filtered target computing framework that executes each subtask in the subtask set.
  • the task request received by the receiving unit further carries input data of the task
  • the determining unit is configured to:
  • the input data for executing each subtask is determined according to the input data of the task carried in the task request.
  • the configuration unit is configured to execute, among all the computing frameworks of the system configuration, all execution frameworks that perform the same task type before receiving the task request
  • the application interface APIs of the same task type are encapsulated by a preset programming language to form a unified API;
  • the processing unit is configured to: when calculating a computing framework having the capability to execute the subtask as a candidate computing framework in all computing frameworks of the system configuration,
  • the processing unit when the processing unit obtains the prediction model corresponding to the candidate computing framework, the processing unit is configured to:
  • Each of the training sample sets except the running time and the resource consumption is trained to obtain a prediction model corresponding to the candidate computing framework, with the running time and the resource consumption as the target features.
  • the processing unit when filtering out the target computing framework that performs the sub-task, includes:
  • selecting a candidate computing framework in which the predicted resource consumption is less than the available resources of the system is used as the first candidate computing framework
  • the first candidate computing framework with the smallest predicted running time is selected as the target computing framework.
  • the operating unit is further configured to:
  • the target computing framework is selected to execute the subtask in multiple computing frameworks by running time and resource consumption, thereby improving data processing efficiency and system work. performance.
  • FIG. 1 is a schematic structural diagram of a terminal device according to an embodiment of the present disclosure
  • FIG. 2 is a specific flowchart of a data processing method according to an embodiment of the present invention.
  • FIG. 3 is a schematic exploded view of a task according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.
  • a computing framework having the ability to execute the subtask is determined as a candidate computing framework in all computing frameworks of the system configuration, wherein the candidate The number of calculation frames is greater than or equal to 2; according to the input data of the subtask and the prediction model corresponding to each candidate calculation frame, respectively predicting the running time and resource consumption of each candidate computing framework when executing the subtask;
  • Each candidate computing framework performs a corresponding running time and resource consumption when the subtask is executed.
  • the target computing framework for executing the subtask is filtered out; and finally, each of the subtasks is executed based on the determined execution.
  • the target calculation framework of the subtasks, and the corresponding subroutine task selects the target computing framework to execute each sub-task in multiple computing frameworks through running time and resource consumption, thereby improving data processing efficiency and system performance.
  • Embodiments of the present invention provide a data processing method, apparatus, and terminal device, which are applied to a resource management system that integrates multiple computing frameworks. Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
  • the embodiment of the invention further provides a terminal device, which is a computer or the like that combines multiple computing frameworks.
  • the terminal device 100 includes a transceiver 101, a processor 102, a bus 103, and a memory 104, wherein:
  • the transceiver 101, the processor 102, and the memory 104 are connected to each other through a bus 103.
  • the bus 103 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is shown in Figure 1, but it does not mean that there is only one bus or one type of bus.
  • the transceiver 101 is configured to communicate with other connected devices, such as receiving task requests and the like.
  • the processor 102 is configured to implement the data processing method shown in FIG. 2 of the embodiment of the present invention, including:
  • the target computing framework for executing the sub-task is selected;
  • the corresponding subtask is executed based on the selected target computing framework of each subtask in the executed subtask set.
  • the task request further carries input data of the task
  • the input data for executing each subtask is determined based on the input data of the task carried in the task request.
  • the method before receiving the task request, the method further includes:
  • an application programming interface executing all of the same task types in all computing frameworks that perform the same task type is encapsulated in a preset programming language to form a unified API;
  • a computing framework having the ability to perform the subtask is determined as a candidate computing framework in all computing frameworks of the system configuration, including:
  • the determined calculation framework is used as a candidate calculation framework.
  • obtaining a prediction model corresponding to the candidate computing framework including:
  • the running time and the resource consumption are taken as the target features, and the other features except the running time and the resource consumption are trained in the training sample set, and the prediction model corresponding to the candidate computing framework is obtained.
  • the target computing framework for performing the sub-task is selected, including:
  • selecting a candidate computing framework whose predicted resource consumption is less than the available resources of the system is used as the first candidate computing framework;
  • the first candidate computing framework with the smallest predicted running time is selected as the target computing framework.
  • the method further includes:
  • Each feature generated by the subtask will be executed in the target computing framework of the subtask as a new training sample
  • a new training sample is added to the set of training samples preset for the ability to perform the subtask for the target computing framework.
  • the terminal device 100 further includes a memory 104 for storing programs, a prediction model of each computing framework, and a training sample set corresponding to each prediction model obtained during training.
  • the program can include program code, the program code including computer operating instructions.
  • the memory 104 may include a random access memory (RAM), and may also include a non-volatile memory such as at least one disk storage.
  • the processor 102 executes an application stored in the memory 104 to implement the above data processing method.
  • a specific processing procedure of a data processing method provided by an embodiment of the present invention includes:
  • Step 201 Receive a task request, where the task request carries a task submitted by a user.
  • the task submitted by the user is a task of processing a large amount of data, such as filtering the data of the set condition in the data in a data table in the database.
  • the task request may also carry the input data of the task submitted by the user. .
  • the method further includes:
  • an application program interface executing the same task type in all computing frameworks having the same task type is encapsulated by a preset programming language to form a unified API.
  • each computing framework provides a corresponding API for each function that can be implemented.
  • developers need to master multiple programming languages, and the programming threshold is higher. And the development efficiency of the system is low.
  • an API of the same function in different computing frameworks is encapsulated into a unified API. Therefore, when implementing the function, only the corresponding unified API needs to be invoked, without A computing framework can be specified in different computing frameworks that can implement this functionality.
  • the unified API can be implemented in any programming language.
  • the following is an example of JAVA as a unified API programming language.
  • the API of the computing framework is encapsulated into a unified API, it includes:
  • the programming language of the API provided by the computing framework is JAVA
  • the parameters need to be reorganized according to the requirements of the API of the computing framework, and can be encapsulated into a unified API
  • the programming language of the API provided by the computing framework is a programming language other than JAVA
  • the API needs to be invoked through JAVA's cross-programming language specification to be packaged as a unified API.
  • JAVA's cross-programming language specification is prior art.
  • the API using the Scala programming language is based on the JAVA Virtual Machine (JVM). Therefore, the JAVA programming language can directly call the API using the Scala programming language; when calling the API using the C/C++ programming language through the JAVA programming language, it can be called through the JAVA Native Interface (JNI); through JAVA
  • JNI JAVA Native Interface
  • the programming language calls the API using the Python programming language, which can be passed through Jython.jar transfer.
  • the APIs that execute the same task type in different computing frameworks are encapsulated by a preset programming language to generate a unified API, which shields the differences of programming languages, greatly reduces the developer's programming threshold, and improves the flexibility of each computing framework. Sex and adaptability.
  • the task submitted by the user carried in the received task request may be in a programming language that is unified with the API.
  • the programming language describes the task.
  • Step 202 Generate a subtask set including at least one subtask according to the task in the task request.
  • the task submitted by the user is formed by combining multiple subtasks, and the subtask distribution order in the task is similar to the directed acyclic graph.
  • the task T includes five subtasks, respectively T1. T2, T3, T4, T5.
  • Perform task decomposition processing on task T which can be decomposed into T1->T2->T3->T4->T5, or T1->T3->T2->T4->T5, and T2 and T3 can be executed in parallel
  • the other steps are executed sequentially.
  • the task is decomposed, and the generation of the subtask set is a prior art, which is not described in detail in the embodiment of the present invention.
  • Step 203 Determine input data for executing each subtask.
  • the input data of each subtask is directly determined according to the input data of the task, or the corresponding input data is directly determined according to each subtask.
  • Step 204 Perform, for each sub-task in the sub-task set, a computing framework having the capability to execute the sub-task as a candidate computing framework in all computing frameworks of the system configuration, wherein the number of candidate computing frameworks is greater than or Equal to 2; according to the input data of the subtask and the prediction model corresponding to each candidate calculation frame, respectively predict the running time and resource consumption corresponding to each candidate computing framework when executing the subtask; each candidate computing framework according to the prediction The running time and resource consumption corresponding to the subtask is performed, and in the candidate computing framework, the target computing framework for executing the subtask is selected.
  • a computing framework having the capability to execute the subtask is determined as a candidate computing framework in all computing frameworks of the system configuration, including:
  • the predictive model may be preset by the user, and may be obtained by machine learning.
  • the predictive model corresponding to a candidate computing framework having the capability of executing the subtask is obtained through machine learning, including:
  • Reading a preset training sample set which is preset for the capability of the candidate computing framework to execute the subtask
  • the running time and resource consumption are taken as the target features, and other features except the running time and resource consumption in the training sample set are trained to obtain a corresponding prediction model.
  • the resource consumption may be a memory usage, a CPU usage, an I/O occupation, and the like, which is not limited in this embodiment of the present invention.
  • Run time 0.83 * number of rows of input data + 0.24 * number of columns of input data + 0.1
  • Memory usage 0.24 * Number of rows of input data + 0.14 * Number of columns of input data + 0.15.
  • the target computing framework for performing the sub-task is selected, including:
  • selecting a candidate computing framework whose predicted resource consumption is less than the available resources of the system is used as the first candidate computing framework;
  • the first candidate computing framework with the smallest predicted running time is selected as the target computing framework.
  • Step 205 Perform a corresponding sub-task based on the target computing framework of each sub-task in the selected sub-task set.
  • the target computing framework of each subtask executes the subtask according to the unified API corresponding to the task type of the subtask.
  • the result is the result of the task submitted by the user.
  • step 205 after the corresponding target task is executed based on the selected target computing framework of each sub-task in the sub-task set, the method further includes:
  • Each feature generated by the subtask will be executed in the target computing framework of the subtask as a new training sample
  • the new training sample is added to a training sample set that is preset for the ability of the target computing framework to perform the subtask.
  • the actual execution task of the computing framework is generated as a new training sample, and the training sample set is continuously supplemented, and machine learning is performed according to the supplemented training sample set to generate a new prediction model, so that the prediction model is more and more accurate.
  • the target computing framework is selected to execute the sub-task in multiple computing frameworks by using running time and resource consumption, thereby improving data processing efficiency. And the performance of the system. And in different computing frameworks, APIs that execute the same task type are encapsulated in a preset programming language to generate a unified API, which shields the differences in programming languages, greatly reduces the developer's programming threshold, and improves each computing framework. Flexibility and adaptability.
  • Example 1 receiving a task request, wherein the task T submitted by the user is a Structured Query Language (SQL) statement as follows:
  • SQL Structured Query Language
  • the above SQL statement represents the task of displaying the location in descending order of 'average score'.
  • students' 'database', 'English', 'algorithm' course scores and are displayed as follows: student ID, database, English, algorithm, number of effective courses, average grades, where student table is shown in Table 1.
  • the curriculum is shown in Table 2, and the results are shown in Table 3:
  • the above task T is subjected to task decomposition processing to obtain a subtask set including three subtasks, respectively: T1, selecting relevant data; T2, forming a new data table according to the result of T1; T3, and the result of T2 according to the average score 'Sort from high to low, the SQL statement corresponding to each subtask Show:
  • the computing framework with the resource consumption less than the available resources and the least running time is usually selected.
  • each subtask has two computing frameworks to choose from: Hadoop and Spark.
  • the input data corresponding to the task submitted by the user is a score table, and the characteristics of the input data include the number of rows, the number of columns, etc., assuming that the number of rows of the transcript is 25,415,996 rows and the number of columns is three columns.
  • Each computing framework performs a prediction model corresponding to each subtask as shown in Table 5.
  • the prediction model includes a runtime prediction model and a memory occupancy prediction model:
  • Runtime prediction model Memory occupancy prediction model (MB) Hadoop T1 0.0001r+0.00034c 0.00015r+0.00024c Hadoop T2 0.00002r+0.00064c 0.00005r+0.00004c Hadoop T3 0.00005r+0.00004c 0.00003r+0.00009c Spark T1 0.00001r+0.00004c 0.00055r+0.00064c Spark T2 0.0002r+0.0016c 0.00035r+0.00084c Spark T3 0.00005r+0.0004c 0.00093r+0.0009c
  • T1 For T1, it is predicted that the memory usage when running T1 through Hadoop is 3812.40012MB, while the memory usage when running T1 through Spark is 13978.79972MB. Obviously, both memory usages are lower than the system's available memory 15000MB.
  • the predicted running time is 354.16008 milliseconds in Spark, which is lower than 2541.0062 milliseconds in Hadoop. Therefore, for T1, choose to use Spark;
  • Table 9 is the final result of task T.
  • the present invention further provides a data processing apparatus.
  • the apparatus 400 includes: a receiving unit 401, a generating unit 402, a determining unit 403, a processing unit 404, and an operating unit 405, where
  • the receiving unit 401 is configured to receive a task request, where the task request carries a task submitted by the user;
  • the generating unit 402 is configured to generate a subtask set including at least one subtask according to the task in the task request;
  • a determining unit 403 configured to determine input data for executing each subtask
  • the processing unit 404 is configured to perform the following operations for each subtask in the subtask set:
  • the target computing framework for executing the sub-task is selected;
  • the running unit 405 is configured to execute the corresponding subtask based on the target computing framework of each subtask in the selected execution subtask set.
  • the task request received by the receiving unit 401 also carries the input data of the task;
  • a determining unit 403 is configured to:
  • the input data for executing each subtask is determined based on the input data of the task carried in the task request.
  • the data processing apparatus 400 further includes: a configuration unit 406, configured to pass an API that executes the same task type in all computing frameworks that perform the same task type in all computing frameworks configured by the system before receiving the task request
  • the programming language is packaged to form a unified API
  • the processing unit 404 when determining, in all computing frameworks of the system configuration, a computing framework having the capability to execute the subtask as a candidate computing framework, is used to:
  • the processing unit 404 is configured to: when obtaining the prediction model corresponding to the candidate calculation framework:
  • the running time and the resource consumption are taken as the target features, and the other features except the running time and the resource consumption are trained in the training sample set, and the prediction model corresponding to the candidate computing framework is obtained.
  • the processing unit 404 when filtering out the target computing framework for executing the subtask, includes:
  • selecting a candidate computing framework whose predicted resource consumption is less than the available resources of the system is used as the first candidate computing framework;
  • the first candidate computing framework with the smallest predicted running time is selected as the target computing framework.
  • the running unit 405 is further configured to:
  • each feature generated by the subtask will be executed in the target computing framework of the subtask as a new training sample
  • a new training sample is added to the set of training samples preset for the ability to perform the subtask for the target computing framework.
  • a data processing method and apparatus after receiving a task request carrying a task submitted by a user, generating a subtask including at least one subtask according to the task And determining to perform input data for each subtask, performing the following operations for each subtask in the subtask set, determining a target computing framework for executing each subtask: determining that the subtask is executed in all computing frameworks of the system configuration a computing framework of capabilities as a candidate computing framework, wherein the number of candidate computing frameworks is greater than or equal to 2; each candidate computing framework is predicted separately based on input data of the subtasks and a prediction model corresponding to each candidate computing framework Corresponding running time and resource consumption when the subtask is executed; according to the predicted running time and resource consumption of each candidate computing framework when the subtask is executed, in the candidate computing framework, the target for executing the subtask is selected a computing framework; ultimately based on the determined execution of each of the subtask sets The target calculation framework of the subtasks, respectively performing the corresponding subtask
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention may be packaged in one or more of them Computers containing computer usable program code may be in the form of a computer program product embodied on a storage medium, including but not limited to disk storage, CD-ROM, optical storage, and the like.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Stored Programmes (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种数据处理方法及装置,用以解决现有技术中融合多种计算框架的资源管理系统在处理数据任务时,不是通过运行时间和资源消耗选择计算框架,数据处理效率较低的问题,降低了系统的工作性能的问题。该方法为:针对子任务集中每个子任务,确定候选计算框架,并预测每个候选计算框架执行该子任务时对应的运行时间和资源消耗,并根据预测的每个候选计算框架执行该子任务时对应的运行时间和资源消耗,在候选计算框架中,筛选出执行该子任务的目标计算框架(204),执行该子任务(205)。这样,资源管理系统通过运行时间和资源消耗在多个计算框架中选择目标计算框架执行每个子任务,提高了数据处理效率,以及系统的工作性能。

Description

一种数据处理方法及装置
本申请要求于2015年4月29日提交中国专利局、申请号为201510212439.5,发明名称为“一种数据处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及计算机技术领域,尤其涉及一种数据处理方法及装置。
背景技术
近年来,随着社会信息化的快速发展,无论在科学研究、工业生产、商业和互联网领域,数据都呈现出爆炸式增长。目前,很多应用中的数据已从太字节(Terabyte,TB)级迅速发展到拍字节(Petabyte,PB)级甚至更高的数量级。因此,面向大数据处理的计算框架已经成为热点话题,其中具有代表性的计算框架包括海杜普(Hadoop)和Spark等。Hadoop和Spark等框架都在计算机技术领域获得了广泛的应用,但是每个计算框架均存在不足:例如,Hadoop提供的映射规约(MapReduce)模型虽然简单易用,但是其计算模型具有局限性,表达能力有限,在解决诸如迭代计算、图分析等复杂问题时,很难将算法映射到MapReduce模型中,且开发的工作量大,运行效率低下;而Spark虽然迭代运算性能好,但是对内存要求很高。
因此,大数据处理的发展趋势为通过融合多种计算框架的数据处理平台处理大数据,即在计算机的集群中通过资源管理系统来容纳多种计算框架,典型的资源管理系统如Mesos以及YARN等。
然而,由于这类资源管理系统允许包含的多种计算框架共享一个集群资源,且每个计算框架的编程语言不尽相同,因此,收到待处理的数据任务时,用户通常根据经验指定计算框架执行该待处理数据任务,而不是通过运行时间和资源消耗选择计算框架,数据处理效率较低,降低了系统的工作性能。
发明内容
本发明实施例提供一种数据处理方法及装置,用以解决现有技术中融合多种计算框架的资源管理系统在处理数据任务时,不是通过运行时间和资源消耗选择计算框架,数据处理效率较低的问题,降低了系统的工作性能。
本发明实施例提供的具体技术方案如下:
第一方面,一种数据处理方法,包括:
接收任务请求,所述任务请求中携带有用户提交的任务;
根据所述任务请求中的所述任务,生成包含至少一个子任务的子任务集;
确定执行每个子任务的输入数据;
针对所述子任务集中的每个子任务执行以下操作:
在系统配置的所有计算框架中确定具有执行该子任务的能力的计算框架作为候选计算框架,其中,所述候选计算框架的数目大于或等于2;
根据该子任务的输入数据、以及每个候选计算框架对应的预测模型,分别预测每个候选计算框架执行该子任务时对应的运行时间和资源消耗;
根据预测的每个候选计算框架执行该子任务时对应的运行时间和资源消耗,在所述候选计算框架中,筛选出执行该子任务的目标计算框架;
基于筛选出的执行所述子任务集中的每个子任务的目标计算框架,执行对应的子任务。
结合第一方面,在第一种可能的实现方式中,所述任务请求中还携带有所述任务的输入数据;
确定执行每个子任务的输入数据,包括:
根据所述任务请求中携带的所述任务的输入数据,确定执行每个子任务的输入数据。
结合第一方面,在第二种可能的实现方式中,接收任务请求之前,还包括:
在系统配置的所有计算框架中,将具有执行相同任务类型的所有计算框架中的、执行所述相同任务类型的应用程序接口API通过预设的编程语言进行封装,形成统一API;
在系统配置的所有计算框架中确定具有执行该子任务的能力的计算框架作为候选计算框架,包括:
确定该子任务的任务类型;
确定该子任务的任务类型对应的统一API;
根据确定的所述统一API,确定具有执行该子任务的任务类型的所有计算框架,并将确定的计算框架作为候选计算框架。
结合第一方面或第一方面的以上任一种可能的实现方式,在第三种可能的实现方式中,获得候选计算框架对应的预测模型,包括:
读取预设的训练样本集合,所述训练样本集合是针对所述候选计算框架执行该子任务的能力预设的;
分别以运行时间、资源消耗为目标特征,对所述训练样本集合中除运行时间、资源消耗以外的其它特征进行训练,得到所述候选计算框架对应的预测模型。
结合第一方面或第一方面的以上任一种可能的实现方式,在第四种可能的实现方式中,根据预测的每个候选计算框架执行该子任务时对应的运行时间和资源消耗,在所述候选计算框架中,筛选出执行该子任务的目标计算框架,包括:
在所述候选计算框架中,选择预测的资源消耗小于系统的可用资源的候选计算框架作为第一候选计算框架;
在所述第一候选计算框架中,筛选出预测的运行时间最小的第一候选计算框架作为目标计算框架。
结合第一方面的第三种可能的实现方式,在第五种可能的实现方式中,基于确定的执行所述子任务集中的每个子任务的目标计算框架,执行对应的子任务之后,还包括:
将在该子任务的目标计算框架中执行该子任务产生的各个特征,作为新的训练样本;
将所述新的训练样本添加至所述训练样本集合。
第二方面,一种数据处理装置,包括:
接收单元,用于接收任务请求,所述任务请求中携带有用户提交的任务;
生成单元,用于根据所述任务请求中的所述任务,生成包含至少一个子任务的子任务集;
确定单元,用于确定执行每个子任务的输入数据;
处理单元,用于针对所述子任务集中的每个子任务执行以下操作:
在系统配置的所有计算框架中确定具有执行该子任务的能力的计算框架作为候选计算框架,其中,所述候选计算框架的数目大于或等于2;
根据该子任务的输入数据、以及每个候选计算框架对应的预测模型,分别预测每个候选计算框架执行该子任务时对应的运行时间和资源消耗;
根据预测的每个候选计算框架执行该子任务时对应的运行时间和资源消耗,在所述候选计算框架中,筛选出执行该子任务的目标计算框架;
运行单元,用于基于筛选出的执行所述子任务集中的每个子任务的目标计算框架,执行对应的子任务。
结合第二方面,在第一种可能的实现方式中,所述接收单元接收的所述任务请求中还携带有所述任务的输入数据;
所述确定单元,用于:
根据所述任务请求中携带的所述任务的输入数据,确定执行每个子任务的输入数据。
结合第二方面,在第二种可能的首先方式中,配置单元,用于在接收任务请求之前,在系统配置的所有计算框架中,将具有执行相同任务类型的所有计算框架中的、执行所述相同任务类型的应用程序接口API通过预设的编程语言进行封装,形成统一API;
所述处理单元,在系统配置的所有计算框架中确定具有执行该子任务的能力的计算框架作为候选计算框架时,用于:
确定该子任务的任务类型;
确定该子任务的任务类型对应的统一API;
根据确定的所述统一API,确定具有执行该子任务的任务类型的所有计算框架,并将确定的计算框架作为候选计算框架。
结合第二方面或第二方面的以上任一种可能的实现方式,在第三种可能的实现方式中,所述处理单元在获得候选计算框架对应的预测模型时,用于:
读取预设的训练样本集合,所述训练样本集合是针对所述候选计算框架执行该子任务的能力预设的;
分别以运行时间、资源消耗为目标特征,对所述训练样本集合中除运行时间、资源消耗以外的其它特征进行训练,得到所述候选计算框架对应的预测模型。
结合第二方面或第二方面的以上任一种可能的实现方式,在第四种可能的实现方式中,所述处理单元,在筛选出执行该子任务的目标计算框架时,包括:
在所述候选计算框架中,选择预测的资源消耗小于系统的可用资源的候选计算框架作为第一候选计算框架;
在所述第一候选计算框架中,筛选出预测的运行时间最小的第一候选计算框架作为目标计算框架。
结合第二方面的第三种可能的实现方式,在第五种可能的实现方式中,所述运行单元,还用于:
基于确定的执行所述子任务集中的每个子任务的目标计算框架,执行对应的子任务之后,将在该子任务的目标计算框架中执行该子任务产生的各个特征,作为新的训练样本;
将所述新的训练样本添加至所述训练样本集合。
采用本发明技术方案,在多个计算框架均可以执行同一子任务时,通过运行时间和资源消耗在多个计算框架中选择目标计算框架执行该子任务,提高了数据处理效率,以及系统的工作性能。
附图说明
图1为本发明实施例提供的一种终端设备结构示意图;
图2为本发明实施例提供的一种数据处理方法的具体流程图;
图3为本发明实施例提供的一种任务分解示意图;
图4为本发明实施例提供的一种数据处理装置的结构示意图。
具体实施方式
采用本发明提供的数据处理方法,通过接收携带有用户提交的任务的任务请求后,根据该任务,生成包含至少一个子任务的子任务集;并确定执行每个子任务的输入数据,针对所述子任务集中的每个子任务执行以下操作,确定执行每个子任务的目标计算框架:在系统配置的所有计算框架中确定具有执行该子任务的能力的计算框架作为候选计算框架,其中,所述候选计算框架的数目大于或等于2;根据该子任务的输入数据、以及每个候选计算框架对应的预测模型,分别预测每个候选计算框架执行该子任务时对应的运行时间和资源消耗;根据预测的每个候选计算框架执行该子任务时对应的运行时间和资源消耗,在所述候选计算框架中,筛选出执行该子任务的目标计算框架;最终基于确定的执行所述子任务集中的每个子任务的目标计算框架,执行对应的子任务。这样,资源管理系统通过运行时间和资源消耗在多个计算框架中选择目标计算框架执行每个子任务,提高了数据处理效率,以及系统的工作性能。
本发明实施例提供了一种数据处理方法、装置及终端设备,应用于融合多种计算框架的资源管理系统。下面结合附图对本发明优选的实施方式进行详细说明。
本发明实施例还提供了一种终端设备,该终端设备为融合多种计算框架的计算机等设备。参阅图1所示,该终端设备100包括:收发器101、处理器102、总线103以及存储器104,其中:
收发器101、处理器102以及存储器104通过总线103相互连接;总线103可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图1中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
收发器101用于与相连的其它设备进行通信,如接收任务请求等。
处理器102用于实现本发明实施例图2所示的数据处理方法,包括:
接收任务请求,任务请求中携带有用户提交的任务;
根据任务请求中的任务,生成包含至少一个子任务的子任务集;
确定执行每个子任务的输入数据;
针对子任务集中的每个子任务执行以下操作:
在系统配置的所有计算框架中确定具有执行该子任务的能力的计算框架作为候选计算框架,其中,候选计算框架的数目大于或等于2;
根据该子任务的输入数据、以及每个候选计算框架对应的预测模型,分别预测每个候选计算框架执行该子任务时对应的运行时间和资源消耗;
根据预测的每个候选计算框架执行该子任务时对应的运行时间和资源消耗,在候选计算框架中,筛选出执行该子任务的目标计算框架;
基于筛选出的执行子任务集中的每个子任务的目标计算框架,执行对应的子任务。
可选的,任务请求中还携带有该任务的输入数据;
确定执行每个子任务的输入数据,包括:
根据任务请求中携带的该任务的输入数据,确定执行每个子任务的输入数据。
可选的,接收任务请求之前,还包括:
在系统配置的所有计算框架中,将具有执行相同任务类型的所有计算框架中的、执行该相同任务类型的应用程序接口(,API)通过预设的编程语言进行封装,形成统一API;
在系统配置的所有计算框架中确定具有执行该子任务的能力的计算框架作为候选计算框架,包括:
确定该子任务的任务类型;
确定该子任务的任务类型对应的统一API;
根据确定的统一API,确定具有执行该子任务的任务类型的所有计算框架, 并将确定的计算框架作为候选计算框架。
可选的,获得候选计算框架对应的预测模型,包括:
读取预设的训练样本集合,训练样本集合是针对该候选计算框架执行该子任务的能力预设的;
分别以运行时间、资源消耗为目标特征,对该训练样本集合中除运行时间、资源消耗以外的其它特征进行训练,得到该候选计算框架对应的预测模型。
可选的,根据预测的每个候选计算框架执行该子任务时对应的运行时间和资源消耗,在候选计算框架中,筛选出执行该子任务的目标计算框架,包括:
在候选计算框架中,选择预测的资源消耗小于系统的可用资源的候选计算框架作为第一候选计算框架;
在第一候选计算框架中,筛选出预测的运行时间最小的第一候选计算框架作为目标计算框架。
可选的,基于确定的执行子任务集中的每个子任务的目标计算框架,执行对应的子任务之后,还包括:
将在该子任务的目标计算框架中执行该子任务产生的各个特征,作为新的训练样本;
将新的训练样本添加至针对目标计算框架执行该子任务的能力预设的训练样本集合。
该终端设备100还包括存储器104,用于存放程序,每个计算框架的预测模型,以及训练时得到的每个预测模型对应的训练样本集合等。具体地,程序可以包括程序代码,该程序代码包括计算机操作指令。存储器104可能包含随机存取存储器(random access memory,RAM),也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。处理器102执行存储器104所存放的应用程序,实现如上数据处理方法。
参阅图2所示,本发明实施例提供的一种数据处理方法的具体处理流程包括:
步骤201:接收任务请求,该任务请求中携带有用户提交的任务。
用户提交的任务为对大量数据进行处理的任务,如在数据库中某数据表中的数据中筛选设定条件的数据等,可选的,任务请求中还可以携带有用户提交的任务的输入数据。
可选的,在执行步骤201之前,还包括:
在系统配置的所有计算框架中,将具有执行相同任务类型的所有计算框架中的,执行所述相同任务类型的应用程序接口(Application Program Interface,API)通过预设的编程语言进行封装,形成统一API。
融合在资源管理系统中的计算框架,如Hadoop、Spark等,其中,不同的计算框架使用的编程语言可能不同,如Spark使用Scala编程语言,而hadoop使用Java编程语言,且每个计算框架可以实现多种不同的功能,即具有执行多种任务类型的任务的能力,每个计算框架为其能实现的每个功能提供对应的API,显然,开发者需要掌握多种编程语言,编程门槛较高,且系统的开发效率较低。相对于现有技术,在本发明实施例中,将不同计算框架中相同功能的API进行封装,成为统一API,因此,在实现该功能时,只需要调用对应的统一API即可,而无需在可以实现该功能的不同计算框架中指定一个计算框架。
统一API可以用任意一种编程语言来实现,下面以JAVA为统一API的编程语言为例,在将计算框架的API封装为统一API时,包括:
若计算框架提供的API的编程语言为JAVA,那么只需要按照该计算框架的API的要求来重组参数,即可封装为统一API;
若计算框架提供的API的编程语言为除JAVA以外的其它编程语言,需要通过JAVA的跨编程语言规范来调用该API,从而封装为统一API。
其中,JAVA的跨编程语言规范是现有技术,比如,通过JAVA编程语言调用使用可伸缩编程语言(Scala)的API时,由于使用Scala编程语言的API是基于JAVA虚拟机(JAVA Virtual Machine,JVM)的,因此,JAVA编程语言可以直接调用使用Scala编程语言的API;通过JAVA编程语言调用使用C/C++编程语言的API时,可以通过JAVA本地接口(JAVA Native Interface,JNI)方式调用;通过JAVA编程语言调用使用Python编程语言的API,可以通过Jython.jar 调用。
将不同计算框架中执行相同任务类型的API通过预设的编程语言进行封装,生成统一API,对编程语言的差异进行了屏蔽,极大降低开发者的编程门槛,提高了每个计算框架的灵活性和适应性。
可选的,在将不同计算框架中执行相同任务类型的API通过预设的编程语言进行封装生成统一API后,接收的任务请求中携带的用户提交的任务可以为采用与统一API的编程语言相通的编程语言描述的任务。
步骤202:根据任务请求中的任务,生成包含至少一个子任务的子任务集。
可选的,用户提交的任务,是多个子任务组合形成的,该任务中的子任务分布顺序类似于有向无环图,参阅图3所示,任务T包括5个子任务,分别为T1,T2,T3,T4,T5。对任务T进行任务分解处理,可以将该任务T分解成T1->T2->T3->T4->T5,或者T1->T3->T2->T4->T5,T2与T3可以并行执行,其它步骤顺序执行。将类似有向无环图的任务,进行任务分解,生成子任务集是现有技术,本发明实施例对此不再赘述。
步骤203:确定执行每个子任务的输入数据。
具体的,在任务请求中携带有用户提交的任务的输入数据时,直接根据该任务的输入数据,确定每个子任务的输入数据,或者,直接根据每个子任务确定对应的输入数据。
步骤204:针对子任务集中的每个子任务执行以下操作:在系统配置的所有计算框架中确定具有执行该子任务的能力的计算框架作为候选计算框架,其中,所述候选计算框架的数目大于或等于2;根据该子任务的输入数据、以及每个候选计算框架对应的预测模型,分别预测每个候选计算框架执行该子任务时对应的运行时间和资源消耗;根据预测的每个候选计算框架执行该子任务时对应的运行时间和资源消耗,在所述候选计算框架中,筛选出执行该子任务的目标计算框架。
具体的,在系统配置的所有计算框架中确定具有执行该子任务的能力的计算框架作为候选计算框架,包括:
确定该子任务的任务类型;
确定该子任务的任务类型对应的统一API;
根据确定的该统一API,确定具有执行该子任务的任务类型的所有计算框架,并将确定的计算框架作为候选计算框架。
可选的,预测模型可以是用户预设的,可以是通过机器学习得到的,具体的,通过机器学习获得具有执行该子任务的能力的一个候选计算框架对应的预测模型,包括:
读取预设的训练样本集合,该训练样本集合是针对该候选计算框架的执行该子任务的能力预设的;
分别以运行时间、资源消耗为目标特征,对该训练样本集合中除运行时间、资源消耗以外的其它特征进行训练,得到对应的预测模型。
其中,资源消耗可以为内存占用、CPU占用、I/O占用等,本发明实施例对此不做限定。
对训练样本集合进行训练时,可以采用多种机器学习算法,例如线性回归算法、支持向量机算法等,得到的预测模型例如:
运行时间=0.83*输入数据的行数+0.24*输入数据的列数+0.1
内存占用=0.24*输入数据的行数+0.14*输入数据的列数+0.15。
具体的,根据预测的每个候选计算框架执行该子任务时对应的运行时间和资源消耗,在候选计算框架中,筛选出执行该子任务的目标计算框架,包括:
在候选计算框架中,选择预测的资源消耗小于系统的可用资源的候选计算框架作为第一候选计算框架;
在第一候选计算框架中,筛选出预测的运行时间最小的第一候选计算框架作为目标计算框架。
这样,动态选择资源消耗小于系统的可用资源、且运行时间最小的候选计算框架执行子任务,提高了执行子任务的效率,使系统工作性能更好。
步骤205:基于筛选出的执行子任务集中的每个子任务的目标计算框架,执行对应的子任务。
每个子任务的目标计算框架,根据该子任务的任务类型对应的的统一API,执行该子任务。
执行子任务集中的每个子任务后,得到的结果即为用户提交的任务的结果。
在步骤205,基于筛选出的执行所述子任务集中的每个子任务的目标计算框架,执行对应的子任务之后,还包括:
将在该子任务的目标计算框架中执行该子任务产生的各个特征,作为新的训练样本;
将该新的训练样本添加至训练样本集合,该训练样本集合是针对该目标计算框架的执行该子任务的能力预设的。
这样,将计算框架实际的执行任务产生各个特征作为新的训练样本,不断对训练样本集合进行补充,并根据补充后的训练样本集合进行机器学习,产生新的预测模型,使预测模型越来越准确。
采用本发明实施例提供的数据处理方法,在多个计算框架均可以执行同一任务时,通过运行时间和资源消耗在多个计算框架中选择目标计算框架执行该子任务,提高了数据处理效率,以及系统的工作性能。且将不同计算框架中,执行相同任务类型的API通过预设的编程语言进行封装,生成统一API,对编程语言的差异进行了屏蔽,极大降低开发者的编程门槛,提高了每个计算框架的灵活性和适应性。
例1,接收到任务请求,其中携带的用户提交的任务T为如下结构化查询语言(Structured Query Language,SQL)语句:
“SELECT S#as 学生ID
,(SELECT score FROM SC WHERE SC.S#=t.S#AND C#='001')AS数据库
,(SELECT score FROM SC WHERE SC.S#=t.S#AND C#='002')AS英语
,(SELECT score FROM SC WHERE SC.S#=t.S# AND C#='003')AS算法,COUNT(t.C#)AS有效课程数,AVG(t.score)AS平均成绩FROM SC AS t
GROUP BY t.S#
ORDER BY avg(t.Score)“
上述SQL语句表示的任务为“按照‘平均成绩’从高到低的顺序,显示所 有学生的‘数据库’、‘英语’、‘算法’三门的课程成绩,并按照如下形式显示:学生ID,数据库,英语,算法,有效课程数,平均成绩”,其中,学生表如表1所示,课程表如表2所示,成绩表如表3所示:
表1学生表
S# Sname
1 张三
2 李四
3 赵五
表2课程表
C# Cname
001 数据库
002 英语
003 算法
表3成绩表
S# C# Score
1 001 85
2 001 90
3 001 95
1 002 90
2 002 80
3 002 70
1 003 60
2 003 80
3 003 100
将上述任务T进行任务分解处理,得到包含3个子任务的子任务集,分别为:T1,选择相关数据;T2,根据T1的结果形成新的数据表;T3,对T2的结果按照‘平均成绩’从高到低的顺序排序,每个子任务对应的SQL语句表4所 示:
表4子任务对应的SQL语句
Figure PCTCN2016077379-appb-000001
在针对每个子任务,动态选择目标计算框架时,通常选择资源消耗小于可用资源,且运行时间最小的计算框架,在本例中,假设每个子任务都有两个计算框架可供选择:Hadoop和Spark。用户提交的任务对应的输入数据为成绩表,输入数据的特征包括行数、列数等,假设该成绩表的行数为25415996行,列数为3列。每个计算框架执行每个子任务对应的预测模型为表5所示,其中在本实施例中,预测模型包括运行时间预测模型、内存占用预测模型:
表5计算框架针对不同子任务对应的预测模型
计算框架 子任务 运行时间预测模型(毫秒) 内存占用预测模型(MB)
Hadoop T1 0.0001r+0.00034c 0.00015r+0.00024c
Hadoop T2 0.00002r+0.00064c 0.00005r+0.00004c
Hadoop T3 0.00005r+0.00004c 0.00003r+0.00009c
Spark T1 0.00001r+0.00004c 0.00055r+0.00064c
Spark T2 0.0002r+0.0016c 0.00035r+0.00084c
Spark T3 0.00005r+0.0004c 0.00093r+0.0009c
其中,r为输入数据的行数,c为输入数据的列数;
将r=25415996,c=3代入表5中每个计算框架针对不同子任务对应的预测模型,得到每个计算框架执行每个子任务的运行时间和内存占用,如表6所示:
表6计算框架执行每个子任务的运行时间和内存占用
计算框架 子任务 运行时间(毫秒) 内存占用(MB)
Hadoop T1 2541.60062 3812.40012
Hadoop T2 508.32184 1270.79992
Hadoop T3 1270.79992 762.48015
Spark T1 254.16008 13978.79972
Spark T2 5083.204 8895.60112
Spark T3 1270.801 23636.87898
假设系统的可用内存为15000MB:
针对T1,预测通过Hadoop执行T1时的内存占用为3812.40012MB,而通过Spark执行T1时的内存占用为13978.79972MB,显然,两个内存占用均低于系统的可用内存15000MB。而预测的运行时间,在Spark中为354.16008毫秒,低于在Hadoop中的2541.60062毫秒,因此对于T1,选择使用Spark;
针对T2,预测通过Hadoop执行T2时的内存占用为1270.79992MB,而通过Spark执行T2时的内存占用为8895.60112MB,都低于系统的可用内存15000MB。而预测的运行时间,在Spark中为5083.204毫秒,高于在Hadoop中的508.32184毫秒,因此对于T2,选择使用Hadoop;
针对T3,预测通过Spark执行T3时的内存占用为23636.87898MB,高于系统的可用内存15000MB,因此不能选择Spark。预测通过Hadoop执行T3时的内存占用为762.48015MB,低于系统的可用内存15000MB,因此对于T3,选择使用Hadoop;
确定T1的目标计算框架为Spark,T2的目标计算框架为Hadoop,T3的目 标计算框架为Hadoop,依次通过Spark,Hadoop,Hadoop执行T1,T2,T3,得到最终的结果:
其中,通过Spark执行T1后得到的结果如表7所示,通过Hadoop执行T2得到的结果如表8所示,通过Hadoop执行T3得到的结果如表9所示:
表7 T1的结果
学生ID 数据库 英语 算法 课程ID
1 85 90 60 001
1 85 90 60 002
1 85 90 60 003
2 90 80 80 001
2 90 80 80 002
2 90 80 80 003
3 95 70 100 001
3 95 70 100 002
3 95 70 100 003
表8 T2的结果
学生ID 数据库 英语 算法 有效课程数
1 85 90 60 3
2 90 80 80 3
3 95 70 100 3
表9 T3的结果
学生ID 数据库 英语 算法 有效课程数 平均成绩
1 85 90 60 3 78.333
2 90 80 80 3 83.333
3 95 70 100 3 88.333
表9即为任务T的最终结果。
基于以上实施例,本发明还提供了一种数据处理装置,参阅图4所示,该装置400包括:接收单元401、生成单元402、确定单元403、处理单元404,以及运行单元405,其中,
接收单元401,用于接收任务请求,任务请求中携带有用户提交的任务;
生成单元402,用于根据任务请求中的任务,生成包含至少一个子任务的子任务集;
确定单元403,用于确定执行每个子任务的输入数据;
处理单元404,用于针对子任务集中的每个子任务执行以下操作:
在系统配置的所有计算框架中确定具有执行该子任务的能力的计算框架作为候选计算框架,其中,候选计算框架的数目大于或等于2;
根据该子任务的输入数据、以及每个候选计算框架对应的预测模型,分别预测每个候选计算框架执行该子任务时对应的运行时间和资源消耗;
根据预测的每个候选计算框架执行该子任务时对应的运行时间和资源消耗,在候选计算框架中,筛选出执行该子任务的目标计算框架;
运行单元405,用于基于筛选出的执行子任务集中的每个子任务的目标计算框架,执行对应的子任务。
接收单元401接收的任务请求中还携带有该任务的输入数据;
确定单元403,用于:
根据任务请求中携带的该任务的输入数据,确定执行每个子任务的输入数据。
该数据处理装置400还包括:配置单元406,用于在接收任务请求之前,在系统配置的所有计算框架中,将具有执行相同任务类型的所有计算框架中的、执行相同任务类型的API通过预设的编程语言进行封装,形成统一API;
处理单元404,在系统配置的所有计算框架中确定具有执行该子任务的能力的计算框架作为候选计算框架时,用于:
确定该子任务的任务类型;
确定该子任务的任务类型对应的统一API;
根据确定的统一API,确定具有执行该子任务的任务类型的所有计算框架,并将确定的计算框架作为候选计算框架。
处理单元404在获得候选计算框架对应的预测模型时,用于:
读取预设的训练样本集合,该训练样本集合是针对该候选计算框架执行该子任务的能力预设的;
分别以运行时间、资源消耗为目标特征,对该训练样本集合中除运行时间、资源消耗以外的其它特征进行训练,得到该候选计算框架对应的预测模型。
处理单元404,在筛选出执行该子任务的目标计算框架时,包括:
在候选计算框架中,选择预测的资源消耗小于系统的可用资源的候选计算框架作为第一候选计算框架;
在第一候选计算框架中,筛选出预测的运行时间最小的第一候选计算框架作为目标计算框架。
运行单元405,还用于:
基于确定的执行子任务集中的每个子任务的目标计算框架,执行对应的子任务之后,将在该子任务的目标计算框架中执行该子任务产生的各个特征,作为新的训练样本;
将新的训练样本添加至针对目标计算框架执行该子任务的能力预设的训练样本集合。
综上所述,通过本发明实施例中提供的一种数据处理方法及装置,该方法通过接收携带有用户提交的任务的任务请求后,根据所述任务,生成包含至少一个子任务的子任务集;并确定执行每个子任务的输入数据,针对所述子任务集中的每个子任务执行以下操作,确定执行每个子任务的目标计算框架:在系统配置的所有计算框架中确定具有执行该子任务的能力的计算框架作为候选计算框架,其中,所述候选计算框架的数目大于或等于2;根据该子任务的输入数据、以及每个候选计算框架对应的预测模型,分别预测每个候选计算框架执行该子任务时对应的运行时间和资源消耗;根据预测的每个候选计算框架执行该子任务时对应的运行时间和资源消耗,在所述候选计算框架中,筛选出执行该子任务的目标计算框架;最终基于确定的执行所述子任务集中的每个子任务的目标计算框架,分别执行对应的子任务。这样,资源管理系统通过运行时间和资源消耗在多个计算框架中选择目标计算框架执行每个子任务,提高了数据处理效率,以及系统的工作性能。
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包 含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本发明的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。
显然,本领域的技术人员可以对本发明实施例进行各种改动和变型而不脱离本发明实施例的精神和范围。这样,倘若本发明实施例的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。

Claims (12)

  1. 一种数据处理方法,其特征在于,包括:
    接收任务请求,所述任务请求中携带有用户提交的任务;
    根据所述任务请求中的所述任务,生成包含至少一个子任务的子任务集;
    确定执行每个子任务的输入数据;
    针对所述子任务集中的每个子任务执行以下操作:
    在系统配置的所有计算框架中确定具有执行该子任务的能力的计算框架作为候选计算框架,其中,所述候选计算框架的数目大于或等于2;
    根据该子任务的输入数据、以及每个候选计算框架对应的预测模型,分别预测每个候选计算框架执行该子任务时对应的运行时间和资源消耗;
    根据预测的每个候选计算框架执行该子任务时对应的运行时间和资源消耗,在所述候选计算框架中,筛选出执行该子任务的目标计算框架;
    基于筛选出的执行所述子任务集中的每个子任务的目标计算框架,执行对应的子任务。
  2. 如权利要求1所述的方法,其特征在于,所述任务请求中还携带有所述任务的输入数据;
    确定执行每个子任务的输入数据,包括:
    根据所述任务请求中携带的所述任务的输入数据,确定执行每个子任务的输入数据。
  3. 如权利要求1所述的方法,其特征在于,接收任务请求之前,还包括:
    在系统配置的所有计算框架中,将具有执行相同任务类型的所有计算框架中的、执行所述相同任务类型的应用程序接口API通过预设的编程语言进行封装,形成统一API;
    在系统配置的所有计算框架中确定具有执行该子任务的能力的计算框架作为候选计算框架,包括:
    确定该子任务的任务类型;
    确定该子任务的任务类型对应的统一API;
    根据确定的所述统一API,确定具有执行该子任务的任务类型的所有计算框架,并将确定的计算框架作为候选计算框架。
  4. 如权利要求1-3任一项所述的方法,其特征在于,获得候选计算框架对应的预测模型,包括:
    读取预设的训练样本集合,所述训练样本集合是针对所述候选计算框架执行该子任务的能力预设的;
    分别以运行时间、资源消耗为目标特征,对所述训练样本集合中除运行时间、资源消耗以外的其它特征进行训练,得到所述候选计算框架对应的预测模型。
  5. 如权利要求1-4任一项所述的方法,其特征在于,根据预测的每个候选计算框架执行该子任务时对应的运行时间和资源消耗,在所述候选计算框架中,筛选出执行该子任务的目标计算框架,包括:
    在所述候选计算框架中,选择预测的资源消耗小于系统的可用资源的候选计算框架作为第一候选计算框架;
    在所述第一候选计算框架中,筛选出预测的运行时间最小的第一候选计算框架作为目标计算框架。
  6. 如权利要求4所述的方法,其特征在于,基于确定的执行所述子任务集中的每个子任务的目标计算框架,执行对应的子任务之后,还包括:
    将在该子任务的目标计算框架中执行该子任务产生的各个特征,作为新的训练样本;
    将所述新的训练样本添加至所述训练样本集合。
  7. 一种数据处理装置,其特征在于,包括:
    接收单元,用于接收任务请求,所述任务请求中携带有用户提交的任务;
    生成单元,用于根据所述任务请求中的所述任务,生成包含至少一个子任务的子任务集;
    确定单元,用于确定执行每个子任务的输入数据;
    处理单元,用于针对所述子任务集中的每个子任务执行以下操作:
    在系统配置的所有计算框架中确定具有执行该子任务的能力的计算框架作为候选计算框架,其中,所述候选计算框架的数目大于或等于2;
    根据该子任务的输入数据、以及每个候选计算框架对应的预测模型,分别预测每个候选计算框架执行该子任务时对应的运行时间和资源消耗;
    根据预测的每个候选计算框架执行该子任务时对应的运行时间和资源消耗,在所述候选计算框架中,筛选出执行该子任务的目标计算框架;
    运行单元,用于基于筛选出的执行所述子任务集中的每个子任务的目标计算框架,执行对应的子任务。
  8. 如权利要求7所述的装置,其特征在于,所述接收单元接收的所述任务请求中还携带有所述任务的输入数据;
    所述确定单元,用于:
    根据所述任务请求中携带的所述任务的输入数据,确定执行每个子任务的输入数据。
  9. 如权利要求7所述的装置,其特征在于,还包括:配置单元,用于在接收任务请求之前,在系统配置的所有计算框架中,将具有执行相同任务类型的所有计算框架中的、执行所述相同任务类型的应用程序接口API通过预设的编程语言进行封装,形成统一API;
    所述处理单元,在系统配置的所有计算框架中确定具有执行该子任务的能力的计算框架作为候选计算框架时,用于:
    确定该子任务的任务类型;
    确定该子任务的任务类型对应的统一API;
    根据确定的所述统一API,确定具有执行该子任务的任务类型的所有计算框架,并将确定的计算框架作为候选计算框架。
  10. 如权利要求7-9任一项所述的装置,其特征在于,所述处理单元在获得候选计算框架对应的预测模型时,用于:
    读取预设的训练样本集合,所述训练样本集合是针对所述候选计算框架执 行该子任务的能力预设的;
    分别以运行时间、资源消耗为目标特征,对所述训练样本集合中除运行时间、资源消耗以外的其它特征进行训练,得到所述候选计算框架对应的预测模型。
  11. 如权利要求7-10任一项所述的装置,其特征在于,所述处理单元,在筛选出执行该子任务的目标计算框架时,包括:
    在所述候选计算框架中,选择预测的资源消耗小于系统的可用资源的候选计算框架作为第一候选计算框架;
    在所述第一候选计算框架中,筛选出预测的运行时间最小的第一候选计算框架作为目标计算框架。
  12. 如权利要求10所述的装置,其特征在于,所述运行单元,还用于:
    基于确定的执行所述子任务集中的每个子任务的目标计算框架,执行对应的子任务之后,将在该子任务的目标计算框架中执行该子任务产生的各个特征,作为新的训练样本;
    将所述新的训练样本添加至所述训练样本集合。
PCT/CN2016/077379 2015-04-29 2016-03-25 一种数据处理方法及装置 WO2016173351A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP16785782.0A EP3267310B1 (en) 2015-04-29 2016-03-25 Data processing method and device
US15/728,879 US10606654B2 (en) 2015-04-29 2017-10-10 Data processing method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510212439.5A CN104834561B (zh) 2015-04-29 2015-04-29 一种数据处理方法及装置
CN201510212439.5 2015-04-29

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/728,879 Continuation US10606654B2 (en) 2015-04-29 2017-10-10 Data processing method and apparatus

Publications (1)

Publication Number Publication Date
WO2016173351A1 true WO2016173351A1 (zh) 2016-11-03

Family

ID=53812469

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/077379 WO2016173351A1 (zh) 2015-04-29 2016-03-25 一种数据处理方法及装置

Country Status (4)

Country Link
US (1) US10606654B2 (zh)
EP (1) EP3267310B1 (zh)
CN (1) CN104834561B (zh)
WO (1) WO2016173351A1 (zh)

Families Citing this family (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104834561B (zh) * 2015-04-29 2018-01-19 华为技术有限公司 一种数据处理方法及装置
CN105045607B (zh) * 2015-09-02 2019-03-29 广东创我科技发展有限公司 一种实现多种大数据计算框架统一接口的方法
CN105892996A (zh) * 2015-12-14 2016-08-24 乐视网信息技术(北京)股份有限公司 一种批量数据处理的流水线作业方法及装置
CN107870815B (zh) * 2016-09-26 2020-06-26 中国电信股份有限公司 一种分布式系统的任务调度方法以及系统
CN107885595B (zh) 2016-09-30 2021-12-14 华为技术有限公司 一种资源分配方法、相关设备及系统
CN108073582B (zh) * 2016-11-08 2021-08-06 中移(苏州)软件技术有限公司 一种计算框架选择方法和装置
CN108268529B (zh) * 2016-12-30 2020-12-29 亿阳信通股份有限公司 一种基于业务抽象和多引擎调度的数据汇总方法和系统
CN106980509B (zh) * 2017-04-05 2021-01-19 智恒科技股份有限公司 计算总线的计算方法和装置
CN107451663B (zh) * 2017-07-06 2021-04-20 创新先进技术有限公司 算法组件化、基于算法组件建模方法、装置以及电子设备
CN109697118A (zh) * 2017-10-20 2019-04-30 北京京东尚科信息技术有限公司 流式计算任务管理方法、装置、电子设备及存储介质
CN107741882B (zh) * 2017-11-22 2021-08-20 创新先进技术有限公司 分配任务的方法及装置和电子设备
CN108248641A (zh) * 2017-12-06 2018-07-06 中国铁道科学研究院电子计算技术研究所 一种城市轨道交通数据处理方法及装置
CN108052394B (zh) * 2017-12-27 2021-11-30 福建星瑞格软件有限公司 基于sql语句运行时间的资源分配的方法及计算机设备
CN108388470B (zh) * 2018-01-26 2022-09-16 福建星瑞格软件有限公司 一种大数据任务处理方法及计算机设备
CN108804378A (zh) * 2018-05-29 2018-11-13 郑州易通众联电子科技有限公司 一种计算机数据处理方法及系统
CN109034394B (zh) * 2018-07-02 2020-12-11 第四范式(北京)技术有限公司 一种机器学习模型的更新方法和装置
CN108985367A (zh) * 2018-07-06 2018-12-11 中国科学院计算技术研究所 计算引擎选择方法和基于该方法的多计算引擎平台
CN109933306B (zh) * 2019-02-11 2020-07-14 山东大学 一种基于作业类型识别的自适应混合云计算框架生成方法
CN110109748B (zh) * 2019-05-21 2020-03-17 星环信息科技(上海)有限公司 一种混合语言任务执行方法、装置及集群
KR102718301B1 (ko) 2019-05-28 2024-10-15 삼성에스디에스 주식회사 이종 언어 함수를 포함하는 워크플로우 실행 방법 및 그 장치
CN110928907A (zh) * 2019-11-18 2020-03-27 第四范式(北京)技术有限公司 目标任务处理方法、装置及电子设备
CN113742028A (zh) * 2020-05-28 2021-12-03 伊姆西Ip控股有限责任公司 资源使用方法、电子设备和计算机程序产品
CN111723112B (zh) * 2020-06-11 2023-07-07 咪咕文化科技有限公司 数据任务执行方法、装置、电子设备及存储介质
CN112052253B (zh) * 2020-08-12 2023-12-01 网宿科技股份有限公司 数据处理方法、电子设备及存储介质
CN112199544B (zh) * 2020-11-05 2024-02-27 北京明略软件系统有限公司 全图挖掘预警方法、系统、电子设备及计算机可读存储介质
US11836534B2 (en) * 2021-01-26 2023-12-05 International Business Machines Corporation Prediction of proficient resources for performing unprecedented workloads using models
CN113326116A (zh) * 2021-06-30 2021-08-31 北京九章云极科技有限公司 一种数据处理方法和系统
CN113239243A (zh) * 2021-07-08 2021-08-10 湖南星汉数智科技有限公司 基于多计算平台的图数据分析方法、装置和计算机设备
CN115866063A (zh) * 2021-09-24 2023-03-28 中国电信股份有限公司 需求调度方法、装置、计算机可读存储介质及电子设备
CN117075930B (zh) * 2023-10-17 2024-01-26 之江实验室 一种计算框架管理系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140229221A1 (en) * 2013-02-11 2014-08-14 Amazon Technologies, Inc. Cost-minimizing task scheduler
CN104361091A (zh) * 2014-11-18 2015-02-18 浪潮(北京)电子信息产业有限公司 一种大数据系统
US20150066646A1 (en) * 2013-08-27 2015-03-05 Yahoo! Inc. Spark satellite clusters to hadoop data stores
CN104834561A (zh) * 2015-04-29 2015-08-12 华为技术有限公司 一种数据处理方法及装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9477928B2 (en) * 2010-07-14 2016-10-25 Fujitsu Limited System and method for comparing software frameworks
AU2011293350B2 (en) * 2010-08-24 2015-10-29 Solano Labs, Inc. Method and apparatus for clearing cloud compute demand
EP2629247B1 (en) * 2012-02-15 2014-01-08 Alcatel Lucent Method for mapping media components employing machine learning
US9159028B2 (en) * 2013-01-11 2015-10-13 International Business Machines Corporation Computing regression models
CN103488775B (zh) 2013-09-29 2016-08-10 中国科学院信息工程研究所 一种用于大数据处理的计算系统及计算方法
US10467569B2 (en) * 2014-10-03 2019-11-05 Datameer, Inc. Apparatus and method for scheduling distributed workflow tasks
US20160300157A1 (en) * 2015-04-08 2016-10-13 Nec Laboratories America, Inc. LambdaLib: In-Memory View Management and Query Processing Library for Realizing Portable, Real-Time Big Data Applications

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140229221A1 (en) * 2013-02-11 2014-08-14 Amazon Technologies, Inc. Cost-minimizing task scheduler
US20150066646A1 (en) * 2013-08-27 2015-03-05 Yahoo! Inc. Spark satellite clusters to hadoop data stores
CN104361091A (zh) * 2014-11-18 2015-02-18 浪潮(北京)电子信息产业有限公司 一种大数据系统
CN104834561A (zh) * 2015-04-29 2015-08-12 华为技术有限公司 一种数据处理方法及装置

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DONG, CHUNTAO ET AL.: "Research on the Framework and Resource Scheduling Mechanisms of Hadoop YARN", INFORMATION AND COMMUNICATIONS TECHNOLOGIES, vol. 9, no. 1, 15 February 2015 (2015-02-15), pages 77 - 84, XP009501671, ISSN: 1674-1285 *
GU, LEI ET AL.: "Memory or Time: Performance Evaluation for Iterative Operation on Hadoop and Spark", 2013 IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2013 IEEE INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING, 31 December 2013 (2013-12-31), pages 721 - 727, XP032606358 *

Also Published As

Publication number Publication date
EP3267310A4 (en) 2018-04-25
CN104834561B (zh) 2018-01-19
CN104834561A (zh) 2015-08-12
EP3267310A1 (en) 2018-01-10
EP3267310B1 (en) 2022-07-20
US20180032375A1 (en) 2018-02-01
US10606654B2 (en) 2020-03-31

Similar Documents

Publication Publication Date Title
WO2016173351A1 (zh) 一种数据处理方法及装置
CN110168516B (zh) 用于大规模并行处理的动态计算节点分组方法及系统
CN108205469B (zh) 一种基于MapReduce的资源分配方法及服务器
Machado YBYRÁ facilitates comparison of large phylogenetic trees
GB2508503A (en) Batch evaluation of remote method calls to an object oriented database
JP2013500543A (ja) データ並列スレッドを有する処理論理の複数のプロセッサにわたるマッピング
US9569221B1 (en) Dynamic selection of hardware processors for stream processing
US9063918B2 (en) Determining a virtual interrupt source number from a physical interrupt source number
US10452686B2 (en) System and method for memory synchronization of a multi-core system
KR101703328B1 (ko) 이종 멀티 프로세서 환경에서의 데이터 처리 최적화 장치 및 방법
CN111708639A (zh) 任务调度系统及方法、存储介质及电子设备
Li et al. Efficient algorithms for task mapping on heterogeneous CPU/GPU platforms for fast completion time
Abbasi et al. A preliminary study of incorporating GPUs in the Hadoop framework
Carvalho et al. Kernel concurrency opportunities based on GPU benchmarks characterization
Li et al. Concurrent query processing in a GPU-based database system
CN113407333B (zh) Warp级别调度的任务调度方法、系统、GPU及设备
Coronado‐Barrientos et al. AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL
CN111831425B (zh) 一种数据处理方法、装置及设备
CN114003359A (zh) 基于弹性持久的线程块的任务调度方法、系统及gpu
Karnagel et al. Limitations of intra-operator parallelism using heterogeneous computing resources
Acosta et al. Towards the optimal execution of Renderscript applications in Android devices
Posner et al. A combination of intra-and inter-place work stealing for the APGAS library
Cojean et al. Resource aggregation for task-based Cholesky Factorization on top of heterogeneous machines
Gannouni A Gamma-calculus GPU-based parallel programming framework
Watson et al. Memory customisations for image processing applications targeting MPSoCs

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16785782

Country of ref document: EP

Kind code of ref document: A1

REEP Request for entry into the european phase

Ref document number: 2016785782

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE