WO2020233262A1 - 一种基于Spark的多中心数据协同计算的流处理方法 - Google Patents

一种基于Spark的多中心数据协同计算的流处理方法 Download PDF

Info

Publication number
WO2020233262A1
WO2020233262A1 PCT/CN2020/083593 CN2020083593W WO2020233262A1 WO 2020233262 A1 WO2020233262 A1 WO 2020233262A1 CN 2020083593 W CN2020083593 W CN 2020083593W WO 2020233262 A1 WO2020233262 A1 WO 2020233262A1
Authority
WO
WIPO (PCT)
Prior art keywords
computing
task
client
queue
thread
Prior art date
Application number
PCT/CN2020/083593
Other languages
English (en)
French (fr)
Inventor
李劲松
李润泽
陆遥
王昱
赵英浩
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Priority to JP2021533418A priority Critical patent/JP6990802B1/ja
Publication of WO2020233262A1 publication Critical patent/WO2020233262A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5018Thread allocation

Definitions

  • the invention belongs to the technical field of stream processing, and in particular relates to a stream processing method based on Spark-based multi-center data collaborative computing.
  • Stream Processing is a computer programming paradigm, which can also be called data stream programming or interactive programming. It is a technology that allows computing applications to be more efficient in a limited parallel processing mode. This type of technology application can exist on a variety of computing units, such as Graphic Processing Unit (GPU) or Field-programmable Gate Arrays (FPGA), and does not explicitly manage memory allocation , Synchronization and communication between units.
  • Spark streaming is an extension of Spark's core API. It has the characteristics of scalability, high throughput, and fault tolerance for real-time streaming data processing.
  • the main interface provided is to create a StreamingContext for the context, stream start, stream end stop, cache, checkpointing, etc.
  • Multi-center data collaborative computing is an application scenario that appears in the context of big data.
  • Multi-party data centers need to coordinate data resources and data processing requirements in order to provide each individual user with easier-to-use and powerful data processing platform resources.
  • a single individual user can choose to integrate his own data resources with multiple data resources for centralized analysis, and at the same time can choose a variety of computing requirements, and perform parallel computing in a multi-center context.
  • the purpose of the present invention is to provide a stream processing method based on Spark-based multi-center data collaborative computing for the shortcomings of the prior art.
  • the present invention implements stream processing of multi-center data collaborative computing through resource management logs and Spark streaming computing. , Coupling the resource allocation advantages of stream processing with multi-center heterogeneous computing requirements, improving the fairness of resource allocation and data analysis efficiency of multi-center collaborative computing, and reducing the waiting time of computing queue tasks.
  • a Spark-based multi-center data collaborative computing stream processing method the method is implemented on a multi-center data collaborative computing system, the multi-center data collaborative computing system includes several A client and a computing terminal, the client is used to generate and submit a user's computing task request to the computing terminal, and the computing terminal is used to parse the request, generate and execute computing instructions; the method includes the following steps:
  • the computing end parses the computing task request sent by the client c k to obtain (c k , t k , nt k , nm k , D k );
  • the computing end inserts (c k , t k , nt k , nm k , D k ) as an element into the calculation task queue Q, and then initiates the scheduling calculation.
  • the scheduling calculation the calculation requirements of each element of the task queue Q are taken Optimize according to the maximum minimum principle of the client as a unit, and update the nt k and nm k of each element;
  • Flow 1 Load data D 1 , perform calculation task t 1 on the data, allocate thread resources nt 1 , and memory resources nm 1 ;
  • Flow 2 Load the data D 2 , execute the calculation task t 2 on the data, the allocated thread resource is nt 2 , and the memory resource is nm 2 ;
  • Stream L Load data D L , perform calculation tasks t L on the data, allocate thread resources nt L , and memory resources nm L ;
  • StreamingContext.CheckPointing is the stream processing task data persistence instruction interface under the Spark framework.
  • the data is read to HDFS, data preprocessing cache, calculation, and return.
  • the data stream persistence operation is performed in the four steps, and the intermediate results and calculation task metadata are saved to D l ; at the same time, the update status of the queue is monitored, If the queue update is monitored, use StreamingContext.stop (StreamingContext.stop is the stream processing task termination instruction interface under the Spark framework) to stop the stream, and return to step (4); if the calculation task in the stream processing process is completed, The client corresponding to the stream processing task returns the task processing result and ejects the task from the queue Q.
  • StreamingContext.stop is the stream processing task termination instruction interface under the Spark framework
  • the client-based scheduling calculation process is as follows:
  • L is the length of the queue Q. If there are multiple records in the client, follow The client sums up to get a new queue based on the client L mid is the length of Q mid , s j is the total number of tasks initiated by each client, They are the total number of thread resources and the total number of memory resources requested by the client c j ;
  • Thread resources allocated by the same client are equally distributed to all tasks corresponding to the client.
  • Thread resources allocated to a task t z actually submitted by user c j For all the thread resources allocated to the user obtained in (3.2.2), s j is the total number of tasks initiated by the user c j .
  • the present invention processes and calculates the execution flow of multi-center data computing requirements and operations, improves program execution performance and resource allocation efficiency; sets resource management logs and RESTFul, and accurately regulates and records Spark from multi-centers
  • the memory and thread resources occupied and required by the request task The strategy of the principle of maximum and minimum fairness is used to implement the resource allocation of each step in the convection computing;
  • the present invention solves the problem of large-scale thread blocking delay in multi-center data collaborative computing and reduces single The waiting time of users improves the flexibility and fairness of resource allocation.
  • Figure 1 is a flow chart of the collaborative computing flow processing method in the center of the invention.
  • the present invention provides a Spark-based multi-center data collaborative computing stream processing method.
  • the method is implemented on a multi-center data collaborative computing system.
  • the multi-center data collaborative computing system includes several clients and A computing terminal, the client is used to generate and submit a user's computing task request to the computing terminal, the computing terminal is used to parse the request, generate and execute computing instructions; the method includes the following steps:
  • the computing end parses the computing task request sent by the client c k to obtain (c k , t k , nt k , nm k , D k );
  • the computing end inserts (c k , t k , nt k , nm k , D k ) as an element into the calculation task queue Q, and then initiates the scheduling calculation.
  • the scheduling calculation the calculation requirements of each element of the task queue Q are taken Optimize according to the maximum minimum principle of the client as a unit, and update the nt k and nm k of each element;
  • Flow 1 Load data D 1 , perform calculation task t 1 on the data, allocate thread resources nt 1 , and memory resources nm 1 ;
  • Flow 2 Load the data D 2 , execute the calculation task t 2 on the data, the allocated thread resource is nt 2 , and the memory resource is nm 2 ;
  • Stream L Load data D L , perform calculation tasks t L on the data, allocate thread resources nt L , and memory resources nm L ;
  • StreamingContext.CheckPointing is the stream processing task data persistence instruction interface under the Spark framework.
  • the data is read to HDFS, data preprocessing cache, calculation, and return.
  • the data stream persistence operation is performed in the four steps, and the intermediate results and calculation task metadata are saved to D l ; at the same time, the update status of the queue is monitored, If the queue update is monitored, use StreamingContext.stop (StreamingContext.stop is the stream processing task termination instruction interface under the Spark framework) to stop the stream, and return to step (4); if the calculation task in the stream processing process is completed, The client corresponding to the stream processing task returns the task processing result and ejects the task from the queue Q.
  • StreamingContext.stop is the stream processing task termination instruction interface under the Spark framework
  • the client-based scheduling calculation process is as follows:
  • L is the length of the queue Q. If there are multiple records in the client, follow The client sums up to get a new queue based on the client L mid is the length of Q mid , s j is the total number of tasks initiated by each client, They are the total number of thread resources and the total number of memory resources requested by the client c j ;
  • Thread resources allocated by the same client are equally distributed to all tasks corresponding to the client.
  • Thread resources allocated to a task t z actually submitted by user c j For all the thread resources allocated to the user obtained in (3.2.2), s j is the total number of tasks initiated by the user c j .
  • the third hospital "hospital3” initiates a new computing task request "task4" to the computing end.
  • the request includes thread resource requirement 16, computing memory requirement 16, and data to be calculated corresponding to this task "path4" ";
  • the computing end parses the computing task request sent by the client c i , and obtains ("hospital3",”task4",16,16,”path4");
  • the scheduling calculation is initiated.
  • the calculation requirements of each element of the task queue Q are optimized according to the maximum and minimum principle of the client as the unit.
  • the nt k and nm k of each element are updated, and the value of the queue Q becomes:
  • Spark.StreamingContext is the stream processing task creation command interface under the Spark framework
  • Spark.Conf is the stream processing task configuration instruction interface under the Spark framework
  • Flow 4 Load the data "path4", perform the calculation task "task4" on the data, the allocated thread resource is 10, and the memory resource is 10;
  • StreamingContext.CheckPointing is the stream processing task data persistence instruction interface under the Spark framework to execute the data flow in the four steps of reading the data in the stream processing process to HDFS, data preprocessing cache, calculation, and return Persistence operations, save intermediate results and computing task metadata to path1, path2, path3, path4; at the same time monitor the update status of the queue, if the queue update is monitored, use StreamingContext.stop (StreamingContext.stop is the stream processing under the Spark framework The task suspension instruction interface) stops the stream and returns to step (4); if the calculation task in the stream processing process is completed, the task processing result is returned to the client corresponding to the stream processing task, and the task is ejected from the queue Q.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Computer And Data Communications (AREA)
  • Multi Processors (AREA)

Abstract

本发明公开了一种基于Spark的多中心数据协同计算的流处理方法,多个客户端生成和提交用户的计算任务请求给计算端,计算端解析请求,生成并执行计算指令;本发明对于多中心的数据计算的需求和操作的执行流处理计算,提高程序执行性能和资源分配效率;设置资源管理的日志和RESTFul,准确调控记录来自多中心的Spark请求任务所占用和需求的内存和线程资源;利用最大最小公平原则的策略,执行对流计算中每一步的资源分配;本发明解决了多中心数据协同计算的大批量的线程阻塞延迟问题,减少单个用户的等待时间,提升资源分配的灵活性和公平程度。

Description

一种基于Spark的多中心数据协同计算的流处理方法 技术领域
本发明属于流处理技术领域,尤其涉及一种基于Spark的多中心数据协同计算的流处理方法。
背景技术
流处理技术(Stream Processing)是一种计算机编程范式,也可以叫做数据流编程或者交互式编程,是一种让计算应用在有限的并行处理模式下获得更高效使用效率的技术。这一类型的技术应用可以在多种运算单元上存在,比如图形运算单元(Graphic Processing Unit,GPU)或者可编程阵列门电路(Field-programmable Gate Arrays,FPGA),并且不显式的管理内存分配,同步和单元之间的交流。Spark streaming是Spark核心API的一个扩展,它对实时流式数据的处理具有可扩展性、高吞吐量、可容错性等特点。主要提供的接口为上下文创建StreamingContext,流开始start,流结束stop,缓存cache,Checkpointing等。
多中心数据协同计算是大数据背景下出现的应用场景,多方数据中心需要统筹数据资源以及数据处理的需求,以求为各个单体用户提供更易用和强大的数据处理平台资源。单个个体用户可以选择将自己的数据资源和多方的数据资源整合进行集中分析,同时可以选择多种运算需求,在多中心背景下进行并行计算。
现有的多中心协同分析平台大多为实质上的单中心,即将多方数据库缓存到同一处数据节点,再将多种分析需求逐个进行处理,实际上等价于将所有并发默认到一个流上进行,这种方式会导致大批量的线程阻塞延迟,每个批在队列中的等待时间增加,新加入用户的计算需求很难得到即时的反馈和满足,数据实时性也难以保持。
发明内容
本发明的目的在于针对现有技术的不足,提供一种基于Spark的多中心数据协同计算的流处理方法,本发明通过资源管理日志和Spark的流计算实现对多中心数据协同计算的流处理化,将流处理的资源分配优势和多中心化的异质化计算需求进行耦合,提高多中心协同计算的资源分配公平性和数据分析效率,降低计算队列任务等待时间。
本发明的目的是通过以下技术方案来实现的:一种基于Spark的多中心数据协同计算的流处理方法,该方法在多中心数据协同计算系统上实现,所述多中心数据协同计算系统包括若干客户端和一个计算端,所述客户端用于生成和提交用户的计算任务请求给计算端,所述计算端用于解析请求,生成并执行计算指令;该方法包括以下步骤:
(1)在客户端和计算端建立RESTFul服务,记计算任务队列为Q=[(c k,t k,nt k,nm k,D k)],1≤k≤L,L为计算任务队列Q的长度,任意一个客户端c k向计算端发起一个新的计算任务请求t k,该请求包括计算的线程资源需求nt k、计算内存的需求nm k、对应此任务的待计算数据D k
(2)计算端解析客户端c k发送的计算任务请求,得到(c k,t k,nt k,nm k,D k);
(3)计算端将(c k,t k,nt k,nm k,D k)作为一个元素插入计算任务队列Q,之后发起Scheduling计算,在Scheduling计算中任务队列Q各个元素的计算需求取值按照客户端为单位的最大最小原则进行优化,更新每个元素的nt k和nm k
(4)计算队列Q的长度len(Q)=L,以L为循环边界条件,用Spark.StreamingContext(Spark.StreamingContext为Spark框架下的流处理任务创建指令接口)创建L个流,并用Spark.Conf(Spark.Conf为Spark框架下的流处理任务配置指令接口)声明分配给各个流的资源;对于依次向Spark发起实际的流任务,载入数据D k,对数据执行计算任务t k,分配的线程资源为nt k,内存资源为nm k;其中,如果D k中存在中间结果和计算任务元数据,则直接从其对应的步骤开始计算任务。
流1:载入数据D 1,对数据执行计算任务t 1,分配的线程资源为nt 1,内存资源为nm 1
流2:载入数据D 2,对数据执行计算任务t 2,分配的线程资源为nt 2,内存资源为nm 2
流L:载入数据D L,对数据执行计算任务t L,分配的线程资源为nt L,内存资源为nm L
(5)对于已经在流处理的任务(c l,t l,nt l,nm l,D l),利用StreamingContext.CheckPointing(StreamingContext.CheckPointing为Spark框架下的流处理任务数据持久化指令接口)在流处理过程中的数据读取至HDFS、数据预处理缓存、计算、返回这四个步骤中执行数据流持久化操作,保存中间结果和计算任务元数据至D l;同时监听队列的更新状况,如果监听到队列更新,则利用StreamingContext.stop(StreamingContext.stop为Spark框架下的流处理任务中止指令接口)停止该流,返回步骤(4);如果完成了流处理过程中的计算任务,则向该流处理任务对应的客户端返回任务处理结果,并将任务从队列Q弹出。
进一步地,所述步骤(3)中,基于客户端的Scheduling计算流程如下:
(3.1)对于队列Q=[(c k,t k,nt k,nm k,D k)],1≤k≤L,L为计算队列Q的长度,如果客户端存在多条记录,先按照客户端进行求和,得到以客户端为单位的新队列
Figure PCTCN2020083593-appb-000001
L mid为Q mid长度,s j为每个客户端发起的任务总数,
Figure PCTCN2020083593-appb-000002
分别为客户端c j请求的线程资源总数和内存资源总数;
(3.2)对于线程资源,执行如下优化分配流程:
(3.2.1)对于所有客户端的线程资源请求总数队列
Figure PCTCN2020083593-appb-000003
按大小进行排序得到
Figure PCTCN2020083593-appb-000004
和下标映射
Figure PCTCN2020083593-appb-000005
记计算中心计算资源池的总线程资源为NT,则预给
Figure PCTCN2020083593-appb-000006
的资源为
Figure PCTCN2020083593-appb-000007
(3.2.2)如果存在
Figure PCTCN2020083593-appb-000008
记这个集合为
Figure PCTCN2020083593-appb-000009
进入步骤(3.2.3);否则输出最终的线程资源分配策略
Figure PCTCN2020083593-appb-000010
利用下标映射得到对应恢复排序前顺序的线程资源分配策略
Figure PCTCN2020083593-appb-000011
进入步骤(3.2.4);
(3.2.3)需要重新分配的线程资源为
Figure PCTCN2020083593-appb-000012
其中|J|为J的元素个数,返回步骤(3.2.2);
(3.2.4)将同一个客户端所分配到的线程资源平均分配给该客户端所对应的所有任务,对于同一c j对应任务T j={t z|1≤z≤s j},
Figure PCTCN2020083593-appb-000013
其中
Figure PCTCN2020083593-appb-000014
为用户c j实际提交的一个任务t z所分配到的线程资源,
Figure PCTCN2020083593-appb-000015
为(3.2.2)得到的该用户分配到的所有线程资源,s j为用户c j发起的任务总数。
(3.3)对于内存资源,执行如下优化分配流程:
(3.3.1)对于所有客户端的内存资源请求总数队列
Figure PCTCN2020083593-appb-000016
按大小进行排序得到
Figure PCTCN2020083593-appb-000017
和下标映射
Figure PCTCN2020083593-appb-000018
记计算中心计算资源池的总内存资源为NM,则预给
Figure PCTCN2020083593-appb-000019
的资源为
Figure PCTCN2020083593-appb-000020
(3.3.2)如果存在
Figure PCTCN2020083593-appb-000021
记这个集合为
Figure PCTCN2020083593-appb-000022
进入步骤(3.2.3);否则输出最终的内存资源分配策略
Figure PCTCN2020083593-appb-000023
利用下标映射得到对应恢复排序前顺序的内存资源分配策略
Figure PCTCN2020083593-appb-000024
进入步骤(3.2.4);
(3.3.3)需要重新分配的内存资源为
Figure PCTCN2020083593-appb-000025
其中|J|为J的元素个数,返回步骤(3.3.2);
(3.3.4)将同一个客户端所分配到的内存资源平均分配给该客户端所对应的所有任务,对于同一c j对应任务T j={t z|1≤z≤s j},
Figure PCTCN2020083593-appb-000026
其中
Figure PCTCN2020083593-appb-000027
为用户c j实际提交的一个任务t z所分配到的内存资源,
Figure PCTCN2020083593-appb-000028
为(3.2.2)得到的该用户分配到的所有内存资源,s j为用户c j发起的任务总数。
(3.4)从(3.2)和(3.3)中得到的[nt k]和[nm k],重新组成Q=[(c k,t k,nt k,nm k,D k)]。
本发明的有益效果是:本发明对于多中心的数据计算的需求和操作的执行流处理计算,提高程序执行性能和资源分配效率;设置资源管理的日志和RESTFul,准确调控记录来自多中心的Spark请求任务所占用和需求的内存和线程资源;利用最大最小公平原则的策略,执 行对流计算中每一步的资源分配;本发明解决了多中心数据协同计算的大批量的线程阻塞延迟问题,减少单个用户的等待时间,提升资源分配的灵活性和公平程度。
附图说明
图1为本发明中心协同计算流处理方法流程图。
具体实施方式
下面结合附图和具体实施例对本发明作进一步详细说明。
如图1所示,本发明提供的一种基于Spark的多中心数据协同计算的流处理方法,该方法在多中心数据协同计算系统上实现,所述多中心数据协同计算系统包括若干客户端和一个计算端,所述客户端用于生成和提交用户的计算任务请求给计算端,所述计算端用于解析请求,生成并执行计算指令;该方法包括以下步骤:
(1)在客户端和计算端建立RESTFul服务,记计算任务队列为Q=[(c k,t k,nt k,nm k,D k)],1≤k≤L,L为计算任务队列Q的长度,任意一个客户端c k向计算端发起一个新的计算任务请求t k,该请求包括计算的线程资源需求nt k、计算内存的需求nm k、对应此任务的待计算数据D k
(2)计算端解析客户端c k发送的计算任务请求,得到(c k,t k,nt k,nm k,D k);
(3)计算端将(c k,t k,nt k,nm k,D k)作为一个元素插入计算任务队列Q,之后发起Scheduling计算,在Scheduling计算中任务队列Q各个元素的计算需求取值按照客户端为单位的最大最小原则进行优化,更新每个元素的nt k和nm k
(4)计算队列Q的长度len(Q)=L,以L为循环边界条件,用Spark.StreamingContext(Spark.StreamingContext为Spark框架下的流处理任务创建指令接口)创建L个流,并用Spark.Conf(Spark.Conf为Spark框架下的流处理任务配置指令接口)声明分配给各个流的资源;对于依次向Spark发起实际的流任务,载入数据D k,对数据执行计算任务t k,分配的线程资源为nt k,内存资源为nm k;其中,如果D k中存在中间结果和计算任务元数据,则直接从其对应的步骤开始计算任务。
流1:载入数据D 1,对数据执行计算任务t 1,分配的线程资源为nt 1,内存资源为nm 1
流2:载入数据D 2,对数据执行计算任务t 2,分配的线程资源为nt 2,内存资源为nm 2
流L:载入数据D L,对数据执行计算任务t L,分配的线程资源为nt L,内存资源为nm L
(5)对于已经在流处理的任务(c l,t l,nt l,nm l,D l),利用StreamingContext.CheckPointing(StreamingContext.CheckPointing为Spark框架下的流处理任务数据持久化指令接口)在流处理过程中的数据读取至HDFS、数据预处理缓存、计算、返回这四个步骤中执行数据流持久 化操作,保存中间结果和计算任务元数据至D l;同时监听队列的更新状况,如果监听到队列更新,则利用StreamingContext.stop(StreamingContext.stop为Spark框架下的流处理任务中止指令接口)停止该流,返回步骤(4);如果完成了流处理过程中的计算任务,则向该流处理任务对应的客户端返回任务处理结果,并将任务从队列Q弹出。
进一步地,所述步骤(3)中,基于客户端的Scheduling计算流程如下:
(3.1)对于队列Q=[(c k,t k,nt k,nm k,D k)],1≤k≤L,L为计算队列Q的长度,如果客户端存在多条记录,先按照客户端进行求和,得到以客户端为单位的新队列
Figure PCTCN2020083593-appb-000029
L mid为Q mid长度,s j为每个客户端发起的任务总数,
Figure PCTCN2020083593-appb-000030
分别为客户端c j请求的线程资源总数和内存资源总数;
(3.2)对于线程资源,执行如下优化分配流程:
(3.2.1)对于所有客户端的线程资源请求总数队列
Figure PCTCN2020083593-appb-000031
按大小进行排序得到
Figure PCTCN2020083593-appb-000032
和下标映射
Figure PCTCN2020083593-appb-000033
记计算中心计算资源池的总线程资源为NT,则预给
Figure PCTCN2020083593-appb-000034
的资源为
Figure PCTCN2020083593-appb-000035
(3.2.2)如果存在
Figure PCTCN2020083593-appb-000036
记这个集合为
Figure PCTCN2020083593-appb-000037
进入步骤(3.2.3);否则输出最终的线程资源分配策略
Figure PCTCN2020083593-appb-000038
利用下标映射得到对应恢复排序前顺序的线程资源分配策略
Figure PCTCN2020083593-appb-000039
进入步骤(3.2.4);
(3.2.3)需要重新分配的线程资源为
Figure PCTCN2020083593-appb-000040
其中|J|为J的元素个数,返回步骤(3.2.2);
(3.2.4)将同一个客户端所分配到的线程资源平均分配给该客户端所对应的所有任务,对于同一c j对应任务T j={t z|1≤z≤s j},
Figure PCTCN2020083593-appb-000041
其中
Figure PCTCN2020083593-appb-000042
为用户c j实际提交的一个任务t z所分配到的线程资源,
Figure PCTCN2020083593-appb-000043
为(3.2.2)得到的该用户分配到的所有线程资源,s j为用户c j发起的任务总数。
(3.3)对于内存资源,执行如下优化分配流程:
(3.3.1)对于所有客户端的内存资源请求总数队列
Figure PCTCN2020083593-appb-000044
按大小进行排序得到
Figure PCTCN2020083593-appb-000045
和下标映射
Figure PCTCN2020083593-appb-000046
记计算中心计算资源池的总内存资源为NM,则预给
Figure PCTCN2020083593-appb-000047
的资源为
Figure PCTCN2020083593-appb-000048
(3.3.2)如果存在
Figure PCTCN2020083593-appb-000049
记这个集合为
Figure PCTCN2020083593-appb-000050
进入步骤(3.2.3);否则输出最终的内存资源分配策略
Figure PCTCN2020083593-appb-000051
利用下标映射得到对应恢复排序前顺序的内存资源分配策略
Figure PCTCN2020083593-appb-000052
进入步骤(3.2.4);
(3.3.3)需要重新分配的内存资源为
Figure PCTCN2020083593-appb-000053
其中|J|为J的元素个数,返回步骤(3.3.2);
(3.3.4)将同一个客户端所分配到的内存资源平均分配给该客户端所对应的所有任务,对于同一c j对应任务T j={t z|1≤z≤s j},
Figure PCTCN2020083593-appb-000054
其中
Figure PCTCN2020083593-appb-000055
为用户c j实际提交的一个任务t z所分配到的内存资源,
Figure PCTCN2020083593-appb-000056
为(3.2.2)得到的该用户分配到的所有内存资源,s j为用户c j发起的任务总数。
(3.4)从(3.2)和(3.3)中得到的[nt k]和[nm k],重新组成Q=[(c k,t k,nt k,nm k,D k)]。
以下给出本发明基于Spark的多中心数据协同计算的流处理方法在多中心医学数据协同计算平台上应用的一个具体实例,该实例的实现具体包括以下步骤:
(1)在客户端(3家医院)和计算端(数据中心)建立RESTFul服务,记计算任务队列为
Q=[(“hospital1”,”task1”,8,4,”path1”),("hospital2","task2",8,8,"path2"),("hospital2","task3",4,8,"path3")],
L=3,第三家医院"hospital3"向计算端发起一个新的计算任务请求"task4",该请求包括计算的线程资源需求16、计算内存的需求16、对应此任务的待计算数据"path4";
(2)计算端解析客户端c i发送的计算任务请求,得到(“hospital3”,”task4”,16,16,”path4”);
(3)计算端将(“hospital3”,”task4”,16,16,”path4”)作为一个元素插入计算任务队列Q,
Q=[(“hospital1”,”task1”,8,4,”path1”),("hospital2","task2",8,8,"path2"),("hospital2","task3",4,8,"path3"),("hospital3","task4",16,16,"path4")];
之后发起Scheduling计算,在Scheduling计算中任务队列Q各个元素的计算需求取值按照客户端为单位的最大最小原则进行优化,更新每个元素的nt k和nm k,队列Q取值变为:
Q=[("hospital1","task1",8,4,"path1"),("hospital2","task2",5,6.5,"path2"),("hospital2","task3",6,6.5,"path3"),("hospital3","task4",13,15,"path4")];
其中,Scheduling计算流程如下:
(3.1)对于队列
Q=[(“hospital1”,”task1”,8,4,”path1”),("hospital2","task2",8,8,"path2"),("hospital2","task3",4,8,"path3"),("hospital3","task4",16,16,"path4")]
L为计算队列Q的长度L=4,如果客户端"hospital2"存在多条记录,先按照客户端进行求和,得到
Q mid=[("hospital1",8,4,1),("hospital2",12,16,2),("hospital1",16,16,1)],
L mid为Q mid长度L mid=3;
(3.2)对于线程资源,执行如下优化分配流程:
(3.2.1)对于所有客户端的线程资源请求总数队列[8,12,16],按大小进行排序得到[8,12,16]和下标映射M=[1,2,3];记计算中心计算资源池的总线程资源为NT=32,则预给[8,12,16]的资源为[10,10,12];
(3.2.2)存在
Figure PCTCN2020083593-appb-000057
记这个集合为J={1},进入步骤(3.2.3);
(3.2.3)需要重新分配的线程资源为R=10-8=2,
Figure PCTCN2020083593-appb-000058
Figure PCTCN2020083593-appb-000059
其中|J|为J的元素个数|J|=1,返回步骤(3.2.2);
(3.2.2)不存在
Figure PCTCN2020083593-appb-000060
所以输出最终的线程资源分配策略P mid=[8 11 13],利用下标映射得到对应恢复排序前顺序的线程资源分配策略P=[8 11 13],进入步骤(3.2.4);
(3.2.4)对于同一"hospital2"对应任务z=2,3,
Figure PCTCN2020083593-appb-000061
(3.3)对于内存资源,执行如下优化分配流程:
(3.3.1)对于所有客户端的内存资源请求总数队列[4 16 16],按大小进行排序得到[4 16 16]和下标映射M=[1 2 3];记计算中心计算资源池的总内存资源为32,则预给[4 16 16]的资源为[10 10 12];
(3.3.2)存在
Figure PCTCN2020083593-appb-000062
记这个集合为J={1},进入步骤(3.3.3);
(3.3.3)需要重新分配的线程资源为R=10-4=6,
Figure PCTCN2020083593-appb-000063
Figure PCTCN2020083593-appb-000064
其中|J|=1为J的元素个数,返回步骤(3.3.2);
(3.3.2)不存在
Figure PCTCN2020083593-appb-000065
输出最终的线程资源分配策略P mid=[4 13 15],利用下标映射得到对应恢复排序前顺序的线程资源分配策略P=[4 13 15],进入步骤(3.3.4);
(3.3.4)对于同一"hospital2"对应任务z=2,3,
Figure PCTCN2020083593-appb-000066
(3.4)从(3.2)和(3.3)中得到的[nt k]和[nm k],重新组成
Q=[("hospital1","task1",8,4,"path1"),("hospital2","task2",5,6.5,"path2"),("hospital2","task3",6,6.5,"path3"),("hospital3","task4",13,15,"path4")]
(4)计算队列Q的长度len(Q)=4,以4为循环边界条件,用Spark.StreamingContext(Spark.StreamingContext为Spark框架下的流处理任务创建指令接口)创建4个流,并用Spark.Conf(Spark.Conf为Spark框架下的流处理任务配置指令接口)声明分配给各个流的资源;对于依次向Spark发起实际的流任务,
流1:载入数据"path1",对数据执行计算任务"task1",分配的线程资源为9,内存资源为4;
流2:载入数据"path2",对数据执行计算任务"task2",分配的线程资源为9,内存资源为9;
流3:载入数据"path3",对数据执行计算任务"task3",分配的线程资源为4,内存资源为9;
流4:载入数据"path4",对数据执行计算任务"task4",分配的线程资源为10,内存资源为10;
其中,如果流1、流2、流3中检查存在中间结果和计算任务元数据,则直接从其对应的步骤开始计算任务。
(5)对于已经在流处理的任务
Q=[("hospital1","task1",8,4,"path1"),("hospital2","task2",5,6.5,"path2"),("hospital2","task3",6,6.5,"path3"),("hospital3","task4",13,15,"path4")]
利用StreamingContext.CheckPointing(StreamingContext.CheckPointing为Spark框架下的流处理任务数据持久化指令接口)在流处理过程中的数据读取至HDFS、数据预处理缓存、计算、返回这四个步骤中执行数据流持久化操作,保存中间结果和计算任务元数据至path1,path2,path3,path4;同时监听队列的更新状况,如果监听到队列更新,则利用StreamingContext.stop(StreamingContext.stop为Spark框架下的流处理任务中止指令接口)停止该流,返回步骤(4);如果完成了流处理过程中的计算任务,则向该流处理任务对应的客户端返回任务处理结果,并将任务从队列Q弹出。
以上仅为本发明的实施实例,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内,不经过创造性劳动所作的任何修改、等同替换、改进等,均包含在本发明的保护范围内。

Claims (2)

  1. 一种基于Spark的多中心数据协同计算的流处理方法,其特征在于,该方法在多中心数据协同计算系统上实现,所述多中心数据协同计算系统包括若干客户端和一个计算端,所述客户端用于生成和提交用户的计算任务请求给计算端,所述计算端用于解析请求,生成并执行计算指令;该方法包括以下步骤:
    (1)在客户端和计算端建立RESTFul服务,记计算任务队列为Q=[(c k,t k,nt k,nm k,D k)],1≤k≤L,L为计算任务队列Q的长度,任意一个客户端c k向计算端发起一个新的计算任务请求t k,该请求包括计算的线程资源需求nt k、计算内存的需求nm k、对应此任务的待计算数据D k
    (2)计算端解析客户端c k发送的计算任务请求,得到(c k,t k,nt k,nm k,D k)。
    (3)计算端将(c k,t k,nt k,nm k,D k)作为一个元素插入计算任务队列Q,之后发起Scheduling计算,在Scheduling计算中任务队列Q各个元素的计算需求取值按照客户端为单位的最大最小原则进行优化,更新每个元素的nt k和nm k
    (4)计算队列Q的长度len(Q)=L,以L为循环边界条件,用Spark.StreamingContext创建L个流,并用Spark.Conf声明分配给各个流的资源;对于依次向Spark发起实际的流任务k,载入数据D k,执行计算任务t k,分配其满足计算的线程资源需求nt k的线程数,分配满足计算内存的需求nm k;其中,如果D k中存在中间结果和计算任务元数据,则直接从其对应的步骤开始计算任务。
    (5)对于已经在流处理的任务(c l,t l,nt l,nm l,D l),利用StreamingContext.CheckPointing在流处理过程中的数据读取至HDFS、数据预处理缓存、计算、返回这四个步骤中执行数据流持久化操作,保存中间结果和计算任务元数据至D l;同时监听队列的更新状况,如果监听到队列更新,则利用StreamingContext.stop停止该流,返回步骤(4);如果完成了流处理过程中的计算任务,则向该流处理任务对应的客户端返回任务处理结果,并将任务从队列Q弹出。
  2. 根据权利要求1所述的一种基于Spark的多中心数据协同计算的流处理方法,其特征在于,所述步骤(3)中,基于客户端的Scheduling计算流程如下:
    (3.1)对于队列Q=[(c k,t k,nt k,nm k,D k)],1≤k≤L,L为计算队列Q的长度,如果客户端存在多条记录,先按照客户端进行求和,得到以客户端为单位的新队列
    Figure PCTCN2020083593-appb-100001
    1≤j≤L mid,L mid为Q mid长度,s j为每个客户端发起的任务总数,
    Figure PCTCN2020083593-appb-100002
    分别为客户端c j请求的线程资源总数和内存资源总数。
    (3.2)对于线程资源,执行如下优化分配流程:
    (3.2.1)对于所有客户端的线程资源请求总数队列
    Figure PCTCN2020083593-appb-100003
    1≤j≤L mid,按大小进行排序得到
    Figure PCTCN2020083593-appb-100004
    和下标映射
    Figure PCTCN2020083593-appb-100005
    记计算中心计算资源池的总线程资源为NT,则预给
    Figure PCTCN2020083593-appb-100006
    的资源为
    Figure PCTCN2020083593-appb-100007
    1≤j≤L mid
    (3.2.2)如果存在
    Figure PCTCN2020083593-appb-100008
    记这个集合为
    Figure PCTCN2020083593-appb-100009
    进入步骤(3.2.3);否则输出最终的线程资源分配策略
    Figure PCTCN2020083593-appb-100010
    利用下标映射得到对应恢复排序前顺序的线程资源分配策略
    Figure PCTCN2020083593-appb-100011
    m i∈M,进入步骤(3.2.4);
    (3.2.3)需要重新分配的线程资源为
    Figure PCTCN2020083593-appb-100012
    其中|J|为J的元素个数,返回步骤(3.2.2);
    (3.2.4)将同一个客户端所分配到的线程资源平均分配给该客户端所对应的所有任务,对于同一c j对应任务T j={t z|1≤z≤s j},
    Figure PCTCN2020083593-appb-100013
    其中
    Figure PCTCN2020083593-appb-100014
    为用户c j实际提交的一个任务t z所分配到的线程资源,
    Figure PCTCN2020083593-appb-100015
    为(3.2.2)得到的该用户分配到的所有线程资源,s j为用户c j发起的任务总数。
    (3.3)对于内存资源,执行如下优化分配流程:
    (3.3.1)对于所有客户端的内存资源请求总数队列
    Figure PCTCN2020083593-appb-100016
    1≤j≤L mid,按大小进行排序得到
    Figure PCTCN2020083593-appb-100017
    和下标映射
    Figure PCTCN2020083593-appb-100018
    记计算中心计算资源池的总内存资源为NM,则预给
    Figure PCTCN2020083593-appb-100019
    的资源为
    Figure PCTCN2020083593-appb-100020
    1≤j≤L mid
    (3.3.2)如果存在
    Figure PCTCN2020083593-appb-100021
    记这个集合为
    Figure PCTCN2020083593-appb-100022
    进入步骤(3.2.3);否则输出最终的内存资源分配策略
    Figure PCTCN2020083593-appb-100023
    利用下标映射得到对应恢复排序前顺序的内存资源分配策略
    Figure PCTCN2020083593-appb-100024
    m i∈M,进入步骤(3.2.4);
    (3.3.3)需要重新分配的内存资源为
    Figure PCTCN2020083593-appb-100025
    其中|J|为J的元素个数,返回步骤(3.3.2);
    (3.3.4)将同一个客户端所分配到的内存资源平均分配给该客户端所对应的所有任务,对于同一c j对应任务T j={t z|1≤z≤s j},
    Figure PCTCN2020083593-appb-100026
    其中
    Figure PCTCN2020083593-appb-100027
    为用户c j实际提交的一个任务t z所分配到的内存资源,
    Figure PCTCN2020083593-appb-100028
    为(3.2.2)得到的该用户分配到的所有内存资源,s j为用户c j发起的任务总数。
    (3.4)从(3.2)和(3.3)中得到的[nt k]和[nm k],重新组成Q=[(c k,t k,nt k,nm k,D k)]。
PCT/CN2020/083593 2019-07-12 2020-04-07 一种基于Spark的多中心数据协同计算的流处理方法 WO2020233262A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2021533418A JP6990802B1 (ja) 2019-07-12 2020-04-07 Sparkに基づくマルチセンターのデータ協調コンピューティングのストリーム処理方法

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910629253.8 2019-07-12
CN201910629253.8A CN110347489B (zh) 2019-07-12 2019-07-12 一种基于Spark的多中心数据协同计算的流处理方法

Publications (1)

Publication Number Publication Date
WO2020233262A1 true WO2020233262A1 (zh) 2020-11-26

Family

ID=68176115

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/083593 WO2020233262A1 (zh) 2019-07-12 2020-04-07 一种基于Spark的多中心数据协同计算的流处理方法

Country Status (3)

Country Link
JP (1) JP6990802B1 (zh)
CN (1) CN110347489B (zh)
WO (1) WO2020233262A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115081936A (zh) * 2022-07-21 2022-09-20 之江实验室 面向应急条件下多遥感卫星观测任务调度的方法和装置

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347489B (zh) * 2019-07-12 2021-08-03 之江实验室 一种基于Spark的多中心数据协同计算的流处理方法
CN110955526B (zh) * 2019-12-16 2022-10-21 湖南大学 一种用于在分布式异构环境下实现多gpu调度的方法和系统
CN115242877B (zh) * 2022-09-21 2023-01-24 之江实验室 面向多K8s集群的Spark协同计算、作业方法及装置
US11954525B1 (en) 2022-09-21 2024-04-09 Zhejiang Lab Method and apparatus of executing collaborative job for spark faced to multiple K8s clusters

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930373A (zh) * 2016-04-13 2016-09-07 北京思特奇信息技术股份有限公司 一种基于spark streaming的大数据流处理方法和系统
CN108037998A (zh) * 2017-12-01 2018-05-15 北京工业大学 一种面向Spark Streaming平台的数据接收通道动态分配方法
US20180270164A1 (en) * 2017-03-14 2018-09-20 International Business Machines Corporation Adaptive resource scheduling for data stream processing
CN109684078A (zh) * 2018-12-05 2019-04-26 苏州思必驰信息科技有限公司 用于spark streaming的资源动态分配方法和系统
CN110347489A (zh) * 2019-07-12 2019-10-18 之江实验室 一种基于Spark的多中心数据协同计算的流处理方法

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100542139C (zh) * 2006-12-31 2009-09-16 华为技术有限公司 一种基于任务分组的资源分配方法和装置
CN105335376B (zh) * 2014-06-23 2018-12-07 华为技术有限公司 一种流处理方法、装置及系统
KR101638136B1 (ko) * 2015-05-14 2016-07-08 주식회사 티맥스 소프트 멀티 스레드 구조에서 작업 분배 시 스레드 간 락 경쟁을 최소화하는 방법 및 이를 사용한 장치
US10120721B2 (en) * 2015-08-28 2018-11-06 Vmware, Inc. Pluggable engine for application specific schedule control
US9575749B1 (en) * 2015-12-17 2017-02-21 Kersplody Corporation Method and apparatus for execution of distributed workflow processes
CN107193652B (zh) * 2017-04-27 2019-11-12 华中科技大学 容器云环境中流数据处理系统的弹性资源调度方法及系统
CN107291843A (zh) * 2017-06-01 2017-10-24 南京邮电大学 基于分布式计算平台的层次聚类改进方法
CN107870763A (zh) * 2017-11-27 2018-04-03 深圳市华成峰科技有限公司 用于创建海量数据实时分拣系统的方法及其装置
CN108804211A (zh) * 2018-04-27 2018-11-13 西安华为技术有限公司 线程调度方法、装置、电子设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930373A (zh) * 2016-04-13 2016-09-07 北京思特奇信息技术股份有限公司 一种基于spark streaming的大数据流处理方法和系统
US20180270164A1 (en) * 2017-03-14 2018-09-20 International Business Machines Corporation Adaptive resource scheduling for data stream processing
CN108037998A (zh) * 2017-12-01 2018-05-15 北京工业大学 一种面向Spark Streaming平台的数据接收通道动态分配方法
CN109684078A (zh) * 2018-12-05 2019-04-26 苏州思必驰信息科技有限公司 用于spark streaming的资源动态分配方法和系统
CN110347489A (zh) * 2019-07-12 2019-10-18 之江实验室 一种基于Spark的多中心数据协同计算的流处理方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115081936A (zh) * 2022-07-21 2022-09-20 之江实验室 面向应急条件下多遥感卫星观测任务调度的方法和装置
CN115081936B (zh) * 2022-07-21 2022-11-18 之江实验室 面向应急条件下多遥感卫星观测任务调度的方法和装置

Also Published As

Publication number Publication date
JP2022508354A (ja) 2022-01-19
JP6990802B1 (ja) 2022-01-12
CN110347489B (zh) 2021-08-03
CN110347489A (zh) 2019-10-18

Similar Documents

Publication Publication Date Title
WO2020233262A1 (zh) 一种基于Spark的多中心数据协同计算的流处理方法
US9171044B2 (en) Method and system for parallelizing database requests
US9485310B1 (en) Multi-core storage processor assigning other cores to process requests of core-affined streams
US10191922B2 (en) Determining live migration speed based on workload and performance characteristics
US9197703B2 (en) System and method to maximize server resource utilization and performance of metadata operations
CN111752965B (zh) 一种基于微服务的实时数据库数据交互方法和系统
WO2021254135A1 (zh) 任务执行方法及存储设备
US8688646B2 (en) Speculative execution in a real-time data environment
CA2533744C (en) Hierarchical management of the dynamic allocation of resources in a multi-node system
US20110145312A1 (en) Server architecture for multi-core systems
WO2019223596A1 (zh) 事件处理方法、装置、设备及存储介质
US9715414B2 (en) Scan server for dual-format database
WO2023082560A1 (zh) 一种任务处理方法、装置、设备及介质
JP2005056077A (ja) データベース制御方法
CN112463390A (zh) 一种分布式任务调度方法、装置、终端设备及存储介质
US9959301B2 (en) Distributing and processing streams over one or more networks for on-the-fly schema evolution
CN104112049A (zh) 基于P2P构架的MapReduce任务跨数据中心调度系统及方法
CN112882818A (zh) 任务动态调整方法、装置以及设备
WO2018133821A1 (en) Memory-aware plan negotiation in query concurrency control
CN114756629A (zh) 基于sql的多源异构数据交互分析引擎及方法
WO2024022142A1 (zh) 资源使用方法和装置
CN113391911A (zh) 一种大数据资源动态调度方法、装置和设备
CN108665157A (zh) 一种实现云工作流系统流程实例均衡调度的方法
CN112925807A (zh) 面向数据库的请求的批处理方法、装置、设备及存储介质
CN115878664B (zh) 一种海量输入数据的实时查询匹配的方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20809813

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021533418

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20809813

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20809813

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 230123)

122 Ep: pct application non-entry in european phase

Ref document number: 20809813

Country of ref document: EP

Kind code of ref document: A1