JP2022508354A

JP2022508354A - Stream processing method for multi-center data co-computing based on Spark

Info

Publication number: JP2022508354A
Application number: JP2021533418A
Authority: JP
Inventors: ▲勁▼松李; ▲潤▼▲澤▼ 李; 遥 ▲陸▼; ▲ユー▼ 王; 英浩 ▲趙▼
Original assignee: 之江実験室
Priority date: 2019-07-12
Filing date: 2020-04-07
Publication date: 2022-01-19
Anticipated expiration: 2040-04-07
Also published as: JP6990802B1; WO2020233262A1; CN110347489B; CN110347489A

Abstract

【課題】本発明は、Ｓｐａｒｋに基づくマルチセンターのデータ協調コンピューティングのストリーム処理方法を提供する。【解決手段】複数のクライアントは、ユーザーによるコンピューティングタスク要求を生成してコンピューティング端末に送信し、コンピューティング端末は、要求を解析し、コンピューティング命令を生成して実行する。本発明は、マルチセンターのデータコンピューティングの要求及び操作の、ストリーム処理コンピューティングを実行することにより、プログラム実行性能及びリソース割り当て効率を改善する。リソース管理ログとＲＥＳＴＦｕｌを設定し、マルチセンターからのＳｐａｒｋ要求タスクに占められ、要求されるメモリー及びスレッドリソースを正確に制御し記録する。マクシミン規準のポリシーを用いて、ストリームコンピューティングにおける各テップのリソース割り当てを実行する。本発明は、マルチセンターのデータ協調コンピューティングにおける数多くのスレッドによって引き起こされるブロッキング遅延という問題を解決して、単一のユーザーの待ち時間を減らし、リソース割り当ての柔軟性及び公平性を改善する。【選択図】図１PROBLEM TO BE SOLVED: To provide a stream processing method for multi-center data co-computing based on Spark. A plurality of clients generate a computing task request by a user and send it to a computing terminal, and the computing terminal analyzes the request, generates a computing instruction, and executes the request. The present invention improves program execution performance and resource allocation efficiency by performing stream processing computing of multi-center data computing requirements and operations. The resource management log and RESTFul are set, and the memory and thread resources requested are accurately controlled and recorded, which are occupied by the Spark request task from the multi-center. Perform resource allocation for each step in stream computing using the policies of the Maximin standard. The present invention solves the problem of blocking delays caused by numerous threads in multicenter data co-computing, reduces latency for a single user, and improves resource allocation flexibility and fairness. [Selection diagram] Fig. 1

Description

本発明は、ストリーム処理の技術分野に関し、特に、Ｓｐａｒｋに基づくマルチセンターのデータ協調コンピューティングのストリーム処理方法に関する。 The present invention relates to the technical field of stream processing, and more particularly to a stream processing method of multi-center data co-computing based on Spark.

ストリーム処理技術（ＳｔｒｅａｍＰｒｏｃｅｓｓｉｎｇ）は、コンピュータプログラミングのパラダイムであり、データストリームプログラミングやインタラクティブプログラミングとも呼ばれ、コンピューティングアプリケーションを、限られた並行処理モデルでより効率的に使用できるようにする技術である。このタイプの技術的なアプリケーションは、例えばグラフィックスプロセッシングユニット（ＧｒａｐｈｉｃＰｒｏｃｅｓｓｉｎｇＵｎｉｔ、ＧＰＵ）又は現場でプログラム可能なゲートアレイ（Ｆｉｅｌｄ－ｐｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙｓ、ＦＰＧＡ）などの様々な計算ユニットに存在することが可能であり、しかも、メモリの割り当て、同期及びユニット間のコミュニケーションを明示的に管理しない。Ｓｐａｒｋｓｔｒｅａｍｉｎｇは、ＳｐａｒｋのコアＡＰＩの拡張の一つであり、それがリアルタイムストリーミングデータの処理に対して、拡張性、高いスループット、フォールト・トレラントなどの特性を有している。提供される主なインタフェースは、コンテキストの作成ＳｔｒｅａｍｉｎｇＣｏｎｔｅｘｔ、ストリーム開始ｓｔａｒｔ、ストリーム終了ｓｔｏｐ、キャッシュｃａｃｈｅ、Ｃｈｅｃｋｐｏｉｎｔｉｎｇなどである。 Stream processing, also known as data stream programming or interactive programming, is a computer programming paradigm that enables computing applications to be used more efficiently with a limited concurrency model. .. This type of technical application can reside in various computing units such as graphics processing units (GPUs) or field programmable gate arrays (Field-programmable Gate Arrays, FPGAs). Moreover, it does not explicitly manage memory allocation, synchronization, and communication between units. Spark streaming is one of Spark's core API extensions, which have characteristics such as scalability, high throughput, and fault tolerance for the processing of real-time streaming data. The main interfaces provided are Streaming Context, stream start start, stream end stop, cache cache, Checkpointing, etc.

マルチセンターのデータ協調コンピューティングは、ビッグデータの背景に現れている応用シナリオであり、マルチパーティデータセンターは、より使用しやすく強力なデータ処理プラットフォームのリソースを個々の単一のユーザーに提供するために、データリソースとデータ処理の要求を統括する必要がある。個々の単一のユーザーは、自分のデータリソースと複数のデータリソースとを統合して集中的に解析することを選択してもよいし、複数の種類の演算要求を選択して、マルチセンター背景で並行コンピューティングを行ってもよい。 Multi-center data co-computing is an application scenario that emerges behind big data, as multi-party data centers provide the resources of a more user-friendly and powerful data processing platform to a single user. In addition, it is necessary to control data resources and data processing requirements. An individual single user may choose to integrate their data resources with multiple data resources for intensive analysis, or select multiple types of compute requests to create a multi-center background. You may perform concurrent computing with.

従来のマルチセンターにおける協調分析プラットフォームは、実質的な単一センターであることが多く、つまり、マルチパーティデータベースを同一箇所のデータノードにキャッシュし、さらに様々な分析要求を一つずつ処理し、実際にすべての並行を一つのストリームにデフォルトして行うことに等価であり、このような形態により、数多くのスレッドによって引き起こされるブロッキング遅延をもたらし、各パッチのキューにおける待ち時間が延長され、新たなユーザーからのコンピューティング要求が即時のフィードバックと満足を得ることが困難であり、データリアルタイム性も保持しにくい。 Traditional multi-center collaborative analysis platforms are often essentially a single center, that is, the multi-party database is cached in the same data node, and various analysis requests are processed one by one. Equivalent to defaulting all parallels to one stream, such a form results in blocking delays caused by many threads, increases latency in the queue for each patch, and new users. It is difficult to obtain immediate feedback and satisfaction from computing requests from, and it is also difficult to maintain data real-time performance.

本発明は、従来技術における欠陥に対して、Ｓｐａｒｋに基づくマルチセンターのデータ協調コンピューティングのストリーム処理方法を提供することを目的とする。本発明は、リソース管理ログ及びＳｐａｒｋのストリームコンピューティングにより、マルチセンターのデータ協調コンピューティングへのストリーム処理を実現し、ストリーム処理のリソース割り当ての利点及びマルチセンター化のヘテロジニアスコンピューティング要求を結合し、マルチセンターの協調コンピューティングのリソース割り当ての公平性及びデータ分析効率を向上させ、コンピューティングキュータスクの待ち時間を短縮する。 It is an object of the present invention to provide a stream processing method for multi-center data co-computing based on Spark for defects in the prior art. The present invention realizes stream processing to multi-center data co-computing by resource management log and Spark stream computing, and combines the advantages of stream processing resource allocation with multi-centered heterogeneous computing requirements. Improves the fairness of resource allocation and data analysis efficiency of multi-center co-computing, and reduces the latency of computing queue tasks.

本発明の目的は、以下のような技術手段により実現される。
Ｓｐａｒｋに基づくマルチセンターのデータ協調コンピューティングのストリーム処理方法であって、
当該方法は、マルチセンターのデータ協調コンピューティングシステムで実施されるものであり、前記マルチセンターのデータ協調コンピューティングシステムは、複数のクライアント及び一つのコンピューティング端末を含み、前記クライアントは、ユーザーによるコンピューティングタスク要求を生成してコンピューティング端末に送信するためのものであり、前記コンピューティング端末は、要求を解析して、コンピューティング命令を生成して実行するためのものであり、
当該方法は、
クライアント及びコンピューティング端末にＲＥＳＴＦｕｌサービスを構築し、コンピューティングタスクキューを

とし、ＬがコンピューティングタスクキューＱの長さであり、いずれか一つのクライアントｃ_ｋがコンピューティング端末に一つの新たなコンピューティングタスク要求ｔ_ｋを送信し、当該要求には、コンピューティングのスレッドリソース要求ｎｔ_ｋ、メモリーをコンピューティングする要求ｎｍ_ｋ、このタスクに対応するコンピューティングすべきデータＤ_ｋを含む、ステップ（１）と、
コンピューティング端末は、クライアントｃ_ｋから送信されたコンピューティングタスク要求を解析して、

を取得する、ステップ（２）と、
コンピューティング端末は、

を一つのエレメントとして、コンピューティングタスクキューＱに挿入してから、Ｓｃｈｅｄｕｌｉｎｇ計算を始め、Ｓｃｈｅｄｕｌｉｎｇ計算では、タスクキューＱにおける各エレメントのコンピューティング要求の値をクライアントを単位とするマクシミン規準に従って最適化し、各エレメントのｎｔ_ｋ及びｎｍ_ｋを更新する、ステップ（３）と、
キュー

の長さ

をコンピューティングし、Ｌを循環境界条件として、Ｓｐａｒｋ．ＳｔｒｅａｍｉｎｇＣｏｎｔｅｘｔ（Ｓｐａｒｋ．ＳｔｒｅａｍｉｎｇＣｏｎｔｅｘｔがＳｐａｒｋフレームワークにおけるストリーム処理タスクの作成命令インタフェースである）により、

個のストリームを作成し、Ｓｐａｒｋ．Ｃｏｎｆ（Ｓｐａｒｋ．ＣｏｎｆがＳｐａｒｋフレームワークにおけるストリーム処理タスクの配置命令インタフェースである）により、各ストリームに割り当てられたリソースを宣言し、Ｓｐａｒｋに実際のストリームタスクを順次送信することについて、データＤ_ｋをロードし、データをコンピューティングタスクｔ_ｋを実行し、割り当てられたスレッドリソースがｎｔ_ｋとなり、メモリーリソースがｎｍ_ｋとなり、ただし、Ｄ_ｋには、中間結果及びコンピューティングタスクメタデータが存在すれば、直接にそれに対応するステップからタスクをコンピューティングし始め、
ストリーム１：データＤ_１をロードし、データに対してコンピューティングタスクｔ_１を実行し、割り当てられたスレッドリソースがｎｔ_１となり、メモリーリソースがｎｍ_１となり、
ストリーム２：データＤ₂をロードし、データに対してコンピューティングタスクｔ_２を実行し、割り当てられたスレッドリソースがｎｔ_２となり、メモリーリソースがｎｍ_２となり、
…
ストリームＬ：データＤ_Ｌをロードし、データに対してコンピューティングタスクｔ_Ｌを実行し、割り当てられたスレッドリソースがｎｔ_Ｌとなり、メモリーリソースがｎｍ_Ｌとなるステップ（４）と、
ストリーム処理されているタスク

について、ＳｔｒｅａｍｉｎｇＣｏｎｔｅｘｔ．ＣｈｅｃｋＰｏｉｎｔｉｎｇ（ＳｔｒｅａｍｉｎｇＣｏｎｔｅｘｔ．ＣｈｅｃｋＰｏｉｎｔｉｎｇがＳｐａｒｋフレームワークにおけるストリーム処理タスクのデータ持続化命令インタフェースである）により、ストリーム処理過程におけるＨＤＦＳへのデータの読み取り、データの前処理キャッシュ、コンピューティング、戻りという四つのステップにおいて、データストリームを持続化させる操作を実行し、中間結果及びコンピューティングタスクメタデータをＤ_lに記憶し、同時に、キューの更新状況を監視し、キューの更新を監視した場合、ＳｔｒｅａｍｉｎｇＣｏｎｔｅｘｔ．ｓｔｏｐ（ＳｔｒｅａｍｉｎｇＣｏｎｔｅｘｔ．ｓｔｏｐがＳｐａｒｋフレームワークにおけるストリーム処理タスクの中止命令インタフェースである）により、当該ストリームを停止させ、ステップ（４）に戻り、ストリーム処理過程におけるコンピューティングタスクが完了した場合に、当該ストリーム処理タスクに対応するクライアントにタスク処理結果を返し、タスクをキューＱから取り出す、ステップ（５）とを含む、Ｓｐａｒｋに基づくマルチセンターのデータ協調コンピューティングのストリーム処理方法。 The object of the present invention is realized by the following technical means.
A stream processing method for multi-center data co-computing based on Spark.
The method is carried out in a multi-center data co-computing system, wherein the multi-center data co-computing system includes a plurality of clients and one computing terminal, wherein the clients are computing by a user. It is for generating an ing task request and sending it to a computing terminal, and the computing terminal is for analyzing a request to generate and execute a computing instruction.
The method is
Build RESTFul services on clients and computing terminals and create computing task queues

L is the length of the computing task queue Q, and any one of the client c _k sends one new computing task request t _k to the computing terminal, and the request is sent to the computing thread. Step (1), which includes a resource request nt _k , a memory computing request nm _k , and a computeable data D _k corresponding to this task.
The computing terminal analyzes the computing task request sent from the client _ck , and the computing terminal analyzes the computing task request.

To get step (2) and
Computing terminals

Is inserted into the computing task queue Q as one element, and then the Scheduling calculation is started. In the Scheduling calculation, the value of the computing request of each element in the task queue Q is optimized according to the Maximin standard for each client. Step (3), which updates nt _k and nm _k of each element,
queue

Length of

With L as the circular boundary condition, Spark. By Streaming Context (Spark. Streaming Context is an instruction interface for creating stream processing tasks in the Spark framework).

Create individual streams and use Spark. Data Dk for declaring the resources assigned to each stream by Conf ( _Spark.Conf is the placement instruction interface for stream processing tasks in the Spark framework) and sequentially sending the actual stream tasks to Spark. Load and execute the computing task tk, the allocated thread resource is nt _k , the memory resource is nm _k , but _{if D k} _has intermediate results and computing task metadata. , Start computing the task directly from the corresponding step,
Stream 1: Load data D ₁ , perform computing task t ₁ on the data, the allocated thread resource becomes nt ₁ , the memory resource becomes nm ₁ and so on.
Stream 2: Load data D ₂ , perform computing task t ₂ on the data, the allocated thread resource becomes nt ₂ , the memory resource becomes nm ₂ .
…
Stream _L : Load the data DL, execute the computing task t _L on the data, the allocated thread resource becomes nt _L , and the memory resource becomes nm _L (4).
Tasks being streamed

About, Streaming Context. Check Pointing (Streaming Context. Check Pointing is the data sustaining instruction interface for stream processing tasks in the Spark framework) allows the data to be read into HDFS during the stream processing process, data preprocessing cache, computing, and return. In one step, if you perform an operation to sustain the data stream, store the intermediate results and computing task metadata in D _l , and at the same time monitor the queue update status and monitor the queue update, Streaming Context. .. When the stream is stopped by stop (Streaming Context.stop is the stop instruction interface of the stream processing task in the Spark framework), the process returns to step (4), and the computing task in the stream processing process is completed. A stream processing method for multi-center data co-computing based on Spark, including step (5) of returning a task processing result to a client corresponding to a stream processing task and retrieving the task from queue Q.

さらに、前記ステップ（３）において、クライアントに基づくＳｃｈｅｄｕｌｉｎｇ計算の流れは、以下の通りであり、
ステップ（３．１）：キュー

であり、ＬがコンピューティングキューＱの長さであることについて、クライアントに複数の記録が存在している場合に、まず、クライアントに従って加算し、クライアントを単位とする新たなキュー

を取得し、Ｌ_ｍｉｄがＱ_ｍｉｄ長さであり、ｓ_ｊが各クライアントによって送信されたタスク総数であり、ｎｔ_ｊ ^mid、ｎｍ_ｊ ^midがそれぞれクライアントｃ_ｊによって要求されたスレッドリソース総数及びメモリーリソース総数であり、
ステップ（３．２）：スレッドリソースについて、以下のように最適化割り当ての流れを実行しており、
ステップ（３．２．１）：すべてのクライアントのスレッドリソース要求総数キュー

について、サイズに従ってソートして

及び添え字マッピングM=

を取得し、コンピューティングセンターのコンピューティングリソースプールの総スレッドリソースをＮＴとすると、予めｎｔ_ｊ ^midに与えられるリソースが

となり、
ステップ（３．２．２）：

が存在している場合に、この集合が

とし、ステップ（３．２．３）に移行し、それ以外の場合は、最終的なスレッドリソース割り当てポリシー

を出力し、添え字マッピングにより、ソートする前に戻す順序に対応するスレッドリソース割り当てポリシー

を取得し、ステップ（３．２．４）に移行し、
ステップ（３．２．３）：再割り当てする必要があるスレッドリソースが

であり、ただし、

がＪのエレメントの数であり、ステップ（３．２．２）に戻り、
ステップ（３．２．４）：同じクライアントに割り当てられたスレッドリソースを、当該クライアントと対応するすべてのタスクに均一に割り当て、同じｃ_ｊにタスク

を対応させ、ただし、

がユーザーｃ_ｊが実際に提出した一つのタスクｔ_ｚに割り当てられたスレッドリソースであり、ｎｔ_ｊ ^ｍｉｄがステップ（３．２．２）で得られた当該ユーザーに割り当てられたすべてのスレッドリソースであり、ｓ_ｊがユーザーｃ_ｊによって送信されたタスクの総数であり、
ステップ（３．３）：メモリーリソースについて、以下のように最適化割り当ての流れを実行しており、
ステップ（３．３．１）：すべてのクライアントのメモリーリソース要求総数キュー

について、サイズに従ってソートして、

及び添え字マッピングM=

を取得し、コンピューティングセンターのコンピューティングリソースプールの総メモリーリソースをＮＭとすると、予めｎｍ_ｊ ^midに与えられるリソースが

となり、
ステップ（３．３．２）：

が存在している場合に、この集合を

として、ステップ（３．２．３）に移行し、それ以外の場合は、最終的なメモリーリソース割り当てポリシー

を出力し、添え字マッピングにより、ソートする前に戻す順序に対応するメモリーリソース割り当てポリシー

を取得し、ステップ（３．２．４）に移行し、
ステップ（３．３．３）：再割り当てする必要があるメモリーリソースが

であり、ただし、

がＪのエレメントの数であり、ステップ（３．３．２）に戻り、
ステップ（３．３．４）：同じクライアントに割り当てられたメモリーリソースを当該クライアントと対応するすべてのタスクに均一に割り当て、同一ｃ_ｊにタスク

，

を対応させ、ただし、

がユーザーｃ_ｊが実際に提出した一つのタスクｔ_ｚに割り当てられたメモリーリソースであり、ｎｍ_ｊ ^midがステップ（３．２．２）で得られた当該ユーザーに割り当てられたすべてのメモリーリソースであり、ｓ_ｊがユーザーｃ_ｊによって送信されたタスクの総数であり、
ステップ（３．４）：ステップ（３．２）及びステップ（３．３）で得られた［ｎｔ_ｋ］及び［ｎｍ_ｋ］から、

］を再構成する。 Further, in the step (3), the flow of Scheduling calculation based on the client is as follows.
Step (3.1): Queue

If there are multiple records in the client regarding that L is the length of the computing queue Q, first, the queue is added according to the client, and a new queue in the client unit.

Is obtained, L _mid is the Q _mid length, s _j is the total number of tasks sent by each client, and nt _j ^mid and nm _j ^mid are the total number of thread resources and memory resources requested by the client c _j , respectively. Is the total number
Step (3.2): For thread resources, the flow of optimization allocation is executed as follows.
Step (3.2.1): Total number of thread resource requests queue for all clients

Sort by size

And subscript mapping M =

And if the total thread resource of the computing resource pool of the computing center is NT, the resource given to nt _j ^mid in advance is

And
Step (3.2.2):

If is present, then this set

And move to step (3.2.3), otherwise the final thread resource allocation policy

Thread resource allocation policy corresponding to the order to return before sorting by subscript mapping

Is obtained, and the process proceeds to step (3.2.4).
Step (3.2.3): Thread resource that needs to be reassigned

However,

Is the number of elements of J, returning to step (3.2.2),
Step (3.2.4): Thread resources assigned to the same client are evenly allocated to all tasks corresponding to the client, and tasks are assigned to the same _cj .

However,

Is the thread resource assigned to one task t _z actually submitted by the user c _j , and nt _j ^mid is all the thread resources assigned to the user obtained in step (3.2.2). Yes, s _j is the total number of tasks sent by user c _j ,
Step (3.3): For memory resources, the flow of optimization allocation is executed as follows.
Step (3.3.1): Total memory resource request queue for all clients

Sort by size,

And subscript mapping M =

And if the total memory resource of the computing resource pool of the computing center is NM, the resource given to nm _j ^mid in advance is

And
Step (3.3.2):

If is present, then this set

As a result, move to step (3.2.3), otherwise the final memory resource allocation policy.

And by subscript mapping, the memory resource allocation policy corresponding to the order to return before sorting

Is obtained, and the process proceeds to step (3.2.4).
Step (3.3.3): Memory resources that need to be reallocated

However,

Is the number of elements of J, returning to step (3.3.2),
Step (3.3.4): Allocate memory resources allocated to the same client evenly to all tasks corresponding to the client, and tasks to the same _cj .

，，

However,

Is the memory resource allocated to one task t _z actually submitted by user c _j , and nm _j ^mid is all memory resources allocated to that user obtained in step (3.2.2). Yes, s _j is the total number of tasks sent by user c _j ,
Step (3.4): From [nt _k ] and [nm _k ] obtained in step (3.2) and step (3.3).

] Is reconstructed.

本発明による有益な効果は、以下の通りである。
本発明は、マルチセンターのデータコンピューティングの要求及び操作の、ストリーム処理コンピューティングを実行することにより、プログラム実行性能及びリソース割り当て効率を改善する。リソース管理ログとＲＥＳＴＦｕｌを設定し、マルチセンターからのＳｐａｒｋ要求タスクに占められ、要求されるメモリー及びスレッドリソースを正確に制御し記録する。マクシミン規準のポリシーを用いて、ストリームコンピューティングにおける各テップのリソース割り当てを実行する。本発明は、マルチセンターのデータ協調コンピューティングにおける数多くのスレッドによって引き起こされるブロッキング遅延という問題を解決して、単一のユーザーの待ち時間を減らし、リソース割り当ての柔軟性及び公平性を改善する。 The beneficial effects of the present invention are as follows.
The present invention improves program execution performance and resource allocation efficiency by performing stream processing computing of multi-center data computing requirements and operations. The resource management log and RESTFul are set, and the memory and thread resources requested are accurately controlled and recorded, which are occupied by the Spark request task from the multi-center. Perform resource allocation for each step in stream computing using the policies of the Maximin standard. The present invention solves the problem of blocking delays caused by numerous threads in multicenter data co-computing, reduces latency for a single user, and improves resource allocation flexibility and fairness.

本発明に係るセンター協調コンピューティングのストリーム処理方法のフローチャートである。It is a flowchart of the stream processing method of the center cooperative computing which concerns on this invention.

以下に、図面及び具体的な実施例を参照しつつ、本発明をより詳しく説明する。
図１に示すように、本発明は、Ｓｐａｒｋに基づくマルチセンターのデータ協調コンピューティングのストリーム処理方法を提供しており、当該方法は、マルチセンターのデータ協調コンピューティングシステムで実施されるものであり、前記マルチセンターのデータ協調コンピューティングシステムは、複数のクライアント及び一つのコンピューティング端末を含み、前記クライアントは、ユーザーによるコンピューティングタスク要求を生成してコンピューティング端末に送信するためのものであり、前記コンピューティング端末は、要求を解析して、コンピューティング命令を生成して実行するためのものであり、
当該方法は、
クライアント及びコンピューティング端末にＲＥＳＴＦｕｌサービスを構築し、コンピューティングタスクキューを

の長さ

について、ＳｔｒｅａｍｉｎｇＣｏｎｔｅｘｔ．ＣｈｅｃｋＰｏｉｎｔｉｎｇ（ＳｔｒｅａｍｉｎｇＣｏｎｔｅｘｔ．ＣｈｅｃｋＰｏｉｎｔｉｎｇがＳｐａｒｋフレームワークにおけるストリーム処理タスクのデータ持続化命令インタフェースである）により、ストリーム処理過程におけるＨＤＦＳへのデータの読み取り、データの前処理キャッシュ、コンピューティング、戻りという四つのステップにおいて、データストリームを持続化させる操作を実行し、中間結果及びコンピューティングタスクメタデータをＤ_lに記憶し、同時に、キューの更新状況を監視し、キューの更新を監視した場合、ＳｔｒｅａｍｉｎｇＣｏｎｔｅｘｔ．ｓｔｏｐ（ＳｔｒｅａｍｉｎｇＣｏｎｔｅｘｔ．ｓｔｏｐがＳｐａｒｋフレームワークにおけるストリーム処理タスクの中止命令インタフェースである）により、当該ストリームを停止させ、ステップ（４）に戻り、ストリーム処理過程におけるコンピューティングタスクが完了した場合に、当該ストリーム処理タスクに対応するクライアントにタスク処理結果を返し、タスクをキューＱから取り出す、ステップ（５）とを含む。 Hereinafter, the present invention will be described in more detail with reference to the drawings and specific examples.
As shown in FIG. 1, the present invention provides a stream processing method for multi-center data co-computing based on Spark, which method is implemented in a multi-center data co-computing system. The multi-center data co-computing system includes a plurality of clients and one computing terminal, the client for generating a computing task request by a user and transmitting it to the computing terminal. The computing terminal is for analyzing a request, generating a computing instruction, and executing the computing instruction.
The method is
Build RESTFul services on clients and computing terminals and create computing task queues

To get step (2) and
Computing terminals

Length of

About, Streaming Context. Check Pointing (Streaming Context. Check Pointing is the data sustaining instruction interface for stream processing tasks in the Spark framework) allows the data to be read into HDFS during the stream processing process, data preprocessing cache, computing, and return. In one step, if you perform an operation to sustain the data stream, store the intermediate results and computing task metadata in D _l , and at the same time monitor the queue update status and monitor the queue update, Streaming Context. .. When the stream is stopped by stop (Streaming Context.stop is the stop instruction interface of the stream processing task in the Spark framework), the process returns to step (4), and the computing task in the stream processing process is completed. This includes step (5) of returning the task processing result to the client corresponding to the stream processing task and retrieving the task from the queue Q.

さらに、前記ステップ（３）において、クライアントに基づくＳｃｈｅｄｕｌｉｎｇ計算流れは、以下の通りである。
ステップ（３．１）：キュー

について、サイズに従ってソートして

及び添え字マッピングM=

となり、
ステップ（３．２．２）：

が存在している場合に、この集合が

であり、ただし、

を対応させ、ただし、

について、サイズに従ってソートして、

及び添え字マッピングM=

となり、
ステップ（３．３．２）：

が存在している場合に、この集合を

であり、ただし、

，

を対応させ、ただし、

］を再構成する。 Further, in the step (3), the Scheduling calculation flow based on the client is as follows.
Step (3.1): Queue

Sort by size

And subscript mapping M =

And
Step (3.2.2):

If is present, then this set

And move to step (3.2.3), otherwise the final thread resource allocation policy

However,

Sort by size,

And subscript mapping M =

And
Step (3.3.2):

If is present, then this set

However,

，，

However,

] Is reconstructed.

以下に、本発明に係るＳｐａｒｋに基づくマルチセンターのデータ協調コンピューティングのストリーム処理方法を、マルチセンターの医学データ協調コンピューティングプラットフォーム上に適用する一つの具体的な実施例を示し、当該実施例は、具体的に、以下のステップを含む。
ステップ（１）：クライアント（三つの病院）及びコンピューティング端末（データセンター）に、ＲＥＳＴＦｕｌサービスを構築し、コンピューティングタスクキューを、以下の式とする。

Ｌ＝３であり、三番目の病院“hospital3”は、コンピューティング端末に一つの新たなコンピューティングタスク要求“task4”を送信し、当該要求には、コンピューティングのスレッドリソース要求１６、コンピューティングメモリーの要求１６、そのタスクに対応するコンピューティングすべきデータ“path4”を含む。
ステップ（２）：コンピューティング端末は、クライアントｃ_ｉから送信されたコンピューティングタスク要求を解析して、

を取得する。
ステップ（３）：コンピューティング端末は、

を一つのエレメントとして、コンピューティングタスクキュー

に挿入する。

その後に、Ｓｃｈｅｄｕｌｉｎｇ計算を始め、Ｓｃｈｅｄｕｌｉｎｇ計算では、タスクキューＱにおける各エレメントのコンピューティング要求の値をクライアントを単位とするマクシミン規準に従って最適化し、各エレメントのｎｔ_ｋ及びｎｍ_ｋを更新し、キューＱの値が次の式になり、

ただし、Ｓｃｈｅｄｕｌｉｎｇ計算の流れは、以下の通りである。
ステップ（３．１）：次のキューについて

ＬがコンピューティングキューＱの長さであり、Ｌ＝４であり、クライアント“hospital2”には複数の記録が存在している場合に、まず、クライアントに従って加算し、次の式を取得し、

Ｌ_ｍｉｄがＱ_ｍｉｄ長さであり、Ｌ_ｍｉｄ＝３である。
ステップ（３．２）：スレッドリソースについて、次のように最適化割り当ての流れを実行しており、
ステップ（３．２．１）：すべてのクライアントのスレッドリソース要求総数キュー[8,12,16]について、サイズに従ってソートして、[8,12,16]及び添え字マッピングM=[1,2,3]を取得し、コンピューティングセンターのコンピューティングリソースプールの総スレッドリソースをＮＴ＝３２とすると、予め[8,12,16]に与えられるリソースが[10,10,12]となる。
ステップ（３．２．２）：

が存在している場合に、この集合を

とし、ステップ（３．２．３）に移行する。
ステップ（３．２．３）：再割り当てする必要があるスレッドリソースが

であり、ただし、

がＪのエレメントの数であり、

であり、ステップ（３．２．２）に戻る。
ステップ（３．２．２）：

が存在していない場合、最終的なスレッドリソース割り当てポリシー

を取得し、ステップ（３．２．４）に移行する。
ステップ（３．２．４）：同一“hospital2”にタスク

を対応させる。
ステップ（３．３）：メモリーリソースについて、以下のように、最適化割り当ての流れを実行しており、
ステップ（３．３．１）：すべてのクライアントのメモリーリソース要求総数キュー

について、サイズに従ってソートして、

及び添え字マッピングM=

を取得し、コンピューティングセンターのコンピューティングリソースプールの総メモリーリソースを

、予め

に与えられるリソースが

となる。
ステップ（３．３．２）：

が存在している場合、この集合を、

とし、ステップ（３．３．３）に移行する。
ステップ（３．３．３）：再割り当てする必要があるメモリーリソースが

，

であり、ただし、

がＪのエレメントの数であり、ステップ（３．３．２）に戻る。
ステップ（３．３．２）：

を取得し、ステップ（３．３．４）に移行する。
ステップ（３．３．４）：同一の“hospital2”にタスク

，

を対応させる。
ステップ（３．４）：ステップ（３．２）及びステップ（３．３）で得られた［ｎｔ_ｋ］及び［ｎｍ_ｋ］から、次の式を再構成する。

ステップ（４）：コンピューティングキューＱの長さをコンピューティングし、

であり、４を循環境界条件として、Ｓｐａｒｋ．ＳｔｒｅａｍｉｎｇＣｏｎｔｅｘｔ（Ｓｐａｒｋ．ＳｔｒｅａｍｉｎｇＣｏｎｔｅｘｔがＳｐａｒｋフレームワークにおけるストリーム処理タスクの作成命令インタフェースである）により、４個のストリームを作成し、Ｓｐａｒｋ．Ｃｏｎｆ（Ｓｐａｒｋ．ＣｏｎｆがＳｐａｒｋフレームワークにおけるストリーム処理タスクの配置命令インタフェースである）により、各ストリームに割り当てられたリソースを宣言し、Ｓｐａｒｋに実際のストリームタスクを順次送信することについて、
ストリーム１：データ“path1”をロードし、データに対してコンピューティングタスク“task1”を実行し、割り当てられたスレッドリソースが９となり、メモリーリソースが４となる。
ストリーム２：データ“path2”をロードし、データに対してコンピューティングタスク“task2”を実行し、割り当てられたスレッドリソースが９となり、メモリーリソースが９となる。
ストリーム３：データ“path3”をロードし、データにコンピューティングタス“task3”を実行し、割り当てられたスレッドリソースが４となり、メモリーリソースが９となる。
ストリーム４：データ“path4”をロードし、データにコンピューティングタスク“task4”を実行し、割り当てられたスレッドリソースが１０となり、メモリーリソースが１０となる。
ただし、ストリーム１、ストリーム２、ストリーム３を検査すると、中間結果及びコンピューティングタスクメタデータが存在している場合に、直接に、それに対応するステップからタスクをコンピューティングし始める。
（５）：ストリーム処理されているタスクについて、

ＳｔｒｅａｍｉｎｇＣｏｎｔｅｘｔ．ＣｈｅｃｋＰｏｉｎｔｉｎｇ（ＳｔｒｅａｍｉｎｇＣｏｎｔｅｘｔ．ＣｈｅｃｋＰｏｉｎｔｉｎｇがＳｐａｒｋフレームワークにおけるストリーム処理タスクのデータ持続化命令インタフェースである）により、ストリーム処理過程におけるＨＤＦＳへのデータの読み取り、データの前処理キャッシュ、コンピューティング、戻りという四つのステップにおいて、データストリームを持続化させる操作を実行し、中間結果及びコンピューティングタスクメタデータをｐａｔｈ１、ｐａｔｈ２、ｐａｔｈ３、ｐａｔｈ４に記憶し、同時に、キューの更新状況を監視し、キューの更新を監視した場合に、ＳｔｒｅａｍｉｎｇＣｏｎｔｅｘｔ．ｓｔｏｐ（ＳｔｒｅａｍｉｎｇＣｏｎｔｅｘｔ．ｓｔｏｐがＳｐａｒｋフレームワークにおけるストリーム処理タスクの中止命令インタフェースである）により、当該ストリームを停止させ、ステップ（４）に戻り、ストリーム処理におけるコンピューティングタスクが完了した場合に、当該ストリーム処理タスクに対応するクライアントに、タスク処理結果を返し、タスクをキューＱから取り出す。 The following shows one specific example of applying the stream processing method of multi-center data co-computing based on Spark according to the present invention on a multi-center medical data co-computing platform. , Specifically, including the following steps.
Step (1): Build a RESTFul service on a client (three hospitals) and a computing terminal (data center), and use the following formula for the computing task queue.

L = 3, and the third hospital "hospital3" sends one new computing task request "task4" to the computing terminal, and the request is sent to the computing thread resource request 16 and the computing memory. Request 16, the data "path4" to be computed corresponding to the task is included.
Step (2): The computing terminal analyzes the computing task request sent from the client _ci , and then

To get.
Step (3): The computing terminal

As one element, the computing task queue

Insert in.

After that, the Scheduling calculation is started, and in the Scheduling calculation, the value of the computing request of each element in the task queue Q is optimized according to the maximin standard for each client, and the nt _k and nm _k of each element are updated to the queue Q. The value of becomes the following formula,

However, the flow of Scheduling calculation is as follows.
Step (3.1): About the next queue

When L is the length of the computing queue Q, L = 4, and there are multiple records in the client "hospital2", first, the addition is performed according to the client, and the following equation is obtained.

L _mid is the Q _mid length, and L _mid = 3.
Step (3.2): For thread resources, the flow of optimization allocation is executed as follows.
Step (3.2.1): Total number of thread resource requests for all clients [8,12,16] sorted by size [8,12,16] and subscript mapping M = [1,2] If, 3] is acquired and the total thread resource of the computing resource pool of the computing center is NT = 32, the resource given to [8,12,16] in advance becomes [10,10,12].
Step (3.2.2):

If is present, then this set

Then, the process proceeds to step (3.2.3).
Step (3.2.3): Thread resource that needs to be reassigned

However,

Is the number of elements of J,

Then, the process returns to step (3.2.2).
Step (3.2.2):

If does not exist, the final thread resource allocation policy

Is acquired, and the process proceeds to step (3.2.4).
Step (3.2.4): Task on the same "hospital2"

To correspond.
Step (3.3): For the memory resource, the flow of optimization allocation is executed as follows.
Step (3.3.1): Total memory resource request queue for all clients

Sort by size,

And subscript mapping M =

And get the total memory resources of the compute resource pool in the compute center

, In advance

The resources given to

Will be.
Step (3.3.2):

If exists, this set,

Then, the process proceeds to step (3.3.3).
Step (3.3.3): Memory resources that need to be reallocated

，，

However,

Is the number of elements of J, and returns to step (3.3.2).
Step (3.3.2):

If does not exist, the final thread resource allocation policy

Is acquired, and the process proceeds to step (3.3.4).
Step (3.3.4): Task on the same “hospital2”

，，

To correspond.
Step (3.4): From the [nt _k ] and [nm _k ] obtained in step (3.2) and step (3.3), the following equation is reconstructed.

Step (4): Compute the length of the compute queue Q,

With 4 as the circulation boundary condition, Spark. Four streams are created by the Streaming Context (Spark. Streaming Context is an instruction interface for creating a stream processing task in the Spark framework), and Spark. Conf (Spark.Conf is the placement instruction interface for stream processing tasks in the Spark framework) declares the resources assigned to each stream and sequentially sends the actual stream tasks to Spark.
Stream 1: Load the data "path1", execute the computing task "task1" on the data, the allocated thread resource becomes 9, and the memory resource becomes 4.
Stream 2: The data “path2” is loaded, the computing task “task2” is executed on the data, the allocated thread resource becomes 9, and the memory resource becomes 9.
Stream 3: The data “path3” is loaded, the computing task “task3” is executed on the data, the allocated thread resource becomes 4, and the memory resource becomes 9.
Stream 4: Load the data "path4", execute the computing task "task4" on the data, the allocated thread resource becomes 10, and the memory resource becomes 10.
However, when the stream 1, stream 2, and stream 3 are inspected, if intermediate results and computing task metadata are present, the task is directly started to be computed from the corresponding step.
(5): For tasks that are being streamed

Streaming Context. Check Pointing (Streaming Context. Check Pointing is the data sustaining instruction interface for stream processing tasks in the Spark framework) allows the data to be read into HDFS during the stream processing process, data preprocessing cache, computing, and return. In one step, perform operations to sustain the data stream, store intermediate results and computing task metadata in path1, path2, path3, path4, and at the same time monitor queue updates and monitor queue updates. If you do, Streaming Context. When the stream is stopped by stop (Streaming Context.stop is the stop instruction interface of the stream processing task in the Spark framework), the process returns to step (4), and the computing task in the stream processing is completed, the stream is concerned. The task processing result is returned to the client corresponding to the processing task, and the task is fetched from the queue Q.

以上は、本発明の実施例に過ぎず、本発明の保護範囲を限定するものではない。本発明の趣旨及び原則を逸脱しない限り創造的労働を経ずに行われたいかなる修正、均等置換や改良などは、いずれも本発明の保護範囲に含まれる。
The above is only an embodiment of the present invention and does not limit the scope of protection of the present invention. Any modifications, even substitutions or improvements made without creative labor, as long as they do not deviate from the spirit and principles of the invention, are all within the scope of the invention.

Claims

A stream processing method for multi-center data co-computing based on Spark.
The method is carried out in a multi-center data co-computing system, wherein the multi-center data co-computing system includes a plurality of clients and one computing terminal, wherein the clients are computing by a user. It is for generating an ing task request and sending it to a computing terminal, and the computing terminal is for analyzing a request to generate and execute a computing instruction.
The method is
Build RESTFul services on clients and computing terminals and create computing task queues

To get step (2) and
Computing terminals

Is inserted into the computing task queue Q as one element, and then the Scheduling calculation is started. In the Scheduling calculation, the value of the computing request of each element in the task queue Q is optimized according to the Maximin standard for each client. Step (3), which updates nt _k and nm _k of each element,
Cue Q length

With L as the circular boundary condition, Spark. By Streaming Context, L streams were created, and Spark. For declaring the resources allocated to each stream by Conf and sequentially sending the actual stream task _k to Spark, load the data _Dk , execute the computing task tk, and request the computing thread resource nt. Allocate the number of threads that satisfy _k , allocate the required nm _k that compute memory is satisfied, but if D _k has intermediate results and compute task metadata, compute the task directly from the corresponding step. Start ing, step (4),
Tasks being streamed

About, Streaming Context. Check Pointing performs operations to sustain the data stream in four steps: reading the data to the HDFS during the stream processing process, preprocessing data cache, computing, and returning, with intermediate results and computing task metadata. Is stored in D _l , and at the same time, the update status of the queue is monitored, and when the update of the queue is monitored, Streaming Context. By stop, the stream is stopped, the process returns to step (4), and when the computing task in the stream processing process is completed, the task processing result is returned to the client corresponding to the stream processing task, and the task is fetched from the queue Q. , Step (5), and a Spark-based multicenter data co-computing stream processing method comprising.