JP6940325B2

JP6940325B2 - Distributed processing system, distributed processing method, and distributed processing program

Info

Publication number: JP6940325B2
Application number: JP2017155083A
Authority: JP
Inventors: 隆文小池; 宏明郡浦; 貴央小川; 大剛関根
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2017-08-10
Filing date: 2017-08-10
Publication date: 2021-09-29
Anticipated expiration: 2037-08-10
Also published as: JP2019035996A

Description

本発明は、分散処理システム、分散処理方法、及び分散処理プログラムに関する。 The present invention relates to a distributed processing system, a distributed processing method, and a distributed processing program.

近年、大量のデータを分析することにより新たな知見を得、これを活用していくというビッグデータ技術が注目されている。このような大量のデータを分析する手法としては、統計解析や機械学習等、様々な手法が存在する。さらに、データの整形といった他の手法を組み合わせることも行われている。これらの複数の手法を組み合わせる場合には、例えば、データの処理順序を分析フローに対応づけて予め決定しておき、前提となる処理が完了し次第、その処理の結果得られたデータを、次の処理の入力データとして用いることにより、最終的な結果を得ることができる。その結果を参照しても適当な知見が得られなければ、手法や条件を変更することで新たな分析フローを決定する。このように、大量のデータの分析は試行錯誤の繰り返しが必要となるため、データの処理量が多くなって分析に要する時間が長くなる場合が多い。 In recent years, big data technology, which obtains new knowledge by analyzing a large amount of data and utilizes it, has been attracting attention. As a method for analyzing such a large amount of data, there are various methods such as statistical analysis and machine learning. In addition, other techniques such as data shaping are being combined. When combining these multiple methods, for example, the data processing order is determined in advance in association with the analysis flow, and as soon as the prerequisite processing is completed, the data obtained as a result of the processing is displayed as follows. The final result can be obtained by using it as the input data of the processing of. If appropriate findings cannot be obtained by referring to the results, a new analysis flow is determined by changing the method and conditions. As described above, analysis of a large amount of data requires repeated trial and error, so that the amount of data processed is large and the time required for analysis is often long.

そこで一般的には、処理の時間を短縮するために、単一の計算機でデータを処理するのではなく、複数の計算機が分散して並列処理を行う。例えば、分析対象のデータを、分散ファイルシステム等を用いたデータレイクと呼ばれるシステムに格納しておく。そして、複数の計算機が、このデータレイクにおけるデータを複数の区間に分割して読み出し、読み出したデータを整形し又は分析し、分析したデータをデータレイク等に格納し、その内容をユーザに表示する。さらに、データの分析処理は一時的な処理であることから、構築が容易な仮想マシンシステムを活用する。すなわち、データの分析処理が必要なときに各仮想マシンを起動してこれらに分析処理を分散して実行させることで、当該分析処理に係るリソースの使用量を抑えることができる。 Therefore, in general, in order to shorten the processing time, instead of processing the data with a single computer, a plurality of computers are distributed to perform parallel processing. For example, the data to be analyzed is stored in a system called a data lake using a distributed file system or the like. Then, a plurality of computers divide the data in this data lake into a plurality of sections and read it, format or analyze the read data, store the analyzed data in the data lake or the like, and display the contents to the user. .. Furthermore, since the data analysis process is a temporary process, a virtual machine system that is easy to build will be used. That is, by starting each virtual machine when data analysis processing is required and causing them to perform the analysis processing in a distributed manner, it is possible to reduce the amount of resources used for the analysis processing.

仮想マシンにリソースを割り当てる際には、動作が不安定とならないようにするべく、予め定めた量を超えて各仮想マシンにリソース配分を行わないように配分を管理する手法が知られている。例えば、特許文献１には、計算機管理システムが、仮想マシンのうちで、リソースの平均使用率が予め決められているポリシー値を超えているものがあった場合に、ポリシー値を満たすように、ポリシー値を満たす他の仮想マシンからリソースを確保し、リソースを引き受ける仮想マシンのある物理サーバの余剰リソースより、前記確保したリソース容量が大きい場合、物理サーバのリソース容量を超えないようにリソース配分する変更指示を出すことが記載されている。 When allocating resources to virtual machines, there is known a method of managing allocation so that resource allocation is not performed to each virtual machine in excess of a predetermined amount so that the operation does not become unstable. For example, in Patent Document 1, the computer management system satisfies the policy value when the average resource usage rate of some virtual machines exceeds a predetermined policy value. Allocate resources from other virtual machines that meet the policy value, and if the allocated resource capacity is larger than the surplus resources of the physical server that has the virtual machine that takes over the resources, allocate the resources so that the resource capacity of the physical server is not exceeded. It is stated that a change instruction will be issued.

特開２０１４−１３０４１３号公報Japanese Unexamined Patent Publication No. 2014-130413

しかしながら、仮想マシン上で複数のデータ分析を実行する場合、データの分析内容が異なれば処理に利用するリソースの配分も異なってくるので、その結果、データ処理を効率よく行えない場合が生じる。特許文献１に記載の技術では、分析処理中、各データ分析の実行順序制約と物理サーバのリソース容量制約とを固定的に保持したままであるので、各データ分析の間でのリソースの配分も固定的となる。その結果、データ処理が不安定となり、処理の遅延が発生する等の問題が生じる可能性がある。 However, when a plurality of data analyzes are executed on a virtual machine, the allocation of resources used for processing differs depending on the data analysis content, and as a result, data processing may not be performed efficiently. In the technique described in Patent Document 1, since the execution order constraint of each data analysis and the resource capacity constraint of the physical server are fixedly held during the analysis process, the allocation of resources between each data analysis is also possible. It becomes fixed. As a result, data processing becomes unstable, which may cause problems such as processing delays.

本発明はこのような現状に鑑みてなされたものであり、データを安定的に分散して処理することが可能な分散処理システム、分散処理方法、及び分散処理プログラムを提供することにある。 The present invention has been made in view of such a current situation, and an object of the present invention is to provide a distributed processing system, a distributed processing method, and a distributed processing program capable of stably distributing and processing data.

以上の課題を解決するための本発明の一つは、複数の仮想マシンを含んで構成され、複数の区画からなる所定のデータを前記区画ごとに処理する処理フローを複数備える所定の処理について、前記処理フローを前記複数の仮想マシンのうち少なくともいずれかに割り当てることにより前記処理フローを並列的に実行可能な、プロセッサ及びメモリを備える分散処理システムであって、前記区画のデータに対する前記複数の処理フローの処理順序を記憶するフローテーブル記憶部と、前記処理フローによる前記区画のデータの処理による前記仮想マシンに対する負荷を、前記区画ごとに算出する処理負荷算出部と、各前記処理フローの現在の実行状態、各前記処理フローの処理順序、及び各前記処理フローについて算出した前記負荷に基づき、並列的に実行される前記処理フロー、及び当該処理フローを実行する前記仮想マシンの組み合わせを決定するフロー管理部と、前記決定した組み合わせが示す並列的な処理を各前記仮想マシンに実行させるマシン制御部と、を備え、前記処理負荷算出部は、前記負荷として、前記処理フローの実行に係る予測時間を各前記処理フローについて算出し、前記フロー管理部は、並列的に実行される前記処理フローが複数ある場合、前記算出した予測時間に基づき、前記複数の処理フローのそれぞれに対して割り当てる前記仮想マシン又はその割り当てに関する優先度を決定し、前記並列的に実行される複数の処理フローのうち前記予測時間を算出していない前記処理フローがある場合には、前記複数の処理フローのそれぞれを実行する前記仮想マシンの数を互いに均等とする。

One of the present inventions for solving the above problems is a predetermined process including a plurality of virtual machines and including a plurality of process flows for processing predetermined data composed of a plurality of partitions for each partition. the processing flow of the processing flow can be executed in parallel by a assigned to at least one of the plurality of virtual machines, a distributed processing system having a processor and a memory, said plurality of processing on data of the partition A flow table storage unit that stores the processing order of flows , a processing load calculation unit that calculates the load on the virtual machine due to the processing of data in the partition by the processing flow for each partition, and the current processing load of each processing flow. execution state, the processing flow each of said processing flow of a processing order, and based on the load calculated for each said processing flow are executed in parallel, and flow to determine the combination of the virtual machine running the process flow A management unit and a machine control unit that causes each virtual machine to execute parallel processing indicated by the determined combination are provided, and the processing load calculation unit uses the load as an estimated time for executing the processing flow. Is calculated for each of the processing flows, and when there are a plurality of the processing flows executed in parallel, the flow management unit allocates the virtual to each of the plurality of processing flows based on the calculated predicted time. If there is a processing flow for which the estimated time has not been calculated among the plurality of processing flows executed in parallel by determining the priority regarding the machine or its allocation, each of the plurality of processing flows is executed. Make the number of virtual machines equal to each other .

本発明によれば、データを安定的に分散して処理することができる。 According to the present invention, data can be stably distributed and processed.

図１は、本実施形態に係る分散処理システム１００の構成の一例を示す図である。FIG. 1 is a diagram showing an example of the configuration of the distributed processing system 100 according to the present embodiment. 図２は、分散処理システム１００における各情報処理装置（仮想マシン管理サーバ１０１、仮想マシン実行サーバ１０２、データレイク１０４、及びユーザ操作端末１０５）が備えるハードウェア構成の一例を示す図である。FIG. 2 is a diagram showing an example of a hardware configuration included in each information processing device (virtual machine management server 101, virtual machine execution server 102, data lake 104, and user operation terminal 105) in the distributed processing system 100. 図３は、各情報処理装置が備える機能の一例を説明する図である。FIG. 3 is a diagram illustrating an example of a function provided in each information processing device. 図４は、分析フローテーブル３００の一例を示す図である。FIG. 4 is a diagram showing an example of the analysis flow table 300. 図５は、処理状態管理テーブル４００の一例を示す図である。FIG. 5 is a diagram showing an example of the processing state management table 400. 図６は、ＶＭ管理テーブル５００の一例を示す図である。FIG. 6 is a diagram showing an example of the VM management table 500. 図７は、分散処理システム１００が行う分析処理の一例を示すシーケンス図である。FIG. 7 is a sequence diagram showing an example of analysis processing performed by the distributed processing system 100. 図８は、フロー実行対象判定処理の詳細を説明するフローチャートである。FIG. 8 is a flowchart illustrating the details of the flow execution target determination process. 図９は、仮想マシン割当判定処理の詳細を示すフローチャートである。FIG. 9 is a flowchart showing details of the virtual machine allocation determination process. 図１０は、ユーザ操作端末１０５が表示する、分析処理の経過又は結果を示す画面の一例である。FIG. 10 is an example of a screen displayed by the user operation terminal 105 showing the progress or result of the analysis process. 図１１は、従来の分散処理システムにおいて分析処理を実行した場合における、仮想マシン実行サーバ１０２及びデータレイク１０４のハードウェアリソースの使用状況の時系列変化の一例を示す図である。FIG. 11 is a diagram showing an example of time-series changes in the usage status of the hardware resources of the virtual machine execution server 102 and the data lake 104 when the analysis process is executed in the conventional distributed processing system. 図１２は、本実施形態の分散処理システム１００において分析処理を実行した場合における仮想マシン実行サーバ１０２及びデータレイク１０４のハードウェアリソース使用状況の時系列変化の一例を示す図である。FIG. 12 is a diagram showing an example of time-series changes in the hardware resource usage status of the virtual machine execution server 102 and the data lake 104 when the analysis process is executed in the distributed processing system 100 of the present embodiment.

＜システム構成＞
図１は、本実施形態に係る分散処理システム１００の構成の一例を示す図である。分散処理システム１００は、所定のデータ（以下、対象データという）の変化を分析する処理（以下、分析処理という）を行う情報処理システムである。 <System configuration>
FIG. 1 is a diagram showing an example of the configuration of the distributed processing system 100 according to the present embodiment. The distributed processing system 100 is an information processing system that performs processing (hereinafter referred to as analysis processing) for analyzing changes in predetermined data (hereinafter referred to as target data).

対象データは、例えば、温度、圧力、速度の時間変化のデータであり、大量に存在する
、いわゆるビッグデータ（Big Data）である。この対象データは複数のデータ区画からなり、例えば、所定の時間帯ごとに区切られたデータからなる。 The target data is, for example, data on changes in temperature, pressure, and velocity over time, and is so-called big data that exists in large quantities. This target data is composed of a plurality of data partitions, for example, data delimited by a predetermined time zone.

図１に示すように、分散処理システム１００は、対象データを記憶するデータベースであるデータレイク１０４と、データレイク１０４から対象データを読み込んで分析処理を実行する少なくとも１台の仮想マシン１０３（いわゆる仮想サーバ）を備える仮想マシン実行サーバ１０２と、仮想マシン１０３のそれぞれに対して分析処理に関する指示を行う仮想マシン管理サーバ１０１と、分散処理システム１００の管理者や使用者等（以下、ユーザという）が使用する、分析処理に関する指示や分析処理に関する情報の表示を行うユーザ操作端末１０５とを含んで構成されている。 As shown in FIG. 1, the distributed processing system 100 includes a data lake 104, which is a database for storing target data, and at least one virtual machine 103 (so-called virtual machine 103) that reads target data from the data lake 104 and executes analysis processing. A virtual machine execution server 102 including a server), a virtual machine management server 101 that gives instructions regarding analysis processing to each of the virtual machines 103, and an administrator, a user, or the like (hereinafter referred to as a user) of the distributed processing system 100. It is configured to include a user operation terminal 105 for displaying instructions related to analysis processing and information related to analysis processing to be used.

なお、仮想マシン管理サーバ１０１、仮想マシン実行サーバ１０２、データレイク１０４、及びユーザ操作端末１０５の各情報処理装置の間は、例えば、ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）、インターネット、専用線等からなるネットワ
ーク１０８により通信可能に接続されている。 Between the information processing devices of the virtual machine management server 101, the virtual machine execution server 102, the data lake 104, and the user operation terminal 105, for example, a LAN (Local Area Network), a WAN (Wide Area Network), the Internet, etc. It is communicably connected by a network 108 composed of a dedicated line or the like.

分析処理は、対象データを処理する複数の処理部（以下、分析フローともいう）を含んで構成されている。分析フローは、例えば、種々の統計解析の処理や、機械学習に係る処理等、ビッグデータを処理、分析するための様々な処理があり得る。 The analysis process includes a plurality of processing units (hereinafter, also referred to as analysis flows) for processing the target data. The analysis flow may include various processes for processing and analyzing big data, such as various statistical analysis processes and machine learning-related processes.

仮想マシン１０３は、分散処理システム１００において少なくとも２台以上設けられており、分散処理システム１００は、分析処理における各分析フロー（より具体的には、各データ区画におけるデータの処理）を複数の仮想マシン１０３のうち少なくともいずれかに割り当てることにより各分析フローを複数の仮想マシン１０３により並列的に実行可能である。 At least two or more virtual machines 103 are provided in the distributed processing system 100, and the distributed processing system 100 performs a plurality of virtual machines for each analysis flow (more specifically, data processing in each data partition) in the analysis processing. By assigning to at least one of the machines 103, each analysis flow can be executed in parallel by the plurality of virtual machines 103.

なお、図２は、分散処理システム１００における各情報処理装置（仮想マシン管理サーバ１０１、仮想マシン実行サーバ１０２、データレイク１０４、及びユーザ操作端末１０５）が備えるハードウェア構成の一例を示す図である。同図に示すように、各情報処理装置は、ＣＰＵ（Central Processing Unit）などのプロセッサ５１と、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）等の主記憶装置５２と、ＨＤＤ（Hard Disk
Drive）、ＳＳＤ（Solid State Drive）等の補助記憶装置５３と、キーボード、マウス
、タッチパネル等からなる入力装置５４と、モニタ（ディスプレイ）等からなる出力装置５５とを備える。 Note that FIG. 2 is a diagram showing an example of the hardware configuration included in each information processing device (virtual machine management server 101, virtual machine execution server 102, data lake 104, and user operation terminal 105) in the distributed processing system 100. .. As shown in the figure, each information processing device includes a processor 51 such as a CPU (Central Processing Unit), a main storage device 52 such as a RAM (Random Access Memory) and a ROM (Read Only Memory), and an HDD (Hard Disk).
Auxiliary storage device 53 such as Drive) and SSD (Solid State Drive), an input device 54 including a keyboard, a mouse, a touch panel and the like, and an output device 55 including a monitor (display) and the like are provided.

なお、図１には示していないが、データレイク１０４に対して、対象データの時間変化を計測する情報処理装置（各種サーバ、センサ等）を接続し、これらの情報処理装置がその測定値（対象データ）をデータレイク１０４に送信して記憶させるものとしてもよい。 Although not shown in FIG. 1, an information processing device (various servers, sensors, etc.) that measures the time change of the target data is connected to the data lake 104, and these information processing devices use the measured value (the measured value (). The target data) may be transmitted to the data lake 104 and stored.

次に、各情報処理装置が備える機能について説明する。
＜機能＞
図３は、各情報処理装置が備える機能の一例を説明する図である。
まず、仮想マシン管理サーバ１０１は、フローテーブル記憶部２１１、処理状態管理テーブル記憶部２１２、処理負荷算出部２０１、フロー管理部２０２、マシン管理テーブル記憶部２１３、及びマシン制御部２０３を備える。 Next, the functions provided by each information processing device will be described.
<Function>
FIG. 3 is a diagram illustrating an example of a function provided in each information processing device.
First, the virtual machine management server 101 includes a flow table storage unit 211, a processing state management table storage unit 212, a processing load calculation unit 201, a flow management unit 202, a machine management table storage unit 213, and a machine control unit 203.

フローテーブル記憶部２１１は、前記区画のデータに対する前記複数の処理部の処理順序を記憶する。 The flow table storage unit 211 stores the processing order of the plurality of processing units for the data in the partition.

すなわち、フローテーブル記憶部２１１は、分析フローテーブル３００を備える。分析
フローテーブル３００の詳細は後述する。 That is, the flow table storage unit 211 includes an analysis flow table 300. The details of the analysis flow table 300 will be described later.

処理状態管理テーブル記憶部２１２は、各分析フローの現在の実行状態を処理状態管理テーブル４００に記憶する。処理状態管理テーブル４００の詳細は後述する。 The processing state management table storage unit 212 stores the current execution state of each analysis flow in the processing state management table 400. Details of the processing state management table 400 will be described later.

マシン管理テーブル記憶部２１３は、前記情報処理装置（仮想マシン１０３）による前記処理部の並列的な実行に関する制約条件を記憶する。 The machine management table storage unit 213 stores the constraint conditions related to the parallel execution of the processing unit by the information processing device (virtual machine 103).

具体的には、前記マシン管理テーブル記憶部２１３は、前記制約条件として、並列的に前記処理部を実行可能な前記情報処理装置の最大数を記憶する。 Specifically, the machine management table storage unit 213 stores the maximum number of information processing devices capable of executing the processing unit in parallel as the constraint condition.

すなわち、マシン管理テーブル記憶部２１３は、ＶＭ管理テーブル５００を備える。ＶＭ管理テーブル５００の詳細は後述する。 That is, the machine management table storage unit 213 includes a VM management table 500. Details of the VM management table 500 will be described later.

処理負荷算出部２０１は、前記処理部（分析フロー）が行う前記区画のデータの処理による前記情報処理装置（仮想マシン１０３）に対する負荷を、前記区画ごとに算出する。 The processing load calculation unit 201 calculates the load on the information processing device (virtual machine 103) due to the processing of the data in the partition performed by the processing unit (analysis flow) for each partition.

具体的には、前記処理負荷算出部２０１は、前記負荷として、前記処理部の処理の実行に係る予測時間を各前記処理部について算出する。 Specifically, the processing load calculation unit 201 calculates, as the load, the estimated time for executing the processing of the processing unit for each of the processing units.

フロー管理部２０２は、各前記処理部（分析フロー）の現在の実行状態、各前記処理部の処理順序、及び処理負荷算出部２０１が各前記処理部について算出した前記負荷に基づき、並列的に実行される前記処理部、及び当該処理部を実行する前記情報処理装置の組み合わせを決定する。 The flow management unit 202 parallelly based on the current execution state of each of the processing units (analysis flow), the processing order of each of the processing units, and the load calculated by the processing load calculation unit 201 for each of the processing units. The combination of the processing unit to be executed and the information processing device that executes the processing unit is determined.

具体的には、前記フロー管理部２０２は、前記制約条件を満たす前記情報処理装置を、前記処理部を並列的に実行する前記情報処理装置（仮想マシン１０３）として決定する。 Specifically, the flow management unit 202 determines the information processing device satisfying the constraint condition as the information processing device (virtual machine 103) that executes the processing units in parallel.

また、前記フロー管理部２０２は、並列的に実行される前記処理部（分析フロー）が複数ある場合、処理負荷算出部２０１が前記算出した予測時間に基づき、前記複数の処理部のそれぞれに対して割り当てる前記情報処理装置（仮想マシン１０３）又はその割り当てに関する優先度を決定する。 Further, when there are a plurality of the processing units (analysis flows) executed in parallel, the flow management unit 202 refers to each of the plurality of processing units based on the predicted time calculated by the processing load calculation unit 201. The information processing device (virtual machine 103) to be assigned or the priority regarding the allocation is determined.

また、前記フロー管理部２０２は、前記並列的に実行される複数の処理部のうち前記予測時間を算出していない前記処理部がある場合には、前記複数の処理部のそれぞれを実行する前記情報処理装置の台数を互いに均等とする。 Further, the flow management unit 202 executes each of the plurality of processing units executed in parallel when there is the processing unit for which the predicted time has not been calculated. Make the number of information processing devices equal to each other.

また、前記フロー管理部２０２は、並列的に実行される前記処理部を決定する際に、当該処理部が処理可能な前記データの区画が複数ある場合には、予め定められた、最初に処理される前記データの区画のみを前記処理部が処理することを決定する。 Further, when the flow management unit 202 determines the processing unit to be executed in parallel, if there are a plurality of data partitions that can be processed by the processing unit, a predetermined first processing is performed. It is determined that the processing unit processes only the section of the data to be processed.

マシン制御部２０３は、フロー管理部２０２が前記決定した組み合わせが示す並列的な処理を各前記情報処理装置に実行させる。 The machine control unit 203 causes each information processing device to execute the parallel processing indicated by the combination determined by the flow management unit 202.

次に、仮想マシン実行サーバ１０２は、仮想マシン１０３を動作させる。仮想マシン１０３は、送受信部２０４及び分析処理部２０５を備える。 Next, the virtual machine execution server 102 operates the virtual machine 103. The virtual machine 103 includes a transmission / reception unit 204 and an analysis processing unit 205.

送受信部２０４は、分析処理に関するデータの送受信を行う。分析処理部２０５は、種々の統計解析機能を備え、例えば、データの抽出、分析、及び記憶を行うことで分析処理における各分析フローを実行する。 The transmission / reception unit 204 transmits / receives data related to the analysis process. The analysis processing unit 205 has various statistical analysis functions, and executes each analysis flow in the analysis processing by, for example, extracting, analyzing, and storing data.

データレイク１０４は、データ保存部２０６及びデータ読み書き部２０７を備える。データ保存部２０６は、対象データを記憶し、また、仮想マシン１０３からのデータの読み出し要求や書き込み要求に応じて、対象データを含む種々のデータの送受信を行う。データ読み書き部２０７は、対象データを含む各種データの読み出し及びデータの書き込みを行う。 The data lake 104 includes a data storage unit 206 and a data read / write unit 207. The data storage unit 206 stores the target data, and also transmits / receives various data including the target data in response to a data read request or write request from the virtual machine 103. The data reading / writing unit 207 reads various data including the target data and writes the data.

データレイク１０４が記憶している対象データは、例えば、所定のフォーマットに従った複数のデータの集合である。例えば、対象データが時系列のデータである場合、対象データはその時刻又は時間帯に対応して前記の複数のデータ区画に分割されている。分析フローは、各データ区画のデータに対して所定の分析処理を行う。 The target data stored in the data lake 104 is, for example, a set of a plurality of data according to a predetermined format. For example, when the target data is time-series data, the target data is divided into the plurality of data partitions according to the time or time zone. In the analysis flow, a predetermined analysis process is performed on the data in each data section.

なお、以下では、分析フローにおいて、あるデータ区画のデータを処理する処理を、区間フローという。 In the following, in the analysis flow, the process of processing the data of a certain data section is referred to as a section flow.

ユーザ操作端末１０５は、出力部２０８を備える。出力部２０８は、各前記情報処理装置が実行した前記処理部の処理の結果、又は前記処理部の処理により発生したデータの入出力量に関する情報を出力する。 The user operation terminal 105 includes an output unit 208. The output unit 208 outputs information regarding the result of the processing of the processing unit executed by each of the information processing devices or the amount of data input / output generated by the processing of the processing unit.

次に、分散処理システム１００が記憶している各テーブル（データベース）について説明する。 Next, each table (database) stored in the distributed processing system 100 will be described.

＜分析フローテーブル＞
図４は、分析フローテーブル３００の一例を示す図である。同図に示すように、分析フローテーブル３００は、分析フローの識別情報（以下、フロー名という）が格納されるフロー名６１１、フロー名６１１が示す分析フローに入力されるデータ（又はその種類）を特定する情報が格納される入力６１２、フロー名６１１が示す分析フローが実行する処理の種類を示す情報（例えば、統計解析や機械学習の種類に関する情報）が格納される処理方法６１３、及び、フロー名６１１が示す分析フローから出力されるデータ（又はその種類）を特定する情報が格納される出力６１４の各項目を有する、少なくとも１つ以上のレコードで構成されている。 <Analysis flow table>
FIG. 4 is a diagram showing an example of the analysis flow table 300. As shown in the figure, in the analysis flow table 300, the flow name 611 in which the analysis flow identification information (hereinafter referred to as the flow name) is stored, and the data (or its type) input to the analysis flow indicated by the flow name 611. Input 612 that stores information that identifies It is composed of at least one or more records having each item of the output 614 in which the information specifying the data (or the type thereof) output from the analysis flow indicated by the flow name 611 is stored.

なお、以下では、フロー名６１１が示す分析フローが分析処理において最初に実行される処理である場合に入力６１２に格納されるデータを、初期入力データという。また、フロー名６１１が示す分析フローが分析処理において最後に実行される処理である場合に出力６１４に格納されるデータを、最終出力データという。そして、それ以外の場合の入力６１２又は出力６１４に格納されるデータを、中間データという。 In the following, the data stored in the input 612 when the analysis flow indicated by the flow name 611 is the first process to be executed in the analysis process is referred to as initial input data. Further, when the analysis flow indicated by the flow name 611 is the last process to be executed in the analysis process, the data stored in the output 614 is referred to as the final output data. The data stored in the input 612 or the output 614 in other cases is referred to as intermediate data.

同図の例では、「分析処理Ａ」は最初からデータレイク１０４に保存されている対象データを読み込み（受信し）、「分析処理Ｂ」は「分析処理Ａ」が出力した「中間データＡ」を読み込み（受信し）、「分析処理Ｃ」は「分析処理Ｂ」が出力した「中間データＢ」を読み込む（受信する）。「分析処理Ａ」に入力されるデータは初期入力データであり、また、「分析処理Ａ」と「分析処理Ｂ」が出力するデータは中間データであり、「分析処理Ｃ」が出力するデータは最終出力データである。 In the example of the figure, "analysis process A" reads (receives) the target data stored in the data lake 104 from the beginning, and "analysis process B" is "intermediate data A" output by "analysis process A". Is read (received), and "analysis process C" reads (receives) "intermediate data B" output by "analysis process B". The data input to "analysis process A" is the initial input data, the data output by "analysis process A" and "analysis process B" is intermediate data, and the data output by "analysis process C" is This is the final output data.

このように、分析フローテーブル３００は、各分析フローが送信又は受信するデータを特定する情報を記憶することで、分析処理における各分析フロー間の処理順序について規定している。なお、分析フローテーブル３００の内容は、例えば、分析処理の前にユーザによって予め入力される。 As described above, the analysis flow table 300 defines the processing order between the analysis flows in the analysis process by storing the information that identifies the data transmitted or received by each analysis flow. The contents of the analysis flow table 300 are input in advance by the user, for example, before the analysis process.

＜処理状態管理テーブル＞
図５は、処理状態管理テーブル４００の一例を示す図である。処理状態管理テーブル４００は、分析フローテーブル３００が規定する分析フローの処理順序を前提に、その分析フローにおける区間フローの実行状態について記憶している。 <Processing status management table>
FIG. 5 is a diagram showing an example of the processing state management table 400. The processing state management table 400 stores the execution state of the section flow in the analysis flow on the premise of the processing order of the analysis flow defined by the analysis flow table 300.

すなわち、処理状態管理テーブル４００は、フロー名が格納されるフロー名６２１、フロー名６２１が示す分析フローにおける区間フローの識別情報（以下、区間名という）が格納される区間名６２２、区間名６２２が示す区間フローの現在の実行状態（例えば、実行中であるか（「実行中」）、実行が完了したか（「実行完了」）、実行されていないが実行が可能な状態であるか（「実行可能」）、又は、実行を開始するのに必要なデータが生成されていないため実行が不可能であるか（「実行不可」）等）を示す情報が格納される実行状態６２３、実行状態６２３に「実行完了」が格納されている場合に、その実行に要した時間（以下、実行時間という）が格納される処理時間６２４、及び、区間名６２２が示す区間フローの実行に要する時間の推定時間（予測時間）が格納される予測時間６２５の各項目を有する、少なくとも１つ以上のレコードで構成されている。なお、予測時間６２５には、処理時間６２４に実行時間が格納されていない場合（すなわち、区間名６２２が示す区間フローの実行が完了していない場合）に、予測時間が格納される。 That is, in the processing state management table 400, the flow name 621 in which the flow name is stored, the section name 622 in which the identification information of the section flow in the analysis flow indicated by the flow name 621 (hereinafter referred to as the section name) is stored, and the section name 622 are stored. The current execution state of the interval flow indicated by (for example, whether it is being executed (“execution”), whether the execution is completed (“execution completed”), or whether it is not being executed but is in a state where it can be executed (“execution completed”). Execution state 623, which stores information indicating whether execution is impossible ("executable"), or because the data required to start execution has not been generated ("executable"), etc. When "execution completed" is stored in the state 623, the processing time 624 in which the time required for the execution (hereinafter referred to as the execution time) is stored, and the time required to execute the section flow indicated by the section name 622. It is composed of at least one or more records having each item of the estimated time 625 in which the estimated time (estimated time) of is stored. In the predicted time 625, the predicted time is stored when the execution time is not stored in the processing time 624 (that is, when the execution of the section flow indicated by the section name 622 is not completed).

予測時間は、後述するように、例えば、過去に実行された他の区間フローの実行時間に基づいて算出される。 The predicted time is calculated based on, for example, the execution time of another section flow executed in the past, as will be described later.

同図の例では、「分析処理Ａ」の区間フローである「区間Ａ」は実行が完了しているため、処理時間６２４に「１２分」が記録されている。また、「分析処理Ａ」の「区間Ｃ」は、同じ種類の統計解析が行われた「区間Ａ」の実行が完了していることから、前記と同じ「１２分」が予測時間６２５に記録されている。なお、予測時間６２５には、各区間フローの間の予測時間の正確な比較のために、単一の仮想マシン１０３が当該区間フローを実行した場合の予測時間が格納される。 In the example of the figure, since the execution of "section A", which is the section flow of "analysis process A", has been completed, "12 minutes" is recorded in the processing time 624. Further, in the "section C" of the "analysis process A", since the execution of the "section A" in which the same type of statistical analysis has been performed has been completed, the same "12 minutes" as described above is recorded in the predicted time 625. Has been done. The predicted time 625 stores the predicted time when a single virtual machine 103 executes the section flow for accurate comparison of the predicted time between the section flows.

処理状態管理テーブル４００の内容は、所定のタイミング、又は所定の時間間隔で更新される。 The contents of the processing state management table 400 are updated at a predetermined timing or at a predetermined time interval.

＜ＶＭ管理テーブル＞
図６は、ＶＭ管理テーブル５００の一例を示す図である。同図に示すように、ＶＭ管理テーブル５００は、仮想マシン実行サーバ１０２の識別情報（以下、実行サーバ名という）が格納される実行サーバ名７１１、実行サーバ名７１１が示す仮想マシン実行サーバ１０２における仮想マシン１０３に対して分析フロー（における区間フロー）を割り当てることが可能な仮想マシン１０３の最大の台数（以下、最大数という）を示す情報が格納される最大ＶＭ割当可能数７１２、及び、実行サーバ名７１１が示す仮想マシン実行サーバ１０２に現在割り当てられている仮想マシン１０３の台数（以下、現在台数という）を示す情報が格納される割当ＶＭ数７１３の各項目を有する、少なくとも１つ以上のレコードで構成される。 <VM management table>
FIG. 6 is a diagram showing an example of the VM management table 500. As shown in the figure, the VM management table 500 is used in the virtual machine execution server 102 indicated by the execution server name 711 and the execution server name 711 in which the identification information (hereinafter referred to as the execution server name) of the virtual machine execution server 102 is stored. Maximum number of VMs that can be assigned 712, which stores information indicating the maximum number of virtual machines (hereinafter referred to as the maximum number) that can be assigned the analysis flow (section flow in) to the virtual machine 103, and execution. At least one or more having each item of the allocated VM number 713 in which information indicating the number of virtual machines 103 currently allocated to the virtual machine execution server 102 indicated by the server name 711 (hereinafter referred to as the current number) is stored. Consists of records.

なお、ＶＭ管理テーブル５００の割当ＶＭ数７１３の内容は、所定のタイミング（例えば、所定の時間間隔、仮想マシン１０３の起動時、又は仮想マシン１０３の停止時）にて更新される。その他の項目は、例えば、ユーザによって分析処理の実行前に入力される。 The content of the allocated VM number 713 of the VM management table 500 is updated at a predetermined timing (for example, at a predetermined time interval, when the virtual machine 103 is started, or when the virtual machine 103 is stopped). Other items are entered by the user, for example, before the analysis process is executed.

同図の例では、「実行サーバＡ」、「実行サーバＢ」、及び「実行サーバＣ」の３台の仮想マシン実行サーバ１０２が登録されており、全ての仮想マシン実行サーバ１０２の最大数が「４」である。 In the example of the figure, three virtual machine execution servers 102 of "execution server A", "execution server B", and "execution server C" are registered, and the maximum number of all virtual machine execution servers 102 is set. It is "4".

以上に説明した各情報処理装置の機能は、各情報処理装置のハードウェアによって、もしくは、各情報処理装置のプロセッサ５１が、主記憶装置５２や補助記憶装置５３に記憶されている各プログラムを読み出して実行することにより実現される。なお、このプログラムは、例えば、二次記憶デバイスや不揮発性半導体メモリ、ハードディスクドライブ、ＳＳＤなどの記憶デバイス、又は、ＩＣカード、ＳＤカード、ＤＶＤなどの、計算機で読み取り可能な非一時的データ記憶媒体に格納される。 The function of each information processing device described above is that the hardware of each information processing device or the processor 51 of each information processing device reads out each program stored in the main storage device 52 and the auxiliary storage device 53. It is realized by executing. This program is, for example, a secondary storage device, a non-volatile semiconductor memory, a hard disk drive, a storage device such as an SSD, or a non-temporary data storage medium such as an IC card, an SD card, or a DVD that can be read by a computer. Stored in.

次に、分散処理システム１００で行われる各処理について説明する。
＜分析処理＞
図７は、分散処理システム１００が行う分析処理の一例を示すシーケンス図である。この処理は、例えば、ユーザ操作端末１０５から仮想マシン管理サーバ１０１に、実行する分析処理を指定する情報が入力された際に開始される。 Next, each process performed by the distributed processing system 100 will be described.
<Analysis processing>
FIG. 7 is a sequence diagram showing an example of analysis processing performed by the distributed processing system 100. This process is started, for example, when information for designating the analysis process to be executed is input from the user operation terminal 105 to the virtual machine management server 101.

まず、仮想マシン管理サーバ１０１のフロー管理部２０２は、指定された分析処理において、仮想マシン１０３により並列的に実行する分析フロー（以下、対象フローという）及びその区間フロー（以下、対象区間フローという）を特定する処理（以下、フロー実行対象判定処理という）を実行する（Ｓ３０１）。この処理の詳細は後述する。 First, the flow management unit 202 of the virtual machine management server 101 executes an analysis flow (hereinafter referred to as a target flow) and its section flow (hereinafter referred to as a target section flow) executed in parallel by the virtual machine 103 in the designated analysis process. ) Is executed (hereinafter, referred to as a flow execution target determination process) (S301). The details of this process will be described later.

次に、フロー管理部２０２は、Ｓ３０１で特定した各対象フロー及び対象区間フローの情報を付帯させた、各仮想マシン１０３への処理の割り当ての指示を、マシン制御部２０３に送信する（Ｓ３０２）。 Next, the flow management unit 202 transmits to the machine control unit 203 an instruction for assigning processing to each virtual machine 103, which is accompanied by information on each target flow and target section flow specified in S301 (S302). ..

マシン制御部２０３は、前記の指示を受信すると、各対象フローにおける各対象区間フローを、仮想マシン１０３のいずれに割り当てて並列処理を各仮想マシン１０３に実行させるかを決定する処理（以下、仮想マシン割り当て判定処理という）を実行する（Ｓ３０３）。この処理の詳細は後述する。なお、この処理の終了後、マシン制御部２０３はＶＭ管理テーブル５００を更新する。 Upon receiving the above instruction, the machine control unit 203 determines to which of the virtual machines 103 each target section flow in each target flow is assigned to execute parallel processing (hereinafter, virtual). The machine allocation determination process) is executed (S303). The details of this process will be described later. After the processing is completed, the machine control unit 203 updates the VM management table 500.

マシン制御部２０３は、Ｓ３０３により処理を割り当てた各仮想マシン１０３に対して、当該仮想マシン１０３の起動の指示を送信する（Ｓ３０４）。そして、フロー管理部２０２は、Ｓ３０４により起動した各仮想マシン１０３に対して、当該仮想マシン１０３に割り当てられた対象区間フローの実行（並列処理の実行）の指示を、各仮想マシン１０３に送信する（Ｓ３０５）。 The machine control unit 203 transmits an instruction to start the virtual machine 103 to each virtual machine 103 to which processing is assigned by S303 (S304). Then, the flow management unit 202 transmits to each virtual machine 103 an instruction to execute the target section flow (execution of parallel processing) assigned to the virtual machine 103 to each virtual machine 103 started by S304. (S305).

指示の送信を仮想マシン管理サーバ１０１から受信した各仮想マシン１０３の送受信部２０４は、分析処理部２０５に、対象区間フローを実行する指示を行う（Ｓ３０６）。 The transmission / reception unit 204 of each virtual machine 103 that has received the transmission of the instruction from the virtual machine management server 101 instructs the analysis processing unit 205 to execute the target section flow (S306).

Ｓ３０６により指示を受けた各仮想マシン１０３の各分析処理部２０５は、自身に割り当てられた対象区間フローを実行する（Ｓ３０７）。 Each analysis processing unit 205 of each virtual machine 103 instructed by S306 executes the target section flow assigned to itself (S307).

例えば、ある分析処理部２０５は、データレイク１０４のデータ保存部２０６から対象フローの初期入力データを読み出して対象区間フローを実行する。また、他の分析処理部２０５は、中間データに基づき対象区間フローを実行する。また、さらに他の分析処理部２０５は、対象区間フローを実行して最終出力データを出力し、これをデータレイク１０４のデータ保存部２０６に送信する。 For example, a certain analysis processing unit 205 reads the initial input data of the target flow from the data storage unit 206 of the data lake 104 and executes the target section flow. Further, the other analysis processing unit 205 executes the target section flow based on the intermediate data. Further, the analysis processing unit 205 executes the target section flow, outputs the final output data, and transmits this to the data storage unit 206 of the data lake 104.

なお、複数の仮想マシン１０３（分析処理部２０５）が同一の対象フローの同一の対象区間フローを並列して実行する場合、その各分析処理部２０５は、例えば、初期入力データ又は中間データを分割して当該分析処理部２０５に割り当て、各分析処理部２０５は割り当てられたデータに基づき、対象区間フローを並列して実行する。 When a plurality of virtual machines 103 (analysis processing unit 205) execute the same target section flow of the same target flow in parallel, each analysis processing unit 205 divides, for example, initial input data or intermediate data. Then, it is assigned to the analysis processing unit 205, and each analysis processing unit 205 executes the target section flow in parallel based on the assigned data.

各仮想マシン１０３の各分析処理部２０５は、自身に割り当てられた対象区間フローの実行を終えると、その旨の通知を送受信部２０４に送信する（Ｓ３０８）。 When each analysis processing unit 205 of each virtual machine 103 finishes executing the target section flow assigned to itself, it transmits a notification to that effect to the transmission / reception unit 204 (S308).

各仮想マシン１０３の送受信部２０４は、前記の通知を分析処理部２０５から受信すると、対象区間フローの実行が終了した旨の通知を、仮想マシン管理サーバ１０１のフロー管理部２０２に送信する（Ｓ３０９）。 When the transmission / reception unit 204 of each virtual machine 103 receives the above notification from the analysis processing unit 205, the transmission / reception unit 204 transmits a notification to the effect that the execution of the target section flow is completed to the flow management unit 202 of the virtual machine management server 101 (S309). ).

フロー管理部２０２は、Ｓ３０５で実行の指示を送信した全ての仮想マシン１０３から前記の終了の通知を受信すると、マシン制御部２０３に、Ｓ３０２で指示した割り当てを解放する旨の指示を送信する（Ｓ３１０）。この指示を受信したマシン制御部２０３は、割り当ての対象となっていた全ての仮想マシン１０３の動作（実行）を停止させる（Ｓ３１１）。 When the flow management unit 202 receives the above-mentioned end notification from all the virtual machines 103 that have sent the execution instruction in S305, the flow management unit 202 transmits an instruction to release the allocation instructed in S302 to the machine control unit 203 (. S310). Upon receiving this instruction, the machine control unit 203 stops the operation (execution) of all the virtual machines 103 that have been assigned (S311).

その後、マシン制御部２０３は、現在の各仮想マシン１０３の各分析フロー（区間フロー）の実行状態を各仮想マシン１０３から取得し、取得した状態に基づきＶＭ管理テーブル５００を更新する（Ｓ３１２）。具体的には、例えば、マシン制御部２０３は、ＶＭ管理テーブル５００の割当ＶＭ数７１３の値を、現在、分析フロー（区間フロー）を実行している仮想マシン１０３の数で更新する。 After that, the machine control unit 203 acquires the execution state of each analysis flow (section flow) of each of the current virtual machines 103 from each virtual machine 103, and updates the VM management table 500 based on the acquired state (S312). Specifically, for example, the machine control unit 203 updates the value of the allocated VM number 713 of the VM management table 500 with the number of virtual machines 103 currently executing the analysis flow (interval flow).

次に、フロー管理部２０２は、分析処理の実行状態に関する情報を生成して更新する（Ｓ３１３）。 Next, the flow management unit 202 generates and updates information regarding the execution state of the analysis process (S313).

具体的には、まず、フロー管理部２０２は、各仮想マシン１０３が行った対象区間フローの実行時間を算出する。例えば、フロー管理部２０２は、Ｓ３０４の処理を行った時刻からＳ３０９の処理を行った時刻までの時間を実行時間として算出する。 Specifically, first, the flow management unit 202 calculates the execution time of the target section flow performed by each virtual machine 103. For example, the flow management unit 202 calculates the time from the time when the processing of S304 is performed to the time when the processing of S309 is performed as the execution time.

また、フロー管理部２０２は、処理負荷算出部２０１に、算出した対象区間フローの実行時間に基づき、未だ実行していない区間フローの予測時間を算出する旨を指示する。具体的には、例えば、処理負荷算出部２０１は、前記で実行時間を算出した対象区間フローと同じ種類の統計解析を行う、未実行の分析フローの各区間フローの予測時間を前記の実行時間と同じにする。なお、実行時間を算出した対象区間フローが複数の仮想マシン１０３で実行されていた場合には、これに基づく予測時間を、単一の仮想マシン１０３が処理を実行した場合の時間に変換する。 Further, the flow management unit 202 instructs the processing load calculation unit 201 to calculate the estimated time of the section flow that has not yet been executed based on the calculated execution time of the target section flow. Specifically, for example, the processing load calculation unit 201 performs the same type of statistical analysis as the target section flow for which the execution time is calculated, and determines the estimated time of each section flow of the unexecuted analysis flow as the execution time. To be the same as. When the target section flow for which the execution time is calculated is executed by a plurality of virtual machines 103, the estimated time based on this is converted into the time when the single virtual machine 103 executes the process.

フロー管理部２０２は、以上のようにして算出した実行時間及び予測時間を、処理状態管理テーブル４００に記憶する。具体的には、処理状態管理テーブル記憶部２１２は、算出した各対象区間フローの実行時間を処理状態管理テーブル４００の各レコードの処理時間６２４に格納し、各予測時間を処理状態管理テーブル４００の各レコードの予測時間６２５に格納する。 The flow management unit 202 stores the execution time and the predicted time calculated as described above in the processing state management table 400. Specifically, the processing status management table storage unit 212 stores the calculated execution time of each target section flow in the processing time 624 of each record of the processing status management table 400, and stores each estimated time in the processing status management table 400. It is stored in the estimated time 625 of each record.

以上のＳ３０１からＳ３１３までの処理が、全ての分析フローが実行を完了するまで繰り返される。以上で分析処理は終了する。 The above processes from S301 to S313 are repeated until all the analysis flows complete the execution. This completes the analysis process.

次に、前記のフロー実行対象判定処理及び仮想マシン割り当て判定処理の詳細を説明する。 Next, the details of the flow execution target determination process and the virtual machine allocation determination process will be described.

＜フロー実行対象判定処理＞
図８は、フロー実行対象判定処理の詳細を説明するフローチャートである。同図に示すように、フロー管理部２０２は、まず、現在実行可能な分析フローが１つだけであるか否
かを判断する（Ｓ４０１）。具体的には、例えば、フロー管理部２０２は、処理状態管理テーブル４００の各レコードの実行状態６２３のうちで、「実行可能」が格納されているレコードの数を確認する。これにより、同時に実行可能なフローの数を、各分析フローの実行順序の制約を崩さずに判定できる。 <Flow execution target judgment processing>
FIG. 8 is a flowchart illustrating the details of the flow execution target determination process. As shown in the figure, the flow management unit 202 first determines whether or not there is only one analysis flow that can be executed at present (S401). Specifically, for example, the flow management unit 202 confirms the number of records in which "executable" is stored in the execution state 623 of each record in the processing state management table 400. As a result, the number of flows that can be executed at the same time can be determined without breaking the restrictions on the execution order of each analysis flow.

現在実行可能な分析フローが１つだけである場合（Ｓ４０１：ＹＥＳ）、フロー管理部２０２は、その分析フローを対象フローとして記憶する（Ｓ４０２）。例えば、フロー管理部２０２は、その対象フローが記憶されている処理状態管理テーブル４００のレコードを記憶する。その後はＳ４１０の処理が行われる。 When there is only one analysis flow that can be executed at present (S401: YES), the flow management unit 202 stores the analysis flow as a target flow (S402). For example, the flow management unit 202 stores a record in the processing state management table 400 in which the target flow is stored. After that, the processing of S410 is performed.

他方、現在実行可能な分析フローが複数ある場合（Ｓ４０１：ＮＯ）、フロー管理部２０２は、その複数の全ての分析フローの予測時間が算出されているか否かを判断する（Ｓ４０３）。具体的には、例えば、フロー管理部２０２は、処理状態管理テーブル４００における前記の複数の分析フローのそれぞれのレコードの予測時間６２５を参照し、予測時間が格納されているか否かを確認する。 On the other hand, when there are a plurality of analysis flows that can be executed at present (S401: NO), the flow management unit 202 determines whether or not the predicted times of all the plurality of analysis flows have been calculated (S403). Specifically, for example, the flow management unit 202 refers to the predicted time 625 of each record of the plurality of analysis flows in the processing state management table 400, and confirms whether or not the predicted time is stored.

その複数の全ての分析フローの予測時間が算出されている場合は（Ｓ４０３：ＹＥＳ）、フロー管理部２０２は、分析フローを実行する仮想マシン１０３の台数又は優先度に関して適当な重み付けをする旨を、仮想マシン割当判定処理のために記憶すると共に、Ｓ４０１で実行可能とした全ての分析フロー（Ｓ４０３で予測時間が算出されている全ての分析フロー）を対象フローとして記憶し（Ｓ４０４）、その後はＳ４１０の処理が行われる。 When the predicted times of all the plurality of analysis flows have been calculated (S403: YES), the flow management unit 202 indicates that the number or priority of the virtual machines 103 that execute the analysis flow is appropriately weighted. , All analysis flows enabled in S401 (all analysis flows for which the predicted time is calculated in S403) are stored as target flows (S404), and then stored for the virtual machine allocation determination process (S404). The processing of S410 is performed.

例えば、フロー管理部２０２は、処理状態管理テーブル４００のうち全ての実行状態６２３が「実行可能」である（「実行中」の区間がない）分析フローのレコードを対象フローとして記憶する。 For example, the flow management unit 202 stores the record of the analysis flow in which all the execution states 623 of the processing state management table 400 are “executable” (there is no “execution” section) as the target flow.

他方、予測時間が予測されていない分析フローがある場合は（Ｓ４０３：ＮＯ）、フロー管理部２０２は、分析フローを実行する仮想マシン１０３の台数又は優先度を各分析フローの間で均等にする旨を、仮想マシン割当判定処理のために記憶すると共に、Ｓ４０１で実行可能とした全ての分析フローを対象フローとして記憶し（Ｓ４０５）、その後はＳ４１０の処理が行われる。 On the other hand, when there is an analysis flow in which the predicted time is not predicted (S403: NO), the flow management unit 202 equalizes the number or priority of the virtual machines 103 that execute the analysis flow among the analysis flows. This is stored for the virtual machine allocation determination process, and all the analysis flows enabled in S401 are stored as target flows (S405), and then the process of S410 is performed.

Ｓ４１０において、まずフロー管理部２０２は、Ｓ４０４又はＳ４０５で実行を決定した各対象フローのうち一つを選択し、その対象フローにおいて実行可能な区間フローが複数存在するか否かを判定する（Ｓ４０６）。具体的には、例えば、フロー管理部２０２は、処理状態管理テーブル４００における、前記で選択した対象フローにおける各区画のレコードの実行状態６２３を参照し、「実行可能」が格納されているレコードの数を確認する。 In S410, the flow management unit 202 first selects one of the target flows determined to be executed in S404 or S405, and determines whether or not there are a plurality of executable section flows in the target flow (S406). ). Specifically, for example, the flow management unit 202 refers to the execution state 623 of the record of each section in the target flow selected above in the processing state management table 400, and the record in which "executable" is stored. Check the number.

前記で選択した対象フローに実行可能な区間フローが複数存在する場合には（Ｓ４０６：ＹＥＳ）、フロー管理部２０２は、そのうち最先の区間フロー（最初に実行される、例えば対象データの時間が最も早い区間フロー）のみを、前記で選択した対象フローにおける対象区間フローとする（Ｓ４０７）。他方、前記で選択した対象フローに実行可能な区間が１つのみ存在する場合には（Ｓ４０６：ＮＯ）、フロー管理部２０２は、その１つの区間を、前記で選択した対象フローにおける対象区間フローとする（Ｓ４０８）。 When there are a plurality of executable section flows in the target flow selected above (S406: YES), the flow management unit 202 determines the earliest section flow (for example, the time of the target data to be executed first). Only the earliest section flow) is set as the target section flow in the target flow selected above (S407). On the other hand, when there is only one executable section in the target flow selected above (S406: NO), the flow management unit 202 sets the one section as the target section flow in the target flow selected above. (S408).

このような処理を行うことで、一部の対象フローが他の対象フローと比較して実行可能な区間フローが多い場合に、一部の対象フローに区間フローが多く割り当てられてしまい、各対象フローの間で並列処理を平準化できなくなることを防ぐことができる。 By performing such processing, when some target flows have more executable section flows than other target flows, many section flows are assigned to some target flows, and each target flows. It is possible to prevent the parallel processing from being unable to be leveled between the flows.

フロー管理部２０２は、以上のＳ４０６、Ｓ４０７、及びＳ４０８の処理を全ての対象フローについて繰り返す。以上でフロー実行対象判定処理は終了する（Ｓ４０９）。 The flow management unit 202 repeats the above processes of S406, S407, and S408 for all the target flows. This completes the flow execution target determination process (S409).

＜仮想マシン割り当て判定処理＞
図９は、仮想マシン割当判定処理の詳細を示すフローチャートである。まず、マシン制御部２０３は、フロー実行対象判定処理で決定した各対象区間フローを割り当て可能な仮想マシン１０３が存在するか否かを判定する（Ｓ５０１）。具体的には、例えば、マシン制御部２０３は、ＶＭ管理テーブル５００の各レコードの割当ＶＭ数７１３及び最大ＶＭ割当可能数７１２を参照して判断する。 <Virtual machine allocation judgment processing>
FIG. 9 is a flowchart showing details of the virtual machine allocation determination process. First, the machine control unit 203 determines whether or not there is a virtual machine 103 to which each target section flow determined in the flow execution target determination process can be assigned (S501). Specifically, for example, the machine control unit 203 determines by referring to the allocated VM number 713 and the maximum VM allottable number 712 of each record in the VM management table 500.

対象区間フローを割り当て可能な仮想マシン１０３が存在しない場合は（Ｓ５０１：ＮＯ）、マシン制御部２０３は、仮想マシン割り当て判定処理を終了する（Ｓ５０７）。他方、対象区間フローを割り当て可能な仮想マシン１０３が存在する場合は（Ｓ５０１：ＹＥＳ）、Ｓ５０２の処理が行われる。 If there is no virtual machine 103 to which the target section flow can be assigned (S501: NO), the machine control unit 203 ends the virtual machine allocation determination process (S507). On the other hand, if there is a virtual machine 103 to which the target section flow can be assigned (S501: YES), the process of S502 is performed.

Ｓ５０２においてマシン制御部２０３は、フロー実行対象判定処理で決定した対象フローの数（以下、対象フロー数という）が１か、もしくは１より大きい（２以上）か否かを確認する（Ｓ５０２）。対象フロー数が１の場合は（Ｓ５０２：ＮＯ）、マシン制御部２０３は、その対象フローの処理を、Ｓ５０１で特定した割り当て可能な仮想マシン１０３の全てに割り当てる旨を記憶し（決定し）（Ｓ５０３）、仮想マシン割り当て判定処理は終了する（Ｓ５０７）。 In S502, the machine control unit 203 confirms whether or not the number of target flows (hereinafter referred to as the number of target flows) determined in the flow execution target determination process is 1 or greater than 1 (2 or more) (S502). When the number of target flows is 1 (S502: NO), the machine control unit 203 stores (determines) that the processing of the target flows is assigned to all of the assignable virtual machines 103 specified in S501 (determines). S503), the virtual machine allocation determination process ends (S507).

他方、対象フロー数が２以上の場合は（Ｓ５０２：ＹＥＳ）、マシン制御部２０３はＳ５０４の処理を実行する。 On the other hand, when the number of target flows is 2 or more (S502: YES), the machine control unit 203 executes the process of S504.

すなわち、Ｓ５０４においてマシン制御部２０３は、重み付けにより分析フローを実行する旨がフロー実行対象判定処理で記憶されていた場合には（Ｓ５０４：重み付け）、各対象フローにおける対象区間フローの予測時間に基づき、各対象フローにおける対象区間フローの終了時刻が互いに概ね同じになるように、各対象フローにおける対象区間フローを割り当てる仮想マシン１０３を決定する（Ｓ５０５）。その後、仮想マシン割り当て判定処理は終了する（Ｓ５０７）。 That is, in S504, when the fact that the analysis flow is executed by weighting is stored in the flow execution target determination process (S504: weighting), the machine control unit 203 is based on the predicted time of the target section flow in each target flow. , The virtual machine 103 to which the target section flow in each target flow is assigned is determined so that the end times of the target section flows in each target flow are substantially the same (S505). After that, the virtual machine allocation determination process ends (S507).

この割り当ては、例えば、マシン制御部２０３は、予測時間の逆数を各対象フローにおける対象区間フローについて算出し、この各逆数に対応する比率にて、各対象区間フローを割り当てる仮想マシン１０３の台数を決定するといった方法で実行される。なお、適当な台数が算出できない場合（例えば、自然数としての台数が算出できない場合）は、マシン制御部２０３は、各仮想マシン１０３に処理の優先度（例えば、ＣＰＵやメモリのリソース配分量）を設定することで、各対象フローにおける対象区間フローを割り当てる。 For this allocation, for example, the machine control unit 203 calculates the reciprocal of the predicted time for the target section flow in each target flow, and assigns the number of virtual machines 103 to which each target section flow is allocated at a ratio corresponding to each reciprocal. It is executed by a method such as determining. If an appropriate number cannot be calculated (for example, if the number as a natural number cannot be calculated), the machine control unit 203 assigns each virtual machine 103 a processing priority (for example, the amount of CPU or memory resource allocation). By setting, the target section flow in each target flow is assigned.

他方、分析フローを均等割り付けにより実行する旨がフロー実行対象判定処理で記憶されていた場合には（Ｓ５０４：均等割り）、マシン制御部２０３は、各対象フローにおける対象区間フローの処理を実行する仮想マシン１０３の台数が互いに概ね同数になるように、各対象フロー（対象区間フロー）を割り当てる仮想マシン１０３を決定する（Ｓ５０６）。その後、仮想マシン割り当て判定処理は終了する（Ｓ５０７）。 On the other hand, when it is stored in the flow execution target determination process that the analysis flow is to be executed evenly (S504: equal allocation), the machine control unit 203 executes the processing of the target section flow in each target flow. The virtual machines 103 to which each target flow (target section flow) is assigned are determined so that the number of virtual machines 103 is substantially the same as each other (S506). After that, the virtual machine allocation determination process ends (S507).

以上のような仮想マシン割り当て判定処理により、複数の対象区間フローを同時に並列して実行することができ、かつ、複数の対象区間フローの実行に要する時間を互いに同じ時間に調整することができる。これにより、一部の分析フローのみが長時間引き続き実行されることを防止することができる。 By the virtual machine allocation determination process as described above, a plurality of target section flows can be executed in parallel at the same time, and the time required for executing the plurality of target section flows can be adjusted to the same time. This makes it possible to prevent only a part of the analysis flow from being continuously executed for a long time.

以上のようにして実行される分散処理の経過又は結果は、ユーザ操作端末１０５等に表示される。 The progress or result of the distributed processing executed as described above is displayed on the user operation terminal 105 or the like.

＜ユーザ操作端末１０５による表示例＞
図１０は、ユーザ操作端末１０５が表示する、分析処理の経過又は結果を示す画面の一例である。同図に示すように、この表示画面１０００には、分析処理の実行状態又は実行結果を示すテーブル１０１０が表示される。このテーブル１０１０には、仮想マシン実行サーバ１０２における各仮想マシン１０３ごとに、各分析フローにおける各区間フローの現在の処理状況１０１１（又はその処理結果）が時系列に沿って表示される。そして、この処理状況１０１１には、その処理により発生したＩ／Ｏ量も表示される。また、分析処理によって発生したＩ／Ｏ量の合計１０１２が、各時間帯ごとに（各データ区画ごとに）表示される。 <Display example by user operation terminal 105>
FIG. 10 is an example of a screen displayed by the user operation terminal 105 showing the progress or result of the analysis process. As shown in the figure, the display screen 1000 displays a table 1010 showing the execution state or the execution result of the analysis process. In this table 1010, the current processing status 1011 (or its processing result) of each section flow in each analysis flow is displayed in chronological order for each virtual machine 103 in the virtual machine execution server 102. Then, the processing status 1011 also displays the amount of I / O generated by the processing. In addition, the total I / O amount 1012 generated by the analysis process is displayed for each time zone (for each data section).

このように、表示画面１０００には、各仮想マシン１０３による分析フローの結果の情報、又はその分析フローにより発生したデータの入出力量に関する情報が出力されるので、ユーザは、分散処理システム１００により分析フローが順次並列処理されていることや、これによりＩ／Ｏに係る負荷が分散され、安定して分散処理が実行されていることを確認することができる。 In this way, the information on the result of the analysis flow by each virtual machine 103 or the information on the input / output amount of the data generated by the analysis flow is output to the display screen 1000, so that the user can analyze by the distributed processing system 100. It can be confirmed that the flows are sequentially processed in parallel, the load related to the I / O is distributed by this, and the distributed processing is stably executed.

以上のように、本実施形態の分散処理システム１００によれば、対象データの区画データに対する処理部（分析フロー）の処理順序を記憶し、各処理部（分析フロー）が行う対象データの各区画データの処理による情報処理装置（仮想マシン１０３）に対する負荷を、そのデータ区画ごとに算出し、そして、各処理部の現在の実行状態、各処理部の処理順序、及び各処理部について算出した負荷に基づき、並列的に実行される処理部及びその処理部を実行する情報処理装置の組み合わせを決定し、これらを各情報処理装置に並列的に実行させる。このように、本実施形態の分散処理システム１００は、対象データの各データ区画に係る処理の負荷に応じて、分析フローの処理を各仮想マシン１０３にデータ区画単位で割り当てることができる。これにより、対象データの処理をデータ区画単位で各仮想マシン１０３に分散させて処理することができ、対象データを安定的に分散して処理することができる。 As described above, according to the distributed processing system 100 of the present embodiment, the processing order of the processing unit (analysis flow) for the partition data of the target data is stored, and each partition of the target data performed by each processing unit (analysis flow). The load on the information processing device (virtual machine 103) due to data processing is calculated for each data partition, and the current execution state of each processing unit, the processing order of each processing unit, and the calculated load for each processing unit are calculated. Based on the above, a combination of a processing unit to be executed in parallel and an information processing device that executes the processing unit is determined, and each information processing device is made to execute these in parallel. As described above, the distributed processing system 100 of the present embodiment can allocate the processing of the analysis flow to each virtual machine 103 in units of data partitions according to the processing load related to each data partition of the target data. As a result, the processing of the target data can be distributed and processed in each virtual machine 103 for each data partition, and the target data can be stably distributed and processed.

この効果について具体的に説明すると、以下のようになる。 A specific explanation of this effect is as follows.

まず、図１１は、従来の分散処理システムにおいて分析処理を実行した場合における、仮想マシン実行サーバ１０２及びデータレイク１０４のハードウェアリソースの使用状況の時系列変化の一例を示す図である。 First, FIG. 11 is a diagram showing an example of time-series changes in the usage status of the hardware resources of the virtual machine execution server 102 and the data lake 104 when the analysis process is executed in the conventional distributed processing system.

この分析処理は、分析フローである分析処理Ａの生成した中間データを用いて分析処理Ｂ（分析フロー）を実行し、分析処理Ｂの生成した中間データを用いて分析処理Ｃ（分析フロー）を実行し、これにより最終出力データを出力するものである。すなわち、分析処理Ａの実行完了後、分析処理Ｂが実行され、その実行完了後、分析処理Ｃが実行される。 In this analysis process, analysis process B (analysis flow) is executed using the intermediate data generated by analysis process A, which is an analysis flow, and analysis process C (analysis flow) is performed using the intermediate data generated by analysis process B. It executes and outputs the final output data. That is, after the execution of the analysis process A is completed, the analysis process B is executed, and after the execution is completed, the analysis process C is executed.

この場合、各仮想マシン１０３に対する処理の割り当ての時系列変化は以下の通りである。まず、全ての仮想マシン１０３が分析処理Ａの区間Ａを実行し（符号８０１）、その処理の終了後、次の区間（分析処理Ａの区間Ｂ）の処理が実行される（符号８０２）。そして、分析処理Ａの全区間の処理の実行が終了後、次の分析フローである分析処理Ｂの区間Ａの処理が実行され（符号８０３）、その後、分析処理Ｂの分析フローの全区間の処理の実行が完了する。そして、その次の分析フローである分析処理Ｃの区間Ａの処理が実行され（符号８０４）、その後、分析処理Ｃの全区間の処理が実行される。 In this case, the time-series changes in the processing allocation to each virtual machine 103 are as follows. First, all the virtual machines 103 execute the section A of the analysis process A (reference numeral 801), and after the processing is completed, the process of the next section (section B of the analysis process A) is executed (reference numeral 802). Then, after the execution of the processing of all the sections of the analysis process A is completed, the processing of the section A of the analysis process B, which is the next analysis flow, is executed (reference numeral 803), and then the processing of the entire section of the analysis flow of the analysis process B is executed. Execution of processing is completed. Then, the processing of the section A of the analysis processing C, which is the next analysis flow, is executed (reference numeral 804), and then the processing of the entire section of the analysis processing C is executed.

このような処理が実行されると、分析処理Ａの実行中におけるデータレイク１０４へのデータの単位時間あたりの読み書き量８１１と比べ、分析処理Ｂの実行中におけるデータレイク１０４へのデータの単位時間あたりの読み書き量８１２の方が少なくなるため、データレイクのＩ／Ｏ（Input/Output）に係るリソースの使用量は、分析処理Ｂにおいて余裕があり、Ｉ／Ｏに係るリソースが余剰となる。 When such a process is executed, the unit time of the data to the data lake 104 during the execution of the analysis process B is compared with the amount of input / output 811 of the data to the data lake 104 during the execution of the analysis process A. Since the read / write amount per 812 is smaller, the usage amount of the resource related to the I / O (Input / Output) of the data lake has a margin in the analysis process B, and the resource related to the I / O becomes surplus.

他方で、分析処理Ｃを実行中のデータレイクへの単位時間あたりの読み書き量８１３は、分析処理Ｂの実行中におけるデータレイク１０４へのデータの単位時間あたりの読み書き量８１２よりも多くなるため、分析処理Ｃにおけるデータレイク１０４への読み書きに遅延が発生し、分析処理の実行時間が長くなる可能性がある。すなわち、Ｉ／Ｏに係るリソースが不足する状態となる。 On the other hand, the read / write amount 813 per unit time to the data lake during the execution of the analysis process C is larger than the read / write amount 812 per unit time of the data to the data lake 104 during the execution of the analysis process B. A delay may occur in reading and writing to the data lake 104 in the analysis process C, which may increase the execution time of the analysis process. That is, the resources related to I / O are insufficient.

このように、従来の分散処理システムでは、単に各分析フローの間の実行順序に基づいて処理を行うため、各分析フローのＩ／Ｏ量が処理の種類によって大きく変化する結果、その際に処理上の遅延が発生し、分析処理の効率が低下する可能性がある。特に、各分析フロー間でのＩ／Ｏ量の違いが大きく異なると、実行する分析フローが切り替わる際に予期しない大きなボトルネックが発生して処理が大幅に遅延し、処理の安定性、信頼性を損ねることになる。 In this way, in the conventional distributed processing system, processing is performed simply based on the execution order between each analysis flow, and as a result, the amount of I / O of each analysis flow changes greatly depending on the type of processing, and the processing is performed at that time. The above delay may occur and the efficiency of the analysis process may be reduced. In particular, if the difference in the amount of I / O between each analysis flow is significantly different, an unexpectedly large bottleneck will occur when the analysis flow to be executed is switched, and the processing will be significantly delayed, resulting in processing stability and reliability. Will hurt.

一方、図１２は、本実施形態の分散処理システム１００において分析処理を実行した場合における仮想マシン実行サーバ１０２及びデータレイク１０４のハードウェアリソース使用状況の時系列変化の一例を示す図である。 On the other hand, FIG. 12 is a diagram showing an example of time-series changes in the hardware resource usage status of the virtual machine execution server 102 and the data lake 104 when the analysis process is executed in the distributed processing system 100 of the present embodiment.

本実施形態の分散処理システム１００は、各区間のデータ処理が終わるごとに、対象フロー及び対象区間フローを決定し、これらの処理を、適当な配分割合にて（例えば、各区間での処理が概ね同時に終了するように）仮想マシン１０３に割り当てる（図７のＳ３０１及びＳ３０３）。 The distributed processing system 100 of the present embodiment determines the target flow and the target section flow each time the data processing of each section is completed, and performs these processes at an appropriate allocation ratio (for example, the processing in each section is performed). Assign it to the virtual machine 103 (so that it ends approximately at the same time) (S301 and S303 in FIG. 7).

すなわち、最初に行われたフロー実行対象判定処理においては、対象フローが分析処理Ａのみと判定され、さらに、先頭のデータ区画に対応する区間フローのみが対象区間フローであると判定される結果（図８のＳ４０７）、分析処理Ａの区間Ａのみが実行される（符号９０１）。 That is, in the flow execution target determination process performed first, the target flow is determined to be only the analysis process A, and further, only the section flow corresponding to the first data partition is determined to be the target section flow ( In S407) of FIG. 8, only the section A of the analysis process A is executed (reference numeral 901).

そして、次に行われるフロー実行対象判定処理においては、まず、対象フローが分析処理Ａ及び分析処理Ｂの複数の分析フローと判定され、さらに、分析処理Ｂの予測時間が算出されていないため均等割付と判定される結果（図８のＳ４０５）、分析処理Ａの区間Ｂの処理、及び分析処理Ｂの区間Ａの処理が均等に（同じ台数の）各仮想マシン１０３に割り当てられ、これらの処理が並列して実行される（符号９０２、符号９０３）。 Then, in the flow execution target determination process to be performed next, the target flow is first determined to be a plurality of analysis flows of the analysis process A and the analysis process B, and further, the predicted time of the analysis process B is not calculated, so that the flow is equal. The result determined to be allocation (S405 in FIG. 8), the process of the section B of the analysis process A, and the process of the section A of the analysis process B are evenly assigned to each virtual machine 103 (of the same number), and these processes. Are executed in parallel (reference numeral 902, reference numeral 903).

さらに、その次に行われるフロー実行対象判定処理においては、まず、対象フローが、前記の分析処理Ａ（の区間Ｂの処理）及び分析処理Ｂ（の区間Ａの処理）に加えて分析処理Ｃである旨判定され、その分析処理Ｃの対象区間フローが区間Ａと判定される結果、分析処理Ａの区間Ｂの処理、分析処理Ｂの区間Ａの処理、及び分析処理Ｃの区間Ａの処理が並列して実行される（符号９０４）。 Further, in the flow execution target determination process to be performed next, first, the target flow is analyzed process C in addition to the analysis process A (process of section B) and analysis process B (process of section A). As a result of determining that the target section flow of the analysis process C is the section A, the process of the section B of the analysis process A, the process of the section A of the analysis process B, and the process of the section A of the analysis process C. Are executed in parallel (reference numeral 904).

このように、分析フローにおける各区間の処理が並列して実行することにより、分析フローに係る処理が、分析フロー単位だけなく、さらにその区間単位でも分散されるため、前記した従来の分散処理システムと比較して、分析処理がより安定的に分散され、処理に係る負荷も分散される。すなわち、リソースの分散による安定化のみならず、時間軸方向
の処理の分散による安定化が実現される。 In this way, by executing the processing of each section in the analysis flow in parallel, the processing related to the analysis flow is distributed not only in the analysis flow unit but also in the section unit. Therefore, the conventional distributed processing system described above. The analysis process is more stably distributed, and the load related to the process is also distributed. That is, not only stabilization by distribution of resources but also stabilization by distribution of processing in the time axis direction is realized.

さらに、本実施形態の分散処理システム１００では、分析フローは、処理するデータの時系列（データ区画）に応じて複数の区間フローに分割されているので、ＣＰＵ等の演算処理に係る負荷だけでなく、入出力処理（Ｉ／Ｏ）に係る負荷も時間軸方向に分散されることになる。 Further, in the distributed processing system 100 of the present embodiment, the analysis flow is divided into a plurality of section flows according to the time series (data partition) of the data to be processed, so that only the load related to the arithmetic processing of the CPU or the like is required. However, the load related to the input / output process (I / O) is also distributed in the time axis direction.

例えば、最初に行われたフロー実行対象判定処理では分析処理Ａのみが実行されるため、分析処理Ａの実行に係る、データレイク１０４への単位時間あたりのデータの読み書き量９１１（以下、単位Ｉ／Ｏ量という）が発生する（符号９１１）。 For example, since only the analysis process A is executed in the first flow execution target determination process, the amount of data read / written to the data lake 104 per unit time related to the execution of the analysis process A 911 (hereinafter, unit I). (Referred to as / O amount) occurs (reference numeral 911).

そして、２回目のフロー実行対象判定処理においては、分析処理Ａを単独で実行している場合、及び分析処理Ｂを単独で実行している場合の間の中間程度の単位Ｉ／Ｏ量が発生することになる（符号９１２）。さらに、３回目のフロー実行対象判定処理においては、分析処理Ａを単独で実行している場合、分析処理Ｂを単独で実行している場合、及び分析処理Ｃを単独で実行している場合の間の三者の中間程度の単位Ｉ／Ｏ量が発生することになる（符号９１３）。 Then, in the second flow execution target determination process, an intermediate unit I / O amount is generated between the case where the analysis process A is executed independently and the case where the analysis process B is executed alone. (Code 912). Further, in the third flow execution target determination process, the case where the analysis process A is executed independently, the case where the analysis process B is executed independently, and the case where the analysis process C is executed independently. An intermediate unit I / O amount of the three parties will be generated (reference numeral 913).

このように、本実施形態の分散処理システム１００によれば、データレイク１０４への単位時間あたりのデータの読み書き量が、前記した通常の分散処理システムの場合と比較して、時間軸方向により詳細に平準化される。これにより、例えば、データレイク１０４へのデータの読み書きが分析処理におけるボトルネックとなることを防ぎ、分析処理を安定して行うことができるようになる。 As described above, according to the distributed processing system 100 of the present embodiment, the amount of data read / written to / from the data lake 104 per unit time is more detailed in the time axis direction as compared with the case of the above-mentioned ordinary distributed processing system. Is leveled to. As a result, for example, reading and writing data to and from the data lake 104 can be prevented from becoming a bottleneck in the analysis process, and the analysis process can be performed stably.

すなわち、本実施形態の分散処理システム１００によれば、分析処理におけるリソースとして、単にプロセッサに係る負荷だけでなく、Ｉ／Ｏに係る負荷も平準化させる結果、分散処理システム１００における処理負荷全体を安定化させることができる。 That is, according to the distributed processing system 100 of the present embodiment, as a resource in the analysis processing, not only the load related to the processor but also the load related to I / O is leveled, and as a result, the entire processing load in the distributed processing system 100 is equalized. It can be stabilized.

以上の効果に加えて、本実施形態の分散処理システム１００は、情報処理装置（仮想マシン１０３）による処理部（分析フロー）の並列的な実行に関する制約条件を記憶し（ＶＭ管理テーブル５００）、その制約条件を満たす仮想マシン１０３を、処理部を並列的に実行する仮想マシン１０３として決定するので、不可抗力的な仮想マシン１０３の制約の範囲内で、分散処理及びこれによるデータ処理を安定的に行うことができる。 In addition to the above effects, the distributed processing system 100 of the present embodiment stores the constraint conditions regarding the parallel execution of the processing unit (analysis flow) by the information processing device (virtual machine 103) (VM management table 500). Since the virtual machine 103 that satisfies the constraint condition is determined as the virtual machine 103 that executes the processing unit in parallel, the distributed processing and the data processing by the virtual machine 103 are stably performed within the constraints of the unavoidable virtual machine 103. It can be carried out.

例えば、制約条件として、並列的に分析フローの処理を実行可能な仮想マシン１０３の最大数を記憶することで、例えば仮想マシン実行サーバ１０２や仮想マシン１０３のハードウェアないしソフトウェアの仕様を満たした、安定的な分散処理及びデータ処理を行うことができる。 For example, by storing the maximum number of virtual machines 103 capable of executing analysis flow processing in parallel as a constraint condition, for example, the hardware or software specifications of the virtual machine execution server 102 and the virtual machine 103 are satisfied. Stable distributed processing and data processing can be performed.

また、本実施形態の分散処理システム１００は、情報処理装置（仮想マシン実行サーバ１０２又は仮想マシン１０３）に対する負荷として、処理部（分析フロー）の処理の実行に係る予測時間を各処理部について算出し、この予測時間に基づき、処理部の部分の処理を並列的に実行させる各情報処理装置の配分（処理の優先度等）を決定するので、仮想マシン実行サーバ１０２又は仮想マシン１０３の実際の稼動実績に基づき合理的な処理の分配を行うことができる。これにより、安定的な分散処理及びデータ処理を行うことができる。 Further, the distributed processing system 100 of the present embodiment calculates the estimated time for executing the processing of the processing unit (analysis flow) for each processing unit as a load on the information processing device (virtual machine execution server 102 or virtual machine 103). Then, based on this estimated time, the distribution of each information processing device (processing priority, etc.) for executing the processing of the processing unit in parallel is determined, so that the actual virtual machine execution server 102 or virtual machine 103 is actually executed. It is possible to distribute rational processing based on the operation results. As a result, stable distributed processing and data processing can be performed.

また、本実施形態の分散処理システム１００は、並列的に実行する複数の処理部（分析フロー）のうち予測時間を算出していない処理部がある場合には、複数の処理部のそれぞれを実行する情報処理装置（仮想マシン１０３）の台数を互いに均等とするので、例えば
ある仮想マシン１０３の稼動実績が少なく分析フローの処理時間が不明又は不正確である場合であっても、合理的に分散処理を行うことができる。 Further, the distributed processing system 100 of the present embodiment executes each of the plurality of processing units (analysis flow) executed in parallel when there is a processing unit for which the predicted time has not been calculated. Since the number of information processing devices (virtual machines 103) to be processed is equalized to each other, even if the operation record of a certain virtual machine 103 is small and the processing time of the analysis flow is unknown or inaccurate, it is reasonably distributed. Processing can be performed.

また、本実施形態の分散処理システム１００は、並列的に実行する処理部（分析フロー）を決定する際に、当該処理部において処理可能なデータの区画が複数ある場合には、処理順序が最先である区画のデータのみを処理するので、実行可能な区間フローの数が分析フローの間で異なる場合であっても、一部の分析フローのみが偏って実行され、並列処理を平準化できなくなることを防ぐことができる。 Further, in the distributed processing system 100 of the present embodiment, when determining the processing unit (analysis flow) to be executed in parallel, if there are a plurality of data partitions that can be processed in the processing unit, the processing order is the highest. Since only the data of the previous partition is processed, even if the number of executable section flows differs between the analysis flows, only some analysis flows are executed in a biased manner, and parallel processing can be leveled. It can be prevented from disappearing.

以上の実施形態の説明は、本発明の理解を容易にするためのものであり、本発明を限定するものではない。本発明はその趣旨を逸脱することなく、変更、改良され得ると共に本発明にはその等価物が含まれる。 The above description of the embodiment is for facilitating the understanding of the present invention, and does not limit the present invention. The present invention can be modified and improved without departing from the spirit of the present invention, and the present invention includes its equivalents.

例えば、本実施形態では、分析処理を行う主体は複数の仮想マシン１０３としたが、複数の物理サーバがこれらの分析処理を行うものとしてもよい。本実施形態の分散処理システム１００は、いずれのタイプの情報処理装置による分散処理にも対応することができるので、ハードウェアリソース及びソフトウェアリソースを効率的に使用することができる。 For example, in the present embodiment, the main body that performs the analysis process is a plurality of virtual machines 103, but a plurality of physical servers may perform these analysis processes. Since the distributed processing system 100 of the present embodiment can support distributed processing by any type of information processing apparatus, hardware resources and software resources can be efficiently used.

例えば、本実施形態では、データの記憶形式としてテーブル（データベース）を示したが、データの記憶形式はデータベースに限られず、いかなる方法であってもよい。 For example, in the present embodiment, a table (database) is shown as a data storage format, but the data storage format is not limited to the database and may be any method.

また、本実施形態では、仮想マシン実行サーバ１０２は複数台あるものとしたが、１台の仮想マシン実行サーバ１０２のみを設け、この仮想マシン実行サーバ１０２に複数の仮想マシン１０３を設けてもよい。 Further, in the present embodiment, it is assumed that there are a plurality of virtual machine execution servers 102, but only one virtual machine execution server 102 may be provided, and a plurality of virtual machines 103 may be provided in the virtual machine execution server 102. ..

１００分散処理システム、１０１仮想マシン管理サーバ、１０２仮想マシン実行サーバ、１０３仮想マシン、１０４データレイク、２１１フローテーブル記憶部、２０１処理負荷算出部、２０２フロー管理部、２０３マシン制御部 100 Distributed processing system, 101 Virtual machine management server, 102 Virtual machine execution server, 103 Virtual machine, 104 Data lake, 211 Flow table storage unit, 201 Processing load calculation unit, 202 Flow management unit, 203 Machine control unit

Claims

Is configured to include a plurality of virtual machines, for a given process in which a plurality comprises a processing flow for processing predetermined data comprising a plurality of compartments in each of the compartments, the processing flow to at least one of the plurality of virtual machines A distributed processing system equipped with a processor and a memory capable of executing the processing flow in parallel by allocating.
A flow table storage unit that stores the execution order of the plurality of processing flows for the data in the partition, and
A processing load calculation unit that calculates the load on the virtual machine by processing the data of the partition according to the processing flow for each partition,
The processing flow executed in parallel based on the current execution state of each processing flow , the execution order of each processing flow , and the load calculated for each processing flow , and the virtual machine that executes the processing flow. The flow management department that decides the combination of machines and
A machine control unit that causes each virtual machine to execute parallel processing indicated by the determined combination.
Equipped with a,
The processing load calculation unit calculates the estimated time for executing the processing flow as the load for each of the processing flows.
The flow management unit
When there are a plurality of the processing flows executed in parallel, the priority regarding the virtual machine or its allocation to be assigned to each of the plurality of processing flows is determined based on the calculated estimated time.
When there is the processing flow for which the predicted time has not been calculated among the plurality of processing flows executed in parallel, the number of the virtual machines executing each of the plurality of processing flows is made equal to each other. ,
Distributed processing system.

A machine management table storage unit that stores constraints related to parallel execution of the processing flow by the virtual machine is provided.
The flow management section, the constraint satisfies the virtual machine is determined as the virtual machine running the process flow in parallel,
The distributed processing system according to claim 1.

As the constraint condition, the machine management table storage unit stores the maximum number of virtual machines capable of executing the processing flow in parallel.
The flow management unit determines the number of virtual machines satisfying the maximum number indicated by the constraint condition as the virtual machines that execute the processing flow in parallel.
The distributed processing system according to claim 2.

When the flow management unit determines the processing flows to be executed in parallel, if there are a plurality of sections of the data that can be processed in the processing flow, the predetermined first processing is performed. The distributed processing system according to claim 1, wherein only the data partition is determined to be processed in the processing flow.

The distributed processing system according to claim 1, further comprising an output unit that outputs information regarding the result of the processing flow executed by each virtual machine or the amount of data input / output generated by the processing flow.

A machine management table storage unit that stores constraints related to parallel execution of the processing flow by the virtual machine, and
Further provided with an output unit that outputs information regarding the result of the processing flow executed by each of the virtual machines or the amount of data input / output generated by the processing flow.
As the constraint condition, the machine management table storage unit stores the maximum number of virtual machines capable of executing the processing flow in parallel.
The flow management unit
The number of the virtual machines satisfying the maximum number indicated by the constraint condition is determined as the virtual machines that execute the processing flow in parallel.
When determining the processing flow to be executed in parallel, if there are a plurality of data partitions that can be processed in the processing flow, only the predetermined data partition that is processed first is said. Determine to be processed in the processing flow,
The distributed processing system according to claim 1.

Is configured to include a plurality of virtual machines, for a given process in which a plurality comprises a processing flow for processing predetermined data comprising a plurality of compartments in each of the compartments, the processing flow to at least one of the plurality of virtual machines It is a distributed processing method in a distributed processing system that can execute the processing flow in parallel by allocating.
A machine management server with a processor and memory
A flow table storage process that stores the execution order of the plurality of processing flows for the data in the partition, and
A processing load calculation process that calculates the load on the virtual machine by processing the data of the partition according to the processing flow for each partition, and
The processing flow executed in parallel based on the current execution state of each processing flow , the execution order of each processing flow , and the load calculated for each processing flow , and the virtual machine that executes the processing flow. Flow management process that determines the combination of machines and
Machine control processing that causes each virtual machine to execute parallel processing indicated by the determined combination, and
The execution,
In the processing load calculation process, as the load, the estimated time for executing the processing flow is calculated for each processing flow.
In the flow management process
When there are a plurality of the processing flows executed in parallel, the priority regarding the virtual machine or its allocation to be assigned to each of the plurality of processing flows is determined based on the calculated estimated time.
When there is the processing flow for which the predicted time has not been calculated among the plurality of processing flows executed in parallel, the number of the virtual machines executing each of the plurality of processing flows is made equal to each other. ,
Distributed processing method.

The machine management server further executes machine management table storage processing that stores constraints related to parallel execution of the processing flow by the virtual machine.
The flow management process, the constraints satisfying the virtual machine is determined as the virtual machine running the process flow in parallel,
The distributed processing method according to claim 7.

Is configured to include a plurality of virtual machines, for a given process in which a plurality comprises a processing flow for processing predetermined data comprising a plurality of compartments in each of the compartments, the processing flow to at least one of the plurality of virtual machines To a distributed processing system equipped with a processor and memory that can execute the processing flow in parallel by allocating
A flow table storage process that stores the execution order of the plurality of processing flows for the data in the partition, and
A processing load calculation process that calculates the load on the virtual machine by processing the data of the partition according to the processing flow for each partition, and
The processing flow executed in parallel based on the current execution state of each processing flow , the execution order of each processing flow , and the load calculated for each processing flow , and the virtual machine that executes the processing flow. Flow management process that determines the combination of machines and
Machine control processing that causes each virtual machine to execute parallel processing indicated by the determined combination, and
To run ,
In the processing load calculation process, as the load, the estimated time for executing the processing flow is calculated for each processing flow.
In the flow management process
When there are a plurality of the processing flows executed in parallel, the priority regarding the virtual machine or its allocation to be assigned to each of the plurality of processing flows is determined based on the calculated estimated time.
When there is the processing flow for which the predicted time has not been calculated among the plurality of processing flows executed in parallel, the number of the virtual machines executing each of the plurality of processing flows is made equal to each other. ,
Distributed processing program.

Further execute the machine management table storage process that stores the constraint conditions related to the parallel execution of the process flow by the virtual machine.
The flow management process, the constraints satisfying the virtual machine is determined as the virtual machine running the process flow in parallel,
The distributed processing program according to claim 9.