WO2013153620A1

WO2013153620A1 - Data processing system and data processing method

Info

Publication number: WO2013153620A1
Application number: PCT/JP2012/059789
Authority: WO
Inventors: 雅輝四ツ谷; 康郎國信; 敬行河野; 吉田　順
Original assignee: 株式会社日立製作所
Priority date: 2012-04-10
Filing date: 2012-04-10
Publication date: 2013-10-17

Abstract

[Problem] To provide a data processing system that can execute a parallel distributed process at a high speed. [Solution] The present invention configures a data processing system (100) provided with: a plurality of first server devices (130) that, in parallel, perform an analysis process comprising an extraction process for extracting columns and column values from data to be analyzed and an aggregation process for aggregating, with a key column as a key, data to be aggregated; and a second server device (140) for performing a process for determining the key column on the basis of the process results of a past analysis process, a process for producing key distribution information indicating the percentage of appearance of each column value in the key column, and a process for determining the plurality of first servers (130) to perform the aggregation process on the basis of the key distribution information. Also, the second server device (140) determines the plurality of first server devices (130) to perform the aggregation process by means of allocating data to be aggregated, which results from dividing the data to be analyzed into each column value of the key column, to the plurality of first server devices (130) in accordance with the percentage of appearance of the column values.

Description

Data processing system and data processing method

The present invention relates to a data processing system and a data processing method, and is suitable for application to a data processing system and a data processing method for parallel and distributed processing of large-scale data.

In recent years, opportunities for companies and individuals to hold large amounts of data have increased, and efforts to increase the value of data by analyzing large amounts of data are becoming widespread, increasing the demand for technology that handles large amounts of data. Yes. As a method for handling a large amount of data, there are a method for improving the processing performance of an arithmetic processing device and a method for increasing the number of arithmetic processing devices (parallel distributed processing).

A simple method for improving the processing performance of the arithmetic processing device is to use a high-performance arithmetic processing device equipped with a computing speed and a component with a high processing speed. There is a problem that the cost is higher than that of the apparatus.

On the other hand, parallel distributed processing is a method of speeding up processing by a plurality of arithmetic processing devices executing processes in parallel and in parallel. By performing parallel and distributed processing using a plurality of relatively inexpensive general-purpose arithmetic processing units, it becomes possible to process a large amount of data at high speed at low cost. Yes.

Patent Document 1 discloses a distributed processing framework called MapReduce as one of methods for realizing parallel distributed processing. MapReduce is a data analysis process that extracts data that is subject to parallel distributed processing from input data and generates data consisting of a combination of Key and Value, and Reduce that aggregates the data extracted by Map processing This is a simplified programming model for processing. In the map processing, after dividing data consisting of a set of key (key) and value is extracted from input data, intermediate data in which the divided data is bundled in units of keys is generated. In the Reduce process, intermediate data is aggregated by combining values of intermediate data having the same key. The execution engine of MapReduce executes Map processing on a plurality of computers in parallel, and controls the generated intermediate data to allocate the intermediate data to each computer according to the key and execute the Reduce processing.

In this way, MapReduce is suitable for distributed processing with a large-scale parallel configuration because processing can be executed by dynamically allocating processing to a plurality of computers. MapReduce also has the advantage that the developer only needs to define the group extraction method in Map processing and the data aggregation method in Reduce processing.

Patent Document 2 discloses an inquiry processing method for speeding up the inquiry processing. In the query processing method disclosed in Patent Document 2, statistical information is calculated by examining the frequency of appearance of keys for all items (columns) of data to be processed, and the difference in data allocated according to the calculated statistical information. Data processing is assigned to a plurality of nodes so as to be small.

US Pat. No. 7,650,331 JP 2004-213680 A

However, in MapReduce disclosed in Patent Document 1, the data extracted by Map processing is assigned to each node that executes aggregation processing according to the key value. If the key distribution is biased, the biased key value There is a problem that the aggregation process stays in a node to which data is allocated corresponding to the above, and becomes a bottleneck.

In addition, the query processing method disclosed in Patent Document 2 is expected to allocate data processing to a plurality of nodes so that the difference in allocated data becomes small when applied to MapReduce. Since it is difficult to pre-define the columns to be used, statistical information is calculated for all the columns of the analysis target data, and there is a problem that the processing load increases due to a large amount of calculation processing.

In addition, in the query processing method disclosed in Patent Document 2, when a change in analysis viewpoint is performed using an unused column as a key so far, the statistical information of the column is not retained. The data distribution destination cannot be determined, and overhead may occur.

The present invention has been made in consideration of the above points, and intends to propose a data processing system and a data processing method capable of executing parallel distributed processing at high speed.

In order to solve such a problem, in the present invention, an extraction process for extracting a column and a column value for each record from the analysis target data, and an aggregation process for aggregating the aggregation target data using the key column as a key. A plurality of first server devices that perform analysis processing in parallel on each server device, processing for determining the key column based on processing results of past analysis processing by the plurality of first server devices, A second server that performs processing for creating key distribution information indicating the appearance ratio of each column value in a key column, and processing for determining the plurality of first server devices that perform the aggregation processing based on the key distribution information And the second server device divides the data to be analyzed for each column value of the key column. By assigning data to the plurality of first server device in response to the appearance ratio of the column values, the data processing system for determining a first server of the plurality performing the aggregation process is provided.

In order to solve such a problem, in the present invention, a plurality of first server devices that perform analysis processing including data extraction processing and aggregation processing in parallel in each server device, and the plurality of first devices In the data processing method in the data processing system having the second server device that performs processing based on the processing result of the past analysis processing by the server device, the plurality of first server devices block from the data to be analyzed A first step of extracting a column and a column value for each record in units, and the second server device determines the key column based on an extraction result by the plurality of first server devices, and each of the key columns The key distribution information indicating the appearance ratio of the column value is created, and the aggregation processing is performed based on the key distribution information. A second step of determining a device; and a third step of the plurality of first server devices aggregating data to be aggregated using the key column determined by the second server device as a key, In the second step, when determining the plurality of first server devices to perform the aggregation processing, the second server device divides the analysis target data into column values of the key columns. Is provided to the plurality of first server devices in accordance with the appearance ratio of the column value.

According to the present invention, it is possible to perform parallel distributed processing at high speed without performing unnecessary statistical processing and avoiding a bottleneck caused by biased key distribution.

It is a block diagram which shows the structure of the data processing system by 1st Embodiment. It is explanatory drawing which shows an example of the process target data on a memory. It is a flowchart which shows the whole process procedure of the data processing system shown in FIG. It is a flowchart which shows the process procedure by a parallel distributed processing management server apparatus. It is a table which shows the structure of a data arrangement information management table. It is a table which shows the structure of a process execution status management table. It is a flowchart which shows the process procedure by a parallel distributed processing execution server apparatus. It is a flowchart (the 1) which shows the process procedure which produces the key distribution information of a key column. It is a table which shows the structure of an analysis application execution history management table. It is a flowchart (the 2) which shows the process procedure which produces the key distribution information of a key column. It is a table which shows the structure of a key column candidate management table. It is a table which shows the structure of a key column threshold value management table. It is a table which shows the structure of a key column management table. It is a flowchart (the 3) which shows the process sequence which produces the key distribution information of a key column. It is a table which shows the structure of a key distribution management table. It is a table which shows the structure of an aggregation process start condition management table. It is a flowchart which shows the process by the data distribution destination determination part shown in FIG. It is a table which shows the structure of a distribution destination candidate server apparatus management table. It is a table which shows the structure of a distribution data allocation management table. It is a table which shows the structure of a distribution destination server apparatus management table. It is a block diagram which shows the structure of the data processing system by 2nd Embodiment. It is a flowchart which shows the process procedure by the data distribution destination determination part shown in FIG. It is a table which shows the structure of an analysis processing efficiency management table. It is a table which shows the structure of a distribution destination server apparatus management table.

(1) First Embodiment (1-1) Configuration according to this Embodiment In FIG. 1, reference numeral 100 denotes a data processing system according to the first embodiment as a whole. The data processing system 100 includes a parallel distributed processing management server device 120, a parallel distributed processing execution server device 130 (130A, 130B,..., 130N), and a data distribution control server device 140 that are connected to each other via a network 110. Prepare. The data processing system 100 is connected to the client device 310 via the network 110. Below, each structure of the data processing system 100 is demonstrated.

(1-1-1) Configuration of Parallel Distributed Processing Management Server Device The parallel distributed processing management server device 120 includes a network interface 121, a CPU (Central Processing Unit) 122, a main storage device 123, which are connected to each other via a bus 125. And a secondary storage device 124. The parallel distributed processing management server device 120 manages execution of parallel distributed processing by the parallel distributed processing execution server device 130.

The network interface 121 is an interface for the client device 120 to connect to the network 110. The CPU 122 is an arithmetic processing unit that executes a program stored in the main storage device 123.

The main storage device 123 is a storage device such as a RAM (Random Access Memory) that stores a program executed by the CPU 122 and data necessary for executing the program. The main storage device 123 stores a data registration processing unit 1231, a distributed processing management unit 1232, and an OS (Operating System) 1233. The data registration processing unit 1231 and the distributed processing management unit 1232 are, for example, application programs. The OS 1233 is a software program that provides basic functions that can be used by application programs and manages the entire parallel distributed processing management server device 120 by the operation of the CPU 122.

Note that “processing by the data registration processing unit 1231” is actually realized by “the CPU 122 operates using basic functions provided to the OS 1231 according to the program of the data registration processing unit 1231”. For convenience, it is described as “the data registration processing unit 1231 performs processing”. Similarly, in the processing of other programs, operations by the CPU and OS may be omitted.

The secondary storage device 124 is a storage device such as a hard disk drive (HDD: Hard Disk Drive) that stores data. The secondary storage device 124 is not limited to a magnetic storage device such as an HDD, and may be another storage device, for example, a semiconductor storage device such as a flash memory. The secondary storage device 124 stores a data arrangement information management table 1241 and a process execution status management table 1242.

The functional description of each unit stored in the main storage device 123 and the specific structure of each table stored in the secondary storage device 124 will be described later in the description of the parallel distributed processing according to this embodiment. In addition, the functional description of each unit stored in the

main storage devices

133 and 143 of the parallel distributed processing execution server device 130 and the data distribution control server device 140 and the details of each table stored in the

secondary storage devices

134 and 144 are provided. The specific structure will also be described later in the description of the parallel distributed processing according to the present embodiment.

In addition, the parallel distributed processing management server device 120, the parallel distributed processing execution server device 130, and the data distribution control server device 140 each have different application programs and tables, but the hardware configuration is “network interface, The CPU, the main storage device, and the secondary storage device are connected to each other by a bus ”. Further, the hardware configuration of the client device 310 is the same as that of the parallel distributed processing management server device 120. Therefore, in the description of the configurations of the parallel distributed processing execution server device 130, the data distribution control server device 140, and the client device 310, which will be described later, the description of the same parts as the parallel distributed processing management server device 120 is omitted.

(1-1-2) Configuration of Parallel Distributed Processing Execution Server Device The parallel distributed processing execution server device 130 (130A, 130B,..., 130N) includes a network interface 131, a CPU 132, A plurality of computers having a main storage device 133 and a secondary storage device 134. In the parallel distributed processing execution server device 130, parallel distributed processing is executed by a plurality of computers performing data extraction processing and aggregation processing in parallel. Aggregation processing by the parallel distributed processing execution server device 130 is performed using a key column determined by the data distribution control server device 140 as a key.

The main storage device 133 stores a parallel distributed processing execution unit 1331 and an OS 1332. The parallel distributed processing execution unit 1331 extracts the column and the column value for each record from the data to be analyzed, and the aggregation processing using the key column as a key for the data divided for each column value of the key column This is an analysis application program (analysis application) that performs parallel and distributed processing by performing The analysis application stored in each of the parallel distributed processing execution server devices 130A to 130N may be a heterogeneous analysis application that performs different extraction processing, or may be the same type of analysis application that performs the same extraction processing.

The secondary storage device 134 stores processing target data 1341 registered as data for performing parallel distributed processing. The processing target data 1341 is sequentially read on the main storage device 133 in units of at least one block, and the read processing (processing target data 1333 on the memory) is subjected to extraction processing by the analysis application.

FIG. 2 shows, as an example of the processing target data 1333 on the memory, website browsing logs by a plurality of users, one record per line. The processing target data 1333 in FIG. 2 includes a block name column 1333A in which a block name to which the record is allocated is described, a date / time information column 1333B in which the creation time of the record is described, and a user who can specify an executor of the record A user name column 1333C in which an ID is described, a content column 1333D in which a browsing destination URL (Uniform Resource Locator) is described, and a category name column 1333E in which a category name previously associated with the browsing destination URL is described. It has a table structure for items. In the processing target data 1333, a plurality of records are divided into a plurality of blocks in units of a predetermined number of records or time. In the block columns 1333F to 1333J, records divided into blocks are described. For example, the top record in FIG. 2 belongs to block 1 and indicates that a URL classified as category A was browsed by a user with user ID 00001 on December 22, 2011 at 10:11. Yes.

Further, for example, when the parallel distributed processing execution unit 1331 performs extraction processing on the processing target data 1333 on the memory in FIG. 2, date information, user ID, content, and category name are extracted as columns, for example, column [category name ], [Category A] to [Category D] are extracted as column values.

(1-1-3) Configuration of Data Distribution Control Server Device The data distribution control server device 140 includes a network interface 141, a CPU 142, a main storage device 143, and a secondary storage device 144 that are connected to each other via a bus 145. It is a computer. The data distribution control server device 140 analyzes the extraction processing result of the processing target data 1341 by the parallel distributed processing execution unit 1331 of the parallel distributed processing execution server device 130 and uses it as a key in the aggregation processing in the parallel distributed processing execution server device 130. A key column that is a column to be processed is determined, and key distribution information based on each column value (key column value) of the key column in the processing target data 1341 is created. In addition, the data distribution control server device 140 distributes data (aggregation target data) to be subjected to aggregation processing to a plurality of parallel distributed processing execution server devices 130 and performs aggregation processing based on key distribution information of key columns. Thus, the process of determining the parallel distributed processing execution server device 130 that is the allocation destination of the aggregation target data is performed.

The main storage device 143 stores an analysis application execution history management unit 1431, a key column candidate extraction unit 1432, a key distribution calculation unit 1433, a data distribution destination determination unit 1434, and an OS 1435. The analysis application execution history management unit 1431, the key column candidate extraction unit 1432, the key distribution calculation unit 1433, and the data distribution destination determination unit 1434 are, for example, application programs.

The secondary storage device 144 includes an analysis application execution history management table 1441, a key distribution calculation target column management table 1442, a key distribution management table 1443, a distribution destination server device management table 1444, a key column candidate management table 1445, and a key column threshold management table 1446. , A key distribution calculation target column management table 1447, an aggregation processing start condition management table 1448, a distribution destination candidate server device management table 1449, and a distribution data allocation management table 1450 are stored.

(1-1-4) Configuration of Client Device The client device 310 is a computer having a network interface 311, a CPU 312, a main storage device 313, and a secondary storage device 314 that are connected to each other via a bus 315. The client device 310 transmits a data registration execution request for requesting registration of processing target data or a data analysis processing execution request for requesting analysis of processing target data to the data processing system 100 in accordance with a user operation.

The main storage device 313 stores a client processing unit 3131 and an OS 3132. The client processing unit 3131 is an application program that performs processing such as reading and writing of client processing data 3141 and transmission of an execution request to the data processing system 100.

The secondary storage device 314 stores client processing data 3141 that is the source of the processing target data 1341. The client processing data 3141 is transmitted following the execution request after the data registration execution request is transmitted from the client device 310 to the data processing system 100, and written into the secondary storage device 134 of the parallel distributed processing execution server device 130. The processing target data 1341 is updated.

(1-2) Overall Processing According to this Embodiment In the data processing system 100 according to this embodiment, the parallel distributed processing management server device 120 manages execution of parallel distributed processing by the parallel distributed processing execution server device 130. Then, the plurality of parallel distributed processing execution server devices 130 perform analysis processing including the extraction processing of the processing target data 1341 and the aggregation processing of the aggregation target data in parallel in each parallel distributed processing execution server device 130. Realize parallel distributed processing. The aggregation target data is data obtained by dividing the processing target data 1341 for each key column value of the key column determined by the data distribution control server device 140. In addition, the data distribution control server device 140 analyzes the extraction processing result of the processing target data 1341 by the parallel distributed processing execution server device 130 and creates key distribution information based on the key column value of the key column in the processing target data 1341. Further, the data distribution control server device 140 determines the parallel distributed processing execution server device 130 that is the allocation destination of the aggregation target data based on the key distribution information of the key column.

Hereinafter, with reference to FIG. 3, an overall processing flow by the parallel distributed processing management server device 120, the parallel distributed processing execution server device 130, and the distribution control server device 140 will be described.

First, in step S101, the client processing unit 1131 causes the parallel distributed processing management server device 120 and the data distribution control to start when a predetermined operation for requesting analysis of processing target data is performed by the user in the client device 310. A data analysis processing execution request is transmitted to the server device 140. When the analysis processing execution request is received, the parallel distributed processing management server device 120 and the data distribution control server device 140 start processing in parallel.

Note that when the client apparatus 310 transmits a data analysis processing execution request, an aggregation key designated by the user is also transmitted to the data distribution control server apparatus 140. The aggregation key is a key column candidate in the aggregation process, and at the time of the first analysis by the parallel distributed processing execution unit 1331, the designated aggregation key is acquired as the key column candidate.

In the parallel distributed processing management server device 120 that has received the data analysis processing execution request, the distributed processing management unit 1232 is activated by the CPU 122 and manages the parallel distributed processing to be executed by the parallel distributed processing execution server device 130 (step S102). In FIG. 3, the outline during the time when the parallel distributed processing is managed by the parallel distributed processing management server apparatus 120 (step S102) is described in steps S103 to S108.

In step S103, the distributed processing management unit 1232 of the parallel distributed processing management server device 120 activates the parallel distributed processing execution unit 1331 of the parallel distributed processing execution server device 130. Then, the started parallel distributed processing execution unit 1331 executes parallel distributed processing (step S104). During the execution of the parallel distributed processing in step S104, the key distribution information is determined (step S105), the aggregation target data is distributed to the plurality of parallel distributed processing execution server devices 130 (step S106), and each parallel distributed processing is executed. Aggregation processing of the aggregation target data is performed in the server 130 (step S107).

Thereafter, when the distributed processing management unit 1232 receives a message indicating the completion of execution of the parallel distributed processing from the parallel distributed processing execution unit 1331 (step S108), the management of the parallel distributed processing is terminated and the processing is completed.

On the other hand, in the data distribution control server device 140 that has received the analysis processing execution request in step S101, key distribution information of the key column is created (step S109). In the process of step S109, a key column that is a calculation target of the key distribution is determined based on the extraction result by the parallel distributed processing execution unit 1331 of the parallel distributed processing execution server device 130, and the key distribution indicating the appearance ratio for each key column value of the key column Information is created. The key column key distribution information created in step S109 is used for the determination in step S105 in the parallel distributed processing execution server device 130.

Thereafter, the data distribution control server device 140 creates a table for determining the parallel distributed processing execution server device 130 that is the distribution destination of the aggregation target data based on the key distribution information of the key column (step S110). This table corresponds to the distribution destination server apparatus management table 1444, and its specific structure will be described later. The data distribution in step S108 on the parallel distributed processing execution server device 130 side is performed with reference to the table created in step S110.

As described above with reference to FIG. 3, in the parallel distributed processing by the data processing system 100, the processing by the parallel distributed processing management server device 120 and the parallel distributed processing execution server device 130 (steps S102 to S108) and the data distribution control server device 140 are performed. Processing (steps S109 to S110) is performed in parallel with reference to each other.

(1-3) Processing by Parallel Distributed Processing Management Server Next, processing by the parallel distributed processing management server device 120 will be described with reference to FIG. The parallel distributed processing management server device 120 registers the processing target data in accordance with the execution request received from the client device 310 (steps S203 to S204) or manages the execution of the parallel distributed processing of the processing target data. (Steps S205 to S208) are performed. Note that the processing shown in steps S205 to S208 in FIG. 4 corresponds to the processing shown in step S102 in FIG.

First, when receiving a processing target data registration execution request or an analysis processing execution request from the client device 310 (step S201), the parallel distributed processing management server device 120 checks whether the received message is a processing target data registration execution request (step S202). ).

When the message received from the client device 310 is a processing target data registration execution request (YES in step S202), the data registration processing unit 1231 is transmitted from the client device 310 after the processing target data registration execution request. The incoming client processing data 3141 or a part of the client processing data 3141 is stored in block units in any of the secondary storage devices 134 of the parallel distributed processing execution server devices 130A to 130N, and the processing target data 1341 is updated (step S203).

When the update of the processing target data 1341 is completed in step S203, the data registration processing unit 1231 updates the processing target data 1341 updated in step S203 by writing predetermined information in the data arrangement information management table 1241 ( Step S204).

The data arrangement information management table 1241 is a table for managing in which parallel distributed processing execution server device 130 the processing target data 1341 is stored in block units in step S203. As shown in FIG. 5, the data arrangement information management table 1241 includes a data block ID column 1241A in which a block name of data is described, and a data arrangement server apparatus name in which the name of the server apparatus in which the data of the block is stored is described. The structure has a column 1241B. “Server apparatus 1”, “Server apparatus 2”, and “Server apparatus 3” described in the data arrangement server apparatus name column 1241B in FIG. 5 are assigned to any of the parallel distributed processing execution server apparatuses 130A to 130N. Correspond.

When the update of the data arrangement information management table 1241 is completed in step S204, the parallel distributed processing management server device 120 ends the processing target data registration process.

On the other hand, if the message received from the client device 310 in step S202 is not a processing target data registration execution request, that is, if it is an analysis processing execution request (NO in step S202), the distributed processing management unit 1232 performs parallel processing. The parallel distributed processing execution unit 1331 of the distributed processing execution server device 130 is activated (step S205).

Next, for the parallel distributed processing (step S207) executed by the parallel distributed processing execution unit 1331 started in step S205, the distributed processing management unit 1232 stores information indicating the execution status of the processing in the processing execution status management table 1242. The execution status of the parallel distributed processing is managed by updating (step S206).

As shown in FIG. 6, the processing execution status management table 1242 includes a server device name column 1242A, an arrangement data block number column 1242B, a processing completion block number column 1242C, a progress rate column 1242D, a start time column 1242E, and a completion time column 1242F. It has the structure which has. In the server device name column 1242A, the name of a server device on which parallel distributed processing is performed (any of the parallel distributed processing execution server devices 130A to 130N) is described. The arrangement data block number column 1242B describes the number of blocks of the processing target data 1341 assigned to the server device (the server device) described in the server device name column 1242A. In the processing completion block number column 1242C, the number of blocks for which processing in the server device has been completed is described. The progress rate column 1242D describes the ratio of the value in the processing completed block number column 1242C to the value in the arrangement data block number column 1242B of the server device as the processing progress rate in the server device. In the start time column 1242E and the completion time column 1242F, the start time and completion time of the analysis processing in the server device are described.

Then, when the distributed processing management unit 1232 receives a message indicating that the execution of the parallel distributed processing is completed from the parallel distributed processing execution unit 1331 (step S208), the series of processing ends.

(1-4) Processing by Parallel Distributed Processing Execution Server Next, processing by the parallel distributed processing execution server device 130 will be described with reference to FIG. In the parallel distributed processing execution server device 130, parallel distributed processing is performed by the parallel distributed processing execution unit 1331 performing analysis processing in parallel in each of the plurality of parallel distributed processing execution server devices 130. Note that the series of processing shown in FIG. 7 corresponds to the processing in step S104 in FIG. 3 or step S207 in FIG.

First, in step S301, the parallel distributed processing execution unit 1331 of the parallel distributed processing execution server device 130 is activated in accordance with an instruction from the distributed processing management unit 1232 of the parallel distributed processing management server device 120. Then, the parallel distributed processing execution unit 1331 starts analysis processing including extraction processing and aggregation processing. In the extraction process, a column and a column value for each record are extracted in block units from the processing target data 1333 on the memory. The result of the extraction process is referred to by the analysis application execution history management unit 1431 of the data distribution control server device 140 described later.

Next, the parallel distributed processing execution unit 1331 refers to the key distribution management table 1443 of the data distribution control server device 140 and confirms whether there is key distribution information of the key column (step S302). The key distribution management table 1443 will be described later with reference to FIGS.

If there is key distribution information of the key column in step S302 (YES in step S302), the parallel distributed processing execution unit 1331 collects data in the parallel distributed processing execution server device 130 based on the distribution destination server device management table 1444. Data is distributed (step S303). Here, since the aggregation target data is data obtained by dividing the processing target data 1341 for each column value of the key column, if there is a record that does not include the key column in the processing target data 1341, the record is the aggregation target data. Excluded from. Thereafter, the process of step S305 is executed.

If there is no key distribution information of the key column in step S302 (NO in step S302), the parallel distributed processing execution unit 1331 sends the processing target data 1341 to the parallel distributed processing execution server device 130 regardless of the bias of the key distribution. Distribute (step S304). Thereafter, the process of step S305 is executed.

In step S305, aggregation processing using the key column as a key is performed in each parallel distributed processing execution server device 130 to which data has been distributed in step S303 or step S304. For this aggregation processing, commonly used key aggregation processing can be used.

Then, in the parallel distributed processing execution server device 130 that has completed the aggregation processing, the parallel distributed processing execution unit 1331 sends a message notifying the completion of the parallel distributed processing execution to the distributed processing management unit 1232 of the parallel distributed processing management server device 120. (Step S306), the process ends.

If it is necessary to output the result of the analysis processing, for example, in step S306, the parallel distributed processing execution unit 1331 transmits the result of the aggregation processing in step S305 to the parallel distributed processing management server device 120. Then, after the distributed processing management unit 1232 of the parallel sentence finger processing management server device 120 receives the result of the aggregation processing from all the parallel distributed processing execution units 1331 of the parallel distributed processing execution server device 130, the received result of the aggregation processing Is transmitted to the client device 310 and output to an output unit (not shown) of the client device 310.

(1-5) Processing by Data Distribution Control Server Next, processing by the data distribution control management server device 140 will be described. Among the processes performed by the data distribution control management server device 140, the extraction processing result of the processing target data 1341 (or processing target data 1333 on the memory) in the parallel distributed processing execution server device 130 is analyzed, and the key column in the processing target data 1341 is analyzed. The processing for creating the key distribution information will be described with reference to FIGS. 8, 10, and 14. FIG. In addition, among the processes performed by the data distribution control management server device 140, the allocation destination of the processing target data 1341 is determined based on the key distribution information for a plurality of parallel distributed processing execution server devices 130 that perform aggregate information using the key column as a key. Processing to be performed will be described with reference to FIG. 8, 10, and 14 correspond to the process shown in step S <b> 109 of FIG. 3, and the process of FIG. 17 corresponds to the process shown in step S <b> 110 of FIG. 3.

(1-5-1) Update of Analysis Application Execution History Table Based on Extraction Result FIG. 8 shows that the analysis application execution history management unit 1431 performs extraction by the parallel distributed processing execution unit 1331 in the process of creating the key column key distribution information. Processing for updating the analysis application execution history table 1441 with reference to the result is described.

First, in step S401, when the data distribution control server device 140 receives an analysis processing execution request from the client device 310, the data distribution control server device 140 activates the analysis application execution history management unit 1431. Next, the analysis application execution history management unit 1431 checks whether or not the value of the parallel distributed processing execution counter is 0 (step S402). Here, the parallel distributed processing execution counter is one of the parameters held in the main storage device 143, and the number of processes is counted each time the parallel distributed processing execution 1331 performs extraction processing in units of blocks.

If the value of the parallel distributed processing execution counter is 0 in step S402 (YES in step S402), it indicates that the extraction processing by the parallel distributed processing execution unit 1331 is the first time. At this time, the analysis application execution history management unit 1431 selects the analysis application name and the analysis process of the parallel distributed processing execution unit 1331 used for the extraction process in block units from the processing target data 1333 on the memory for which the extraction process has been completed. And the column designated as the aggregation key at the time of transmission of the data analysis processing execution request in the client device 310 is acquired (step S403).

Then, the analysis application execution history management unit 1431 uses the column acquired in step S403 as a key column candidate, and updates the analysis application execution history management table 1441 together with the other data acquired in step S403 (step S404). As shown in FIG. 9, the analysis application execution history management table 1441 includes an analysis application name column 1441A in which the analysis application name acquired in step S403 is described, an execution date / time column 1441B in which the execution date and time of analysis processing is described, and a key column. The structure has a key column candidate name field 1441C in which candidates are described.

Here, the analysis application name described in the analysis application name column 1441A of the analysis application execution history management table 1441 is described every time extraction processing is performed in block units. Therefore, when analysis processing by the same kind of analysis application is performed on a plurality of blocks, the analysis application name is described a plurality of times in the analysis application name column 1441A, and a plurality of blocks are analyzed by different types of analysis applications. When analysis processing is performed, a plurality of analysis application names are described in the analysis application name column 1441A.

If the value of the parallel / distributed process execution counter is not 0 in step S402 (NO in step S402), the key column candidate has already been acquired during the first analysis process, so it is necessary to acquire the key column candidate again. The analysis application execution history management unit 1431 does not update the analysis application execution history management unit 1431. Then, when the value of the parallel distributed processing execution counter is not 0 in step S402, or after the processing of step S404, the processing of FIG. 10 is executed. Note that the analysis application execution history management unit 1431 may add 1 to the value of the parallel distributed processing execution counter before performing the processing of FIG.

(1-5-2) Determination of Key Column FIG. 10 shows a process for creating key column key distribution information. The key column candidate extraction unit 1432 determines a key column candidate having an appearance degree exceeding a predetermined threshold as a key column, and performs analysis. A process of updating the application execution history management table 1441 is described.

First, in step S501, the key column candidate extraction unit 1432 is activated. Then, the key column candidate extraction unit 1432 refers to the analysis application execution history management table 1441 and counts the value of the total number of occurrences of key column candidates (key column candidate appearance total value) described in the key column candidate name field 1441C for each analysis application. (Step S502).

Next, the key column candidate extraction unit 1432 refers to the analysis application execution history management table 1441 and the key column candidate appearance total value counted in step S502, and for each key column candidate in the past analysis processing execution history by the analysis application. Is calculated as the key column candidate degree, and the key column candidate management table 1445 is updated (step S503). As shown in FIG. 11, the key column candidate management table 1445 includes a key column candidate name field 1445A in which a key column candidate name is described, a key column candidate appearance number field 1445B in which the number of appearances of the key column candidate is described, and the appearance of the key column candidate. The structure has a key column candidate degree column 1445C in which the ratio is described.

For example, when each item of the key column candidate management table 1445 is calculated for the columns from [analysis application 1] to [analysis application 4] in the analysis application execution history table 1441 shown in FIG. 9, it is described in the key column candidate name 1441C. There are three key column candidates: [Category Name], [Date], and [User Name], and the number of occurrences of each key column candidate is 6, 1 and 2 times. The total number of key column candidates that is the total number of times is 6. At this time, the key column candidate degree of [Category name] is 3/6 = 0.50, the key column candidate degree of [Date] is 1/6 = 0.17, and the key column candidate degree of [User name] is 2 / 6 = 0.33. Therefore, in the key column candidate management table 1445, the key column candidate appearance number [3] and the key column candidate degree [0.50] are described in the key column candidate [category name] column, and the other columns are also described in the same manner.

Next, the key column candidate extraction unit 1432 refers to the key column candidate management table 1445, and determines whether the key column candidate degree of each key column candidate exceeds the key column threshold described in the key column threshold management table 1446 (Step S1). S504). As shown in FIG. 12, the key column threshold value management table 1446 has a structure having a key column threshold value field 1446A in which preset threshold values are described. In FIG. 12, the key column threshold value table is set to [0.4]. Yes.

If it is determined in step S504 that there is a key column candidate degree exceeding the key column threshold (YES in step S504), the key column candidate extraction unit 1432 determines a key column candidate having a key column candidate degree exceeding the key column threshold as a key column. The management table 1447 is updated (step S505). As shown in FIG. 13, the key column management table 1447 has a structure having a key column name column 1447 in which the name of the key column determined in step S505 is described. In FIG. 13, since the key column candidate degree of [category name] in the key column candidate management table 1445 of FIG. 11 is [0.50] and exceeds the key column threshold value [0.4] of FIG. ] Is described.

When the key column candidate degree analysis including the execution history of the past analysis processing included in the analysis target is performed and the key column is determined by the processing in step S505, the key column candidate extraction unit 1432 completes the update of the key column management table 1447. Then, an update completion message is transmitted (step S506), and then the process shown in step S601 of FIG. 14 is executed.

If it is determined in step S504 that there is no key column candidate degree exceeding the key column threshold (NO in step S504), the key column candidate extraction unit 1432 assumes that there is no key column candidate degree to be determined as a key column at present. The process shown in step S606 of FIG. 14 is executed without determining the key column.

(1-5-3) Calculation of Key Distribution Information FIG. 14 shows a key column value for each key column value in the key column determined by the key distribution calculation unit 1433 in the process of creating key distribution information of the key column. The process of calculating the appearance ratio (key distribution value) of and updating the key distribution management table 1443 is described.

First, in step S601, when the data distribution control server device 140 receives an update completion message for the key column management table 1447 from the key column candidate extraction unit 1432, the key distribution calculation unit 1433 is activated. Then, the key distribution calculation unit 1433 refers to the processing target data 1333 on the memory, and for the key column described in the key column management table 1447, the key distribution candidate 1300 records the number of records of the key column that appears in the processing target data 1333 on the memory. The total number of occurrences is counted (step S602).

Next, the key distribution calculation unit 1433 counts the number of appearances for each key column value of the key column from the processing target data 1333 in the memory, and uses the appearance ratio with respect to the total number of key column candidate record appearances counted in step S602 as the key distribution value. The key distribution management table 1443 is calculated and updated (step S603).

As shown in FIG. 15, the key distribution management table 1443 includes a key column value field 1443A in which a key column value is described, a key appearance number field 1443B in which the number of occurrences of the key column value is described, and a key distribution value of the key column value. The key distribution value column 1443C is described.

For example, in the key distribution management table 1443 of FIG. 15 in which [category name] is set as the key column, [category A] to [category D] are described as the key column values in the key column value field 1443A, and correspond to each key column value. As the key appearance number, [10], [5], [3], and [2] are described in order from the top of the key appearance number column 1443B. Here, the key column candidate record appearance total value indicating the total number of key appearances is 10 + 5 + 3 + 2 = 20. At this time, the key distribution value in [Category A] is 10/20 = 0.50, the key distribution value in [Category B] is 5/20 = 0.25, and the key distribution value in [Category C] is 3/20 = 0.15, and the key distribution value in [Category D] is 2/20 = 0.10.

Next, in step S604, the key distribution calculation unit 1433 determines whether or not a predefined aggregation processing start condition is satisfied. The pre-defined aggregation processing start condition includes, for example, the processing progress rate of analysis processing in the parallel distributed processing execution server device 130, the number of execution times of analysis processing, the number of records subjected to analysis processing, or the progress from the start of analysis processing. Time etc. can be set.

As an example, a case where the processing progress rate of analysis processing in the parallel distributed processing execution server device 130 is defined in advance as a start condition of aggregation processing will be described with reference to FIGS. 6 and 16. In the aggregation processing start condition management table 1448 of FIG. 16, a predefined threshold value [0.5] is described in the job progress rate threshold value field 1448A. At this time, in step S604, the key distribution calculation unit 1433 refers to the processing execution status management table 1242 (FIG. 6) of the parallel distributed processing management server device 120 (FIG. 6), and executes the parallel distributed processing execution described in the server device column 1242A. It is determined whether the progress rate in the server device 130 exceeds the threshold value [0.5] of the aggregation processing start condition management table 1448. In the case of the process execution status management table 1242 shown in FIG. 6, since the progress rates of [server apparatus 1] to [server apparatus 3] do not exceed the threshold value [0.5], the key distribution calculation unit 1433 It is determined that the start condition for the aggregation process is not satisfied.

When it is determined in step S604 that the aggregation processing start condition is satisfied (YES in step S604), the key distribution management table 1242 is created as key distribution information of the key column, and the aggregation processing may be started. Therefore, the key distribution calculation unit 1433 transmits an update completion message informing that the update of the key distribution management table 1242 is completed (step S605), and ends the process.

On the other hand, if it is determined in step S604 that the aggregation process start condition is not satisfied (NO in step S604), or if it is determined in step S504 in FIG. 10 that there is no key column candidate degree exceeding the key column threshold ( In step S504, it is determined that sufficient analysis results have not yet been collected to create key column key distribution information. At this time, the parallel distributed processing execution server device 130 reads the next block of the processing target data 1333 on the memory in order to continue the analysis processing (step S606), and performs the analysis processing on the read next block. (Step S607). In step S607, in the analysis processing for the next block, the processing described so far with reference to FIGS. 7, 8, and 10 is performed again. During steps S606 to S607, the key distribution calculation unit 1433 waits for the progress of analysis processing for the next block. Thereafter, after a predetermined time has elapsed, the key distribution calculation unit 1433 confirms whether the key column is described in the key column management table 1447 (step S608).

If the key column is confirmed in step S608 (YES in step S608), the key distribution calculation unit 1433 returns to the process in step S602 and starts calculating the key distribution value for the newly confirmed key column. If the key column is not confirmed in step S608 (NO in step S608), the key distribution calculation unit 1433 returns to step S604, and determines again whether the aggregation processing start condition is satisfied. Do.

The key distribution management table 1242 is created as the key column distribution information of the key column if the key column is determined by the processes described in FIGS. 8, 10, and 14 and the start condition of the aggregation process is satisfied. The In addition, in a situation where the key column has not been determined or a condition where the aggregation process start condition is not satisfied, the process of creating key distribution information of the key column is not completed. In steps S302 to S304 of FIG. 7 described above, the parallel distributed processing execution unit 1331 confirms the presence / absence of the key distribution management table 1242, and sends the aggregation target data to the parallel distributed processing execution server device 130 according to the confirmation result. Distribute.

Further, the data distribution control server device 140 receives the update completion notification of the key distribution management table 1443 in FIG. 14 and executes a process of determining a data distribution destination server device (described later with reference to FIG. 17).

(1-5-4) Processing for Determining Data Distribution Destination Server Device In the following, in order to distribute aggregation target data to a plurality of parallel distributed processing execution server devices 130 and perform aggregation processing, a data distribution destination determination unit 1434 However, the process of determining the parallel distributed processing execution server device 130 that is the allocation destination of the aggregation target data based on the key distribution information of the key column will be described with reference to FIG. Note that the series of processing shown in FIG. 17 corresponds to the processing in step S110 in FIG.

First, when the data distribution control server device 140 receives an update completion message of the key distribution management table 1443 from the key distribution calculation unit 1433, the data distribution destination determination unit 1434 is activated (step S701).

Then, the data distribution destination determination unit 1434 refers to the processing execution status management table 1242 to extract the parallel distributed processing execution server device 130 (distribution destination candidate server device) that can distribute the aggregation target data, and extracts the extraction result as the distribution destination. The candidate server device management table 1449 is updated (step S702). As shown in FIG. 18, the distribution destination candidate server device management table 1449 has a structure having a distribution destination candidate server device name column 1449A in which distribution destination candidate server devices are described. In step S702, the data distribution destination determination unit 1434 counts the number of distribution destination candidate server devices described in the distribution destination candidate server device name column 1449A.

Next, the data distribution destination determination unit 1434 allocates the aggregation target data divided for each key column value based on the number of distribution destination candidate server devices counted in step S702 and the description content of the key distribution management table 1443. The number of server devices is calculated, and the distribution data allocation management table 1450 is updated based on the calculation result (step S703).

As shown in FIG. 19, the distributed data assignment management table 1450 has a structure having a key column value field 1450A, a record number field 1450B, a key distribution value field 1450C, and an assigned server device number field 1450D. The key distribution value column 1450C describes the key distribution values in the key distribution management table 1443 in descending order. The key column value column 1450A and the record number column 1450B include the key distribution calculation target column value and key corresponding to the key distribution value. The number of occurrences is listed. The number of assigned server devices column 1450D describes the number of assigned server devices based on the key distribution value in the column and the number of distribution destination candidate server devices counted in step S702.

Next, the data distribution destination determination unit 1434 assigns the distribution destination candidate server device described in the distribution destination candidate server device name field 1449A to the head of the data distribution destination determination unit 1434 according to the number of allocation server devices in the distribution data allocation management table 1450. Are assigned as distribution server devices in order. Then, the data distribution destination determination unit 1434 updates the distribution destination server device management table 1444 with the assigned distribution server device and the corresponding key column value (step S704). As shown in FIG. 20, the distribution destination server device management table 1444 has a structure having a key column value column 1444A in which a key column value is described and a distribution server device column 1444B in which a distribution server device name is described.

(1-6) Effects According to this Embodiment In the data processing system 100 according to this embodiment, the data distribution control server device 140 adds the key columns from the key column candidates with high accuracy as the analysis processing by the parallel distributed processing execution server device 130 is repeated. Can be determined. Then, the parallel distributed processing execution server device 130 performs the aggregation process by assigning the data divided for each column value of the key column based on the key distribution information of the appropriate key column so that the processing amount is allocated as evenly as possible. The aggregation process can be performed while avoiding the bottleneck due to the bias of the key distribution. As a result, the time difference required for the analysis processing in each parallel distributed processing execution server device 130 is reduced, and the parallel distributed processing execution server device 130 as a whole can be expected to realize parallel distributed processing at high speed. This effect is particularly effective when each of the parallel distributed processing execution server apparatuses 130A to 130N is composed of computers having substantially the same processing performance.

Further, in the data processing system 100 according to the present embodiment, the data distribution control server device 140 refers to the past analysis processing results, extracts key column candidates that are key column candidates, and narrows down the key column candidates to statistical information ( (Corresponding to the analysis application execution history management table 1441) is created, it is not necessary to create statistical information for all the columns of the processing target data 1341, and unnecessary statistical processing is not performed, and the effect of reducing the processing load can be expected. .

Further, in the data processing system 100 according to the present embodiment, the key distribution information is generated by calculating the appearance ratio for each column value of the key column until the progress of the data analysis process satisfies a predetermined condition, and therefore sufficient statistics are provided. The key distribution information of the key column can be created based on the information, and by using the key distribution information of the key column thus created, the parallel distributed processing execution server device 130 that is the data distribution destination in the aggregation processing can be accurately Can be well determined.

(2) Second embodiment (2-1) Configuration according to the present embodiment The data processing system according to the second embodiment has the first feature when assigning aggregation target data to a server device in parallel distributed processing. The data processing system 100 according to the embodiment is characterized in that the allocation is performed in consideration of the bias of the key distribution, whereas the allocation is performed in consideration of the processing capability of the server device in addition to the bias of the key distribution. To do.

As shown in FIG. 21, the configuration of the data processing system 200 is the same as that of the data processing system 100 of the first embodiment except that the data distribution control server device 140 is a data distribution control server device 240. Since this is a configuration, description of the common configuration is omitted. In addition, regarding the internal configuration of the data distribution control server device 240, those having the same functions as those of the data distribution control server device 140 are denoted by the same reference numerals, and description thereof is omitted.

The data distribution destination determination unit 2434 is an application program for performing processing different from that of the data distribution destination determination unit 1434 to determine the allocation destination of the aggregation target data. For the operation, refer to FIG. It will be described later.

The distribution processing efficiency management table 2450 is a table used for determining an allocation destination of aggregation target data in the aggregation processing. The distribution server device management table 2444 is a table in which the allocation destination of the aggregation target data determined by the data distribution destination determination unit 2434 is described. The structure of the distribution processing efficiency management table 2450 will be described later with reference to FIG. 23, and the structure of the distribution server apparatus management table 2444 will be described later with reference to FIG.

(2-2) Processing According to this Embodiment The processing by the data processing system 200 is the processing by the data processing system 100 of the first embodiment, except for the processing in which the data distribution destination determination unit 2434 determines the distribution destination server device. The description of common processing is omitted. Hereinafter, processing performed by the data distribution destination determination unit 2434 will be described with reference to FIG.

In FIG. 22, the data distribution destination determination unit 2434 is activated, updates the distribution destination candidate server device management table 1449, counts the number of distribution destination candidate server devices, and updates the distribution data allocation management table 1450 ( Steps S801 to S803) are the same as the processing described in steps S701 to S703 in FIG.

In step S804, the data distribution destination determination unit 2434 updates the analysis processing efficiency management table 2450 with reference to the processing execution status management table 1242.

As shown in FIG. 23, the analysis processing efficiency management table 2450 has a structure including a server device name column 2450A, a processing completion block total number column 2450B, a total processing time column 2450C, a processing efficiency column 2450D, and a processing capacity ratio column 2450E. ing.

In the server device name column 2450A, the name of the parallel distributed processing execution server device 130 that has performed analysis processing in the past is described. The processing completion block total number column 2450B, the total processing time column 2450C, the processing efficiency column 2450D, and the processing capacity ratio column 2450E correspond to predetermined parallel processing execution server devices 130 described in the server device name column 2450A. Data is described.

The total number of completed blocks of analysis processing executed in the past is written in the total number of processed blocks column 2450B. The total number of blocks completed in the analysis process can be calculated by accumulating the number of process completed blocks in the process execution status management table 1242 every time the analysis process is completed. The total processing time column 2450C describes the total processing time [h] required for the analysis processing executed in the past. The total processing time is calculated from the start time and the completion time described in the start time column 1242E and the completion time column 1242F of the process execution status management table 1242, and this calculated value is calculated every time block-based analysis processing is completed. It can be calculated by accumulating.

The processing efficiency column 2450D describes the processing efficiency indicating the number of processing blocks per unit time that can be calculated from the total number of processing completed blocks and the total processing time. The processing capacity ratio column 2450E describes the processing capacity ratio between server apparatuses described in the server apparatus name column 2450A. For example, in FIG. 23, the processing capability ratio is described with the processing capability value of the server device in the top row as the reference value (1.0).

Next, the data distribution destination determination unit 2434 selects the distribution destination candidate server devices described in the distribution destination candidate server device name column 1449A in order from the server devices with the highest processing capability ratio in the analysis processing efficiency management table 2450. The number of assigned server devices is assigned based on the key distribution value and the distribution destination server device management table 2444 is updated (step S805).

As shown in FIG. 24, the distribution destination server apparatus management table 2444 has a structure having a key column value field 2444A in which a key column value is described and a distribution server apparatus name field 2444B in which a distribution server apparatus name is described.

Here, an example of a specific allocation procedure of the distribution server device will be described. First, the number of server devices to which the processing target data 1341 is allocated from the distribution destination candidate server device management table 1449 is four servers [server device 1] to [server device 4]. Next, from the values described in the processing capacity ratio column 2450E of the analysis processing efficiency management table 2450, the processing capacity ratio of [Server apparatus 1] to [Server apparatus 4] to be assigned is set to 1 as a whole. Each processing capacity ratio is calculated. At this time, [Server 1] is 0.25, [Server 2] is 0.15, [Server 3] is 0.10, and [Server 4] is 0.50.

Based on such data, first, in the key column value column 2444A in the top row of the distribution destination server device management table 2444, [category A] having the largest number of records in the distribution data allocation management table 1450 is described. In the distribution server device name column 2444B in the top row, [server device 4] having the highest processing capability ratio in FIG. 23 is described. Since the key distribution value of [Category A] is 0.50 and the processing capacity ratio of [Server 4] is also 0.50, [Server 4] is the only server device to which [Category A] processing is assigned. I know it ’s good. Then, [Category B] is described in the key column value field 2444A in the next row of the distribution destination server device management table 2444, and the same assignment as in [Category A] is performed. As described above, for all the key column values [Category A] to [Category D], server devices to be assigned are determined and described in the distribution destination server device management table 2444.

(2-3) Effects of this Embodiment In the data processing system 200 according to the second embodiment as described above, in addition to the bias of the key distribution, the processing capability of the parallel distributed processing execution server devices 130 (130A to 130N) And the aggregation target data is allocated to the parallel distributed processing execution server device 130 on the basis of the key distribution value and the processing capability ratio. Therefore, in the aggregation processing by the plurality of parallel distributed processing execution server devices 130, each parallel distributed processing execution server A difference in processing time required by the apparatus can be further suppressed, and a distribution destination server that is unlikely to cause a bottleneck can be determined. As a result, the effect of realizing faster parallel distributed processing can be expected.

Further, in the data processing system 200 according to the second embodiment, the parallel distributed processing execution server devices 130A to 130N are not the same model or have different specifications and processing capabilities because of different manufacturing ages. In particular, it is possible to determine a distribution destination server in which a bottleneck is unlikely to occur in consideration of a difference in specifications and processing capability, and an effect of realizing high-speed parallel distributed processing can be expected.

(3) Other Embodiments In the

data processing systems

100 and 200 according to the first and second embodiments described above, the client processing data 3141 that is the source of the processing target data 1341 is the secondary storage of the client device 310. Although the case where it is stored in the device 314 has been described, the present invention is not limited to this. For example, data that is the source of the processing target data 1341 is accumulated in an external backbone server system (not shown), and the accumulated data May be configured to be transmitted from the core server system to the

data processing system

100 or 200 intermittently or periodically.

In the

data processing systems

100 and 200 according to the first and second embodiments described above, when the client device 310 transmits a data analysis processing execution request, the aggregate key specified by the user is the data distribution control server device. Although the case of being transmitted to 140 or 240 has been described, the present invention is not limited to this. For example, the present invention may be configured to be defined in the program of the parallel distributed processing execution unit 1331 of the parallel distributed processing execution server device 130. Good. In such a case, the analysis application execution history management unit 1431 may acquire the aggregate key as a key column candidate from the parallel distributed processing execution unit 1331 during the first analysis process by the parallel distributed processing execution unit 1331.

Furthermore, in the above-described first and second embodiments, the case where the

data processing systems

100 and 200 execute the analysis processing in parallel has been described. However, the present invention is not limited to this, for example, a plurality of streams. The data processing method according to the present invention may be applied to a data processing system configured using a processing infrastructure and inputting a data group aggregated for each key to the stream processing infrastructure and performing time-series processing. In the data processing system configured as described above, it is possible to avoid a bottleneck caused by the concentration of data input to a specific stream processing platform, and it can be expected to perform parallel distributed processing of streams at high speed.

In the parallel distributed

system

100 or 200 according to the first or second embodiment, the parallel distributed processing execution server device 130 (130A to 130N) obtains a column and a column value for each record in block units from the data to be analyzed. It is an example of the some 1st server apparatus which performs the analysis process which consists of the extraction process to extract, and the aggregation process which aggregates the data of aggregation object by using a key column as a key in each server apparatus. In addition, the data distribution

control server devices

140 and 240 create key distribution information indicating a key column based on processing results of extraction processing by a plurality of first server devices, and key distribution information indicating the appearance ratio of each column value in the key column 4 is an example of a second server device that performs a process of determining a plurality of first server devices that perform aggregation processing based on key distribution information. The analysis application execution history management unit 1431 is an example of an aggregate key acquisition unit.

The present invention can be applied to a data processing system and a data processing method for parallel and distributed processing of large-scale data.

100, 200 Data processing system 110 Network 120 Parallel distributed processing

management server apparatus

121, 131, 141, 311

Network interface

122, 132, 142, 312 CPU
123, 133, 143, 243, 313

Main storage device

124, 134, 144, 244, 314

Secondary storage device

125, 135, 145, 315 Bus 1231 Data registration processing unit 1232 Distributed processing management unit 1233 OS
1241 Data arrangement information management table 1242 Processing execution status management table 130 (130A, 130B,..., 130N) Parallel distributed processing execution server device 1331 Parallel distributed processing execution unit 1332 OS
1341

Processing target data

140, 240 Data distribution control server device 1431 Analysis application execution history management unit 1432 Key column candidate extraction unit 1433 Key

distribution calculation unit

1434, 2434 Data distribution destination determination unit 1435 OS
1441 Analysis application execution history management table 1442 Key distribution calculation target column management table 1443 Key distribution calculation target column management table 1444, 2444 Distribution destination server device management table 1445 Key column candidate management table 1446 Key column threshold management table 1447 Key distribution calculation target column management table 1448 Aggregation processing start condition management table 1449 Distribution destination candidate server device management table 1450 Distribution data allocation management table 2450 Analysis processing efficiency management table 310 Client device 3131 Client processing unit 3132 OS
3141 Client processing data

Claims

Each server device performs an analysis process consisting of an extraction process for extracting a column and a column value for each record from data to be analyzed and an aggregation process for aggregating data to be aggregated using the key column as a key in each server device. A plurality of first server devices to perform;
Processing for determining the key column based on processing results of past analysis processing by the plurality of first server devices, processing for creating key distribution information indicating an appearance ratio of each column value in the key column, and the key distribution A second server device that performs processing for determining the plurality of first server devices that perform the aggregation processing based on information,
The second server device allocates the data to be aggregated obtained by dividing the data to be analyzed for each column value of the key column to the plurality of first server devices according to the appearance ratio of the column value. And determining the plurality of first server devices to perform the aggregation processing.
The second server device is
An aggregation key acquisition unit that acquires an aggregation key designated at the start of the analysis process as a key column candidate;
A key column candidate extraction unit that calculates, as a key column candidate degree, a ratio in which the key column candidate is designated in the past processing result of the analysis process, and determines the key column candidate as a key column when the key column candidate degree is greater than a predetermined threshold When,
A key distribution calculation unit that creates key distribution information indicating an appearance ratio for each column value of the key column;
The number of the plurality of first server devices to be assigned for each column value of the key column is calculated based on the created key distribution information of the key column, and the number of the first server devices is calculated based on the calculated number and the key distribution information of the key column. The data processing system according to claim 1, further comprising: a data distribution destination determination unit that determines the plurality of first server devices to which data to be aggregated is allocated.
The key distribution calculation unit calculates an appearance ratio for each column value of the key column until the progress of the analysis process satisfies a predetermined condition, and creates key distribution information indicating the appearance ratio for each calculated column value The data processing system according to claim 2, wherein:
The key distribution calculation unit calculates an appearance ratio for each column value of the key column until the predetermined condition based on the number of completed blocks or elapsed time of the analysis process is satisfied, and the appearance ratio for each calculated column value The data processing system according to claim 3, wherein key distribution information is generated.
The data distribution destination determination unit is calculated from the appearance ratio for each column value of the key column and the processing result of the analysis process executed in the past by each server device of the plurality of first server devices. The data processing system according to claim 2, wherein the plurality of first server devices to which the data to be aggregated is allocated are determined based on the processing efficiency of each.
Based on a plurality of first server devices performing analysis processing consisting of data extraction processing and aggregation processing in parallel on each server device, and processing results of past analysis processing by the plurality of first server devices In a data processing method in a data processing system having a second server device for processing,
A first step in which the plurality of first server devices extract a column and a column value for each record in block units from data to be analyzed;
The second server device determines the key column based on the extraction results of the plurality of first server devices, creates key distribution information indicating an appearance ratio of each column value in the key column, and generates the key distribution information. A second step of determining the plurality of first server devices that perform the aggregation processing based on:
A plurality of first server devices comprising a third step of aggregating data to be aggregated using the key column determined by the second server device as a key;
In the second step, when determining the plurality of first server devices to perform the aggregation processing, the second server device divides the analysis target data into column values of the key columns. Is assigned to the plurality of first server devices in accordance with the appearance ratio of the column value.
In the second step, the second server device is
An aggregate key specified at the start of the analysis process is acquired as a key column candidate,
Calculating a ratio that the key column candidate is designated in the past processing result of the analysis processing by the plurality of first server devices as a key column candidate degree;
When the calculated key column candidate degree is larger than a predetermined threshold, the key column candidate is determined as a key column,
Create key distribution information indicating the appearance ratio for each column value of the determined key column,
Based on the key distribution information of the created key column, calculate the number of the plurality of first server devices to be assigned for each column value of the key column,
The data processing method according to claim 6, wherein the plurality of first server devices to which the data to be aggregated is allocated are determined based on the calculated number and the key distribution information of the key column.
In the second step, the second server device calculates an appearance ratio for each column value of the key column until the progress of the analysis process satisfies a predetermined condition, and the appearance for each calculated column value. The data processing method according to claim 7, wherein key distribution information indicating a ratio is created.
In the second step, the second server device calculates an appearance ratio for each column value of the key column until the predetermined condition based on the number of completed blocks or the elapsed time of the analysis process is satisfied, and The data processing method according to claim 8, wherein key distribution information indicating an appearance ratio for each calculated column value is created.
In the second step, the second server device generates an appearance ratio for each column value of the key column and a processing result of the analysis processing executed in the past by each server device of the plurality of first server devices. The data processing method according to claim 7, wherein the plurality of first server devices to which the data to be aggregated is assigned are determined based on the processing efficiency for each server device calculated from the following.