WO2019193810A1

WO2019193810A1 - Analysis sequence control system and analysis sequence control method

Info

Publication number: WO2019193810A1
Application number: PCT/JP2019/001531
Authority: WO
Inventors: 健杉本; 木下　雅文; 宏明郡浦; 侑中田
Original assignee: 株式会社日立製作所
Priority date: 2018-04-06
Filing date: 2019-01-18
Publication date: 2019-10-10
Also published as: JP2019185314A

Abstract

An objective of the present invention is to provide technology which enables a reduction in man hours spent on awareness-raising among developers who are developing a plurality of analysis programs to be executed by an analysis sequence control system and an analysis sequence control program for calling the analysis programs. A data analysis system 101 creates, in association with a prior analysis program 3041[k-1], a handover filename for identifying a handover file 4042[k-1] including a filename of data to be handed over to a successor analysis program 3041[k]. By executing the analysis program 3041[k], the handover data is acquired from the prior analysis program 3041[k-1] on the basis of the filename included in the handover file [k-1] which is associated with the prior analysis program 3041[k-1].

Description

Analysis sequence control system and analysis sequence control method

The present invention relates to an analysis sequence control system and an analysis sequence control method.

An effort to analyze the vast amount of IoT (Internet of Things) data in the world and create value such as cost reduction and new service planning based on the analysis results is attracting widespread attention. Data analysts are required to have science and business knowledge as well as engineering knowledge related to data acquisition and processing.

In other words, in addition to designing a process for analyzing data (analysis process), a data analyst accesses a data storage area including a file system, a data lake, a data warehouse (DWH), etc. (data access process), It is also necessary to design a process for visualizing the analysis result (visualization process). A software program for performing data access processing, analysis processing, and visualization processing is called an analysis program, and development thereof is called development of an analysis program.

Regarding the development of analysis programs, software and services such as a wide variety of open source software (OSS) responsible for data access processing, analysis processing, and analysis result visualization processing can be used. The data analyst combines these plural analysis programs while correcting them to form one analysis sequence. A software program for performing an analysis sequence control process for executing each analysis program according to the analysis sequence is called an analysis sequence control program, and its development is called development of an analysis sequence control program.

Such an analysis program and analysis sequence control program are generally shared by a plurality of data analysts (developers) as the development scale increases.

The data access process, analysis process, and visualization process are closely related to each other, such as passing data between each process or starting another process from one process. For example, when software or a service for visualization processing is first selected, data storage destinations for storing data to be visualized by the software or service may be limited. In addition, when deciding on the software or service for analysis processing first, the data to be acquired in the data access processing, the contents of data processing, and the number of samples of data to be acquired so that the input suitable for the analysis processing can be obtained It is necessary to decide. Therefore, when different data analysts are in charge of development of each process, it is necessary for the data analysts to match each other in order to maintain consistency between the processes. Therefore, in the case where the development is shared by a plurality of data analysts, the development must be advanced by adjusting the processing content and interface of each program so as to be consistent among the programs.

In order to obtain better results in the analysis, it is required to repeat the analysis many times while changing the analysis method. When the analysis is repeated, it may be necessary to change the processing content or interface of each program by changing the format of data to be analyzed or increasing the type of data. Therefore, it is necessary to redesign the processing contents and interface changes of each program after adjustment between a plurality of data analysts, and much time is required for such adjustment. As a result, the time taken to examine the analysis algorithm and evaluate the analysis result has become a factor.

As an example of this, the case where three data analysts A, B, and C share the development of two analysis programs and one analysis sequence control program is shown below. The first analysis program in charge of the data analyst A performs processing for extracting data from the database. The second analysis program handled by the data analyst B performs a process of analyzing the data extracted by the first analysis program. The analysis sequence control program in charge of the data analyst C performs processing for sequentially calling the first analysis program and the second analysis program through the call interface. Furthermore, when the analysis sequence control program calls the second analysis program, the analysis sequence control program informs the second analysis program of the storage location of the data extracted from the database by the first analysis program.

Suppose that the type of data to be analyzed has increased due to changes to the analysis method. In this case, the data analyst A changes the first analysis program so as to take out a new type of data from the database. The data analyst B changes the second analysis program so as to analyze the new type of data extracted by the first analysis program. The data analyst C changes the analysis sequence control program so that the second analysis program is also notified of the storage location of the new type of data extracted from the database by the first analysis program.

The program change procedure is as follows. Data analyst A and data analyst B make adjustments by adjusting the data format of the new type of data. The data analyst A and the data analyst C adjust and determine the storage location of the new type of data. The data analyst B and the data analyst C adjust and determine how to notify the second analysis program of the storage location of the new type of data. When the redesigned programs are completed after these adjustments, the data analysts A, B, and C are simultaneously placed on a server or the like to be usable (hereinafter referred to as “deploy”).

Thus, since the interface between each program is changed every time a change is made to the analysis method, the man-hour for adjusting the interface and the like by each data analyst increases. In addition, it is necessary to synchronize the timing for each completed program to be usable, and the data analyst who completed the program first must wait for the completion of other programs, which is unnecessary for each data analyst. May occur.

Also, since analysis takes a long time, it is required to execute a plurality of analysis programs in parallel in order to reduce the time required for analysis. For this reason, it is necessary to determine an interface in consideration of executing a plurality of analysis programs in parallel, and the problem (item to be adjusted) is further complicated.

For example, methods described in Patent Document 1 and Patent Document 2 are known as techniques for transferring data between analysis programs executed in parallel.

In Patent Document 1, when batch processing is run on a plurality of servers, even when a plurality of programs use a common file on the plurality of servers, access information of each file and file access by batch processing A method is disclosed in which a part of the subsequent process can be started without waiting for the end of all the batch processes in the preceding stage by analyzing the sequence of the above and using the analysis result.

Patent Document 2 discloses a method for transferring data on the new system by transmitting data address information from the old system to the new system when the data is transferred from the old system to the new system.

Japanese Patent Laid-Open No. 7-114514 JP 2000-137604 A

As described above, when multiple data analysts develop an analysis sequence control program and multiple analysis programs, the man-hours required for adjustment (consciousness adjustment) of the interface etc. by each data analyst, and each completed program The problem is to reduce the waiting time that occurs when the device is ready for use.

However, simply applying the methods disclosed in

Patent Documents

1 and 2 cannot solve the problem for the following reasons.

Patent Document 1 discloses a data transfer method when programs are operating in parallel. However, in Patent Document 1, data to be inherited is always fixed. Therefore, when different developers develop a plurality of analysis programs constituting the analysis sequence control program, the technique of Patent Document 1 requires that the design of the analysis program itself be changed when the data to be inherited increases or decreases. Some things remain the same. For this reason, the number of man-hours required for matching the consciousness between the developers is not reduced.

Patent Document 2 discloses a technique for preventing the influence of a change in data position even when updating a program by taking over address information. However, since the case where the programs operate in parallel is not taken into consideration, the interface arrangement in consideration of executing two programs in parallel becomes complicated. The address information itself must be created every time the program is updated. For this reason, the number of man-hours required for matching the consciousness between the developers is not reduced.

An object of the present invention is to enable a reduction in the number of man-hours for matching between developers who respectively develop a plurality of analysis programs executed in an analysis sequence control system and an analysis sequence control program that calls the analysis program. Is to provide.

An analysis sequence control system according to the present invention includes a plurality of analysis programs, a processor that executes an analysis sequence control program that calls the plurality of analysis programs according to an analysis sequence, and data analysis in the execution of the analysis program and the analysis sequence control program. An analysis sequence control system comprising: a storage device used for storage, wherein the processor executes the analysis sequence control program so that when the analysis program is executed in parallel, the analysis program of the previous stage In response, the succeeding analysis program is created by creating the inheriting file information for specifying the inheriting file including the position information indicating the position where the succeeding analysis program stores the data, and executing the analyzing program. Based on the position information included in the inherited file specified by the inherited file information created corresponding to the data, the data inherited from the previous analysis program is acquired, and the data is subjected to predetermined processing. The generated data is stored in the storage device, and the position information indicating the position where the generated data is stored is written to the inherited file specified by the inherited file information created corresponding to the analysis program. .

According to the present invention, when an analysis program is executed in parallel, a takeover file that specifies a takeover file that includes position information indicating a position where data to be taken over by the subsequent analysis program is stored in correspondence with the previous analysis program. Create information. Then, by executing the analysis program, the data taken over from the previous analysis program is acquired based on the position information included in the transfer file specified by the transfer file information created corresponding to the previous analysis program. The data generated by performing predetermined processing on the data is stored in the storage device, and the position information indicating the position where the generated data is stored is the inherited file information created corresponding to the analysis program. Is written to the inherited file specified by. As a result, the data to be transferred between analysis programs is specified in the transfer file, so even if there is a change in the data to be transferred between the analysis programs, the change can be absorbed by the transfer file, and the analysis sequence The interface that calls each analysis program from the control program does not change. As a result, it is easy to implement data transfer, and reduce the amount of time required for developers to develop multiple analysis programs executed in the analysis sequence control system and analysis sequence control programs that call the analysis programs. Is possible.

It is a block diagram which shows an example of the hardware constitutions of the data analysis system concerning embodiment of this invention. It is a block diagram which shows an example of the function of an analysis sequence control server. It is a block diagram which shows an example of the function of an analysis server. It is a figure which shows an example of the function of a shared data store. It is a figure which shows an example of the function of a development server. It is a figure which shows an example of a common setting file. It is a figure which shows an example of a takeover file. It is a figure which shows an example of an analysis sequence setting file. It is a figure which shows an example of an analysis program call format. It is a figure which shows an example of starting I / F of an analysis sequence control program. It is a flowchart which shows an example of the process of the whole analysis. It is a flowchart which shows an example of the analysis program call process by an analysis sequence control program. It is a flowchart which shows an example of an analysis program starting process by an analysis program starting script. It is a flowchart which shows an example of the taking over file writing process by an analysis program starting script. It is a flowchart which shows an example of the starting script production | generation process of a starting script production | generation tool. It is a flowchart at the time of data analysis system change at the time of application of this invention.

Hereinafter, modes for carrying out the present invention will be described with reference to the drawings.

FIG. 1 is a diagram showing an example of a configuration of a data analysis system 101 as an analysis sequence control system according to the present embodiment. As shown in the figure, the data analysis system 101 is connected to an analysis sequence control server 102, an analysis server 103, a shared data store 104, a development server 105, a client device 106, and a data storage unit 107. Network 108 to be configured.

The analysis sequence control server 102 has an analysis sequence control program 2041 for calling an analysis program according to a predetermined analysis sequence. The analysis sequence control program 2041 includes an ID for identifying the analysis program, information indicating the execution format of the analysis program, and the order of calling the analysis program. The analysis sequence control server 102 instructs the analysis server 103 to execute the analysis program 3041 by executing the analysis sequence control program 2041.

The analysis server 103 has an analysis program for performing data access processing, analysis processing, analysis result visualization processing, and the like. The analysis server 103 executes these analysis programs 3041 according to instructions from the analysis sequence control server 102.

The shared data store 104 stores data transferred between the analysis sequence control server 102 and the plurality of analysis servers 103 and data that needs to be shared.

The development server 105 is a server used for developing an analysis sequence control program, an analysis program, or various setting files. The development server 105 deploys programs and setting files developed via the network 108 to the analysis sequence control server 102 and the analysis server 103.

The client device 106 is a device that instructs the analysis sequence control server 102 to execute the analysis sequence control program 2041 via the network 108. As the client device 106, a personal computer, a tablet terminal, or the like is used.

The data storage unit 107 includes, for example, a file system, a data lake, a data warehouse (DWH), and the like. The data storage unit 107 stores and holds data to be analyzed. The analysis target data stored in the data storage unit 107 is taken in from an external server or an external service via the network 108, for example. Alternatively, it is taken in from a storage device such as a CD, USB memory, or externally connected hard disk drive connected to the input / output device of the data storage unit 107.

The network 108 is, for example, a wired or wireless LAN (Local Area Network), a WAN (Wide Area Network), the Internet, an intranet, a dedicated line, and the like.

The analysis sequence control server 102, the analysis server 103, the shared data store 104, and the development server 105 may each be a server device configured by a physical computer, or may be configured by a virtual machine. Further, the analysis sequence control server 102, the analysis server 103, and the shared data store 104 may be configured by a plurality of units for parallel execution. The development server 105 and the client device 106 are not limited to one.

Furthermore, the roles of the servers may be mixed, and some or all of the servers may be included in one server device. For example, in actual development and subsequent operation, the analysis sequence control server 102 and the analysis server 103 may be configured on one physical server device, and the analysis sequence control server 102 and the analysis server 103 are configured as one physical server. It may be configured on a separate virtual machine of the device.

FIG. 2 is a diagram illustrating an example of hardware and functions included in the analysis sequence control server 102 of the present embodiment.

The analysis sequence control server 102 includes an input / output circuit interface 201, a processor 202, an input / output device 203, a storage device 204, and an internal communication line (for example, a bus) connecting them.

The input / output circuit interface 201 is an interface for communicating with the network 108.

The processor 202 is an arithmetic device and a control device. The processor 202 executes the analysis sequence control program 2041 stored in the storage device 204. The processor 202 calls the plurality of analysis programs 3041 according to the analysis sequence by executing the analysis sequence control program 2041.

The input / output device 203 is a device for accepting data input, outputting data, or both. For example, the input / output device 203 receives an input from a keyboard, a mouse, or the like, and displays information from the processor 202 on a display.

The storage device 204 includes a volatile storage device (DRAM (Dynamic Access Random Memory, etc.)) and a non-volatile storage device (HDD, SSD, etc.). The storage device 204 stores an analysis sequence control program 2041 and an analysis sequence control engine 2042.

The analysis sequence control program 2041 is a software program for performing an analysis sequence control process for executing each analysis program in accordance with the analysis sequence. The analysis sequence control program 2041 is executed by the processor 202 to instruct the analysis server to call the analysis program according to the analysis sequence.

The analysis sequence control engine 2042 is an application framework that executes the analysis sequence control program 2041 and provides functions necessary for analysis sequence control, such as parallel execution of analysis programs and automatic retry when an analysis program fails.

FIG. 3 is a diagram illustrating an example of hardware and functions provided in the analysis server 103 of the present embodiment.

The analysis server 103 includes an input / output circuit interface 301, a processor 302, an input / output device 303, a storage device 304, and an internal communication line (for example, a bus) connecting them.

The input / output circuit interface 301 is an interface for communicating with the network 108.

The processor 302 is an arithmetic device and a control device. The processor 302 executes an analysis program 3041, an analysis program start script 3042, and a remote procedure call reception process 3043 stored in the storage device 304.

The input / output device 303 is a device for receiving data input, outputting data, or both. For example, the input / output device 303 receives an input from a keyboard, a mouse, or the like, and displays information from the processor 302 on a display.

The storage device 304 includes a volatile storage device (DRAM (Dynamic Access Random Memory, etc.)) and a non-volatile storage device (HDD, SSD, etc.). The storage device 304 stores a plurality of analysis programs 3041, a plurality of analysis program activation scripts 3042, a remote procedure call reception process 3043, and an analysis library 3044.

A plurality of types of analysis programs 3041 (3041 [1] to 3041 [n], where n is a natural number equal to or greater than 2) use an algorithm such as a process for accessing the data storage unit 107 (data access process) and data for machine learning. This is a program that performs analysis processing (analysis processing), processing for visualizing analysis results (analysis result visualization processing), and the like, and outputs results according to the processing. A plurality of types of analysis programs 3041 [1] to 3041 [n] are called from the analysis sequence control program 2041 and executed sequentially (sequentially) by the processor 302. Further, two or more of the plural types of analysis programs 3041 [1] to 3041 [n] may be executed in parallel. Executing only one analysis program 3041 is referred to as one parallel. Executing p analysis programs 3041 in parallel (p is a natural number) is called p-parallel.

When executing the analysis program 3041 in parallel, the analysis program 3041 is started based on the parallel execution ID list. The parallel execution ID is parallel execution unit identification information that identifies individual execution units included in the parallel execution, and the parallel execution ID list is list information that lists a plurality of parallel execution IDs. As the parallel execution ID, for example, a numerical value (1 to p) may be used, or the data name key value (“power”, “currency”, “thermal” of the data name key 40411 included in the common setting file 4041 described later. Etc.) may be used. For example, when predicting the failure timing of a plurality of devices, a different parallel execution ID is assigned to each of the plurality of devices, and the analysis program 3041 related to the analysis of data collected from each device is executed in parallel. Identify by parallel execution ID. When 10 parallel execution IDs are included in the parallel execution ID list, the analysis program 3041 is executed for each of the 10 parallel execution IDs to obtain a total of 10 parallel executions. The data name key value corresponds to data name key information. Details of the parallel execution method will be described later.

Analysis program startup scripts 3042 [1] to 3042 [n] are programs that operate in cooperation with the analysis programs 3041 [1] to 3041 [n]. Each analysis program activation script 3042 activates a corresponding analysis program 3041. Analysis program start scripts 3042 [1] to 3042 [n] exist for the analysis programs 3041 [1] to 3041 [n], respectively, and are pre-processing necessary for starting the analysis programs 3041 [1] to 3041 [n]. To implement. That is, when there are n types of analysis programs 3041, there are also n types of analysis program start scripts 3042. The analysis program activation script 3042 is executed in parallel by the analysis sequence control program 2041 for each parallel execution ID.

The remote procedure call acceptance process 3043 is a process for accepting a remote procedure call for starting the analysis program 3041 from the analysis sequence control program 2041 and starting the analysis program start script 3042. In response to the instruction received from the analysis sequence control program 2041, the analysis program activation script 3042 to be activated is determined, and the analysis program activation script 3042 is activated.

The analysis library 3044 is a library that provides the analysis program 3041 with a function group necessary for analysis. For example, a machine learning library, a library that performs statistical processing, a library that provides access to a database or a data lake, or a library that performs visualization can be used.

The analysis program 3041 may be developed by a different developer for each type. The developers of the analysis sequence control program 2041 and the analysis program 3041 may be different.

FIG. 4 is a diagram illustrating an example of hardware and functions included in the shared data store 104 according to the present embodiment.

The shared data store 104 includes an input / output circuit interface 401, a processor 402, an input / output device 403, a storage device 404, and an internal communication line (for example, a bus) connecting them.

The input / output circuit interface 401 is an interface for communicating with the network 108.

The processor 402 is an arithmetic device and a control device. The processor 402 passes the common setting file 4041, the takeover file 4042, the analysis data 4043, the analysis result 4044, etc. stored in the storage device 404 to the analysis sequence control server 102 and the analysis server 103 via the network 108. I do.

The input / output device 403 is a device for accepting data input, outputting data, or both. For example, the input / output device 403 receives input from a keyboard, a mouse, or the like, and displays information from the processor 402 on a display.

The storage device 404 includes a volatile storage device (DRAM (Dynamic Access Random Memory, etc.)) and a non-volatile storage device (HDD, SSD, etc.). The storage device 404 stores a common setting file 4041, a takeover file 4042, analysis data 4043, and an analysis result 4044.

The common setting file 4041 is a file in which settings used in common by the analysis sequence control program 2041 and the analysis program 3041 are collected.

The takeover file 4042 is a file in which information related to data taken over between the analysis programs 3041 is collected. For example, the takeover file 4042 exists for each type of the analysis program 3041, and when the analysis program 3041 is executed in parallel, it exists for each parallel execution ID. In each of the analysis programs 3041 [1] to 3041 [n] executed in sequence, the takeover files 4042 [0] to [n-1] are used. For example, in order to take over information between the analysis program 3041 [k−1] in the preceding stage and the analysis program 3041 [k] in the subsequent stage (k is a natural number, 1 <k ≦ n) that are executed in order, File 4042 [k−1] is used. In the analysis program 3041 [1] executed first, the takeover file 4042 [0] including the contents of the parameters passed to the activation interface 601 of the analysis sequence control engine 2042 is used. In order to take over information between the analysis program 3041 [k-1] in the previous stage and the analysis program 3041 [k] in the subsequent stage executed as the parallel execution ID = s, the takeover file 4042 [k-1] [ s] is used.

The analysis data 4043 is, for example, data taken over between the analysis programs 3041, the analysis program 3041 in the previous stage writes out the analysis data 4043, and the analysis program 3041 in the subsequent stage reads the analysis data 4043. Position information indicating the position where the analysis data 4043 is stored is recorded in the takeover file 4042. The position (storage location) of the analysis data 4043 is the data name value (“power — 1232.csv”) corresponding to the data name key value (“power”, “currency”, “thermal”, etc.) in the data name 40420 included in the takeover file 4042. "," Currency_1232.csv "," thermal_1232.csv ", etc.).

Analysis result 4044 is result data analyzed by executing a plurality of types of analysis programs 3041 according to the analysis sequence. The analysis data 4043 and the analysis result 4044 may be stored in the data storage unit 107.

FIG. 5 is a diagram illustrating an example of hardware and functions provided in the development server 105 of the present embodiment.

The development server 105 includes an input / output circuit interface 501, a processor 502, an input / output device 503, a storage device 504, and an internal communication line (for example, a bus) connecting them.

The input / output circuit interface 501 is an interface for communicating with the network 108.

The processor 502 is an arithmetic device and a control device. The processor 502 generates the analysis program start script 3042 by executing the start script generation tool 5041 stored in the storage device 504.

The input / output device 503 is a device for accepting data input, outputting data, or both. For example, the input / output device 503 receives an input from a keyboard, a mouse, or the like, and displays information from the processor 502 on a display.

Storage device 504 includes a volatile storage device (DRAM (Dynamic Access Random Memory, etc.)) and a non-volatile storage device (HDD, SSD, etc.). The storage device 504 stores a startup script generation tool 5041, an analysis sequence setting file 5042, and an analysis program call format 5043.

The start script generation tool 5041 generates an analysis program start script 3042 based on the analysis sequence setting file 5042 and the analysis program call format 5043.

FIG. 6 is a diagram illustrating an example of elements included in the common setting file 4041.

The common setting file 4041 includes a server address 40410, a data name key 40411, a parameter key 40412, and a parameter default value 40413.

The server address 40410 includes a server address required at the time of analysis. For example, it includes the address of the shared data store 104 and the address of the data storage unit 107. For example, when the analysis sequence control program 2041 or the analysis program 3041 needs to access the data storage unit 107, the information of the server address 40410 is used as the access destination address. The server address 40410 corresponds to the server type (“analyze_addr”, “db_addr”, etc.) and the address value (“192.168.0.1:8282”, “192.168.0.3:8282”, etc.). ) Is set.

The data name key 40411 includes a key used for identifying data to be inherited between the analysis programs 3041. The data name key 40411 is set with a data name key value ("power", "currency", "thermal", etc.) corresponding to the type of data ("power_file_name", "currency_file_name", "thermal_file_name", etc.). . Details regarding the inheritance of data names will be described later.

The parameter key 40412 includes a key used when using a parameter value that can be set when the analysis sequence control program 2041 is started. The parameter key 40412 is set with a parameter key value (such as “time_range” or “start_time”) corresponding to a parameter type (such as “time_range_parameter” or “start_time_parameter”). In the example shown in FIG. 6, the analysis sequence control program 2041 is activated using parameters having parameter keys of “start_time” and “time_range”. The parameter key value corresponds to parameter key information. Details of the activation interface of the analysis sequence control program 2041 will be described with reference to FIG.

The parameter default value 40413 is a parameter default value that can be set when the analysis sequence control program 2041 is started. The parameter default value 40413 is a default value (“10”, “2016-01-01T18: 00: 00.0000Z”, etc.) for each parameter key value (“time_range”, “start_time”, etc.) of the parameter key 40412. Is set. When these parameters are not included in the activation interface of the analysis sequence control program 2041, the parameter default value 40413 is used.

FIG. 7 is a diagram for explaining an example of elements included in the takeover file 4042.

The number of inherited files 4042 corresponds to the number of types of analysis programs and the number of parallel executions of analysis programs. In other words, n inherited files 4042 (4042 [0] to 4042 [n-1]) are provided for the analysis programs 3041 [1] to 3041 [n]. Further, when the analysis program 3041 [k] is executed in p parallel, p number of inherited files 4042 (4042 [k-1] [1] corresponding to the individual execution units of the parallel execution of the analysis program 3041 [k]. ] To 4042 [k−1] [p]).

When the analysis sequence control program 2041 sequentially calls a plurality of types of analysis programs 3041 [1] to 3041 [n], the takeover file 4042 includes a first analysis program 3041 [k−1] that is executed first, It has data information to be taken over for each parallel execution unit between the analysis program 3041 [k] in the subsequent stage to be executed. For example, in the preceding analysis program 3041 [k−1] and the subsequent analysis program 3041 [k] executed as the parallel execution ID = s, the preceding analysis program 3041 [k−1] is replaced by the takeover file 4042 [k− 1] [s] is generated, and the analysis program 3041 [k] in the subsequent stage takes over the data from the analysis program 3041 [k-1] in the previous stage based on the takeover file 4042 [k-1] [s].

The takeover file 4042 includes a data name 40420 and a parameter value 40421.

The data name 40420 holds the data name inherited from the previous analysis program 3041 [k-1]. The data name is configured in a key-value format, and has the data name key value (“power”, “currency”, “thermal”, etc.) of the data name key 40411 of the common setting file 4041 as the key on the left side, and the value on the right side as the value It has a data name value ("power_1232.csv", "currency_1232.csv", "thermal_1232.csv", etc.). The data name value is location information indicating the location where the data is stored. For example, the file name, the address and table name of the database server, the address and table name of the data lake server, etc. are sufficient to obtain the data to be used. Information. The analysis program 3041 can access the data to be processed using the data name value on the right side. Further, by including the data name key 40411 in the common setting file 4041, all the analysis sequence control programs 2041 and analysis programs 3041 can pass specific data based on the common data name key 40411. For example, when the data name key value of the takeover file 4042 is “power”, the data type of the data stored in “power — 1232.csv” indicated in the data name value corresponding to the data name key value is the common setting. It is derived from the file 4041 that it is “power_file_name”. In the present embodiment, a file name is set as the data name value, but the same method can be applied even when a data name value other than the file name is used.

The parameter value 40421 stores the parameter value actually set when the analysis sequence control program 2041 is started. The parameter value 40421 has the data parameter key value (“time_range”, “start_time”, etc.) of the parameter key 40412 of the common setting file 4041 as the key on the left side, and the parameter value (“5”, “2017-06” as the value on the right side. -10T12: 00: 00.000Z "). Further, by including the parameter key 40412 in the common setting file 4041, all the analysis sequence control programs 2041 and analysis programs 3041 can pass specific data based on the common parameter key 40412. For example, if the parameter key value of the takeover file 4042 is “start_time”, the parameter type “2017-06-10T12: 00: 00.00Z” indicated in the parameter value corresponding to this parameter key value is the common setting file. From 4041, it is derived that “start_time_parameter”. The analysis program 3041 performs analysis using this parameter value 40421 in preference to the parameter default value 40413 included in the common setting file 4041.

The analysis sequence setting file 5042 shown in FIG. 8 and the analysis program call format 5043 shown in FIG. 9 are setting files used when the start script generation tool 5041 automatically generates the analysis program start script 3042.

FIG. 8 is a diagram for explaining an example of the analysis sequence setting file 5042.

The analysis sequence setting file 5042 includes an analysis sequence ID 50421 used for specifying an analysis sequence and a function list 50422, and includes a function ID 50423 and a function call address 50424 for each function. The analysis sequence control program 2041 executes the analysis sequence specified by the analysis sequence ID 50421 that matches the analysis sequence ID passed as a parameter at the time of activation. A function ID 50423 is an ID assigned to the analysis program 3041. The function ID 50423 is used to correspond to the analysis program call format 5043. A function call address 50424 is an address used when the analysis program 3041 is started. By accessing this address, the remote procedure call acceptance process 3043 calls an analysis program activation script 3042 that activates the analysis program 3041. In the function call address 50424, the server type of the server address 40410 of the common setting file 4041 may be designated, and in this case, it is converted into an address value and accessed.

FIG. 9 is a diagram showing an example of the analysis program call format 5043.

The analysis program call format 5043 includes a function ID 50431, a library name 50432, a script name 50433, a parallel execution method 50434, and a list of parameters 50435. Each parameter 50435 includes a name field 50436 and a type field 50437.

The function ID 50431 corresponds to the function ID 50423 included in the analysis sequence setting file 5042. The analysis sequence control program 2041 executes the analysis program 3041 in the order of the function ID 50423 included in the analysis sequence setting file 5042. A library name 50432 is an analysis tool used by the analysis program 3041. Scripts such as “bash” and “python” and dedicated analysis tools can be used. A script name 50433 is an execution file name of the analysis program 3041. If it is “bash” or “python”, the file name of the script is used. The parallel execution method 50434 indicates whether the analysis program 3041 is executed in parallel (1) based on a parallel execution ID, or (2) is executed in parallel. In the latter case, (2-1) Indicates whether a list of inherited file names is used as an argument or (2-2) a parallel execution ID list is used as an argument. The list of parameters 50435 is a list of parameters to be passed to the analysis program 3041. Corresponding to each parameter, a parameter when starting the analysis program 3041 is generated. Each parameter has a name field 50436 and a type field 50437. The name field 50436 is a field that represents the association with the data name key value of the data name key 40411 of the common setting file 4041. As will be described later, a key value in which the value of the name field 50436 matches the data name key value of the data name key 40411 of the common setting file 4041 is extracted, and is associated with the data name 40420 of the takeover file 4012 using that value as a key. Generate a program that performs processing to take A type field 50437 indicates an attribute of the parameter. Specifically, it indicates whether it is a data name for read access, a data name for write access, or a parameter passed from the activation interface 601 or the common setting file 4041. Depending on the type field 50437, processing to be performed in the generated program changes. The library name 50432 and the script name 50433 correspond to the analysis library name and the analysis script name. The analysis program call format information includes a library name 50432, a script name 50433, a parallel execution method 50434, and a list of parameters 50435. The name field 50436 and the type field 50437 are a name field and a type field.

FIG. 10 shows an example of the activation interface 601 of the analysis sequence control engine 2042.

The analysis sequence control engine 2042 is activated using, for example, a REST (REpresentation State Transfer) interface. The activation interface 601 includes, as parameters, an analysis sequence ID and a parallel execution ID list that designates a parallel execution unit, and includes an address of the analysis sequence control engine 2042. In addition, a parameter group for the analysis sequence control program 2041 is included. The analysis sequence control engine 2042 activates the analysis sequence control program 2041 based on the analysis sequence ID included in the activation interface 601.

In FIG. 10, in the activation interface 601, the analysis sequence ID (analysisid = DataAnalysis) is sent to the analysis sequence control engine 2042 of the analysis sequence control server 102 indicated by the address (http://www.hoge.co.jp/analysis). And a parallel execution ID list (distlist = / tmp / distdir /).

The parameter group for the analysis sequence control program 2041 is transferred, for example, in the JSON (Java Script Object Notation) format (Java Script is a registered trademark). When the parameter described in the parameter key 40412 of the common setting file 4041 is included as a parameter, the value passed by the activation interface 601 is used with priority over the parameter default value 40413 of the common setting file 4041. A value that is not specified in the parameter key 40412 of the common setting file 4041 can also be passed through the activation interface 601. In this case, since the parameter default value 40413 of the common setting file 4041 is not included, it is necessary to set a default value in the analysis program 3041 or to set the parameter in the activation interface 601 without fail.

FIG. 11 shows an example of a flowchart of the overall analysis flow of the data analysis system 101. In the data analysis system 101, the processor 202 of the analysis sequence control server 102 that executes each program and the processor 302 of the analysis server 103 are the main actors. In the following description, each program is described as the main subject for convenience. The part that is included is also included.

First, the activation interface 601 of the analysis sequence control engine 2042 is executed from the client device 106 (S1100). The analysis sequence control engine 2042 selects and executes an analysis sequence control program 2041 having an analysis sequence specified based on the analysis sequence ID included in the activation interface 601 (S1101).

Next, the analysis sequence control program 2041 instructs the analysis server 103 to execute the analysis programs 3041 [1] to 3041 [n] according to the analysis sequence corresponding to the analysis sequence ID (S1102). Each analysis program 3041 may be executed in parallel in plural cases or in one parallel case.

The remote procedure call acceptance process 3043 receives an instruction to execute the analysis program 3041, and activates the analysis program activation script 3042 (S1103). Then, the analysis program start script 3042 starts the analysis program 3041 (S1104), and the analysis program 3041 executes the analysis (S1105). After the analysis is completed, the process returns to the analysis sequence control program 2041. Then, the analysis sequence control program 2041 confirms the end of the execution of the previous analysis program 3041 [k−1], and executes the subsequent analysis program 3041 [k] (S1106). When execution of all the analysis programs 3041 [1] to 3041 [n] is completed, the client device is notified of the analysis result (S1107). The client device 106 displays the analysis result (S1108).

FIG. 12 is a flowchart showing an example of analysis program call processing in the analysis sequence control program 2041. This explains the details of the processing in S1102 of FIG.

First, the analysis sequence control program 2041 decomposes the parallel execution ID list into individual parallel execution IDs (S1200). Then, the file name of the takeover file 4042 is generated (S1201). The file name of the takeover file 4042 is generated for each parallel execution ID regardless of the number of parallel executions of the analysis program 3041. The file name generation logic of the takeover file 4042 is arbitrary as long as the object of the present invention is not violated. For example, it is conceivable that a parallel execution ID and an analysis program 3041 name are combined, and a file name is obtained by adding a prefix and a postfix. That is, when the analysis program 3041 [k] is executed in parallel by executing the analysis sequence control program 2041, the processor 202 executes the analysis programs 3041 [k] [1] to 3041 [k] [p]. A file name for specifying the takeover files 4042 [k-1] [1] to 4042 [k-1] [p] for storing the data name value of the data name 40420 for specifying the data to be taken over is created. The data name value of the data name 40420 corresponds to data information, and the file name of the takeover file 4042 corresponds to takeover file information.

Next, a call interface for the analysis program 3041 is created. This is different between the case of executing in parallel for each parallel execution ID and the case of executing in parallel.

When executing in parallel for each parallel execution ID (Yes in S1202), the takeover file 4042 [k-1] [s corresponding to the parallel execution ID = s generated when the analysis program 3041 [k-1] in the previous stage is executed. ] And a call interface including the file name of the takeover file 4042 [k] [s] generated when the analysis program 3041 [k] in the subsequent stage is executed (S1203). That is, when the processor 202 executes the analysis sequence control program 2041, an analysis program 3041 [k] preceding the analysis program 3041 [k] is connected to an interface that calls the analysis program 3041 [k] executed as parallel execution ID = s. The file name of the takeover file 4042 [k-1] [s] corresponding to k-1] and the file name of the takeover file 4042 [k] [s] corresponding to the analysis program 3041 [k] itself are used as arguments. included. By doing so, the analysis sequence control program 2041 can easily specify the information of the takeover file through an interface for calling the analysis program 3041. In addition, in the execution of the first analysis program 3041 [1], the previous analysis program 3041 does not exist. Therefore, the contents of the parameters passed by the activation interface 601 of the analysis sequence control engine 2042 are set to 1 as the takeover file 4042 [0]. This file is handled as a takeover file 4042 used by the analysis program 3041 [1]. As described above, when the analysis program 3041 is started in parallel, the analysis sequence control program 2041 takes the file name of the separate takeover file 4042 for each parallel execution ID of the analysis program 3041 as an argument. Since each inheritance file 4042 includes a data name value that identifies data to be processed, the data indicated by the data name value included in the inheritance file 4042 is extracted from the shared data store 104 or the data storage unit 107 and processed. If the analysis program 3041 is assembled in the form of, the analysis processing can be performed in parallel for each parallel execution ID using the common analysis program 3041.

When executing in parallel, there are a case of passing a list of takeover files 4042 and a case of passing a parallel execution ID list.

In the case of passing the list of the takeover file 4042 (Yes in S1204), the list of the takeover file 4042 [k-1] generated when the preceding analysis program 3041 [k-1] is executed and the subsequent analysis program 3041 [k]. A call interface including a list of the takeover file 4042 [k] generated at the time of execution is generated (S1205). The takeover file 4042 [k−1] and the takeover file 4042 [k] correspond to all the parallel execution IDs. That is, the interface that calls the analysis program 3041 [k] by executing the analysis sequence control program 2041 by the processor 202 is the inherited file corresponding to the analysis program 3041 [k-1] preceding the analysis program 3041 [k]. Information indicating a list of 4042 [k−1] and information indicating a list of the takeover file 4042 [k] corresponding to the analysis program 3041 “k” itself are included. By doing so, the analysis sequence control program 2041 can easily specify the information of the takeover file through an interface for calling the analysis program 3041. For example, it is conceivable that the takeover files 4042 are collectively arranged in a specific directory, and the directory is passed as information indicating a list of the takeover files 4042. As in the case of parallel execution for each parallel execution ID, there is no previous analysis program 3041 in the execution of the first analysis program 3041 [1], so the parameters passed to the startup interface 601 of the analysis sequence control engine 2042 Is generated as a takeover file 4042 [0], and this file is handled as a takeover file 4042 used by the analysis program 3041 [1].

When the parallel execution ID list is passed (No in S1204), a call interface that passes the parallel execution ID list included in the startup interface 601 of the analysis sequence control engine 2042 to the analysis program 3041 is generated (S1206).

Finally, the analysis program start script 3042 for starting the analysis program 3041 is called using the call interface created in each case (S1207).

When the execution of one analysis program 3041 is completed (S1208), if there is another analysis program 3041 to be executed (No in S1208), the analysis program 3041 to be executed next is specified and the same method is used. After creating the interface, the analysis program 3041 is executed. When all the analysis programs 3041 have been executed (Yes in S1208), the processing of this flowchart is ended.

FIG. 13 shows details of the operation of the analysis program start script 3042 in the analysis sequence control program 2041.

The analysis program start script 3042 first reads the common setting file 4041. Subsequent operations differ depending on the parallel execution method of the analysis program 3041.

First, the case where parallel execution is performed for each parallel execution ID (S1301) will be described. In this case, based on the argument of the call interface of the analysis program start script 3042 for executing the analysis program 3041 [k], the takeover file 4042 [[1] corresponding to the analysis program 3041 [k−1] in the preceding stage with the parallel execution ID = s. k-1] [s] is read (S1302). Then, using the contents of the read transfer file 4042 [k-1] [s] and the data name value of the data name key 40411 of the common setting file 4041, the position (storage location) of data used by the analysis program 3041 is determined. The read file name as the position information to be indicated is specified (S1303). This is because all the analysis programs 3041 process so as to associate specific data with the data name key 40411 of the common setting file 4041, so that the data corresponding to the data name key 40411 with the analysis program 3041 is displayed. When reading, the data name value (file name) corresponding to the data name key value of the data name key 40411 is read from the takeover file 4042, so that the position of the data to be processed based on the read data name value ( The analysis program 3041 [k] in the subsequent stage can grasp the storage location). The same applies to the case where it is desired to write data corresponding to a data name key 40411 having an analysis program. In this way, it is possible to appropriately transfer data between the analysis program 3041 [k-1] in the previous stage and the analysis program 3041 [k] in the subsequent stage. Next, a write file name is generated as position information indicating the position (storage location) of data such as analysis data 4033 and analysis result 4044 (S1304). The file name generation logic is arbitrary as long as the object of the present invention is not violated. For example, a prefix and a postfix may be added to the file name of the takeover file 4042. Further, parameters are extracted from the takeover file 4042 (S1305). Then, in accordance with the interface specification of the analysis program 3041, the analysis program 3041 is activated with the read file name, write file name, and extracted parameters as arguments (S1306). Finally, a takeover file 4042 is generated and written for the subsequent analysis program 3041 to be executed next in accordance with the analysis sequence (S1307). The newly generated takeover file 4042 uses the file name of the takeover file 4042 for the current analysis program 3041 included in the call interface of the analysis program start script 3042 that started the currently executing analysis program 3041. In other words, the processor 302 executes the analysis program 3041 [k], thereby taking over the transfer file identified by the file name of the takeover file [k-1] [s] corresponding to the preceding analysis program 3041 [k-1]. Data name value for acquiring the inherited data specified by the data name value (file name) stored in the file [k-1] [s] and specifying the data generated by the predetermined processing using the inherited data (File name) is written to the takeover file [k] [s] specified by the file name of the takeover file 4042 [k] [s] corresponding to the analysis program [k]. Since the file name of the takeover file 4042 that differs for each parallel execution ID is set in the call interface, the takeover file 4042 having a different file name for each parallel execution ID is generated. Thus, since the existence of the takeover file 4042 is not seen from the viewpoint of the analysis program 3041, in the present invention, the developer of the analysis program 3041 does not need to be aware of the surrounding analysis program execution environment.

Next, when the list of the inherited file 4042 is included in the calling interface (S1308), first, the list of the inherited file 4042 of the preceding analysis program 3041 [k-1], the inheritance of the subsequent analysis program 3041 [k] to be executed this time. Both lists of the file 4042 are disassembled (S1309). Each list may include a plurality of file names of the takeover file 40402, or may include only one file name. Then, as in the case of parallel execution for each parallel execution ID, all inherited files 4042 of the previous analysis program 3041 [k−1] are read (S1310), and the data name value of the data name key 40411 of the common setting file 4041 is read. The file to be read is specified based on (S1311). In addition, a write file name is generated as position information indicating the position (storage location) of the analysis data 4033 and the analysis result 4044 for each parallel execution ID (S1312). Also, parameters are taken out from the takeover file 4042 (S1313). Then, in accordance with the interface specification of the analysis program 3041, the analysis program 3041 is activated with the read file name list, the write file name list, and the read parameters as arguments (S1314). Finally, as in the case of parallel execution for each parallel execution ID, finally, a takeover file 4042 is generated and written for each parallel execution ID for the subsequent analysis program 3041 to be executed next in accordance with the analysis sequence ( S1315). Also in this case, since the presence of the takeover file 4042 is not visible from the viewpoint of the analysis program 3041, the developer of the analysis program 3041 does not need to be aware of the surrounding analysis program execution environment.

Finally, when the parallel execution ID list is included in the calling interface, the analysis program 3041 is started with the parallel execution ID list as an argument (S1316). The analysis program 3041 recognizes the list of parallel execution IDs and executes the program. In this method, the analysis program 3041 needs to be aware of the parallel execution ID, and the developer of the analysis program 3041 needs to be aware of the existence of the takeover file 4042. Such a method for delivering a list of parallel execution IDs can be used in a special case, for example, when the list of parallel execution IDs is dynamically rewritten. This can be realized, for example, by changing the number of files in the directory in the analysis program 3041 with the list of parallel execution IDs as file names in a specific directory. Note that the case where the parallel execution ID list is included in the call interface may be omitted.

FIG. 14 shows an example of the takeover file writing method.

In this file, parameters are first written (S1400), then a read file name is written (S1401), and finally a write file name (S1402) is written. The file name and parameter value to be actually used are written using the data name value of the data name key 40411 and the parameter key value of the parameter key 40412 included in the common setting file 4041 as keys.

FIG. 15 shows an example of the operation flow of the startup script generation tool 5041.

The start script generation tool 5041 is a program that generates an analysis program start script 3042 based on the analysis sequence setting file 5042 and the analysis program call format 5043. One analysis sequence setting file 5042 exists for the analysis sequence control program 2041, and one analysis program call format 5043 exists for each type of analysis program 3041 called from the analysis sequence control program 2041. That is, the analysis program call format 5043 and the analysis program 3041 have a one-to-one correspondence. The analysis sequence setting file 5042 is created by the developer of the analysis sequence control program 2041, and the analysis program call format 5043 is naturally created by the developer of the corresponding analysis program 3041.

First, the startup script generation tool 5041 reads the analysis sequence setting file 5042 (S1500). Next, a list is created in which the function IDs 50423 included in the analysis sequence setting file 5042 are arranged in the order of description (S1501). Then, the following processing is performed in order from the top to the end of the list of function ID 50423. When one function ID 50423 to be processed is determined (S1502), the corresponding analysis program call format 5043 is read (S1503). The analysis program call format 5043 exists for each type of the analysis program 3041 called from the analysis sequence control program 2041, and the analysis program call format 5043 including the function ID 50431 that matches the function ID 50423 needs to exist. If the analysis program call format 5043 corresponding to the function ID 50423 is not found, an error is displayed and the process is terminated.

Thereafter, the analysis program start script 3042 program is sequentially generated.

First, a common setting file 4041 reading program is generated (S1504). This may always be the same program as long as the format of the common setting file 4041 is fixed. For example, if the common setting file 4041 is in the JSON format as shown in FIG. 6, a program for reading the JSON format file may be generated.

Next, a list of call parameters for the analysis program 3041 is generated from the analysis program call format 5043 (S1505). Then, a program for generating parameters is created for each parameter in the list.

When the analysis program 3041 [k] is executed in parallel based on the parallel execution ID (Yes in S1506), the takeover file 4042 [k-1] generated by the preceding analysis program 3041 [k-1] with the parallel execution ID = s. ] [S] is generated (S1507). Then, a program for generating parameters of the analysis program 3041 is generated (S1508). This is generated as follows based on the name field 50436 and the type field 50437 for each parameter. First, when the type field 50437 is read, data indicating the position (storage location) of the inherited data in the inherited file 4042 [k-1] [s] generated by the analysis program 3041 [k-1] in the previous stage. The file name that is the name value is included. Therefore, the value of the name field 50436 of the analysis program call format 5043 and the value of the key that matches the data name key 40411 of the common setting file 4041 are extracted, and the takeover file 4042 [k−1] [s] using the extracted value as a key. ] To generate a program for extracting the file name. When the type field 50437 is write, a program for generating a write file name is generated. Finally, when the type field 50437 is a parameter (params), the parameter value is included in the takeover file 4042 [k−1] [s] generated by the previous analysis program 3041. Therefore, the value of the name field 50436 of the analysis program call format 5043 and the value of the key that matches the parameter key 40412 of the common setting file 4041 are extracted, and the takeover file 4042 [k−1] [s] using the extracted value as a key. From this, a program for extracting the value of the parameter to be used is generated.

When the analysis program 3041 is called with the list of the takeover file 4042 [k-1] as an argument (No in S1508 and Yes in S1509), first, the list of the takeover file 4042 [k-1] is taken over as the takeover file 4042 [k-1]. A program to be decomposed is generated (S1510). Then, for each takeover file 4042 [k-1], the takeover file 4042 [k-1] is read, written, and called according to the value of the parameter type field 50437, as in the case of parallel execution for each parallel execution ID. A program for extracting interface parameters is generated (S1511), and a program for generating parameters of the analysis program 3041 is generated (S1512).

When the analysis program is executed based on the parallel execution ID list (No in S1509), a program for executing the analysis program 3041 is generated using the parallel execution ID list as a parameter (S1513).

After the program generation process for all parameters has been completed (S1514), the analysis program startup parameters can be generated, so a program for executing the analysis program 3041 is generated in combination with the library name and script name included in the analysis program call format 5043. (S1515).

Finally, a program for writing the takeover file is generated (S1516). This is because all the data read from the previous analysis program 3041 is written as it is, and the name of the newly created write parameter and the value of the key matching the data name key 40411 of the common setting file 4041 are extracted and the extracted value Is used as a key and a file name is used as a value to generate a program for writing to the takeover file 4042.

The last generated program is written out as an analysis program start script 3042 program. Any arbitrary script may be used for this program. For example, it may be generated as a shell script or a Python script (S1517).

These processes are repeated until execution of all the function IDs 50423 is completed (S1518).

As described above, the analysis program start script 3042 can be automatically generated. From the viewpoint of the developer of the analysis program 3041, the implementation of the analysis program start script 3042, which is dependent on the implementation environment of the present invention, is not necessary, so that the user concentrates on creating the analysis program 3041 without worrying about the environment. This makes it possible to further shorten the period required for the execution of the analysis program 3041.

When the system is developed by a total of three people, that is, the developer of the analysis sequence control program 2041 and two developers of the analysis programs 3041 [1] and 3041 [2] using this mechanism, the analysis program 3041 FIG. 16 shows an example of a system modification sequence when it is necessary to newly transfer one type of data between [1] and the analysis program 3041 [2].

First, the developer of the previous analysis program 3041 [1] determines the structure of newly created data (hereinafter referred to as “main data”) (S1600). Then, the design of the analysis program 3041 [1] is changed so that this data is used as a write parameter, and the analysis program call format 5043 corresponding to the analysis program 3041 [1] is changed (S1601), and The startup script generation tool 5041 is executed (S1602). As a result, an analysis program start script 3042 for the analysis program 3041 [1] is created. Then, the analysis program 3041 [1] is implemented (S1603) and deployed to the environment (S1604).

The developer of the latter analysis program 3041 [2] first agrees with the developer of the former analysis program 3041 [1] on the data format of this data. Next, the analysis program 3041 [2] is redesigned so that this data is used as a read parameter, and the analysis program call format 5043 corresponding to the analysis program 3041 [1] is also changed (S1605). Then, the startup script generation tool 5041 is executed (S1606). As a result, an analysis program start script 3042 for the analysis program 3041 [2] is generated. Then, the analysis program 3041 [2] is implemented (S1607) and deployed to the execution environment (S1608). Here, it should be noted that the analysis program 3041 [1] is deployed (S1604) before the analysis program 3041 [2] is deployed (S1608).

As can be seen from this sequence, the analysis sequence control program 2041 need not be changed. This is because all of the data exchange between the analysis programs 3041 is performed by the takeover file 4042, and even if the data exchanged between the analysis programs 3041 is changed, the name of the takeover file 4042 is not changed. This is because the influence of the above is suppressed in the contents of the takeover file 4042. Also, the analysis program 3041 must be deployed at the same time as the analysis sequence control program 2041 and all the analysis programs 3041. However, according to the present invention, only a simple deployment order between the analysis programs 3041 can be observed. You can see that it is in good shape.

From the above, when developing the analysis program and analysis sequence control program, which was an issue, required by multiple developers to agree on the interface when changing the analysis method, and to adjust the program deployment, etc. It is possible to shorten the time required. This will make it possible to focus more on the creation of analysis programs.

In addition, according to the present invention, the analysis program designer needs to perform the design related to the takeover file 4042, but the analysis program designer needs to additionally perform the use of the startup script generation tool 5041. It is possible to suppress the matter only to the creation of the analysis program call format 5043.

As described above, according to the data analysis system 101 of this embodiment, when the analysis program 3041 is executed in parallel, the analysis program 3041 [k] in the subsequent stage corresponds to the analysis program 3041 [k-1] in the previous stage. Inheritance file information for specifying the inheritance file 4042 [k-1] including the data name value indicating the location where the inherited data is stored is created. Then, by executing the analysis program 3041 [k], the data name value included in the takeover file [k-1] specified by the file name created corresponding to the previous analysis program 3041 [k-1] Based on the above, the data inherited from the previous analysis program 3041 [k-1] is acquired, and the data generated by performing predetermined processing on the data is stored in the shared data store 104 or the data storage unit 107 as a storage device At the same time, the data name value indicating the position where the generated data is stored is written to the takeover file 4042 [k] specified by the file name created corresponding to the analysis program. As a result, the data to be inherited between the analysis programs 3041 is specified in the takeover file 4042, so that even if a change occurs in the data to be taken over between the analysis programs 3041, the change can be absorbed by the takeover file 4042. The interface for calling each analysis program 3041 from the analysis sequence control program 2041 does not change. Therefore, it is easy to implement data transfer, and the number of man-hours for adjusting the consciousness among developers who develop a plurality of analysis programs 3041 executed in the data analysis system 101 and an analysis sequence control program 2041 that calls the analysis program 3041 are reduced. It becomes possible to reduce.

The call interface for the analysis sequence control program 2041 to call the analysis program 3041 includes a takeover file 4042 [k−1] corresponding to the analysis program 3041 [k−1] preceding the analysis program 3041 [k], and the analysis List and analysis program of takeover file 4042 [k-1] corresponding to takeover file 4042 [k] corresponding to program 3041 [k] or analysis program 3041 [k-1] in the previous stage of analysis program 3041 [k] A list of inherited files 4042 [k] corresponding to 3041 [k] is included. As a result, the analysis sequence control program 2041 can easily specify the information of the takeover file 4042 on the call interface for calling the analysis program 3041.

Also, the common setting file 4041 including the data name key value corresponding to the data type is stored in the storage device 404 of the shared data store 104. The takeover file 4042 [k-1] has a data name corresponding to the type of data that the analysis program 3041 [k-1] corresponding to the takeover file 4042 [k-1] takes over to the subsequent analysis program 3041 [k]. A key value and a data name value indicating a position where the data is stored are associated with each other. By executing the analysis program 3041 [k], the processor 302 of the analysis server 103 executes the analysis program 3041 [k], and based on the inherited file [k-1] and the common setting file 4041 corresponding to the previous analysis program 3041 [k-1], the data name Using the key value, a data name value indicating the type of data inherited from the previous analysis program 3041 [k−1] and the position where the data is stored is acquired. Data indicating the data name key value corresponding to the type of the generated data to be transferred to the subsequent analysis program [k + 1] and the position where the data is stored in the transfer file [k] corresponding to the analysis program [k] Export with name values associated. As a result, the common setting file 4041 associates the data type with the data name key value, and the takeover file 4042 associates the data name key value with the data name value. Based on the inheritance file 4042, information specifying data to be inherited between the analysis programs 3041 can be managed in a common and collective manner.

Also, the common setting file 4041 in which default values of parameters that can be set as arguments of the analysis program 3041 are stored in the storage device 404 of the shared data store 104. The call interface for the analysis sequence control program 2041 to call the analysis program 3041 can include parameter values that can be set as arguments of the analysis program 3041. The analysis program 3041 uses a parameter value that can be set as an argument of the analysis program 3041 in the call interface, and uses a default value if a settable parameter value is not set. . As a result, the default values of parameters that can be set in the analysis program 3041 may be set in the common setting file 4041 in advance, or may be specified by an interface that calls the analysis program 3041. Convenient data analysis design and operation.

Also, the common setting file 4041 includes parameter key values corresponding to parameter types. In the takeover file 4042 [k-1], the parameter corresponding to the type of parameter that the analysis program 3041 [k-1] corresponding to the takeover file 4042 [k-1] delivers to the subsequent analysis program 3041 [k]. The key value and the parameter value of the parameter can be included. By executing the analysis program 3041 [k], the processor 302 of the analysis server 103 executes a parameter based on the takeover file 4042 [k-1] and the common setting file 4041 corresponding to the previous analysis program 3041 [k-1]. Using the key value, the type of the parameter inherited from the previous analysis program 3041 and the parameter value of the parameter are acquired. The parameter key value corresponding to the parameter type to be transferred to the subsequent analysis program 3041 [k + 1] and the parameter value of the parameter are written in the transfer file 4042 [k] corresponding to the analysis program 3041 [k] in association with each other. Since this is done, parameter values to be given to the analysis program 3041 may be passed between the analysis programs 3041, so that the convenience of data analysis design and operation is further enhanced.

Further, the call interface for the analysis sequence control program 2041 to call the analysis program 3041 can include a parallel execution ID list in which parallel execution IDs are listed. Then, a specific directory is given as the parallel execution ID list, and the file names under the specific directory are used as the parallel execution ID. As a result, when a parallel execution ID list is used instead of the takeover file 4042, it is easy to dynamically rewrite the parallel execution ID on the list by setting the file name under the specific directory as the parallel execution ID. This makes it possible to improve the convenience of data analysis design and operation.

In addition, in cooperation with the analysis program 3041 [k], data to be inherited from the inheritance file 4042 [k-1] corresponding to the analysis program 3041 [k-1] in the previous stage of the analysis program 3041 [k] is stored. The data name value indicating the position is read, the analysis program 3041 [k] is called, and the data name value indicating the position where the data generated by the execution of the analysis program 3041 [k] is stored is the analysis program [k]. Has a startup script generation tool 5041 for generating an analysis program startup script 3042 [k] for executing three processes of writing to the takeover file 4042 [k] corresponding to. As a result, the analysis program activation script 3042 created by the activation script generation tool 5041 performs reading and writing of the takeover file 4042, so that each analysis program 3041 does not need to perform the processing. As a result, each developer of each analysis program 3041 can save the trouble of creating a program for reading and writing the takeover file 4042 and incorporating it into the analysis program 3041.

Further, an analysis program call format 5043 including an analysis sequence setting file 5042 in which information relating to the calling order of the analysis program 3041 is recorded in the storage device 504 of the development server 105, and a library name, a script name, and parameters given when calling the program. And are stored. The startup script generation tool 5041 is provided with an analysis sequence setting file 5042 and an analysis program call format 5043. The start script generation tool 5041 identifies the analysis program 3041 specified by the analysis script name from the library specified by the library name, gives the parameters, and starts the analysis program 3041 in the calling order of the analysis sequence setting file 5042 A startup script 3042 is generated. In this way, the developer of the analysis sequence control program 2041 creates the analysis sequence setting file 5042, and the developer of the analysis program 3041 creates the analysis program call format 5043. Development files can be effectively improved.

Also, the common setting file 4041 including the data name key value corresponding to the data type is stored in the storage device 404 of the shared data store 104. The analysis program call format 5043 has a name field indicating the name of a parameter set as an argument when calling the analysis program 3041, and a data name key value is set in the name field. Based on the common setting file 4041 and the analysis program call format 5043, the startup script generation tool 5041 uses the data name key value to generate consistent data between the analysis program 3041 that delivers the data and the analysis program 3041 that receives the data. Generate data name value (file name). As a result, the data name key value corresponding to the data stored in the common setting file 4041 is information commonly used by the analysis program 3041 that delivers the data and the analysis program 3041 that receives the data. If a data name value (file name) is assigned to data so as to be associated with the data name key value, the file name can be automatically made consistent among the analysis programs 3041.

Other implementations of the present disclosure will become apparent to those skilled in the art from consideration of the specification and embodiments of the present disclosure disclosed herein. Various aspects and / or components of the described embodiments may be used alone or in any combination. The specification and specific examples are merely exemplary and the scope and spirit of the disclosure is set forth in the appended claims.

DESCRIPTION OF SYMBOLS 101 ... Data analysis system, 102 ... Analysis sequence control server, 103 ... Analysis server, 104 ... Shared data store, 105 ... Development server, 106 ... Client device, 107 ... Data storage part, 108 ... Network, 202, 302 ... Processor, 601 ... Start interface, 2041 ... Analysis sequence control program, 2042 ... Analysis sequence control engine, 3041 ... Analysis program, 3042 ... Analysis program start script, 3043 ... Remote procedure call acceptance process, 4041 ... Common setting file, 4042 ... Takeover file, 4043 ... Analysis data, 4044 ... Analysis result, 5041 ... Startup script generation tool, 5042 ... Analysis sequence setting file, 5043 ... Analysis program call format

Claims

A processor that executes a plurality of analysis programs and an analysis sequence control program that calls the plurality of analysis programs according to an analysis sequence;
A storage device used for storing data in the execution of the analysis program and the analysis sequence control program,
The processor is
By executing the analysis sequence control program, when the analysis program is executed in parallel, it corresponds to the preceding analysis program and includes a position information indicating a position where data to be transferred to the subsequent analysis program is stored. Create takeover file information that identifies the file,
By executing the analysis program, based on the position information included in the transfer file specified by the transfer file information created corresponding to the analysis program of the previous stage, data to be transferred from the analysis program of the previous stage is obtained. The data generated by acquiring and performing predetermined processing on the data is stored in the storage device, and the position information indicating the position where the generated data is stored is created corresponding to the analysis program Writing to the inherited file specified by the inherited file information;
Analysis sequence control system.
The analysis sequence control system according to claim 1,
In the call interface for the analysis sequence control program to call the analysis program,
A takeover file corresponding to the analysis program preceding the analysis program and a takeover file corresponding to the analysis program itself, or
Information indicating a list of transfer files corresponding to the analysis program in the previous stage of the analysis program and information indicating a list of transfer files corresponding to the analysis program itself;
Included,
Analysis sequence control system.
The analysis sequence control system according to claim 1,
A common setting file including data name key information corresponding to the type of data is stored in the storage device,
The takeover file includes data name key information corresponding to the type of data that the analysis program corresponding to the takeover file takes over to the subsequent analysis program and position information indicating the position where the data is stored,
The processor executes the analysis program,
Based on the takeover file corresponding to the previous analysis program and the common setting file, using the data name key information, the position information indicating the type of data to be taken over from the previous analysis program and the position where the data is stored And get
In the transfer file corresponding to the analysis program, the data name key information corresponding to the type of the generated data to be transferred to the subsequent analysis program and the position information indicating the position where the data is stored are written in association with each other.
Analysis sequence control system.
The analysis sequence control system according to claim 1,
A common setting file storing default values of parameters that can be set as arguments of the analysis program is stored in the storage device,
The call interface for the analysis sequence control program to call the analysis program can include values of parameters that can be set as arguments of the analysis program,
The analysis program uses the value of the parameter if the value of the parameter is set as an argument of the analysis program in the call interface, and uses the default value if the value of the parameter is not set.
Analysis sequence control system.
The analysis sequence control system according to claim 4,
The common setting file further includes parameter key information corresponding to a parameter type,
The handover file can include parameter key information corresponding to the type of parameter that the analysis program corresponding to the handover file passes to the subsequent analysis program and the parameter value of the parameter,
The processor executes the analysis program,
Based on the takeover file corresponding to the previous analysis program and the common setting file, using the parameter key information, obtain the parameter type and parameter value of the parameter to be taken over from the previous analysis program, and
The parameter key information corresponding to the parameter type to be inherited to the subsequent analysis program and the parameter value of the parameter are written in association with the analysis file corresponding to the analysis program.
Analysis sequence control system.
The analysis sequence control system according to claim 2,
The call interface for the analysis sequence control program to call the analysis program can include list information that lists parallel execution unit identification information,
A file name under a specific directory is used as the parallel execution unit identification information.
Analysis sequence control system.
The analysis sequence control system according to claim 1,
Coordinates with the analysis program, reads position information indicating the position where the data is stored from the takeover file corresponding to the analysis program in the previous stage of the analysis program, calls the analysis program, and generates by executing the analysis program A start script generation tool for generating an analysis program start script for executing three processes of writing position information indicating a position where stored data is stored in a takeover file corresponding to the analysis program;
Analysis sequence control system.
The analysis sequence control system according to claim 7,
The storage device stores an analysis sequence setting file in which information relating to the calling order of the analysis programs is recorded, and an analysis program call format information including an analysis library name, an analysis script name, and parameters to be provided when the analysis program is called. ,
The startup script generation tool is provided with an analysis sequence setting file and analysis program call format information,
The start script generation tool identifies an analysis program specified by the analysis script name from the analysis library specified by the analysis library name, and gives the parameters to start the analysis program in the calling order. Generate scripts,
Analysis sequence control system.
The analysis sequence control system according to claim 8,
A common setting file including data name key information corresponding to the type of data is stored in the storage device,
The analysis program call format information has a name field indicating the name of a parameter set as an argument when calling the analysis program,
In the name field, the data name key information is set,
The startup script generation tool uses the data name key information based on the common setting file and the analysis program call format information to consistently transmit the data between the analysis program that delivers the data and the analysis program that receives the data. Generate a data name value for
Analysis sequence control system.
An analysis sequence control method for calling a plurality of analysis programs according to an analysis sequence,
When the analysis program is executed in parallel, corresponding to the analysis program of the previous stage, creating the transfer file information for specifying the transfer file including the position information indicating the position where the data to be transferred to the subsequent analysis program is stored,
In the execution of the analysis program, based on the position information included in the transfer file specified by the transfer file information created corresponding to the analysis program of the previous stage, data to be transferred from the analysis program of the previous stage is acquired. In addition, the data generated by performing a predetermined process on the data is stored in a storage device, and the position information indicating the position where the generated data is stored is generated in accordance with the analysis program. Write to the inherited file specified by the information;
Analysis sequence control method.