WO2019123732A1

WO2019123732A1 - Analysis support method, analysis support server, and storage medium

Info

Publication number: WO2019123732A1
Application number: PCT/JP2018/033417
Authority: WO
Inventors: 俊彦樫山
Original assignee: 株式会社日立製作所
Priority date: 2017-12-18
Filing date: 2018-09-10
Publication date: 2019-06-27
Also published as: KR102309094B1; JP6842405B2; KR20200020932A; JP2019109676A

Abstract

This analysis support method, in which a computer having a processor and a memory evaluates data to be analyzed, includes: a first step in which the computer reads a first data catalog that stores the definition of a column of the data to be analyzed and a second data catalog that defines a column of input data of an analysis software that executes an analysis process; a second step in which the computer calculates, as mapping accuracy, the similarity between a column of the first data catalog and a column of the second data catalog; and a third step in which the computer calculates the difficulty in analyzing the data to be analyzed with the analysis software on the basis of the mapping accuracy of the column of the second data catalog used in the analysis software.

Description

Analysis support method, analysis support server and storage medium

Capture by reference

This application claims priority to Japanese Patent Application No. 2017-241859, which is a Japanese application filed on Dec. 18, 2017 (2017), and is incorporated into the present application by referring to the contents thereof.

The present invention relates to a technique for proposing software for analyzing data.

In order to reduce the time required to analyze data, it is desirable to reuse software such as applications and queries used in past data analysis. There is known a technique of applying schema matching to use software used in past data analysis (for example, Non-Patent Document 1). Non-Patent Document 1 discloses a technique for calculating the degree of similarity between schemas for which analysis has been performed in the past and components of schemas of data to be newly analyzed.

Further, Patent Document 1 discloses a technique for identifying elements of data necessary to use software used in past data analysis.

Moreover, as a technique for a computer to recommend software to a user, for example, Patent Document 2 is known. Patent Document 2 discloses a technology in which a server recommends an application according to the power consumption by the application.

U.S. Patent No. 9110967 JP 2012-63917 A

In the above-described conventional example, it is possible to specify the similarity of data schemas and the relationship between data components. However, in the above-mentioned conventional example, it is necessary for the person in charge of analysis to judge from the past results etc. about the software to be used from the similarity of schemas and the relationship between components for the new data to be analyzed Will occur. That is, in the conventional example, there is a case where a person in charge of analysis performs trial and error regarding which software to use, and the number of analysis steps can not be reduced.

Further, in the above-described conventional example, when there are a large number of new analysis target data tables, it has not been possible to indicate from which table the analysis should be started. That is, in the above-described conventional example, it is not possible to distinguish between data requiring man-hours (or labor) for analysis preprocessing such as data cleansing and data not so, and analysis can not be performed with a small number of man-hours. There was a problem.

Then, this invention is made in view of the said problem, and it aims at reducing the man-hour which an analysis of data requires.

The present invention is an analysis support method in which a computer having a processor and a memory evaluates data to be analyzed, wherein the computer stores a definition of a column of the data to be analyzed, and an analysis process. A second step of reading a second data catalog in which a column of input data of analysis software to execute is defined; and the similarity between the columns of the first data catalog and the columns of the second data catalog; A second step of calculating the mapping accuracy as the mapping accuracy, and the computer analyzes the analysis target data with the analysis software based on the mapping accuracy of the columns of the second data catalog used in the analysis software And a third step of calculating.

Therefore, according to the present invention, it is possible to propose analysis software to be applied to analysis target data based on the degree of difficulty in converting the analysis target data into input data (common data model), and Man-hours and labor for analysis can be reduced.

FIG. 1 is a block diagram illustrating an example of a data analysis support system according to a first embodiment of the present invention. It is a block diagram which shows Example 1 of this invention, and shows an example of an analysis assistance server. It is a block diagram which shows Example 1 of this invention, and shows an example of the functional element of an analysis assistance program. It is a figure which shows Example 1 of this invention, and shows an example of an analysis catalog. FIG. 7 shows the first embodiment of the present invention, and shows an example of a data source catalog. FIG. 7 shows the first embodiment of the present invention, and shows an example of a common data model catalog. It is a figure which shows Example 1 of this invention, and shows an example of a required column management table. FIG. 7 shows the first embodiment of the present invention, and shows an example of a column mapping probability management table. It is a figure which shows Example 1 of this invention, and shows an example of an analysis difficulty level management table. It is a flowchart which shows Example 1 of this invention, and shows an example of the process performed with an analysis assistance program. It is a figure which shows Example 1 of this invention, and shows an example of an analysis recommendation result confirmation screen. It is a block diagram which shows Example 2 of this invention, and shows an example of a data analysis assistance system. FIG. 14 shows the second embodiment of the present invention and is a diagram showing an example of processing performed in the production planning period conversion of the ETL processing unit. It is a block diagram which shows Example 2 of this invention, and shows an example of the functional element of an analysis assistance program. It is a figure which shows Example 2 of this invention, and shows an example of an analysis catalog. FIG. 7 shows Example 2 of the present invention, and shows an example of an ETL catalog. FIG. 7 shows the second embodiment of the present invention, and shows an example of an ETL column mapping accuracy management table. It is a figure which shows Example 2 of this invention, and shows an example of a data quality management table. It is a figure which shows Example 2 of this invention, and shows an example of an analysis difficulty level management table. It is a flowchart which shows Example 2 of this invention, and shows an example of the process performed by an analysis assistance program. It is a flowchart which shows Example 2 of this invention, and shows an example of a calculation process of a difficulty level. It is a figure which shows Example 2 of this invention, and shows an example of the ground of analysis difficulty. It is a flowchart which shows Example 2 of this invention, and shows an example of the correction | amendment process of standard working time. It is a flowchart which shows Example 2 of this invention and shows an example of a recommendation process. It is a flowchart which shows Example 2 of this invention, and shows an example of a result confirmation screen process. It is a figure which shows Example 2 of this invention, and shows an example of a result confirmation screen. It is a block diagram which shows Example 3 of this invention, and shows an example of a data analysis assistance system. It is a block diagram which shows Example 3 of this invention, and shows an example of the functional element of an analysis assistance program. It is a figure which shows Example 3 of this invention, and shows an example of the installation alert data of an event log format. It is a figure which shows Example 3 of this invention, and shows an example of an alert code master. It is a figure which shows Example 3 of this invention, and shows an example of the equipment alert of a table format. It is a figure which shows Example 3 of this invention, and shows an example of the past performance confirmation screen. It is a figure which shows Example 3 of this invention, and shows an example of another candidate presentation screen. It is a flowchart which shows Example 3 of this invention, and shows an example of a process of a column mapping precision calculation part. It is a flowchart which shows Example 3 of this invention, and shows an example of the process performed by a result confirmation screen.

Hereinafter, embodiments of the present invention will be described based on the attached drawings.

FIG. 1 shows a first embodiment of the present invention and is a block diagram showing an example of a data analysis support system. The data analysis support system includes a data collection server 410 of factory A that collects data from production facilities, a data collection server 430 of factory B that collects data from production facilities, and

data collection servers

410 and 430 of factories A and B. Data as a data source, a data lake server 2 for storing data concerning analysis, an analysis server group 300 for analyzing input data (converted data) converted according to the common data model catalog 33 of the data lake server 2, and data It includes an analysis support server 1 which makes a proposal for analysis software (application or query) of the analysis server group 300 suitable for source analysis.

The data lake server 2 is connected to the analysis support server 1 via the network 51, connected to the analysis server group 300 via the network 52, and connected to the

data collection servers

410 and 430 via the network 53. Ru.

The data collection server 410 of the factory A stores data of the parts list 421, the manufacturing results 422, the equipment alert 423, the process and equipment master 424, the production plan 425, and the inventory results 426. The data collection server 430 of the plant B stores data of the parts list 441, the production results 442, the production plan 443, and the equipment alert 444.

The parts list 421, 441 includes a list of parts that constitute the product. The manufacturing results 422 and 442 store information on the product manufacturing results. The production plans 425 and 443 store parts and product production schedules and the like. The

facility alert

423, 444 stores an alarm or an error from a production facility. The process and equipment master 424 stores information on production processes and information on manufacturing facilities. The inventory result 426 stores the inventory status of the manufactured product.

The storage 20 of the data lake server 2 is shared with an analysis catalog 31 which collects analysis software (applications and queries) used in past data analysis, and a data source catalog 32 which sets columns of data to be analyzed. A common data model catalog 33 defining the data model (common data model 60) is stored. The analysis catalog 31, the data source catalog 32, and the common data model catalog 33 are data set in advance.

In addition, in the storage 20 of the data lake server 2, as a common data model 60, a parts list 61 defining information of parts constituting a product of plant A, a production plan 62 of a product of plant A, and a product of plant A And the facility alert 64 from the production facility of factory A are stored.

The parts list 61 is a common data model in which the definition of the parts list 421 of the factory A is set. The production plan 62 is a common data model in which the definition of the production plan 425 of the factory A is preset. The manufacturing record 63 is a common data model in which the definition of the manufacturing record 422 of the factory A is preset. The facility alert 64 is a common data model in which the definition of the facility alert 423 of the factory A is preset.

These common data models 60 include definitions that provide a database of general entities in management operations such as production planning, manufacturing equipment, and equipment alerts. The analysis support server 1 reads the tables of the

data collection servers

410 and 430 serving as data sources, converts the columns according to the common data model catalog 33, and then causes the analysis server group 300 to execute analysis processing. The process of converting a table serving as a data source into a table according to the definition of the common data model 60 may be performed by the analysis server group 300.

In the first embodiment, data collected by the data collection server 430 of the plant B is the analysis software (analysis process) and the common data model 60 used when data analysis is performed using the data collection server 410 of the plant A as a data source. An example applied to

The analysis server group 300 analyzes the converted data (input data) by an analysis query, and the analysis query issuing server 301 analyzes the converted data converted according to the definition of the equipment alert 64 to analyze the converted data according to the definition of the equipment alert 64 Defect share analysis server 302 to extract, production simulator server 303 to execute simulation about production from converted data converted according to the definition of production plan 62 and parts list 61, etc., and asset share to allow production facilities to be interchanged between each plant The ring server 304 is included.

That is, the analysis query issuance server 301 issues an analysis query to the database storing the input data to carry out the analysis. Further, the defect factor analysis server 302 carries out analysis by an analysis application. The production simulator server 303 carries out production simulation by simulation software (application).

The analysis server group 300 is not limited to these servers, and may be a computer that performs analysis, simulation, and evaluation using data of the data lake server 2 and the

data collection servers

410 and 430. . Further, the analysis software is not limited to the above applications and queries, and applications and queries corresponding to the analysis may be adopted.

In addition to the above data, the storage 20 stores the required column management table 34, the column mapping accuracy management table 35, the analysis difficulty management table 36, and the recommendation result file 37 as shown in FIG. .

The converted data is data obtained by converting the table data columns of the

data collection servers

410 and 430 of the factories A and B into the definition of the common data model 60 according to the common data model catalog 33, and the storage of the data lake server 2 20, and may be stored in the analysis server group 300.

The analysis support server 1 receives the data source catalog 32 including new analysis target data, and calculates the degree of difficulty in converting the analysis target data into converted data corresponding to the common data model 60 of the data lake server 2 Then, the analysis support program 10 that evaluates the data to be analyzed based on the difficulty level is operated. In the first embodiment, as an example of the evaluation performed by the analysis support program 10, an example is proposed in which an analysis software and an analysis order that are optimal when analyzing data to be analyzed are proposed. Further, the analysis support server 1 has catalog data 40 used by the analysis support program 10 and a management table 50.

The degree of difficulty in the first embodiment is an index indicating the amount of work of mapping that assigns a column of data to be analyzed to a column corresponding to the common data model 60, as described later. The analysis software of the analysis server group 300 can execute the analysis process with the converted data corresponding to the mapping of the columns of the common data model 60 as input data. For this reason, the operation of assigning the columns of the data source to the columns of the common data model 60 is the preprocessing of the analysis processing.

In the first embodiment, the analysis support server 1 calculates the amount of work required for the pre-processing as the degree of difficulty of analysis, so that it is possible to determine the magnitude of the amount of work when using analysis software used in past analysis. In this way, the analysis support server 1 proposes from what analysis processing used in the past analysis what analysis processing should be started from what kind of analysis processing should be made for analysis of huge data, or what kind of analysis processing is possible. can do.

FIG. 2 is a block diagram showing an example of the analysis support server 1. The analysis support server 1 is a computer including an interface 7 that connects a CPU 3, a memory 4, a storage 5, a network interface (NI / F in the drawing) 6, a display 8, a keyboard 91, and a mouse 92.

The analysis support program 10 is loaded into the memory 4 and executed by the CPU 3. The storage 5 stores catalog data 40 and a management table 50.

FIG. 3 is a block diagram showing an example of functional elements of the analysis support program 10. As shown in FIG. The analysis support program 10 includes, as functional elements, a required column calculation unit 11, a column mapping accuracy calculation unit 12, an analysis difficulty degree calculation unit 13, and an analysis recommendation unit 15.

The required column calculation unit 11 reads the analysis catalog 31 of the data lake server 2 and generates or updates the required column management table 34. That is, the necessary column calculation unit 11 analyzes the analysis software (application or query) used in the past analysis from the analysis catalog 31, the common data model 60 used by the analysis software from the data source catalog 32, and the common data model 60 The table name and column name necessary for the analysis are calculated from the relationship of the data source that is the source of and stored in the required column management table 34.

In addition, extraction of the relationship of the column of the common data model 60 from the column of the data source performed by the required column calculation part 11 can use a well-known or well-known technique, for example, the data lineage etc. disclosed by patent document 1 Apply the method. Also, the necessary column management table 34 may be created in advance by the administrator or user of the analysis support server 1 from the common data model 60 to be analyzed and the analysis software of the analysis server group 300.

The column mapping accuracy calculation unit 12 reads the data source catalog 32 including new analysis target data and the common data model catalog 33, and generates or updates the column mapping accuracy management table 35.

The column mapping accuracy calculation unit 12 calculates, for each column of data to be analyzed, the table and column similarity as the mapping accuracy for each column of the analysis target data, and stores it in the column mapping accuracy management table 35. The column mapping accuracy calculator 12 calculates the similarity from the analysis target data and the common data model 60 table name, column name, column value, value range of the column, column data format, etc.

The mapping probability indicates the similarity between the table name and column name of the data source and the table name and column name of the common data model 60 in units of columns. For the column-by-column similarity, for example, known or known techniques such as schema matching and mapping disclosed in Non-Patent Document 1 may be applied. The calculation of the degree of similarity may be performed using a known method or a known method, and thus will not be described in detail in this embodiment. Further, the mapping accuracy in the present embodiment indicates that the closer to 1 the degree of similarity between the column of the data to be analyzed and the column of the common data model 60 is high.

The analysis difficulty calculation unit 13 reads the column mapping accuracy management table 35 and the necessary column management table 34, and processes the new analysis target data with the analysis software implemented in the past with the analysis software for each analysis process. It is calculated and stored in the analysis difficulty level management table 36. The analysis difficulty level calculation unit 13 includes an analysis difficulty level recalculation unit 14 that performs recalculation of the difficulty level each time the mapping accuracy is updated.

As for the difficulty of this embodiment, the amount of work (time or effort) for preprocessing (column mapping processing) of analysis work decreases as the value approaches 1, and the amount of work for preprocessing of analysis work increases as it approaches 0. Indicates Specifically, when the value of difficulty is close to 1, it is possible to easily assign each column of data to be analyzed to common data model 60, and mapping (analysis of columns as preprocessing of data to be analyzed) Less time or effort).

On the other hand, when the value of the difficulty level approaches 0, it is not easy to assign each column of the analysis target data to the column of the common data model 60, and the time required for preprocessing (column mapping processing) of the analysis target data Effort is increased.

The analysis / recommendation unit 15 outputs a result confirmation screen 81 in which the analysis software to be recommended is listed up based on the difficulty level of the analysis target data stored in the analysis difficulty level management table 36 to the display 8.

Furthermore, the analysis / recommendation unit 15 displays an analysis difficulty level basis display unit 16 that displays the basis (the relationship between the analysis target data and the column of the common data model 60) for which the difficulty level is calculated, and the mapping determination unit 17 that adjusts the mapping accuracy. Including. The analysis and recommendation unit 15 stores, in the recommendation result file 37, the recommendation result of the analysis software (analysis name) that processes the analysis target data. Further, the analysis and recommendation unit 15 writes and updates the adjustment result of the mapping accuracy in the column mapping accuracy management table 35, and reflects the determination of the user of the analysis support server 1 in the column mapping accuracy management table 35.

With the above configuration, it is possible to propose analysis software to be applied to the analysis target data based on the degree of difficulty in converting the analysis target data to converted data according to the common data model 60, and Man-hours and labor for analysis can be reduced.

Further, the analysis target data for which the analysis support program 10 of the present embodiment calculates the difficulty level is not limited to the table, and data such as a spreadsheet can be used as the analysis target data.

The necessary column calculation unit 11 of the analysis support program 10, the column mapping accuracy calculation unit 12, and the functional units of the analysis difficulty degree calculation unit 13 and the analysis recommendation unit 15 are loaded into the memory 4 as a program.

The CPU 3 operates as a functional unit that provides a predetermined function by performing processing according to the program of each functional unit. For example, the CPU 3 functions as the analysis difficulty level calculation unit 13 by processing according to the analysis difficulty level calculation program. The same is true for other programs. Furthermore, the CPU 3 also operates as a functional unit that provides each function of a plurality of processes executed by each program. A computer and a computer system are devices and systems including these functional units.

FIG. 4 is a view showing an example of the analysis catalog 31. As shown in FIG. The analysis catalog 31 stores an overview of analyzes performed in the past.

The analysis catalog 31 includes an analysis ID 311, an analysis name 312, an application / query 313, an importance 314, a past effect 315, a standard duration 316, a required skill 317, a due date 318, and a completion flag 319 In one entry.

The analysis ID 311 stores an identifier of analysis software that performs analysis. The analysis name 312 stores the name of the analysis. The application / query 313 stores the type of analysis software. In this embodiment, analysis software is an example of either an application or a query. In the case of the application, the name of the defect factor analysis application executed by the defect factor analysis server 302 of the analysis server group 300 is stored. In the case of a query, the name of the query issued by the analysis query issuance server 301 is stored.

The importance 314 stores the importance of the analysis software. The importance 314 stores “H” (high), “M” (medium), or “L” (low). The past effect 315 stores the effect given by the analysis result of the analysis software. The past effect 315 stores either “H” (high), “M” (medium), or “L” (low).

The standard duration 316 stores the standard duration required for the analysis. The necessary skill ID 317 stores the skills of the person in charge of analysis who are required to use the analysis software. In this embodiment, an example of storing the name of analysis software, the name of software language, the field to be analyzed, and the like as the necessary skill ID 317 is shown.

The due date 318 stores a due date for presenting the result of the analysis. The completion flag 319 stores information identifying whether or not the analysis has been completed.

FIG. 5 shows an example of the data source catalog 32. As shown in FIG. The data source catalog 32 is a table in which information on columns of data sources to be analyzed is set in advance. The data source catalog 32 according to the first embodiment includes a table of the manufacturing result log 442 of the data collection server 430 of the plant B, the equipment alert 444, and the production plan 443.

Data source catalog 32 includes column ID 321, input data source name 322, table name 323, column name 324, data type 325, unit 326, data range 327, and other attribute 328 in one entry. .

The column ID 321 stores an identifier for specifying a column of the data source. The input data source name 322 stores the name of the computer that provided the data source.

The table name 323 stores the name of the table of the data source. The column name 324 stores the name of the column. The data type 325 stores the format of data. The unit 326 stores the unit of data of the column. The data range 327 stores the range of values of the column. The other attribute 328 stores the attribute of the column.

The data source catalog 32 is information generated in advance based on the information from the

data collection servers

410 and 430.

FIG. 6 is a view showing an example of the common data model catalog 33. As shown in FIG. The common data model catalog 33 is a table storing information for each column of the common data model 60 (parts list 61 to equipment alert 64).

The common data model catalog 33 includes a column ID 331, a table name 332, a column name 333, a data type 334, a unit 335, a data range 336, and other attributes 337 in one entry.

The column ID 331 stores a table of the common data model 60 and an identifier for specifying a column. The table name 332 stores the name of the table of the common data model 60 to which the column belongs.

The data type 334 stores the data format of the column. The unit 335 stores the unit of data of the column. The data range 336 stores the range of values of the column. The other attribute 337 stores the attribute of the column.

FIG. 7 shows an example of the required column management table 34. As shown in FIG. The necessary column management table 34 is a table for specifying the information of the column of the common data model 60 used by the analysis software implemented in the analysis server group 300.

The required column management table 34 includes an analysis ID 341, an analysis name 342, a table name 343, a column name 344, and a required flag 345 in one entry.

The analysis ID 341 stores an identifier for specifying the analysis software implemented in the analysis server group 300. The analysis name 342 stores the name given to the analysis or the name of the analysis software. The table name 343 stores the table name of the common data model 60 used in the analysis.

The column name 344 stores the name of a column storing converted data. The mandatory flag 345 stores whether the column is an optional item or a mandatory item of analysis processing. In the analysis processing of the analysis server group 300, "Yes" is stored if it is a mandatory item, and "No" is stored if it is an optional item.

FIG. 8 is a diagram showing an example of the column mapping accuracy management table 35. As shown in FIG. In the column mapping accuracy management table 35, the mapping accuracy (similarity) of the column of the data source calculated by the analysis support program 10 and the column of the common data model 60 is stored. In the example of FIG. 8, the analysis support program 10 maps the data with the common data model catalog 33 with the data of the data collection server 430 of the plant B (manufacturability 442, production plan 443, facility alert 444) as new analysis target data. The example which calculated 357 is shown.

The column mapping accuracy management table 35 includes a mapping ID 351, a data source name 352, a map source table name 353, a map source column name 354, a map destination table name 355, a map destination column name 356, and a mapping accuracy 357. Include in one entry.

The mapping ID 351 stores an identifier for identifying the mapping probability. The data source name 352 stores the name of a data source having data to be analyzed.

The map source table name 353 stores the name of the analysis target data table on the data source side. In the mapping source column name 354, the name of the column of analysis target data on the data source side is stored.

In the map destination table name 355, the name of the table after conversion into the definition of the common data model 60 is stored. In the mapping destination column name 356, the name of the column after conversion into the definition of the common data model 60 is stored. The mapping accuracy 357 stores the degree of similarity between the mapping source column name 354 column and the mapping destination column name 356 column.

The analysis support server 1 can acquire, by referring to the mapping accuracy 357, the similarity between the mapping source column name 354 column and the mapping destination column name 356 column at the schema level.

FIG. 9 is a diagram showing an example of the analysis difficulty level management table 36. As shown in FIG. The analysis difficulty level management table 36 is a table generated by the analysis difficulty level calculation unit 13 of the analysis support program 10 for new analysis target data.

The analysis difficulty level management table 36 includes an analysis ID 361, an analysis name 362, an application / query 363, and a difficulty level 364 in one entry. The analysis ID 361 stores an identifier for specifying analysis software to be analyzed. The analysis name 362 stores the name of the analysis.

The application / query 363 stores the type of analysis software that is to be analyzed by the analysis server group 300. The difficulty level 364 stores the level of difficulty of each analysis software calculated by the analysis support program 10.

For the difficulty level 364, the analysis support program 10 selects the table name 343 and the column name 344 for each analysis ID 341 of the required column management table 34, and the map destination table name 355 of the column mapping probability management table 35 and the map destination column name Get a mapping accuracy 357 that matches 356. The analysis support program 10 selects the entry whose mandatory flag 345 is “Yes” in the column name 344 and excludes the entry of “No”.

Then, when a plurality of column names 344 exist in one analysis ID 341 of the required column management table 34, the multiplication value of the mapping accuracy 357 of the map destination column name 356 corresponding to the column name 344 is calculated as the difficulty 364 Ru.

For example, in the case of analysis software with an analysis name 362 = "total operation time actual value for each facility" of analysis ID 361 = "1", the analysis support program 10 compares the analysis ID 341 of the required column management table 34 with table name 343 "1". = "Manufacturing results" and column name 344 = "equipment ID", "date and time", "operating time" are selected.

Next, in the analysis support program 10, the mapping source table 353 of the column mapping accuracy management table 35 = "Manufacturing results", mapping destination column name 356 = mapping accuracy of "equipment ID" 357 = 0.9, mapping destination column name 356 = "Date and time" mapping accuracy 357 = 0.85 and map destination column name 356 = "Operation time" mapping accuracy 357 = 0.9 is acquired.

Then, the analysis support program 10 multiplies the mapping accuracies of the three acquired columns, and sets the difficulty level 364 as 0.9 × 0.85 × 0.9 = 0.69 (0.68850.60.69). calculate.

FIG. 10 is a flowchart showing an example of the analysis support program 10 executed by the analysis support server 1. In the following description, although the analysis support program 10 is the subject of the processing, the analysis support server 1 or the CPU 3 may be the subject of the processing. This process is started when a data source catalog 32 including new analysis target data is received.

First, the analysis support program 10 reads the analysis catalog 31 by the necessary column calculation unit 11, and calculates the column of the common data model 60 that is the input of the analysis software of the analysis server group 300 that executes the analysis, and the necessary column management table Write to 34 (S1).

The necessary column calculation unit 11 acquires the table name and column name of the common data model 60 used by the analysis server group 300 in the past, and stores the table name 343 and column name 344 of the necessary column management table 34. The required column calculation unit 11 determines whether it is a column essential for analysis or an optional column based on a query or a log of the analysis server group 300, and sets a mandatory flag 345.

Next, the column mapping accuracy calculation unit 12 of the analysis support program 10 reads the data source catalog 32 and the common data model catalog 33, and the columns of the data sources (tables of the data collection servers 410 and 430) and the common data model 60 ( The mapping accuracy 357 of the column of the common data model catalog 33) is calculated and written in the column mapping accuracy management table 35 (S2).

The column mapping accuracy calculation unit 12 calculates the similarity for each column by schema matching & mapping of the data schema of the table on the side of the

data collection servers

410 and 430 and the data schema of the common data model 60 as described above. Store in 357.

Next, the analysis difficulty calculation unit 13 reads the necessary column management table 34 and the column mapping accuracy management table 35, acquires the mapping accuracy 357 of the necessary columns for each analysis ID 341, and as described above, the difficulty 364 Are calculated and written in the analysis difficulty management table 36 (S3).

Next, the analysis support program 10 causes the analysis / recommendation unit 15 to read the analysis difficulty level management table 36, sorts the analysis ID with the large value of the difficulty 364 as the upper rank, and displays it on the result confirmation screen 81 shown in FIG. To do (S4).

The larger the value of the difficulty level 364 is, the easier the conversion from the data source to the common data model 60 becomes, and the conversion and mapping operation from the data source to the common data model 60 which is the preprocessing of the analysis become easier. The analysis / recommendation unit 15 displays a list of analyzes in the analysis list 811 of the result confirmation screen 81 in the order of easy pre-processing.

FIG. 11 is a diagram showing an example of the result confirmation screen 81 output by the analysis and recommendation unit 15 of the analysis support program 10. As shown in FIG. In the result confirmation screen 81, the upper part in the drawing is a display area of the analysis list 811, and the lower part in the drawing is a display area of the analysis difficulty level basis 812. The analysis difficulty level basis 812 is displayed after one line of the analysis list 811 is selected.

Analysis list 811 consists of a list that includes analysis ID, analysis name, application or query type, importance, past effects, difficulty level and completion flag in one row, and displays the contents of analysis Be done. The items other than the difficulty level (364) of the analysis list 811 are the contents of the analysis catalog 31 of FIG.

When the user of the analysis support server 1 operates the mouse 92 and clicks on the line of analysis ID = 4, the mapping accuracy of the common data model 60 and the data source is displayed in the display area of the analysis difficulty basis 812 (in FIG. "0.9 etc." is displayed.

Further, below the certainty of the display area of the analysis difficulty level basis 812, a determination button 813 for determining the correspondence between the common data model 60 and the data source column is displayed. The user of the analysis support server 1 changes the mapping probability of the column to 1.0 (100%) if the correspondence between the column of the common data model 60 and the column of the data source is valid.

In step S5 of FIG. 10, the analysis / recommendation unit 15 of the analysis support program 10 determines whether an end operation of the result confirmation screen 81 has been received. When the user of the analysis support server 1 performs an operation of closing the window on the result confirmation screen 81, the process ends. The analysis and recommendation unit 15 stores the recommendation result of analysis in the recommendation result file 37 when the process is ended. On the other hand, when there is no end operation, the process proceeds to step S6.

In step S6, the analysis / recommendation unit 15 determines whether the mouse 92 operated by the user on the result confirmation screen 81 has selected a row of the analysis list 811. If the line of the analysis list 811 is selected, the process proceeds to step S7. If not, the process returns to step S5 and waits for the operation of the mouse 92 or the keyboard 91.

In step S7, the analysis and recommendation unit 15 acquires from the column mapping probability management table 35 the mapping probabilities of the column of the common data model 60 and the column of the data source serving as the basis for calculating the difficulty level, and analyzes the result confirmation screen 81. Display on difficulty base 812

Next, in step S8, the analysis / recommendation unit 15 determines whether the confirmation button 813 has been clicked by the mouse 92 operated by the user in the analysis difficulty level basis 812 or not. If the enter button 813 is operated, the process proceeds to step S9. If not, the process returns to step S3 and the above process is repeated.

In step S9, the analysis / recommendation unit 15 sets the mapping probability 357 of the column of the data source and the column of the common data model 60 selected by the confirm button 813 to 1.0, and the corresponding entry of the column mapping probability management table 35 Update

After updating the column mapping accuracy management table 35, the analysis and recommendation unit 15 returns to step S3 to recalculate the difficulty 364 and repeat the above process.

By the above processing, the analysis support server 1 sets new analysis target data in the data source catalog 32 to calculate the mapping accuracy and calculate the degree of difficulty. What kind of analysis can be easily performed? Can be known on the result confirmation screen 81.

This makes it possible to propose an analysis that can be easily implemented and analysis software to be applied to the analysis target data based on the mapping difficulty level when converting the analysis target data to the common data model 60. Therefore, it is possible to reduce the time and effort required for analysis.

In addition, if the mapping of the column mapping accuracy management table 35 for the new data to be analyzed is valid, the user operates the confirmation button 813 on the result confirmation screen 81, whereby the mapping accuracy 357 is 1.0 (100% Can be fed back to the column mapping probability management table 35.

Also, after feedback to the column mapping accuracy management table 35, the analysis support server 1 can recalculate the mapping accuracy 357 and the difficulty 364 to display a new analysis list 811.

As described above, the display of the analysis list 811 and the analysis difficulty level ground 812 allows the user of the analysis support server 1 to grasp the effort required for the preprocessing of the analysis as the difficulty level. In addition, it is possible to grasp how much time-consuming columns exist in conversion from the data source to the common data model 60.

As described above, according to the first embodiment, the degree of difficulty can be calculated as an index indicating the magnitude of the amount of work of column mapping that is the preprocessing of the analysis processing, and the analysis target data that is the data source can be evaluated.

In this way, the analysis support server 1 uses, in the past analysis, what kind of analysis processing should be used to start analysis on a large amount of data and various tables, or what kind of analysis processing can be realized It is possible to propose from among the analysis software. In addition, by using analysis software used in the past, computer resources can be effectively used to significantly reduce the lead time of analysis processing.

Although the example in which the common data model 60 and the respective tables are stored in the data lake server 2 has been described in the first embodiment, these data may be stored in the analysis support server 1.

12 to 26 show Embodiment 2 of the present invention. The second embodiment shows an example in which column mapping accuracy and difficulty are calculated in consideration of ETL (Extract, Transform, Load) catalog and data source quality, and analysis is proposed in order of work efficiency.

In the second embodiment, the configuration of the first embodiment is the analysis project management server 305, the ETL catalog 38, the ETL column mapping accuracy management table 39, the data quality management table 41, the skill set record 42, and the ETL processing unit. 70 is added. The other configuration is the same as that of the first embodiment.

FIG. 12A is a block diagram showing an example of a data analysis support system. In the second embodiment, the ETL processing unit 70 is added to the data lake server 2, the analysis project management server 305 is added, and the data held in the storage 20 of the data lake server 2 is added as shown in FIG. The ETL processing unit 70 includes an equipment alert unit conversion 71, a production planning period conversion 72, and a production planning equipment name division 73, and the analysis support program 10 reads out as needed and causes the analysis support server 1 to execute.

FIG. 13 is a block diagram showing an example of functional elements of the analysis support program 10. As shown in FIG. In the second embodiment, the ETL catalog 38, the ETL column mapping accuracy management table 39, the data quality management table 41, and the skill set performance 42 are added to the storage 20 of the first embodiment.

In addition, the analysis support program 10 adds the ETL column mapping accuracy calculation unit 121 to the column mapping accuracy calculation unit 12, and adds the data quality analysis unit 131 and the data quality analysis difficulty correction unit 132 to the analysis difficulty calculation unit 13. Then, the analysis scheduling unit 151 is added to the analysis and recommendation unit 15, and the cooperation interface 18 is added.

The collaboration interface 18 outputs the contents of the analysis project to the analysis project management server 305 in the form of a spreadsheet. The analysis project management server 305 receives the analysis project in the spreadsheet format by the analysis task fetch unit 306, and manages the analysis project.

In the second embodiment, in addition to the calculation of the column mapping accuracy management table 35 from the data source catalog 32 and the common data model catalog 33 in the column mapping accuracy calculating unit 12 as in the first embodiment, the ETL column mapping accuracy is calculated. A unit 121 generates an ETL column mapping accuracy management table 39 from the data source catalog 32 and the ETL catalog 38.

Then, for the column of the data source, one of the column mapping accuracy management table 35 and the ETL column mapping accuracy management table 39 having the higher mapping accuracy is used to calculate the difficulty level.

Further, in the analysis difficulty level calculation unit 13, the data quality analysis unit 131 reads the analysis target data of the data source catalog 32, performs analysis on the quality of the analysis target data, and generates a data quality 41 table.

The data quality analysis difficulty correction unit 132 corrects the mapping accuracy based on the quality of the analysis target data. The other configuration is the same as that of the first embodiment.

FIG. 14 shows an example of the analysis catalog 31. As shown in FIG. FIG. 14 is different from FIG. 4 of the first embodiment in that importance 314, past effect 315, standard required time 316, required skill 317, and due date 318 are set. Each column of importance 314 to delivery date 318 can be set by the user or administrator of the analysis support server 1.

FIG. 15 shows an example of the ETL catalog 38. As shown in FIG. The ETL catalog 38 is a table in which the definition of the elements of the ETL processing unit 70 is set in advance. In the second embodiment, the equipment alert unit conversion 71, the production planning period conversion 72, and the production planning equipment name division 73 are elements. An example is shown. In the ETL catalog 38, a definition of data to be extracted, a definition of conversion of data, and a definition of a common data model 60 storing the converted data are set in advance corresponding to analysis target data.

The ETL catalog 38 includes an ETL ID 381, an ETL name 382, an input table name 383, an input column name 384, an output table name 385, and an output column name 386 in one entry.

The ETLID 381 stores an identifier for identifying an ETL. The ETL name 382 stores the name of the ETL (each element of the ETL processing unit 70). The input table name 383 stores the names of the tables of the

data collection servers

410 and 430 serving as data sources. In the input column name 384, the names of the columns in the table of the

data collection servers

410 and 430 are stored.

The output table name 385 stores the name of the common data model 60 table. In the output column name 386, the names of the columns in the table of the common data model 60 are stored.

In the illustrated example, ETL ID 382 = "3" ETL name 382 = "Production plan period conversion" Input column name 384 = "Start time" and ETL ID 381 = "4" input column name 384 = 2 "End time" It shows that one value is converted into one value of the output column name 386 = “date and time” of the production plan 62 of the common data model 60. The specific content of the conversion is set in each element of the ETL processing unit 70.

The ETL catalog 38 associates one or more input column names 384 and output column names 386, and defines the conversion definition of value and data format, so that simple mapping can not assign columns to converted data columns. Can be used as a data source.

FIG. 12B shows an example of processing performed by ETL name 382 = “production planning period conversion” = production planning period conversion 72 as an example of the ETL processing unit 70. The analysis support server 1 reads and executes the production planning period conversion 72. The production planning period conversion 72 reads “start time” and “end time” designated by the input column name 384 from the production plan 443 designated by the input table 383 (S721). In this example, the production plan 443 of plant B is used as a data source to be newly added.

The production planning period conversion 72 executes a predetermined conversion on the read data source (S722). In this example, the output column name 386 is calculated as “date and time” = “end time” − “disclosure time”. Then, the production planning period conversion 72 stores the data converted into the production planning 62 of the common data model 60 specified by the output table name 385 (S723). The data source is similarly converted for the equipment alert unit conversion 71 of the ETL processing unit 70 and the production planning equipment name division 73 and stored in the common data model 60.

FIG. 16 is a diagram showing an example of the ETL column mapping accuracy management table 39. As shown in FIG. The ETL column mapping accuracy management table 39 is a table generated by the ETL column mapping accuracy calculating unit 121.

ETL column mapping accuracy management table 39 includes ETL mapping ID 391, mapping source table name 392, mapping source column name 393, mapping destination ETL name 394, mapping destination column name 395, and mapping accuracy 396 in one entry. Including.

The ETL mapping ID 391 stores an identifier for identifying an entry of mapping accuracy. In the map source table name 392, the table names of the

data collection servers

410 and 430 of the data source are stored. The mapping source column name 393 stores column names in the table of the

data collection servers

410 and 430 of the data source.

In the map destination ETL name 394, a name corresponding to the ETL name 382 of the ETL catalog 38 is stored. In the mapping destination column name 395, a name corresponding to the output column name 386 of the ETL catalog 38 is stored. In the mapping accuracy 396, the mapping accuracy of the ETL calculated by the ETL column mapping accuracy calculation unit 121 is stored.

In the illustrated example, in the ETL mapping ID 391 = “1”, the map source table name 392 converts the “date and time” value of the map source column name 393 for the facility alert 444 of the data collection server 430 into time units. , Map destination column name 395 of equipment alert unit conversion 71 of ETL processing unit 70 can be mapped to “date and time (time unit)”, and is stored, and “date and time” and “date and time” mapping accuracy 396 = “date and time” It shows that it is 0.9 ".

FIG. 17 shows an example of the data quality management table 41. As shown in FIG. The data quality management table 41 is a table generated by the data quality analysis unit 131 of the analysis difficulty level calculation unit 13 with reference to the data source catalog 32 including analysis target data. The data quality management table 41 stores data quality for each data source column.

The data quality management table 41 includes a column ID 411, an input data source name 412, a table name 413, a column name 414, a number of nulls 415, an overlap 416, an outlier 417, a character deviation 418, and an overall score 419. In one entry.

The column ID 411 stores an identifier for specifying a column of the data source. The input data source name 412 stores a name specifying a data collection server as a data source. In the table name 413, the name of the table of the data collection server as a data source is stored. The column name 414 stores the names of the columns included in the table serving as the data source.

The number of nulls 415 stores the ratio of records including null values in the column. The overlap 416 stores the ratio of records whose values overlap in the column. The outlier 417 stores the ratio of records whose values exceed a predetermined threshold in the column. The character number deviation 418 stores the ratio of records in which the character number deviates in the column. In the total score 419, the total score 419 calculated as the quality of the data source based on the values of the many nulls 415, the overlap 416, the outliers 417, and the character deviation 418 is stored.

In the second embodiment, an example is shown in which the total score representing the quality of the data is calculated as 1− (Null many 415 + duplication 416 + outlier 417 + number-of-characters 418). Note that the calculation method of the overall score 419 is not limited to this, and the value of each field of the null majority 415 to the number of characters shift 418 indicating the quality of the data source in the data quality management table 41 May be used.

In the second embodiment, the closer the value of the overall score 419 is to 1.0, the higher quality data that can be analyzed as it is, and the closer to 0, the lower the quality that requires pretreatment such as cleansing the data source. Data of

That is, the quality indicated by the overall score 419 is an index indicating the amount of processing (time or effort) required for cleansing the data source. In addition, the cleansing of the second embodiment indicates, for example, detection of duplication, an error, a sway of notation or the like from the data source, and performing deletion, correction, normalization and the like.

In the second embodiment, processing for mapping a column of data to be analyzed to a column of the common data model 60 (column mapping processing) and cleansing for the content of data to be analyzed are pre-processed when analyzing data to be analyzed. An example including two processes of the process to be performed is shown. The column mapping process includes a process of converting the value of the map source column name 354 into the value of the map destination column name 356 based on the ETL catalog 38.

The analysis support program 10 receives the data source catalog 32, and before the column mapping accuracy calculating unit 12 calculates the mapping accuracy, the data quality analyzing unit 131 of the analysis difficulty calculating unit 13 analyzes the data quality management table. Generate 41

Then, as described later, based on the total score 419 of the data quality management table 41, the mapping accuracy (357, 386) is corrected.

FIG. 18 is a diagram showing an example of the analysis difficulty level management table 36. As shown in FIG. The analysis difficulty level management table 36 is a table generated by the analysis difficulty level calculation unit 13 of the analysis support program 10 for new analysis target data. The analysis difficulty level management table 36 of the second embodiment is obtained by adding the standard required time 365 and the corrected required time 366 to the analysis difficulty level management table 36 shown in FIG. 9 of the first embodiment. The configuration is the same as that of the first embodiment.

The analysis difficulty level management table 36 includes an analysis ID 361, an analysis name 362, an application / query 363, a difficulty level 364, a standard required time 365, and a corrected required time 366 in one entry.

The standard duration 365 stores the standard time required to complete the analysis. In the second embodiment, an example is shown in which the analysis difficulty level calculator 13 sets the standard required time 316 preset for each ID 311 of the analysis catalog 31 to the standard required time 365. In the required time after correction 366, the data quality analysis difficulty level correction unit 132 stores a value obtained by correcting the standard required time 365 according to the difficulty level 364.

Although not illustrated, in the skill set results 42 of FIG. 13, the number of persons who perform analysis work and the skills of each person are set in advance. In addition, the skill of the staff stores a value corresponding to the necessary skill 317 in the analysis catalog 31.

FIG. 19 is a flowchart showing an example of processing performed by the analysis support program 10. This process is started after receiving the data source catalog 32, as in the first embodiment. In FIG. 19, it is assumed that the required column management table 34 has already been generated. Further, as described above, the data quality management table 41 has already been generated by the data quality analysis unit 131 of the analysis difficulty level calculation unit 13.

The column mapping accuracy calculator 12 of the analysis support program 10 reads the data source catalog 32 and the common data model catalog 33, calculates the mapping accuracy 357, and writes the mapping accuracy 357 in the column mapping accuracy management table 35 (S11). This process is the same as step S2 shown in FIG. 10 of the first embodiment, and the mapping accuracy of the data source column and the column of the common data model 60 is calculated, and the column mapping accuracy management table 35 shown in FIG. Is generated.

Next, in the analysis support program 10, the ETL column mapping accuracy calculation unit 121 reads the data source catalog 32 and the ETL catalog 38, calculates the mapping accuracy, and writes it in the ETL column mapping accuracy management table 39 (S12).

The ETL column mapping accuracy calculation unit 121 obtains the table name 323 and the column name 324 of the data source catalog 32, and searches the input table name 383 and the input column name 384 of the ETL catalog 38, and the ETL name 382 of the matching entry. And get the output column name 386.

Then, the ETL column mapping accuracy calculation unit 121 calculates the mapping accuracy of the input column name 384 and the output column name 386. After generating a new entry in the ETL column mapping accuracy management table 39, the ETL column mapping accuracy calculating unit 121 assigns a unique ETL mapping ID 391.

The ETL column mapping accuracy calculation unit 121 stores the calculated mapping accuracy in the mapping accuracy 396, stores the input table name 383 in the map source table name 392, stores the input column name 384 in the map source column name 393, and maps The ETL name 382 is stored in the destination ETL name 394, the output column name 386 is stored in the mapping destination column name 395, and the ETL column mapping probability management table 39 is generated.

The ETL column mapping accuracy calculation unit 121 executes the above processing for all entries in the data source catalog 32. As a result, it becomes possible to convert a column of a data source which can not be used by simple mapping into a unit of the mapping destination column name 395 or a data format. In the ETL catalog 38, one or more data source columns can be consolidated into one mapped column name 356, or a definition can be set to divide one data source column into a plurality of mapped column names 356.

Next, the analysis difficulty level calculation unit 13 of the analysis support program 10 calculates, for each analysis in the analysis catalog 31, the difficulty level in the case of performing analysis using the data of the data source catalog 32 (S13).

The analysis difficulty level calculation unit 13 selects the mapping accuracy of the larger value from the mapping accuracy 357 of the column mapping accuracy management table 35 and the mapping accuracy 396 of the ETL column mapping accuracy management table 39. If there is no entry where the mapping source column name 354 of the column mapping accuracy management table 35 corresponds to the input column name 384 of the ETL catalog 38, the analysis difficulty calculation unit 13 determines the value of the column mapping accuracy management table 35. Choose

Then, the data quality analysis difficulty level correction unit 132 of the analysis difficulty level calculation unit 13 corrects the selected mapping accuracy with the overall score 419 of the data quality management table 41, and then the difficulty level of the analysis processing for the analysis target data. Calculated for each analysis ID.

FIG. 20 is a flowchart showing an example of the process of calculating the degree of difficulty performed in step S13. First, in step S31, the analysis difficulty level calculation unit 13 reads the data quality management table 41, and acquires the total score 419 for each column name 414.

Next, in step S32, the analysis difficulty calculation unit 13 reads the column mapping accuracy management table 35 and the ETL column mapping accuracy management table 39, and compares the data source and the ETL columns.

That is, in the analysis difficulty level calculation unit 13, the map source table name 353 and the map source column name 354 of the column mapping accuracy management table 35, the map source table name 392 and the map source column name 393 of the ETL column mapping accuracy management table 39 If they match, the larger one of the mapping accuracy 357 and the mapping accuracy 396 of the ETL column mapping accuracy management table 39 is selected as the mapping accuracy of the map source column name.

Next, in step S33, the analysis difficulty calculation unit 13 acquires, for each analysis ID 311, the mapping accuracy for each column selected in step S32, and the mapping accuracy with the total score 419 for each column name 414 acquired in step S31. And calculate the difficulty.

Assuming that the column number included in the analysis ID 311 is n, the selected mapping accuracy is S, and the score of the data quality management table 41 is T, the difficulty D is
D = (S1 x T1) x (S2 x T2) ......... x (Sn x Tn)
Is represented by

By correcting the selected mapping accuracy S by multiplying it by the overall score of data quality, the lower the data quality, the lower the value of difficulty D, and it takes time and effort to preprocess (cleanse) the data source. become.

Next, in step S34, the data quality analysis difficulty level correction unit 132 of the analysis difficulty level calculation unit 13 calculates the value of the standard required time 316 of the analysis catalog 31 based on the difficulty level calculated in step S33. To correct.

Next, in step S35, the analysis difficulty level calculation unit 13 generates an analysis difficulty level management table 36. That is, the analysis difficulty level calculation unit 13 adds a new entry to the analysis difficulty level management table 36, and the analysis ID 311, the analysis name 312, the application / query 313 of the analysis catalog 31, the analysis ID 361, the analysis name 362, and the application / Store in the query 363

Then, the analysis difficulty level calculation unit 13 stores the difficulty level calculated in step S33 in the difficulty level 364 and stores the standard required time 316 of the analysis catalog 31 in the standard required time 365, and the standard required corrected in step S34. The time is stored in the required time after correction 366 and the process is ended.

FIG. 21 is a view showing an example of a display area of the analysis difficulty level basis 812 for explaining calculation of the difficulty level. FIG. 21 illustrates an example in which the difficulty level 364 of the analysis target data is calculated for the “total number of alerts per facility” analysis ID 311 = “4”.

The column mapping accuracy calculation unit 12 acquires the column name 344 = "equipment ID" and "date and time" from the equipment alert 64 of the common data model 60 from the necessary column management table 34. Further, the column mapping accuracy calculation unit 12 acquires the column name 324 = "equipment ID" and "date time" from the table name 323 = "equipment alert" from the data source catalog 32.

The column mapping accuracy calculation unit 12 calculates the mapping accuracy of the common data model 60 and the data source, and as shown in FIG. 8, the mapping accuracy of “equipment ID” = 0.95, “date and time” and “date and time” The mapping accuracy of = 0.9 is obtained.

The ETL column mapping accuracy calculation unit 121 selects “equipment alert date and time conversion” including “date and time” in the input column name 384 from the ETL catalog 38, and acquires the output column name 386 = “date and time (time unit)” Calculate mapping accuracy = 0.9.

The column mapping accuracy calculating unit 12 acquires the larger one of the mapping accuracy by the ETL and the mapping accuracy by the common data model 60. As a result, the selected mapping accuracy is “equipment ID” = 0.95 and “date and time” = 0.9.

Next, in the analysis difficulty level calculation unit 13, the data quality analysis difficulty level correction unit 132 reads the total score 419 from the data quality management table 41, and “equipment ID” = 0.98 and “date time” = 1.0 To get

The data quality analysis difficulty correction unit 132 corrects the mapping accuracy with the overall score 419 to calculate the difficulty 364. That is, the degree of difficulty = (0.95 × 0.98) × (1.0 × 0.9) = 0.8379.

FIG. 22 is a flowchart showing an example of the standard required time correction process. This process is performed by the analysis difficulty level calculation unit 13 in step S34.

In step S41, the analysis difficulty level calculation unit 13 reads the column mapping accuracy management table 35, and if the difficulty level exceeds 0.8, the process proceeds to step S47, and the standard required time 316 is corrected as it is the required time 366 Store in

In step S42, if the analysis difficulty level calculation unit 13 determines that the difficulty level is 0.8 or less and the difficulty level is 0.6 or more, the process proceeds to step S46, the correction coefficient is set to 1.2, and the standard required time 316 is set. The value multiplied by 1.2 is stored in the corrected required time 366.

In step S43, if the analysis difficulty calculation unit 13 is less than 0.6 and the difficulty is 0.4 or more, the process proceeds to step S45, the correction coefficient is set to 1.5, and the standard required time 316 is set to 1.5. The multiplied value is stored in the corrected required time 366.

In step S44, since the difficulty level is less than 0.4, the analysis difficulty level calculation unit 13 sets the correction coefficient to 2 and stores a value obtained by multiplying the standard required time 316 by 2 in the corrected required time 366.

By the above processing, the higher one of the mapping accuracy of the common data model 60 and the data source and the mapping accuracy of the ETL catalog 38 and the data source is selected, and the multiplication value of the mapping accuracy corrected by the overall score 419 of data quality is selected. The degree of difficulty 364 in the case of performing the analysis ID in the data source is calculated.

As a result, the higher the overall data quality score 419 is, the higher the value of the difficulty 364 is, and the effort required for preprocessing (cleansing) of the data source is reduced. Conversely, the lower the overall data quality score 419 is, the smaller the value of the difficulty 364 is, and the effort required to preprocess the data source increases.

Further, a correction factor is set according to the difficulty level 364 in the standard required time 316, and the correction factor is corrected to be larger as the value of the difficulty level 364 is lower. As a result, as the value of the difficulty 364 decreases, the time or effort required for preprocessing such as data cleansing increases, and the time required is also corrected to increase.

Next, in step S14 of FIG. 19, the analysis / recommendation unit 15 of the analysis support program 10 sorts the analysis difficulty level management table 36 in descending order of the degree of difficulty, and then considers the delivery date 318 as described later. It selects as recommendation object in order from analysis processing.

In step S <b> 15, the analysis and recommendation unit 15 displays the analysis process (analysis name) selected in step S <b> 14 on the display 8 as a result confirmation screen 81. In step S16, the analysis and recommendation unit 15 determines whether the mapping from the data source to the common data model has been determined on the result confirmation screen 81. If the determination button 813 is clicked and the mapping from the data source to the common data model is determined, the process proceeds to step S17. If the mapping is not determined, the process proceeds to step S18.

In step S17, the mapping determination unit 17 of the analysis and recommendation unit 15 updates the column mapping accuracy management table 35 by setting the mapping accuracy 357 corresponding to the mapping for which the determination button 813 is clicked to 1.0. Thereafter, the process returns to step S13, and recalculation of the difficulty level 364 is performed.

In step S18, when the analysis / recommendation unit 15 detects the end of the display of the result confirmation screen 81, the process ends. If not, the process returns to step S16 and the operation of the confirmation button 813 is accepted.

As a result of the above-described process, the results confirmation screen 81 displays the values of the difficulty 364 in descending order of analysis. That is, since an analysis with less time and labor required for preprocessing is displayed at the top, it is possible to reduce the number of man-hours required for data analysis by carrying out the analysis from the top.

FIG. 23 is a flowchart showing an example of the recommendation process performed by the analysis and recommendation unit 15. This process is a process performed in step S14 of FIG. In step S51, the analysis and recommendation unit 15 sorts the entries of the analysis difficulty level management table 36 in descending order of the value of the difficulty level 364.

Next, in step S52, the analysis scheduling unit 151 of the analysis and recommendation unit 15 refers to the analysis catalog 31, the skill set record 42, and the analysis difficulty level management table 36 to forward personnel and analysis software against analysis by forward scheduling. assign.

The analysis scheduling unit 151 acquires the analysis ID 36 in the descending order of the value of the difficulty 364, and acquires the necessary skill 317 and the due date 318 from the analysis catalog 31. The analysis scheduling unit 151 acquires the post-correction required time 366 corresponding to the analysis ID 361 from the analysis difficulty level management table 36.

The analysis scheduling unit 151 selects personnel who satisfy the necessary skill 317 from the skill set performance 42, and performs forward scheduling so as to satisfy the corrected required time 366 and the due date 318. A known or known technique may be applied to forward scheduling.

Next, in step S53, the analysis scheduling unit 151 determines whether all the analyzes in the analysis difficulty level management table 36 have completed the processing within the due date 318 with reference to the scheduling result. If all the analyzes are within the delivery date 318, the process is ended, and if there is an analysis exceeding the delivery date 318, the process proceeds to step S54.

In step S54, the analysis scheduling unit 151 determines whether the number of scheduling recalculations (the number of trials) has reached a predetermined threshold or more. If the number of recalculations is equal to or greater than a predetermined threshold value, the process proceeds to step S55, and the analysis scheduling unit 151 outputs an error message for delaying the delivery date.

On the other hand, if the number of recalculations is less than the threshold, the process proceeds to step S56, the analysis scheduling unit 151 raises the rank of the analysis ID 361 exceeding the due date 318 by one, changes the recommendation rank, and returns to step S52. Repeat the above process.

By the above process, the analysis process of the analysis difficulty level management table 36 is scheduled so that the value of the difficulty level 364 is in the descending order and the delivery date 318 is satisfied.

FIG. 25 is a view showing an example of the result confirmation screen 81 generated by the analysis and recommendation unit 15. In the result confirmation screen 81, the upper part in the drawing is a display area of the analysis list 811, and the lower part in the drawing is a display area of the analysis difficulty level basis 812. The analysis difficulty level basis 812 is displayed after one line of the analysis list 811 is selected.

The analysis list 811 is composed of a list including check boxes, analysis ID, analysis name, application or query type, difficulty, required time after correction, end schedule and completion flag in one line. , The contents of the analysis are displayed. The end schedule is determined based on the result of scheduling, and the other items are set to the values of the analysis difficulty level management table 36 or the values of the analysis catalog 31.

At the upper right of the analysis list 811, an export button 815 and a reschedule button 816 are arranged. When a check box is selected and then the export button 815 is clicked, the analysis content of the line for which the check box is selected is output in a predetermined file format (for example, CSV format) via the cooperation interface 18.

In addition, by selecting a check box and clicking a reschedule button 816, it is possible to perform scheduling again for the selected row.

In addition to the configuration of the first embodiment, the analysis difficulty level basis 812 is added with an ETL catalog name 814. When the analysis difficulty calculation unit 13 selects the mapping accuracy of the ETL catalog 38, the ETL catalog name 814 is displayed.

At the bottom of the illustrated analysis difficulty level ground 812, an example is shown in which the data quality score of the column of the data source is displayed. The data quality score indicates that it is data without duplication or loss as it approaches 1 within a value range of 0 to 1. The larger the value of the data quality score, the lower the effort required to preprocess the analysis.

FIG. 24 is a flowchart showing an example of the result confirmation screen process performed by the analysis and recommendation unit 15. This process is a process performed in step S15 of FIG.

In step S61, the analysis / recommendation unit 15 reads the analysis difficulty level management table 36, generates the result confirmation screen 81, and displays the contents of analysis in the order of scheduling in FIG.

The analysis list 811 contains a check box, an analysis ID, an analysis name, an application or query, a difficulty level, a corrected time required, an end schedule (delivery date 318), and a completion flag in one line. The contents of the analysis are displayed.

In step S62, the analysis and recommendation unit 15 determines whether the user of the analysis support server 1 operates the mouse 92 to select one row. If a row is selected, the process proceeds to step S63. If not, the process proceeds to step S64.

In step S63, the analysis / recommendation unit 15 acquires the mapping accuracy of the row selected in the analysis list 811 and the information of the map source and the map destination from the column mapping accuracy management table 35 or the ETL column mapping accuracy management table 39, Output to the display area of the analysis difficulty level basis 812.

In step S64, the analysis and recommendation unit 15 determines whether the user of the analysis support server 1 operates the mouse 92 to select the export button 815. If the export button 815 is selected, the process proceeds to step S65. If not, the process proceeds to step S66.

In step S65, the analysis and recommendation unit 15 outputs the content of the analysis selected by the check box of the analysis list 811 in a predetermined file format.

In step S66, the analysis and recommendation unit 15 determines whether the user of the analysis support server 1 operates the mouse 92 to select the reschedule button 816 or not. If the re-scheduling button 816 is selected, the process proceeds to step S67. If not, the process proceeds to step S68.

In step S67, the analysis scheduling unit 151 of the analysis and recommendation unit 15 performs scheduling again for the content of the analysis selected by the check box of the analysis list 811. Thereafter, the process returns to step S61, and the contents of the analysis list 811 are updated.

In step S 68, the analysis and recommendation unit 15 determines whether the user of the analysis support server 1 operates the mouse 92 to select the confirmation button 813. If the confirmation button 813 is selected, the process proceeds to step S69, and if not, the process proceeds to step S70.

In step S69, the process returns to step S13 of FIG.

In step S70, the analysis and recommendation unit 15 determines whether the user of the analysis support server 1 operates the mouse 92 and selects the close box of the result confirmation screen 81. If the close box is selected, the process ends. If not, the process returns to step S61 to repeat the above process.

By the above processing, it is possible to display the analysis difficulty level basis 812 on the result confirmation screen 81, re-scheduling, update of the mapping accuracy, recalculation of the difficulty level, and the like.

As described above, in the second embodiment, the column mapping accuracy and the difficulty level can be calculated in consideration of the ETL catalog 38 and the quality of the data source, and the analysis software can be proposed in the order of good work efficiency.

26 and 27 show an example of the third embodiment. In the third embodiment, in addition to the configuration of the second embodiment,

data collection servers

450 and 460 having an event log as a data source are added, and event log-table conversion to the column mapping probability calculation unit 12 of the analysis support program 10 An example is shown in which a part is added and the alert code master 43 is added to the storage 20. The other configuration is the same as that of the second embodiment.

FIG. 26 is a block diagram showing an example of a data analysis support system. The data collection server 450 of the area A and the data collection server 460 of the area B collect traffic related data. The data collection server 450 of the area A collects vehicle data 451, operation data 452, track maintenance data 453, facility maintenance results 454, weather data 455 and facility alert 456, and sends it to the analysis server group 300 as a data source. provide.

Similarly, the data collection server 460 of the area B collects the vehicle data 461, the operation data 462, the equipment maintenance result 463, and the equipment alert 464 and provides the data to the analysis server group 300 as a data source.

In the storage 20 of the data lake server 2, vehicle data 61A, operation data 62A, maintenance data 63A, and equipment alert 64A are preset to the common data model 60.

FIG. 27 is a block diagram showing an example of functional elements of the analysis support program. The column mapping accuracy calculator 12 of the analysis support program 10 is added with an event log-table converter 122 which converts an event log into a table format based on the data source catalog 32 and the alert code master 43. The other configuration is the same as that of the second embodiment.

FIG. 28 is a diagram showing an example of the

facility alert

456, 464 in the event log format. The facility alerts 456 and 464 are composed of data including date, time, importance, alert ID, facility name, vehicle number, and message on one line.

FIG. 29 shows an example of the alert code master 43. As shown in FIG. The alert code master 43 includes an alert ID 431 and a message 432 in one entry. The message 432 includes date, time, importance, alert ID, equipment name, vehicle number, and message.

FIG. 30 is a diagram showing an example of the equipment alert 456T converted into the table format. The facility alert 456T is a result of converting the facility alert 456 in the event log format into a table format by the event log-table converter 122 of the analysis support program 10.

The facility alert 456T includes the date and time 4561, the degree of importance 4562, the alert ID 4563, the facility name 4564, the vehicle number 4565, and the message 4566 in one entry.

The facility log 64A of the common data model 60 can be used by the event log-table converter 122 converting data in the event log format into a table format.

FIG. 31 is a view showing an example of the past result confirmation screen 83 generated by the analysis and recommendation unit 15. The analysis / recommendation unit 15 outputs the past results confirmation screen 83 when a predetermined operation (for example, a double click or the like) is performed in the display area of the analysis difficulty level ground 812 shown in FIG.

The past performance confirmation screen 83 includes a window 84 for displaying the column mapping of the currently selected analysis, and a window 85 for displaying past performance. In the past results confirmation screen 83, a past results relationship display button 831, a previous results button 834, and a next results button 833 are arranged.

By clicking the past performance relationship display button 831, the analysis and recommendation unit 15 can display the recommendation result displayed in the past by the analysis ID of the window 84. The analysis and recommendation unit 15 refers to the recommendation result file 37, acquires the recommendation result of the analysis ID of the window 84, and generates the window 85.

By clicking the previous result button 834, the analysis and recommendation unit 15 can trace back the recommendation result displayed in the past by the analysis ID of the window 84 in the past. By clicking the next result button 833, the analysis and recommendation unit 15 can transition from the past to the recommendation result displayed in the previous time using the analysis ID of the window 84.

Another candidate button 832 is arranged near the display position of the mapping accuracy of the window 84. By clicking the other candidate button 832, the analysis and recommendation unit 15 outputs the other candidate presentation screen 86 shown in FIG. FIG. 32 is a view showing an example of the other candidate presentation screen 86 generated by the analysis and recommendation unit 15.

The other candidate presentation screen 86 displays the column mapping accuracy management table 35, the contents of the ETL column mapping accuracy management table 39, and each column mapping accuracy, and selects a combination of column mapping by clicking the select button. Is possible.

FIG. 33 is a flowchart showing an example of processing of the event log-table converter 122. This process is executed when generating the column mapping probability management table 35.

First, in step S81, the event log-table conversion unit 122 reads the alert code master 43, then reads the facility alert 456 in the event log format, and converts it into the facility alert 456T in the table format.

In step S82, the column mapping accuracy calculation unit 12 reads the data source catalog 32 and the common data model catalog 33, calculates the column mapping accuracy as described above, and generates the column mapping accuracy management table 35.

FIG. 34 is a flowchart showing an example of processing of the result confirmation screen 81 generated by the analysis and recommendation unit 15. This process is obtained by adding steps S101 to S104 to the flowchart of FIG. 24 of the second embodiment, and the other configuration is the same as that of FIG.

Steps S61 to S67 are the same as in the second embodiment. If it is determined in step S67 that the reschedule button 816 is not selected, the process proceeds to step S101.

In step S101, the analysis / recommendation unit 15 determines whether a request for display of past results has been received. The request for the past performance display is when a double click or the like is received in the display area of the analysis difficulty level 812 as described above. When receiving the request for displaying the past results, the analysis and recommendation unit 15 proceeds to step S102 and displays the past results confirmation screen 83.

In step S103, the analysis and recommendation unit 15 determines whether or not the other candidate button 832 is selected on the past record confirmation screen 83. When the other candidate button 832 is selected, the process proceeds to step S104, and the analysis and recommendation unit 15 outputs the other candidate presentation screen 86. If the other candidate button 832 is not selected, the process proceeds to step S68, and the same process as that of the second embodiment is repeated.

As described above, in the third embodiment, data in the event log format can also be handled in the same manner as the table format of the first and second embodiments, and analysis software can be recommended according to the degree of difficulty of analysis. It becomes. In addition, in the result confirmation screen 81, past recommendation results and other candidates can be referred to, and it becomes possible to smoothly promote the analysis processing plan.

The present invention is not limited to the embodiments described above, but includes various modifications. For example, the embodiments described above are described in detail in order to illustrate the present invention in an easy-to-understand manner, and are not necessarily limited to those having all the configurations described. Also, part of the configuration of one embodiment can be replaced with the configuration of another embodiment, and the configuration of another embodiment can be added to the configuration of one embodiment. In addition, addition, deletion, or replacement of other configurations may be applied singly or in combination with some of the configurations of the respective embodiments.

Further, each of the configurations, functions, processing units, processing means, and the like described above may be realized by hardware, for example, by designing part or all of them with an integrated circuit. In addition, each configuration, function, and the like described above may be realized by software by a processor interpreting and executing a program that realizes each function. Information such as programs, tables, and files for realizing each function can be placed in a memory, a hard disk, a recording device such as a solid state drive (SSD), or a recording medium such as an IC card, an SD card, or a DVD.

Further, control lines and information lines indicate what is considered to be necessary for the description, and not all control lines and information lines in the product are necessarily shown. In practice, almost all configurations may be considered to be mutually connected.

Claims

A computer having a processor and a memory is an analysis support method for evaluating data to be analyzed,
A first step of reading the first data catalog in which the definition of the columns of the analysis target data is stored, and the second data catalog in which the columns of the input data of analysis software for executing the analysis process are defined;
A second step of calculating the similarity between the columns of the first data catalog and the columns of the second data catalog as a mapping probability;
A third step of calculating the degree of difficulty of analyzing the data to be analyzed by the analysis software based on the mapping accuracy of the columns of the second data catalog used in the analysis software;
An analysis support method characterized by including:
The analysis support method according to claim 1, wherein
The computer further includes a fourth step of outputting information of the analysis software corresponding to the difficulty level,
The third step is
The degree of difficulty is calculated for each of the analysis software with reference to an analysis catalog storing information of one or more analysis software,
The fourth step is
An analysis support method comprising: sorting the calculated difficulty levels in a predetermined order; and outputting information of the analysis software corresponding to the difficulty levels.
The analysis support method according to claim 2,
The third step is
The required column management information for identifying the column of the input data used by the analysis software is referred to, the column used by the analysis software is acquired, and the difficulty level is calculated from the mapping accuracy corresponding to the acquired column. An analysis support method characterized by calculating.
The analysis support method according to claim 2,
The second step is
A third data catalog storing a definition for converting a column of the data to be analyzed into a column of the input data, and the first data catalog are read, and the columns of the first data catalog and the third data Calculating the similarity of catalog columns as ETL mapping accuracy,
The third step is
And calculating the difficulty level by selecting the larger one of the mapping accuracy and the ETL column mapping accuracy.
The analysis support method according to claim 1, wherein
The third step is
An analysis support method comprising: calculating an index indicating quality of the analysis target data; correcting the mapping accuracy with the index; and calculating the difficulty level.
The analysis support method according to claim 2,
The analysis catalog is
The time required for processing for each analysis software and the due date of analysis processing,
The fourth step is
An analysis support method comprising: scheduling for each analysis software to satisfy the delivery date from the required time with reference to the analysis catalog.
The analysis support method according to claim 6, wherein
The fourth step is
3. An analysis support method comprising: correcting the required time based on the degree of difficulty; and performing the scheduling based on the corrected required time.
An analysis support server that has a processor and a memory and evaluates data to be analyzed,
The first data catalog storing the definition of the columns of the analysis target data and the second data catalog defining the columns of the input data of the analysis software for executing the analysis process are read, and the columns of the first data catalog are stored. A column mapping accuracy calculating unit that calculates the similarity of the columns of the second data catalog as the mapping accuracy;
A degree of difficulty calculation unit that calculates the degree of difficulty of analyzing the data to be analyzed by the analysis software based on the mapping accuracy of the columns of the second data catalog used by the analysis software;
An analysis support server characterized by having.
The analysis support server according to claim 8, wherein
An analysis catalog that stores information of one or more analysis software, and
And a recommendation unit that outputs information of the analysis software corresponding to the difficulty level,
The difficulty level calculation unit
The degree of difficulty is calculated for each analysis software of the analysis catalog,
The recommendation unit
An analysis support server characterized by sorting the calculated difficulty levels in a predetermined order and outputting information of the analysis software corresponding to the difficulty levels.
The analysis support server according to claim 9, wherein
It further has necessary column management information for specifying a column of the input data used by the analysis software,
The difficulty level calculation unit
An analysis support server, which acquires a column used by the analysis software with reference to the necessary column management information, and calculates the degree of difficulty from the mapping accuracy corresponding to the acquired column.
The analysis support server according to claim 9, wherein
A third data catalog storing definitions for converting columns of the analysis target data into columns of the input data;
ETL column mapping accuracy calculation which calculates the similarity between the columns of the first data catalog and the columns of the third data catalog as ETL mapping accuracy by reading the first data catalog and the third data catalog Have a part, and
The difficulty level calculation unit
An analysis support server, which selects the larger one of the mapping accuracy and the ETL column mapping accuracy to calculate the degree of difficulty.
The analysis support server according to claim 8, wherein
The difficulty level calculation unit
An analysis support server characterized by calculating an index indicating the quality of the analysis target data, correcting the mapping accuracy with the index, and calculating the difficulty level.
The analysis support server according to claim 9, wherein
The analysis catalog is
The time required for processing for each analysis software and the due date of analysis processing,
The recommendation unit
An analysis support server characterized by performing scheduling for each analysis software to satisfy the delivery date from the required time with reference to the analysis catalog.
The analysis support server according to claim 13, wherein
The recommendation unit
An analysis support server characterized by correcting the required time based on the degree of difficulty and performing the scheduling based on the corrected required time.
A computer having a processor and a memory, which is a storage medium storing a program for evaluating data to be analyzed,
A first data catalog storing the definition of columns of the data to be analyzed, and a first step of reading a second data catalog defining columns of input data of analysis software that executes analysis processing;
Calculating a similarity between a column of the first data catalog and a column of the second data catalog as a mapping probability;
A third step of calculating the degree of difficulty of analyzing the data to be analyzed by the analysis software based on the mapping accuracy of the columns of the second data catalog used in the analysis software;
A non-transitory computer readable storage medium storing a program for causing the computer to execute.