WO2016002062A1

WO2016002062A1 - Information processing device and information processing system

Info

Publication number: WO2016002062A1
Application number: PCT/JP2014/067856
Authority: WO
Inventors: 文也工藤; 知明秋富
Original assignee: 株式会社日立製作所
Priority date: 2014-07-04
Filing date: 2014-07-04
Publication date: 2016-01-07
Also published as: JP6174802B2; JPWO2016002062A1

Abstract

In order to further facilitate extraction of the inclusion relation between columns of tabulated data, the present invention performs: a step for calculating the amounts of information of a first column and a second column included in an input table; a step for comparing the amount of information of the first column and the amount of information of the second column; a step for calculating a first conditional amount of information, which is the amount of information of the first column conditioned on the second column; and a step for determining the inclusion relation between the first column and the second column on the basis of the first conditional amount of information.

Description

Information processing apparatus and information processing system

The present invention relates to an information processing apparatus and an information processing system. More specifically, the present invention relates to an information processing apparatus and an information processing system that support analysis of data in a table format.

In recent years, the development of systems that analyze big factors related to business performance by utilizing big data on business performance accumulated by companies has been actively conducted. The analyst examines the relationship with the purpose by limiting the conditions for a large amount of data including various information by narrowing down the conditions. At this time, it is important to narrow down the conditions at what granularity. Take the case of analyzing the sales improvement of a certain store as an example. For example, when examining customer transitions by time, analysis results vary greatly depending on the granularity of time to be narrowed, such as every minute, every hour, and every six hours. In this way, the analyst processes the data into various conditions and granularities, or analyzes it using relationships. However, as the size of the data to be analyzed has increased, it has become difficult for analysts to manually process such data and discover relationships. Therefore, development of a system that supports such analysis is required.

JP 2003-22277

Prior to the invention of the present application, the inventors of the present application examined the extraction of the granularity relationship between columns in the collected data, particularly regarding the granularity of data described in the background art.

FIG. 1 shows a specific example of a table treated as an analysis target and a specific example of a granularity relationship between columns. In the table 001, each row of the table representing one sample is called a record, and each column of the table such as “customer ID” 002, “age” 003, and “entry time” 004 is called a column.

Here, the columns in the table may have columns with different granularity of stored records. For example, “product classification” 005 is a lexicographic concept of “product name” 006, and therefore includes “product name” 006. Further, “age” 003 includes “customer ID” 002 in terms of information amount because every customer has one age value and there may be a plurality of customers of the same age.

In this way, “product name” 006 and “customer ID” 002 with finer granularity are called child columns, and “product category” 005 and “age” 003 are called parent columns. That is, specific examples of columns having an inclusion relationship in the table 001 are “product name” 006, “product classification” 005, “customer ID” 002, and “age” 003. On the other hand, specific examples of columns having no inclusion relationship are “product classification” 005 and “temperature” 008, for example. If an inclusive relationship can be found in advance, the information can be used for analysis.

As an example of a technique for extracting an inclusion relationship, Patent Document 1 registers information related to a concept hierarchy such as a synonym of a predetermined word or a word that is a superordinate concept in advance in a thesaurus dictionary, and a concept between words at the time of search. A technique for providing a search method considering a hierarchy is described.

The technique described in Patent Document 1 can know the concept relationship between words for words in which information on synonyms and words that are higher-level concepts is stored in a thesaurus dictionary. However, for example, “age” is not registered as a superordinate concept of “customer ID” in a normal thesaurus dictionary. Furthermore, column names in the target table can be freely assigned by the creator of the table. For this reason, the column name of the product name column varies depending on the creator of the table, for example, “Label”, “Product Name”, “pro_name”, “Product name”, and as a result, the word is not registered in the thesaurus dictionary. Of course it can be. In these cases, the method of searching for a superordinate word using the technique described in Patent Document 1 cannot be applied.

Based on the above, an object of the present invention is to provide a technique that makes it easier to extract inclusion relations between columns for table format data.

A representative example of means for solving the problems according to the present invention is an information processing method, which is an information amount of the first column with respect to an input table including the first column and the second column. The first step of calculating the first information amount and the second information amount of the second column are compared with the magnitude relationship between the first information amount and the second information amount. A second step of calculating, a third step of calculating a first conditional information amount which is an information amount of the first column with respect to the second column, and a first step based on the first conditional information amount And a fourth step of determining an inclusion relationship between the column and the second column.

Alternatively, in the information processing system, the storage unit that stores the input table including the first column and the second column, the first information amount that is the information amount of the first column, and the second column An information amount calculation processing unit that calculates a second information amount, which is an information amount, and compares the magnitude relationship between the first information amount and the second information amount, and information on the first column relative to the second column An inclusion relation calculation processing unit that calculates a first conditional information quantity, which is a quantity, and determines an inclusion relation between the first column and the second column based on the first conditional information quantity It is characterized by that.

According to the present invention, it is easier to extract inclusion relations between columns in table format data.

The schematic diagram explaining the inclusion relation of a column. The block diagram which shows the structure of the whole system. The processing flow figure in an inclusion relation calculation extraction part. The processing flow figure in calculation of information amount. The processing flow figure in calculation of the amount of conditional information. The processing flow figure which makes all the columns the object of a child column. The figure which shows an input table. The figure which shows an inclusion relationship information table. The figure which shows a parent flag table. The schematic diagram which shows the specific example of an inclusion relationship calculation process. The processing flow figure which generates a column automatically. The figure which shows the specific example of the information of all the columns. The figure which shows classification | category of a type | mold. The processing flowchart in a time table production | generation process. The processing flow figure in table automatic generation processing. The figure which shows the specific example of a time table and an automatic generation table. The figure which shows an inclusion relationship information table. The figure which shows an output table.

Hereinafter, an embodiment of the present invention will be described with reference to the drawings. FIG. 2 is a block diagram showing the hardware configuration of the entire system of the present invention. An information processing system 100 according to the present embodiment includes an information processing apparatus 100 including a central processing unit 101 and a storage device 102, an input device 103, and an output device 104.

The central processing unit 101 is a processor that executes a program stored in the storage device 102, and includes an information amount calculation processing unit 105, an inclusion relation calculation processing unit 106, and the like. The storage device 102 is a large-capacity non-volatile storage device such as a magnetic storage device or a flash memory, and stores an input table 107, an output table 108, an inclusion relationship information table 109, and the like. The input device 103 is a user interface such as a keyboard and a mouse, and the output device 104 is a user interface such as a display device and a printer. Here, the information processing device 100, the input device 103, and the output device 104 are connected via a network, but this point is not particularly limited. This system may be physically constructed on one computer, or physically constructed on a logical partition configured on one or a plurality of computers. The information amount calculation processing unit 105 calculates the information amount of each column, which will be described later with reference to FIG. The inclusion relation calculation processing unit 106 calculates the conditional information amount of the parent column with respect to the child column, which will be described later with reference to FIG.

Next, a method by which the information processing system 100 extracts the inclusion relationship between columns for table format data will be described with reference to FIG. FIG. 3 is a flowchart showing processing in the central processing unit 101. In the flow of FIG. 3, input data is an input table 110 and child column information 203. The input table 110 is a table in which, for example, information on store sales and customers is stored as shown in the upper part of FIG. 7, and data is stored in each column (customer ID, age, store entry time,...) Of the table. included. Hereinafter, each column of this table is referred to as a “column”. Each row of the table is called a record and represents one sample. The “child column” indicated by the child column information 203 is one column arbitrarily selected from the input table 110 as “product classification” in the lower part of FIG. In FIG. 3, a flow when the child column is fixed to “product classification” will be described, and a flow in which all the columns are targets of the child column will be described with reference to FIG.

First, in step 301, one column as a parent column candidate is selected from the columns included in the input table 110, and the parent column information 302 is generated. An example of the parent column information 302 is shown in the lower part of FIG. Parent column candidates are selected in order from the input table 110 for columns other than the column of the child column information 203 and the column for which the flag is set in the parent flag table 309. Here, the parent flag table 309 is a table indicating a flag relationship whose inclusion relationship is known. By using this table, combinations that do not need to be examined as parent columns can be excluded. A specific example of the parent flag table 309 is shown in FIG. In the parent flag table 309, for example, “product name” and “product classification” are flagged as having an inclusion relationship.

Next, in step 303, the column information amount H (X) of the parent column information 302 and the column information amount H (Y) of the child column information 203 are calculated. Here, the larger the amount of information, the greater the number of unique records in the column. The number of unique records is the number of records in a column that does not allow duplication.

Next, in step 304, the amount of information is compared, and when 0 <H (parent column) <H (child column) is satisfied, it is determined as Yes. Columns with the same value in all records have the amount of information 0, and they always satisfy H (parent column) <H (child column), but they are meaningless, so 0 <H (parent column) In addition, such columns are excluded.

If it is determined as Yes in step 304, in step 305, the conditional information amount H (X | Y) of the parent column with respect to the child column is calculated for the parent column information 302 and the child column information 203.

Next, in step 306, inclusion determination of the parent column by the child column is performed. This step is a step of determining an inclusion relationship from the viewpoint of the amount of information between columns. If H (X | Y) is 0, it is determined Yes. In consideration of the case where noise is included in the data in the column, when the value is sufficiently close to 0 and less than a predetermined threshold value α (for example, 0.1), it is determined as Yes.

In step 307, the determination result is registered in the inclusion relation information table 109. If it is determined Yes in step 306, the column name and the corresponding parent column name are registered in the inclusion relation information table 109.

Next, the parent flag table 309 is updated at step 308, and the inclusion relation information table 109 is updated at step 310.

Here, the details of step 303 in FIG. 3 will be described with reference to FIG. First, in step 401, parent column information 302 and child column information 203 are input, and parent column information and child column information excluding records including Null are output. At this time, since the record of the parent column information 302 and the record of the child column information 203 are handled as a pair, if either of them includes Null, both records are excluded. A specific example of the above is shown in FIG. For example, from the input table 110, the “product classification” column is selected as the parent column information 302, and the “product name” column is selected as the child column information 203. On the other hand, when step 401 is executed, the fifth record in the “product name” column contains “slipper”, but the fifth record in the “product classification” column is null, so the fifth record Is deleted, the parent column information 701 and the child column information 702 are output. Thereafter, the variable Z is initialized (step 402), a unique record in each column is extracted and stored in X (step 403).

Next, in step 404, the amount of information for each record x in X is calculated according to (1) below.

here,
x: each unique record in the column X: a set of records x p (x): the probability that the record in the column is x H (X): the amount of information in the column. This calculation is performed for all x in the parent column in step 405. In

steps

406 and 408 to 410, the same amount of information is calculated for the child column, whereby the information amount can be calculated for the parent column and the child column (step 407).

The specific calculation is as follows. First, the amount of information is calculated for the parent column information 701. In the parent column information 701, the number of records in the “product classification” column is nine. Among them, paying attention to the record “stationery” in “product classification”, since the number of records is 3, the probability that x = “stationery” is p (x) = 3/9. Similarly, when x = “food”, p (x) = 4/9, and when x = “kitchen”, p (x) = 2/9, so the amount of information in the X = “product classification” column is From the equation (2), H (X) = 1.53 is calculated.

here,
x: each unique record in the “product category” column X: set of records x p (x): probability that the record in the “product category” column is x H (X): the amount of information in the “product category” column . Similarly, when the amount of information is calculated for the child column information 702, the amount of information is calculated as H (Y) = 2.95 from Equation (3).

here,
y: Each unique record in the “Product Name” column Y: Set of records y p (y): Probability that the record in the “Product Name” column is y H (Y): Information amount in the “Product Name” column . Therefore, since 0 <H (X) <H (Y), it is determined as Yes in the subsequent step 304. That is, in this specific example, the “product name” column has a larger amount of information than the “product classification” column. In other words, the “product name” column has more unique records than the “product classification” column. The number of unique records is the number of records in a column that does not allow duplication.

Next, the details of step 305 in FIG. 3 will be described with reference to FIG. As in FIG. 4, the parent column information and child column information excluding the record including Null are output (step 401), Z is initialized (step 501), and the unique record in each column is extracted to extract X, Store in Y (step 502).

Next, in steps 503 to 505, for each record x and y in X and Y, the conditional information amount 506 of the parent column with respect to the child column is calculated. The calculation of the conditional information amount in step 503 is performed by the following equation (4).

x: Each unique record in the parent column y: Each unique record in the child column X: Set of records x Y: Set of records y p (y): Probability that the record in the child column is y p (x, y): Parent Probability that the record of the column is x and the record of the child column is y H (X | Y): the amount of information remaining in the record of the parent column when the record of the child column is determined. The description will continue using the child column information 702. Hereinafter, the meaning of each symbol is as follows.
x: Each unique record in the “product classification” column y: Each unique record in the “product name” column X: Set of records x Y: Set of records y p (y): Record in the “Number of products” column is y Probability p (x, y): Probability that the record in the “product category” column is x and the record in the “product name” column is y H (X | Y): When the record in the “product name” column is determined, “product” The amount of information remaining in the records in the “Classification” column First, if x = “Stationery” and y = “Pen”, the number of records in the “Product Name” column is 9, and the number of records in which the “Product Name” column is “Pen” 1. Since the number of records in which the “product name” column is “pen” and the “product classification” is “stationery” is 1, p (y) = 1/9 and p (x, y) = 1 / 9 Similarly, when x = “food” and y = “tea”, p (y) = 2/9 and p (x, y) = 2/9.

With the above procedure, using p (y) and p (x, y) for all records, X = all records in the “Product Classification” column, Y = conditional information amount of all records in the “Product Name” column Is calculated as H (X | Y) = 0 from Equation (5).

Here, in step 306, it is determined as Yes when H (X | Y) <α is satisfied. Here, α is a threshold having a range of about 0 ≦ α <0.1. In the specific example, it is determined as Yes because the condition is satisfied. In other words, it is determined that the “product classification” column is included in the “product name” column. If the X and Y are interchanged and the conditional information amount H (Y | X) of the “product name” column for the “product classification” column is calculated, equation (6) is shown.

From Equation (6), H (Y | X) = 1.42. This is a value determined as No by the inclusion relationship determination unit 306 because H (Y | X)> α. In other words, it is determined that the “product name” column is not included in the “product classification” column. That is, the “product classification” column is a parent column of the “product name” column, but the reverse is not true. It should be noted that the above calculation is for explanation, and the calculation of equation (6) is not essential for the implementation of the present invention.

In this way, in step 306, when the record of one column is determined, it is determined whether the record of the other column tends to be determined uniquely. As a specific example of the input table 110, when “product name” is “pen”, “product category” is always “stationery”, and when “product name” is “tea”, “product category” is Since it is always “food”, the “product classification” column is a parent column of the “product name” column. On the other hand, when “Product category” is “Stationery”, “Product name” is one of “Pen”, “Pencil”, and “Eraser” and is not uniquely determined, so “Product name” is the parent of “Product category”. is not. As described above, the column of the parent column information 302 and the column of the child column information 203 determined as Yes by the information amount determination unit 304 and Yes by the inclusion relationship determination unit 306 are columns having an inclusion relationship.

This completes the extraction of the inclusion relationship when the child column is fixed to “product classification”. Next, a flow in which all columns are child columns will be described with reference to FIG.

First, in step 202, any one column in the input table 110 is selected as a child column, and child column information 203 is output. Next, in step 204, the series of flows described with reference to FIGS. 3 to 5 are executed to extract inclusion relation information between columns and update the inclusion relation information table 109. In step 205, the process of step 204 is executed for all columns.

When it is confirmed in step 205 that the processing in step 204 has been executed for all columns, the nearest degree calculation processing is executed in step 206 and the inclusion relation information table 109 is updated as necessary. As described above, the inclusion relation can be automatically calculated for all the columns included in the input table 110. Here, the nearest degree 603 is a value indicating the closeness of the inclusive relation between the columns stored in the column name 601 and the parent column name 602, and the record of the parent column name 602 for the record having the same column name 601. The columns stored in the column are numbered in descending order of the number of unique records in that column. A specific example of the latest degree 603 will be described with reference to FIG.

The table in this embodiment has no problem considering that it is the same concept as a table in a general database, but the present invention is not limited to the table in the database, and the table is not limited to the memory area on the program. A stored form may be used, or data in any form such as a text file format or a CSV file format may be replaced.

FIG. 8 is a specific example of the inclusion relation information table 109. The column name 601 stores the column name of the child column whose inclusion relation is calculated in the input table 110 of FIG. In the parent column name 602, when a parent column exists for the column of the column name 601, the column name of the parent column is stored. For example, in step 307 of FIG. 3, “product classification” that is a column of parent column information 302 and “product name” that is a column of child column information 203 are registered in the record 604.

FIG. 9 is a specific example of the parent flag table 309. The parent flag table 309 is a square matrix having the same elements in rows and columns. Do not use the shaded area in the square matrix. The initial state of the parent flag table 309 is an empty table. When it is registered from the input table 110 that the “product name” column and the “product classification” column are in an inclusive relationship in step 307, the location corresponding to “product name” and “product classification” in the parent flag table 309 in step 308 Stores flag 1. Similarly, a flag is stored at a location corresponding to a combination of columns determined to be in an inclusive relationship.

As described above, the information processing method according to the present embodiment uses the table 110 including the first column (parent column) and the second column (child column) as an input, and is the information amount of the first column. A step 303 for obtaining a first information amount H (X) and a second information amount H (Y) which is the information amount of the second column; and a step 304 for comparing the first information amount and the second information amount. A step 305 for obtaining a first conditional information amount H (X | Y) that is an information amount of the first column with respect to the second column, and the first column based on the first conditional information amount, And a step 306 of determining an inclusion relationship of the second column. In addition, the information processing system according to the present embodiment includes a storage unit 102 that stores an input table including a first column and a second column, a first information amount that is an information amount of the first column, and a first information amount. A second information amount that is an information amount of the second column, and an information amount calculation processing unit 105 that compares the magnitude relation between the first information amount and the second information amount, and a second information amount for the second column. An inclusion relation calculation processing unit 106 that calculates a first conditional information quantity that is an information quantity of one column and determines an inclusion relation between the first column and the second column based on the first conditional information quantity. It is characterized by having.

Due to such a feature, the information processing method and the information processing system according to the present embodiment can more easily extract the inclusion relationship between the columns in the table format data. In particular, since this information processing method is a method of extracting inclusion relations based on the amount of data information, a thesaurus dictionary is naturally unnecessary, and even if the creator has given an arbitrary column name Implementation is possible without problems.

FIG. 11 is a flowchart of the information processing system 100 in the second embodiment, which corresponds to FIG. 6 in the first embodiment. The difference from the first embodiment is that a function for automatically generating a parent column is added by changing the operation depending on the type of the child column input by the child column information 203. As a result, the parent column can be generated even when the parent column of the child column does not exist in the input table 110.

In FIG. 11, in addition to the input table 110, the information table 801 for all columns is input from the user. The all column information table 801 stores information on all the columns related to the input table 110. A specific example of the information table 801 for all columns is shown in FIG. One column is selected from the input table 110 in step 202, and child column information 203 is output. Using child column information 203 and all column information table 801 as input, in step 802, the column type of child column information 203 is determined to be one of three types: “time type”, “numeric type”, and “string type”. Then, step 805, step 803, and step 204 are executed according to the selected type. In step 802, the type is determined based on the column type name of the child column information 203 stored in the information table 801 for all columns. FIG. 13 shows the type classification for determining the type.

If it is determined that the time type is determined in step 802, step 805 is executed and a time table 807 is output. A detailed flow of the time table generation processing unit 805 is shown in FIG. 14, and a specific example of the time table 807 is shown in FIG.

If it is determined in step 802 that it is a numeric type, step 803 is executed, and a frequency distribution is calculated for each record in the column of the child column information 203. Then, using the obtained result, in step 804, it is determined whether it is handled as a numeric type or a character string type. Specifically, a numerical value in which the column element in the child column information 203 is treated as an ID, for example, No is determined when the frequency distribution follows a uniform distribution, and Yes is determined when the frequency distribution follows a normal distribution or other distributions. Or whether it is a numerical value treated as a value. If it is determined Yes in step 804, step 806 is executed, and the automatic generation table 808 is output. FIG. 15 shows details of the automatic generation table generation processing unit 806, and FIG. 16 shows a specific example of the automatic generation table.

If it is determined in step 802 that the character string type, step 204 is executed. The details of step 204 are the same as those described with reference to FIGS.

Thereafter, the input table 110 and the time table 807 or the automatic generation table 808 are input, and the tables are joined in step 809. When the input table 110 is a table in a general relational database, this join is performed using inner 結合 join, which is a join query, and is performed in an equivalent process in other formats such as a text format. The column that becomes the key at the time of joining is a column of the child column information table. In step 810, the inclusion relation information is stored in the inclusion relation information table 812. A specific example of the inclusion relation information table 812 is shown in FIG.

If it is determined Yes in step 205, the output table 108 and the inclusion relation information table 812 are output, and the inclusion relation information table 812 is updated in step 206 as necessary. A specific example of the output table 108 is shown in FIG. This time, the type is first determined for the selected child column, but first, step 204 is executed, and if the parent column is not found, the type is determined, and step 805 and step 806 are executed. It is also possible.

FIG. 12 shows a specific example of the information table 801 for all columns. The information table 801 for all columns stores column names, model names, and threshold values. The column name is the name of each column in the input table 110, the type name is the type name in a general relational database, such as int, float, double, decimal, bit, boolean, char, string, time, date, datetime, etc. Is mentioned. The threshold value is a numerical value given from the user or the system.

FIG. 13 shows a type classification table 902. In step 802, the determination is made according to the type classification table 902. The types other than the types listed here are determined by defining the type classification in the same manner.

FIG. 14 shows a detailed flowchart of step 805. In this process, a column with various granularities is generated for a child column that is a time type. In step 1001, the child column information 203 is inputted, and the start date and time and end date and time of the record in the column are obtained by obtaining the maximum and minimum values of the record. Next, in step 1002, a time table 807 is generated in the range from the start date to the end date. A specific example of the time table 807 is shown in FIG.

FIG. 15 shows a detailed flowchart of step 806. In this process, a column with various particle sizes is generated for the child column. In step 1101, the child column information 203 is input, and the maximum and minimum values of the records in the column are acquired. In addition, the frequency distribution information of the records in the column is acquired. Next, based on the child column information obtained in step 1101, the automatic generation table 808 is generated in step 1102. In step 1102, a table having values obtained by dividing the range of the maximum value and the minimum value with the threshold values acquired from the information table 801 for all columns, and a table based on the frequency distribution of child columns can be generated. A specific example of the automatic generation table 808 is shown in FIG.

FIG. 16 is a specific example of the time table 807 and the automatic generation table 808. A time table 807 is generated based on the start date and time and the end date and time obtained in step 1001. The time table generated at this time is a time column of various increments such as a column having a time of 10 minutes in the record and a column having a time of 1 hour in the record. Similarly, the automatic generation table 808 is generated based on the maximum value and the minimum value obtained in step 1101. Here, in the “ReCalc_temperature” column 1506, the “temperature” column in the input table 110 includes the “threshold” column record “2” corresponding to the record “temperature” column in the column name in the information table 801 of all columns. Stores the value obtained by dividing the range of the maximum value and minimum value of this record into two. Besides the method of dividing the maximum value and the minimum value by the threshold value, it is also possible to generate a column divided by the threshold value so that the frequencies are equal based on the frequency distribution information extracted by the frequency distribution information extraction 1102. Is possible.

FIG. 17 is a specific example of the inclusion relation information table 812. The column name 601 stores the column names of all the columns in the input table 110. In the parent column name 602, when a parent column exists for the column of the column name 601, the column name of the parent column is stored. The latest degree 603 is the closeness of the inclusion relationship between the columns stored in the column name 601 and the parent column name 602. For columns in which the record of the column name 601 is stored in the record of the same parent column name 602, numbers are given in descending order of the number of unique records in the column. In the record 1201, “ReCalc_temperature” generated by the automatic generation table generation processing unit 806 is registered as a parent column of “temperature”. In

records

1208 and 1209, “age” is extracted as a parent column of “customer ID” and “product classification” is extracted and registered as a parent column of “product name” by the inclusion relation information extraction processing unit 204 between the columns. Since these records have only one parent column for one child column, 1 is stored in the latest degree.

On the other hand, in the

records

1202, 1203, and 1204, since the time table 807 is generated, “time every 10 minutes”, “time every hour”, and “time every 6 hours” are registered as parent columns. Here, since “entry time” has three parent columns, the latest degree 603 is obtained in descending order of the number of unique records for each parent column. Here, since the number of unique records increases in the order of “time every 10 minutes”, “time every 1 hour”, and “time every 6 hours”, the rank is stored in this order in this order. In addition, because “time every 10 minutes” “time every hour” “time every 6 hours” and “time every hour” are parent time columns “time every 6 hours” They are registered in

records

1205, 1206, and 1207, respectively. When there is no parent column as in the record 1210, nothing is stored in the parent column name or the latest degree. The inclusion relation information table 812 is obtained as described above.

FIG. 18 is a specific example of the output table 108. The time table 807 and the automatic generation table 808 generated in step 805 and step 806 are combined with the input table 110 in step 809, and the output table 108 is output. In step 805, the output table 108 includes a “time every 10 minutes” column 1302, a time every hour column 1303, and a time every 6 hours column 1304 as parent columns for the “entry time” column 1301. Is generated and combined. Further, the “ReCalc_temperature” column 1306 is generated and combined as a parent column for the “temperature” column 1305 by the automatic generation table generation processing unit 806. Inclusion relationship information between columns including artificially generated columns is stored in the inclusion relationship information table 812. As described above, by using the output table 108 and the column inclusion relation information table 812 output by the invention according to this embodiment, it is possible to extract inclusion relations for all types of columns in the input table.

110: Information processing system, 101: Central processing unit, 102: Storage device, 103: Input device, 104: Output device, 105: Information amount calculation processing unit, 106: Inclusion relation calculation processing unit, 107: Input table, 108: Output table, 109: Inclusion relationship information table, 110: Input table, 202, 204, 205, 206, 301, 303, 304, 305, 306, 307, 308, 401, 402, 403, 404, 405, 406, 501 , 502, 503, 504, 505, 802, 803, 804, 805, 806, 809, 810, ： 1001, 1002, 1101, 1102: Step, 203, 702: Child column information, 302, 701: Parent column information, 309 : Parent flag table, 407: Information amount, 506: Conditional information amount, 601: Column name, 602: Parent column name, 603: Proximity, 604, 1201 to 1210: Record, 801: Information table for all columns, 807 : Time table, 808: Automatic generation table, 812: Inclusion relation information table, 902: Type classification table, 1301 to 1306: Column.

Claims

For an input table that includes a first column and a second column,
A first step of calculating a first information amount that is an information amount of the first column and a second information amount that is an information amount of the second column;
A second step of comparing a magnitude relationship between the first information amount and the second information amount;
A third step of calculating a first conditional information amount, which is an information amount of the first column with respect to the second column;
A fourth step of determining an inclusion relationship between the first column and the second column based on the first conditional information amount;
An information processing method characterized by comprising:
In claim 1,
In the second step, the third step is executed when the first information amount is larger than 0 and smaller than the second information amount.
In claim 1,
In the fourth step, when the first conditional information amount is 0 or more and less than a predetermined threshold value, it is determined that the first column includes the second column. Method.
In claim 1,
The information processing method further comprising a fifth step of determining whether a type of a record included in the second column is a time type, a numeric type, or a character string type.
In claim 4,
A sixth step of generating a plurality of columns in which the step size of the record included in the second column is changed when it is determined in the fifth step that the type of the second column is a time type; An information processing method characterized by further comprising:
In claim 4,
When it is determined in the fifth step that the type of the second column is a numeric type, it is determined whether a record included in the second column is handled as a numeric type or a character string type An information processing method further comprising a seventh step.
In claim 6,
When it is determined that the second column is handled as a numerical type in the sixth step, the method further includes an eighth step of generating a plurality of columns by dividing the record range of the column by a predetermined threshold. A characteristic information processing method.
In claim 3,
The input table further includes a third column determined to include the second column;
In the information processing method, the inclusion relation table storing the inclusion relation of the first, second, and third columns indicates the order in which the number of unique records included in the first and third tables is in descending order. An information processing method further comprising a ninth step of storing the degree.
In claim 1,
An information processing method, further comprising: a tenth step of deleting the record when the first column or the second column includes a null record.
A storage unit for storing an input table including a first column and a second column;
A first information amount that is an information amount of the first column and a second information amount that is an information amount of the second column are calculated, and the first information amount and the second information are calculated. An information amount calculation processing unit for comparing magnitude relations of amounts;
A first conditional information amount that is an information amount of the first column with respect to the second column is calculated, and the first column and the second column are calculated based on the first conditional information amount. An information processing system comprising: an inclusion relation calculation processing unit that determines an inclusion relation of columns.