WO2015045318A1 - 情報処理システム、情報処理方法およびプログラムを記憶する記録媒体 - Google Patents
情報処理システム、情報処理方法およびプログラムを記憶する記録媒体 Download PDFInfo
- Publication number
- WO2015045318A1 WO2015045318A1 PCT/JP2014/004706 JP2014004706W WO2015045318A1 WO 2015045318 A1 WO2015045318 A1 WO 2015045318A1 JP 2014004706 W JP2014004706 W JP 2014004706W WO 2015045318 A1 WO2015045318 A1 WO 2015045318A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- attribute
- function
- attributes
- analysis engine
- information processing
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
Definitions
- the present invention relates to a technique for supporting data mining.
- Data mining is a technology for finding useful knowledge that has been unknown so far from a large amount of information.
- an example of analyzing sales data owned by a major supermarket chain is known.
- sales data it was found that "customers who purchased diapers tend to purchase beer at the same time”.
- the supermarket chain can improve sales by taking measures such as “Don't cut diapers and beer at the same time” by taking advantage of this knowledge.
- the first stage (process) is a “pretreatment stage”.
- the attribute (feature) input to a device or the like that operates according to the data mining algorithm is processed to make the attribute a new attribute. Convert.
- the second stage is the “analysis process stage”.
- an attribute is input to a device or the like that operates according to the data mining algorithm, and an analysis result that is an output of the device or the like that operates according to the data mining algorithm is obtained.
- the third stage is the “post-processing stage”.
- the analysis result is converted into an easy-to-read graph, a control signal for inputting to another device, or the like.
- the “pre-processing stage” needs to be appropriately performed.
- the work of designing what procedure the “preprocessing stage” should be performed on depends on the knowledge of a skilled technician (data scientist) of the analysis technology.
- the design process in the preprocessing stage is not sufficiently supported by the information processing technology, and still depends heavily on trial and error by the manual work of skilled engineers.
- Non-Patent Document 1 discloses an example of software for realizing data mining.
- Non-Patent Document 1 provides a function for supporting selection of an attribute suitable for realizing a desired task (analysis process). This function is also referred to as “feature selection”.
- Non-Patent Document 1 Suppose an operator performs data mining using software disclosed in Non-Patent Document 1. In this case, the operator cannot always obtain an accurate analysis result. This is because the software disclosed in Non-Patent Document 1 merely selects an attribute for obtaining an accurate analysis result from attributes prepared in advance. As described above, the software disclosed in Non-Patent Document 1 has a restriction that only a solution selected from attributes prepared in advance can be output. For this reason, if the attribute prepared in advance does not include an attribute that can obtain an accurate analysis result, the operator cannot obtain an accurate analysis result.
- the present invention has an object to provide an information processing system and the like that contributes to improvement in accuracy of analysis processing.
- a first aspect of the present invention relates to a function that defines an operation that takes a plurality of operands, and selects a combination of attributes to be the plurality of operands from a plurality of inputted attributes, By applying the function to the combination of the attribute generation means for generating a new attribute that is a result of applying the function to the combination of attributes, and an analysis engine that executes an analysis process based on the attribute, And a verification unit that inputs the new attribute and determines whether information output from the analysis engine satisfies a predetermined requirement.
- a computer accessible to a function storage unit that stores a function defining an operation that takes a plurality of operands acquires the function from the function storage unit, and inputs a plurality of input functions.
- a combination of attributes as the plurality of operands from among the attributes and applying the function to the combination of attributes
- a new result that is a result of applying the function to the combination of attributes
- a process for obtaining the function from the function storage means is input to a computer accessible to a function storage means for storing a function defining an operation that takes a plurality of operands. This is a result of applying a function to a combination of attributes by selecting a combination of attributes to be a plurality of operands from a plurality of attributes and applying the function to the combination of attributes.
- the object of the present invention is also achieved by a computer-readable storage medium storing the above program.
- FIG. 1 is a block diagram illustrating the configuration of an information processing system 1000 according to the first embodiment of the present invention.
- FIG. 2 is a diagram showing an example of a data set according to the first embodiment of the present invention.
- FIG. 3 is a diagram illustrating an example of data stored in the function storage unit 110 according to the first embodiment of the present invention.
- FIG. 4 is a diagram illustrating the details of the attribute generation unit 120 according to the first embodiment of the present invention.
- FIG. 5 is a diagram for explaining the details of the test unit 130 according to the first embodiment of the present invention.
- FIG. 6 is a diagram illustrating details of the test unit 130 according to the first embodiment of the present invention.
- FIG. 7 is a diagram illustrating details of the test unit 130 according to the first embodiment of the present invention.
- FIG. 1 is a block diagram illustrating the configuration of an information processing system 1000 according to the first embodiment of the present invention.
- FIG. 2 is a diagram showing an example of a data set according to the first embodiment of the present invention
- FIG. 8 is a flowchart for explaining the operation of the information processing system 1000 according to the first embodiment of the present invention.
- FIG. 9 is a block diagram illustrating the configuration of an information processing system 1001 according to the second embodiment of the present invention.
- FIG. 10 is a diagram showing an example of a data set according to the second embodiment of the present invention.
- FIG. 11 is a diagram illustrating an example of data stored in the function storage unit 111 according to the second embodiment of the present invention.
- FIG. 12 is a diagram illustrating details of the attribute generation unit 121 according to the second embodiment of the present invention.
- FIG. 13 is a diagram for explaining the details of the verification unit 131 according to the second embodiment of the present invention.
- FIG. 14 is a block diagram illustrating the configuration of an information processing system 1002 according to the third embodiment of the present invention.
- FIG. 15 is a diagram illustrating an example of a hardware configuration capable of realizing the information processing system according to each embodiment of the present invention.
- Data set is data input to the information processing system 1000.
- a “data set” includes one or more attributes.
- Attribute can be rephrased as “variable”.
- a “function” defines a processing that creates a new attribute from a certain attribute.
- the “function” is applied to the attribute included in the data set. That is, when a “function” is applied to a certain attribute, a process defined by the function is executed for the certain attribute, and as a result, a new attribute is generated.
- “function” defines an operation to be applied to an attribute.
- the function defines a process of transforming one attribute to another attribute.
- the “function” may be a mapping applied to the attribute included in the data set.
- a function represents the above-described operation associated with the function.
- a function represents the above-described process associated with the function.
- the process defined by “function” is, for example, a unary operation. “Function” defines operations such as trigonometric functions (sin (X), cos (X), tan (X)), natural logarithm, absolute value, or sign inversion.
- the “function” may define an operation including the parameter n such as log n X, X n and the like.
- the process defined by “function” is a polynomial operation.
- a multinomial operation is an operation having a plurality of operands.
- “Function” defines, for example, arithmetic operations (addition, subtraction, multiplication, etc.) of attribute X and attribute Y.
- the “function” is, for example, a logical operation (logical product (AND), logical sum (OR), exclusive operation) applied to the bit value of the attribute X and the bit value of the attribute Y.
- logical OR logical OR
- the process defined by the “function” may be “data-dependent process” in which the process is determined according to the data.
- data-dependent processing is normalization processing.
- the data mining device generates a new attribute called “standardized height” by applying a function that defines standardization processing to the attribute “height”.
- the data mining device does not individually standardize the data for each person included in the attribute. For example, it is assumed that the data mining apparatus first accepts only the first information “name: N, height: 174” of information for 100 people. In this case, the data mining device does not calculate a new attribute “standardized height” for the first person's information. This is because the data mining device must have the information required for 100 people until the information is standardized (ie, the average value of the “height” values for 100 people and the “height” for 100 people). This is because the standard deviation of "" cannot be known, and as a result, a function for standardization cannot be determined.
- data-dependent processing include histogram generation, clustering, principal component analysis, and the like.
- the “analysis engine” is an analysis process based on attributes. That is, the analysis engine accepts an attribute as an input, performs analysis based on the attribute, and outputs the analysis result.
- the analysis engine is also called an analysis algorithm executed by the data mining apparatus.
- Analysis engines include, for example, regression analysis, factor analysis, covariance structure analysis, principal factor analysis, discriminant analysis, kernel analysis, and cluster analysis.
- An analysis engine that performs processing such as analysis (Cluster Analysis) or anomaly detection. “Specifying the type of analysis engine” means accepting such specification of the type of analysis engine.
- the “analysis engine” may refer to, for example, a main body (for example, an apparatus) that performs the above-described analysis processing, or a program that controls the processor to execute the analysis processing.
- the constraint condition is a requirement to be satisfied by information output from the analysis engine.
- the constraint condition is a requirement that the analysis result output from the analysis engine should satisfy.
- the type of analysis engine is single regression analysis, one specific example of the constraint condition is “chi-square value is 0.9 or more”.
- Output information writing information to the storage device, sending the information to an external device, or presenting the information to the operator in the form of a screen display or sound, etc. are collectively referred to as “output information”. Describe.
- the first embodiment is a specific example of the present invention when single regression analysis is designated as the type of analysis engine.
- FIG. 1 is a block diagram illustrating an overview of an information processing system 1000 according to the first embodiment.
- the information processing system 1000 includes a function storage unit 110, an attribute generation unit 120, a test unit 130, and an output unit 140.
- the function storage unit 110 can store one or a plurality of functions.
- the function storage unit 110 stores at least one function that defines an operation (multinomial operation) that takes a plurality of operands.
- the function storage unit 110 may be mounted inside the information processing system 1000 or may be mounted on an external device (not shown) that can be accessed by the information processing system 1000.
- Attribute generation unit 120 acquires a target data set.
- the attribute generation unit 120 may accept an input of a data set from an operator, or may read the data set from a storage unit (not shown).
- the attribute generation unit 120 may receive a data set from a device (not shown) provided outside the information processing system 1000.
- the attribute generation unit 120 acquires a function from the function storage unit 110.
- the attribute generation unit 120 applies the acquired function to the attributes included in the data set.
- the attribute generation unit 120 generates a new attribute that is a result of applying the function to the attribute.
- the attribute generation unit 120 acquires a function that defines a polynomial operation.
- a function that defines a polynomial operation takes at least two attributes as inputs.
- the attribute generation unit 120 selects a combination of attribute data to be an input (operator) of an operation defined by the function from a plurality of attribute data included in the data set.
- the attribute generation unit 120 generates a new attribute that is a result of applying the function by applying the function to the selected combination of attribute data.
- the verification unit 130 acquires the specification of the type of analysis engine and the specification of constraint conditions from, for example, an operator.
- the test unit 130 acquires “single regression analysis” as the type of analysis engine. In addition, the test unit 130 acquires designation of an attribute that is an objective variable that is a target predicted by the function among a plurality of attributes included in the data set.
- the test unit 130 inputs a new attribute generated by the attribute generation unit 120 to the single regression analysis engine (not shown) as an explanatory variable.
- the test unit 130 acquires a regression equation output from the single regression analysis engine.
- the test unit 130 determines whether the regression equation satisfies a constraint condition.
- the output unit 140 outputs, for example, a regression equation that satisfies the requirements.
- FIG. 2 is a diagram for explaining an example of a data set input to the information processing system 1000 shown in FIG.
- the data set includes, for example, an identifier (ID), a height value, a weight value, an abdominal circumference value, and an annual consumption value of beer. Contains associated information. “Height”, “weight”, “waist circumference”, and “annual consumption of beer” shown in FIG. 2 correspond to “attributes”, respectively.
- ID an identifier
- a height value a weight value
- abdominal circumference value a weight value
- an annual consumption value of beer Contains associated information. “Height”, “weight”, “waist circumference”, and “annual consumption of beer” shown in FIG. 2 correspond to “attributes”, respectively.
- the data set shown in FIG. 2 is a data set prepared for explanation, and is not a measurement value obtained from a subject.
- FIG. 3 is a diagram illustrating an example of data stored in the function storage unit 110 illustrated in FIG. As illustrated in FIG. 3, the function storage unit 110 stores a plurality of functions.
- the process defined by the function whose function ID (identifier) is “function 1” is X.
- X represents an identity map.
- the process defined by the function whose function ID is “function 2” is a process of calculating the product value of the value of the first attribute and the value of the second attribute.
- a function is represented by the function ID of the function. For example, “function 2” represents a function whose function ID is “function 2”.
- the operator 900 inputs a data set to the attribute generation unit 120.
- a plurality of attributes are included in the data set.
- the operator 900 may further input designation of an attribute that is a target variable to the attribute generation unit 120.
- the attribute generation unit 120 acquires a target data set.
- the attribute generation unit 120 may further acquire designation of an attribute that is a target variable.
- the attribute generation unit 120 may read the data set from a storage device (not shown).
- the attribute generation unit 120 may receive a data set from an apparatus (not shown) that is not included in the information processing system 1000 and that can communicate with the information processing system 1000.
- the attribute generation unit 120 acquires designation of an attribute “annual consumption of beer” as an attribute that is a target variable.
- the attribute generation unit 120 reads the function 2 (that is, the product value calculation) from the function storage unit 110.
- the attribute generation unit 120 inputs a function from the attributes other than the objective variable (that is, “height”, “weight”, or “abdomen”) among the plurality of attributes included in the data set. Select an attribute.
- attributes selected as attributes to be input to a function are denoted as “n” and “m”.
- the attribute generation unit 120 executes the following operations (1) and (2) for each combination of selected attribute combinations (in this case, three combinations).
- the attribute generation unit 120 inputs the selected combination of attributes to the function 2 as an operand.
- the attribute generation unit 120 obtains a result of applying the function 2 to the selected combination of attributes, and sets the result as a new attribute.
- the attribute generation unit 120 newly generates the following three attributes.
- attribute generation unit 120 is not necessarily required to generate all of the three new attributes described above.
- FIG. 4 is a diagram for explaining one specific example of a newly generated attribute.
- the attribute “height ⁇ waist circumference” illustrated in FIG. 4 is a new attribute generated as a result of the function generation unit 120 applying the function 2 to the combination of the attribute “height” and the attribute “abdominal circumference”.
- test unit 130 will be described in detail with reference to FIGS. 1, 5, 6, and 7.
- FIG. The following description is only one specific example of the operation of the test unit 130, and the operation of the test unit 130 is not limitedly interpreted.
- test unit 130 acquires “single regression analysis” as the type of the analysis engine, acquires “annual consumption of beer” as the attribute that is the objective variable, and “chi-square value is 0.9 as the constraint condition”. It is assumed that the condition “above” is acquired.
- Y is an objective variable.
- X is an explanatory variable.
- a and b are constants.
- the test unit 130 analyzes how much the attribute (explanatory variable) output from the attribute generation unit 120 can explain the annual consumption (objective variable) of beer.
- the test unit 130 acquires attributes (“height”, “weight”, and “waist circumference”) from the attribute generation unit 120. Further, the test unit 130 acquires the attributes (height ⁇ weight, height ⁇ waist circumference, and abdominal circumference ⁇ weight) generated by the attribute generation unit 120.
- the test unit 130 selects one attribute from the plurality of acquired attributes. For example, it is assumed that the test unit 130 selects the attribute “height”.
- the test unit 130 For each acquired attribute, the test unit 130 inputs the attribute to the analysis engine (in the above example, the single regression analysis engine) and the analysis result output by the analysis engine (that is, the regression equation and the key). A process of obtaining a square value) and a process of determining whether the analysis result (that is, the chi-square value) satisfies a constraint condition are executed.
- the analysis engine in the above example, the single regression analysis engine
- the analysis result output by the analysis engine that is, the regression equation and the key
- FIG. 7 is a diagram for explaining the results of the processing performed by the test unit 130 for the six types of attributes acquired by the test unit 130. As shown in FIG. 7, the only explanatory variable that satisfies the constraint condition “chi-square value is 0.9 or more” is “height ⁇ waist circumference”.
- the output unit 140 outputs, for example, a regression equation that satisfies the requirements.
- the output unit 140 may operate as described below. For example, it is assumed that the analysis result obtained by inputting (input) the attribute A as shown below satisfies the constraint condition. Attribute A: Product value of attribute B value and attribute C value.
- attribute B is a height value
- attribute C is a weight value
- the output unit 140 may output information that “a preprocessing for calculating the product of the attribute value of height and the attribute value of weight should be executed”.
- the output unit 140 outputs information that “the analysis result that satisfies the constraint condition is obtained when the attribute“ the product of the value of the attribute of height and the value of the attribute of weight ”is input to the designated analysis engine”. You may do it.
- the output unit 140 may output information “product of the attribute value of height and the attribute value of weight”.
- the output unit 140 may output the information together with the type of the designated analysis engine and the file name of the data set.
- FIG. 8 is a flowchart for explaining the operation of the information processing system 1000 according to the first embodiment.
- the attribute generation unit 120 acquires one function from the function storage unit 110 (step S101).
- the attribute generation unit 120 selects a combination of attributes that are operands in the operation defined by the function from among a plurality of attributes included in the data set (step S102).
- the attribute generation unit 120 inputs the selected combination of attributes to the function, and calculates a value output according to the function as a new attribute (step S103).
- the operation shown in step S103 is to apply a function to the selected combination of attributes and generate a new attribute that is a result of applying the function to the selected combination of attributes.
- the attribute generation unit 120 generates a new attribute for all combinations of attributes that can be operands in the function (step S104).
- the test unit 130 selects a specific attribute from a plurality of new attributes (step S105).
- the test unit 130 analyzes how much the specified objective variable can be explained based on a specific attribute (explanatory variable). As a result, the test unit 130 obtains an analysis result (that is, a regression equation and a chi-square value) (step S106).
- the verification unit 130 repeats the operation shown in step S106 for all the attributes generated by the attribute generation unit 120 (step S107).
- the verification unit 130 verifies whether an analysis result satisfying the constraint condition is obtained (step S108). Note that the operation shown in step S108 may be executed in the repetition from step S105 to step S107.
- step S108 When an analysis result satisfying the constraint condition is obtained (YES in step S108), the output unit 140 outputs an analysis result satisfying the constraint condition (step S109). When an analysis result that satisfies the constraint condition cannot be obtained (NO in step S108), the output unit 140 does not output an analysis result that satisfies the constraint condition.
- the reason is that the attribute generation unit 120 according to the first embodiment calculates a function for the attribute and generates a new attribute.
- the information processing system 1000 can “increase the number of attributes that are candidates for explanatory variables”. In other words, it can be said that “the number of attribute candidates for verifying the hypothesis can be increased”. With such an action, there is an increased possibility that an explanatory variable that sufficiently explains the objective variable will be selected, and the effect of improving the accuracy of data mining is realized.
- the information processing system 1000 creates new attributes (that is, height ⁇ weight, weight ⁇ abdominal circumference, height ⁇ height) based on the three types of attributes included in the data set and the function stored in the function storage unit 110. Abdominal circumference).
- the information processing system 1000 increases the number of attributes that are candidates for explanatory variables, thereby increasing the possibility of selecting an attribute that sufficiently explains the objective variable, and thus the accuracy of data mining can be improved. it can.
- the information processing system 1000 according to the first embodiment can output a preprocessing procedure to be performed on the attribute in order to improve the accuracy of data mining.
- the reason is that, when the output unit 140 according to the first embodiment obtains an analysis result that satisfies the constraint conditions, the attribute input to the analysis engine to obtain the analysis result is output.
- the output unit 140 outputs information indicating what processing should be performed on the attributes included in the data set in order to obtain an analysis result that satisfies the constraint conditions.
- the information processing system 1000 according to the first embodiment can reduce the man-hours of an analysis engineer who performs data analysis.
- the reason is that the attribute generation unit 120 of the information processing system 1000 according to the first embodiment generates a new attribute based on a plurality of attributes.
- the verification unit 130 of the information processing system 1000 selects an attribute that satisfies a predetermined criterion from the generated new attributes. That is, for example, the test unit 130 inputs the generated new attribute to an analysis engine that performs an analysis process based on the input attribute. And the test
- the verification unit 130 selects an attribute input to the analysis engine.
- the predetermined requirement that is, the constraint condition
- the predetermined requirement is, for example, that the correlation with the objective variable is higher than a predetermined criterion. That is, if an analysis engineer inputs a plurality of attributes to the information analysis system 1000, the information processing system 1000 can automatically or semi-automatically generate attributes having a high correlation with the objective variable.
- the analysis engineer can calculate the value of the product of “annual consumption of personal beer” and “height value and waist circumference value”. Even without knowing that there is a strong correlation with “ The reason is that the information processing system 1000 generates a new attribute “value of the product of the height value and the abdominal circumference value” based on the attribute “height” and the attribute “abdominal circumference”. . In other words, if the analysis engineer inputs the attribute “height” and the attribute “abdominal circumference” to the information processing system 1000, the information processing system 1000 will determine that “the product value of the height value and the waist circumference value”.
- the attribute having a high correlation with the objective variable can be automatically or semi-automatically generated for the user.
- an analysis engineer who performs data analysis may find that there is a strong correlation between the objective variable and the newly generated attribute. it can.
- an analysis engineer who performs data analysis can find that there is a strong correlation between "the annual consumption of individual beer" and "the value of the product of height and waist circumference” .
- the output unit 140 outputs a newly generated attribute and information indicating that an analysis result that satisfies the constraint condition is obtained by inputting the attribute.
- the output unit 140 outputs information that “the product of the attribute value of height and the attribute value of weight” is input to the designated analysis engine, and an analysis result that satisfies the constraint condition is obtained ”.
- the information processing system 1000 can be used for the purpose of assisting the analysis engineer to find an explanatory variable having a strong correlation with the objective variable.
- Z is an objective variable.
- X is a first explanatory variable.
- Y is a second explanatory variable.
- a, b, and c are constants.
- the test unit 130 repeats the operation of step S106 shown in FIG. 8 for 15 combinations of explanatory variables.
- test inspection part 130 may receive a curve regression analysis as a kind of analysis engine.
- the test unit 130 accepts designation of the type of curve, for example, an exponential function or a Gaussian function.
- the second embodiment is a specific example of the present invention when discriminant analysis is designated as the type of analysis engine.
- FIG. 9 is a block diagram showing the configuration of the information processing system 1001 according to the second embodiment. As illustrated in FIG. 9, the information processing system 1001 according to the second embodiment may include the following configuration.
- a function storage unit 111 is provided instead of the function storage unit 110 according to the first embodiment.
- An attribute generation unit 121 is provided instead of the attribute generation unit 120.
- test unit 131 is provided instead of the test unit 130.
- the first embodiment and the second embodiment differ in the data set to be handled and the type of analysis engine to be specified.
- FIG. 10 is a diagram for explaining an example of a data set input to the information processing system 1001 shown in FIG.
- the data set shown in FIG. 10 can be paraphrased as multivariate data.
- the data set includes information for associating attribute 1 to attribute 4 with each of a plurality of identifiers.
- the data set shown in FIG. 11 is data representing, for example, a questionnaire response result for a plurality of people.
- Each attribute is an answer to a question item included in the questionnaire.
- the contents of attribute 1 to attribute 4 are shown below. Specifically, the question item and the value represented by the answer are shown for each attribute.
- Attribute 1 Do you like dogs and cats? (Dog is represented as 0, cat is represented as 1), Attribute 2: What is your age? (Represent 40 years or older as 0, Represent less than 40 years as 1), Attribute 3: What is your gender? (Represents a man as 0, a woman as 1), Attribute 4: Which do you like sushi or tempura? (Sushi is represented as 0, Tempura is represented as 1).
- FIG. 11 is a diagram illustrating an example of information stored in the function storage unit 111 illustrated in FIG. As shown in FIG. 11, the function storage unit 111 stores functions 1 to 4.
- Function 1 defines the identity map X.
- Function 2 defines a logical product (AND) operation on the values of two attributes.
- Function 3 defines a logical sum (OR) operation on the values of two attributes.
- Function 4 defines an exclusive OR (XOR) for the values of the two attributes.
- FIG. 12 is a diagram depicting one specific example relating to a new attribute generated by the attribute generation unit 121.
- the attribute generation unit 121 selects one function from a plurality of functions stored in the function storage unit 111.
- the attribute generation unit 121 selects a combination of attributes from a plurality of attributes included in the input data set. For example, it is assumed that the attribute generation unit 121 selects “logical sum (OR)” as a function, and additionally selects attribute 1 and attribute 2 as attributes.
- FIG. 12 shows the new attribute generated by the attribute generation unit 121 as a result.
- the attribute generation unit 121 generates new attributes for all combinations that are operands for the function among a plurality of attribute combinations included in the data set, for example.
- the attribute generation unit 121 does not necessarily have to generate new attributes for all combinations.
- the verification unit 131 is designated “discriminant analysis” as information on the type of analysis engine. Furthermore, it is assumed that the test unit 131 is designated with attribute 4 (that is, “Do you like sushi or tempura?”) As the objective variable.
- test unit 131 receives a condition that “match rate is 95% or more” as a constraint condition (that is, a requirement that information output from the analysis engine should satisfy).
- the “match rate” is an index indicating how much the value of the selected attribute matches the value of the attribute designated as the prediction target.
- the test unit 131 Based on the new attribute generated by the attribute generation unit 121, the test unit 131 analyzes whether it can sufficiently explain “whether you like sushi or tempura”.
- the test unit 131 acquires a new attribute generated by the attribute generation unit 121.
- the test unit 131 selects one attribute from the acquired plurality of attributes. For example, it is assumed that the test unit 131 selects the attribute “attribute 3”.
- the test unit 131 calculates a matching rate between the value of the selected attribute and the value of the attribute designated as the prediction target.
- the number of persons for which the matching rate is calculated may be specified in advance.
- the test unit 131 calculates the coincidence rate with the value of the objective variable “Which is sushi or tempura?” For all the acquired attributes.
- FIG. 13 is a diagram for explaining a result of processing performed by the test unit 131 for the attribute generated by the attribute generation unit 121.
- the matching rate between the value obtained by performing exclusive OR (XOR) on attribute 1 and attribute 3 and the value of attribute 4 is 100%, which satisfies the constraint condition. This means that the preference of “sushi” and “tempura” can be explained based on the value of the exclusive OR XOR of “attribute 1” and “attribute 3” in the questionnaire result.
- the reason is that the attribute generation unit 121 according to the second embodiment applies a function to the attribute to generate a new attribute.
- the information processing system 1000 has an effect of “increasing the number of attributes that are candidates for explanatory variables”. This can be paraphrased as “increasing the number of attribute candidates for verifying the hypothesis”. With such an action, there is an increased possibility that an explanatory variable that sufficiently explains the objective variable will be selected, and the effect of improving the accuracy of data mining is realized.
- the information processing system 1001 according to the second embodiment can output a preprocessing procedure to be performed on the attribute in order to improve the accuracy of data mining. This is because the output unit 140 according to the second embodiment outputs the attribute input to the analysis engine in order to obtain the analysis result when the analysis result satisfying the constraint condition is obtained. Alternatively, the output unit 140 outputs information indicating what processing should be performed on the attributes included in the data set in order to obtain an analysis result that satisfies the constraint conditions.
- FIG. 14 is a block diagram illustrating a configuration of an information processing system 1002 according to the third embodiment. As illustrated in FIG. 14, the information processing system 1002 includes an attribute generation unit 122 and a test unit 132.
- the attribute generation unit 122 selects a combination of attributes to be the plurality of operands from the plurality of input attributes with respect to a function that defines an operation that takes a plurality of operands. On the other hand, by applying the function to the attribute, a new attribute that is a result of applying the function to the attribute combination is generated.
- the verification unit 132 inputs the new attribute to an analysis engine that executes an analysis process based on the attribute, and determines whether information output from the analysis engine satisfies a predetermined requirement.
- the third embodiment it is possible to provide the information processing system 1002 that contributes to improving the accuracy of analysis processing.
- FIG. 15 is a diagram illustrating a hardware configuration of a computer capable of realizing the information processing system 1000 according to the first embodiment.
- the computer shown in FIG. 15 includes a CPU (Central Processing Unit) 1, a memory 2, a storage device 3, and a communication interface (I / F) 4.
- the computer shown in FIG. 15 may further include an input device 5 or an output device 6.
- the functions of the information processing system 1000 are realized by, for example, the CPU 1 executing a computer program (software program, hereinafter simply referred to as “program”) read into the memory 2. In execution, the CPU 1 appropriately controls the communication interface 4, the input device 5, and the output device 6.
- program software program
- the present invention described using the above-described embodiments as an example is also configured by a non-volatile storage medium 8 such as a compact disk in which such a program is stored.
- the program stored in the storage medium 8 is read by the drive device 7, for example.
- the communication executed by the information processing system 1000 is realized by the application program controlling the communication interface 4 using, for example, a function provided by an OS (Operating System).
- the input device 5 is, for example, a keyboard, a mouse, or a touch panel.
- the output device 6 is a display, for example.
- the information processing system 1000 may be configured such that two or more physically separated devices are connected to be communicable by wire, wireless, or a combination thereof.
- the example of the hardware configuration shown in FIG. 15 is applicable to the other embodiments described above.
- the information processing system according to each embodiment of the present invention may be a dedicated device. Note that the hardware configuration of the information processing system and each functional block according to each embodiment of the present invention is not limited to the above-described configuration.
- the analysis engine that executes the analysis processing is not necessarily installed in the same apparatus as the information processing system 1000.
- the analysis engine only needs to be mounted on a device that can be accessed from the information processing system 1000.
- the above-described modified examples can be applied to other embodiments.
- the present invention has been described by taking as an example the case where single regression analysis, multiple regression analysis, and discriminant analysis are designated as the types of analysis engines.
- the present invention is not limited to the above-described embodiments, and can be implemented in various modes.
- the present invention can also be applied to data mining using an analysis engine other than the types exemplified in the above embodiments.
- each block diagram is a configuration shown for convenience of explanation.
- the present invention described by taking each embodiment as an example is not limited to the configuration shown in each block diagram in the implementation.
- the present invention described using the above-described embodiment as an example can be used for a tool that supports data mining, for example.
Abstract
Description
「データセット」とは、情報処理システム1000に入力されるデータである。「データセット」は、1つまたは複数の属性を含む。「属性」は、「変量」と言い換えることもできる。
「関数」は、ある属性から新たな属性を生成(construct)する処理(processing)を定義する。「関数」は、データセットに含まれる属性に対して適用(apply)される。すなわち、「関数」をある属性に適用すると、ある属性に対して当該関数が定義する処理が実行され、その結果として新たな属性が生成される。
「分析エンジン」は、属性に基づく分析処理である。すなわち、分析エンジンは、入力として属性を受け付け、該属性に基づき分析を行い(execute)、分析した結果を出力する。分析エンジンは、データマイニング装置が実行する分析アルゴリズムなどとも呼ばれる。分析エンジンは、例えば、回帰分析(Regression Analysis)、因子分析(Factor Analysis)、共分散構造分析(Covariance Structure Analysis)、主成分分析(Principal Factor Analysis)、判別分析(Discriminant Analysis)、カーネル分析、クラスター分析(Cluster Analysis)または異常検出などの処理を実行する分析エンジンである。「分析エンジンの種類の指定」とは、このような分析エンジンの種類の指定を受け付けることをいう。「分析エンジン」は、例えば、上述の分析処理を実行する主体(例えば装置)、又は、プロセッサが分析処理を実行するよう制御するプログラムなどを指すこともある。
制約条件は、分析エンジンが出力する情報が満たすべき要件である。言い換えれば、制約条件は、分析エンジンが出力する分析結果が満たすべき要件である。分析エンジンの種類が単回帰分析である場合、制約条件の1つの具体例は、「カイ二乗値が0.9以上」である。
以降、情報を記憶装置から読み出すこと、情報を外部装置から受信すること、または、オペレータから情報の入力を受け付けることなどを、まとめて「情報を取得する」と記載する。
以降、情報を記憶装置に書き込むこと、情報を外部装置へ送信すること、または、画面表示または音声などの形式でオペレータに対して情報を提示することなどを、まとめて「情報を出力する」と記載する。
第1の実施形態は、分析エンジンの種類として単回帰分析が指定された場合の、本発明の1つの具体例である。
身長 体重、
身長 腹囲、
体重 腹囲。
・身長×腹囲、
・腹囲×体重。
属性A:属性Bの値と属性Cの値との積の値。
検定部130は、分析エンジンの種類として、重回帰分析の指定を受け付けてもよい。例えば、検定部130が、重回帰分析(Z=aX+bY+c)の指定を受け付けるとする。ここで、Zは目的変数である。Xは第1の説明変数である。Yは第2の説明変数である。a、bおよびcは、それぞれ定数である。
第2の実施形態は、分析エンジンの種類として判別分析が指定された場合の、本発明の1つの具体例である。
属性2:年齢は? (40歳以上を0と表す、40歳未満を1と表す)、
属性3:性別は? (男を0と表す、女を1と表す)、
属性4:寿司と天麩羅どちらが好き? (寿司を0と表す、天麩羅を1と表す)。
図14は、第3の実施形態にかかる情報処理システム1002の構成を説明するブロック図である。図14に示すように、情報処理システム1002は、属性生成部122と、検定部132と、を備える。
図15は、第1の実施形態に係る情報処理システム1000を実現できるコンピュータのハードウェア構成を表す図である。図15に示すコンピュータは、CPU(Central Processing Unit)1、メモリ2、記憶装置3、通信インターフェース(I/F)4を備える。図15に示すコンピュータは、さらに、入力装置5または出力装置6を備えていてもよい。情報処理システム1000の機能は、例えばCPU1が、メモリ2に読み出されたコンピュータプログラム(ソフトウェアプログラム、以下単に「プログラム」と記載する)を実行することにより実現される。実行に際して、CPU1は、通信インターフェース4、入力装置5および出力装置6を適宜制御する。
分析処理を実行する分析エンジンは、必ずしも情報処理システム1000と同一の装置に実装される必要はない。分析エンジンは、情報処理システム1000からアクセスすることが可能な装置に実装されていればよい。上述の変形例は、他の実施形態にも適用可能である。
2 メモリ
3 記憶装置
4 通信インターフェース
5 入力装置
6 出力装置
7 ドライブ装置
8 記憶媒体
110 関数記憶部
111 関数記憶部
120 属性生成部
121 属性生成部
122 属性生成部
130 検定部
131 検定部
132 検定部
140 出力部
900 オペレータ
1000 情報処理システム
1001 情報処理システム
1002 情報処理システム
Claims (10)
- 複数の被演算子をとる演算を定義する関数に関し、入力された複数の属性の中から、前記複数の被演算子となる属性の組み合わせを選択し、前記属性の組み合わせに対して前記関数を適用することにより、属性の組み合わせに対して関数を適用した結果である新たな属性を生成する属性生成手段と、
前記属性に基づき分析処理を実行する分析エンジンに、前記新たな属性を入力し、前記分析エンジンが出力する情報が所定の要件を満たすか否かを判定する検定手段と、
を備える情報処理システム。 - 前記検定手段は、分析エンジンの選択を受け付け、分析エンジンが出力する情報が満たす要件の入力を受け付け、前記選択された分析エンジンに前記新たな属性を入力する、
請求項1に記載の情報処理システム。 - 前記属性生成手段は、前記複数の属性から、前記属性の組み合わせを複数選択し、
前記複数の属性の組み合わせのうちそれぞれの属性の組み合わせに対して前記関数を適用することにより、複数の前記新たな属性を生成する処理を実行し、
前記検定手段は、前記複数の前記新たな属性データのそれぞれに対して、
前記選択された分析エンジンに前記複数の新たな属性データのうち特定の属性データを入力する処理と、
前記分析エンジンが出力する情報を取得する処理と、
前記取得した情報が前記要件を満たしているか否かを判定する処理と、
を実行する、
請求項1または2に記載の情報処理システム。 - 前記分析エンジンが出力する情報のうち、前記要件を満たす情報を出力する、第1の出力手段を更に備える、請求項1から3のいずれかに記載の情報処理システム。
- 前記分析エンジンが出力する情報が前記要件を満たした場合に、前記分析エンジンが出力する情報を得るために当該分析エンジンに入力された属性か、または、当該属性を生成するために、前記属性生成手段が適用した関数および前記関数を適用した属性の組み合わせを、出力する、第2の出力手段を更に備える、請求項1から3のいずれかに記載の情報処理システム。
- 前記関数は、二項演算を定義する、
請求項1から5のいずれかに記載の情報処理システム。 - 前記関数は、前記属性に対する算術演算または論理演算を定義する、
請求項1から6のいずれかに記載の情報処理システム。 - 前記検定手段は、分析エンジンとして回帰分析が選択された場合に、更に、目的変数として前記属性のうちいずれかの属性の指定を受け付け、前記要件として説明変数の個数の指定を受け付ける、
請求項1から7のいずれかに記載の情報処理システム。 - 複数の被演算子をとる演算を定義する関数を記憶する関数記憶手段にアクセス可能なコンピュータが、
前記関数記憶手段から前記関数を取得し、入力された複数の属性の中から、前記複数の被演算子となる属性の組み合わせを選択し、前記属性の組み合わせに対して前記関数を適用することにより、属性の組み合わせに対して関数を適用した結果である新たな属性を生成する属性生成手段と、
前記属性に基づき分析処理を実行する分析エンジンに、前記新たな属性を入力し、前記分析エンジンが出力する情報が所定の要件を満たすか否かを判定する
情報処理方法。 - 複数の被演算子をとる演算を定義する関数を記憶する関数記憶手段にアクセス可能なコンピュータに、
前記関数記憶手段から前記関数を取得する処理と、
入力された複数の属性の中から、前記複数の被演算子となる属性の組み合わせを選択し、前記属性の組み合わせに対して前記関数を適用することにより、属性の組み合わせに対して関数を適用した結果である新たな属性を生成する処理と、
前記属性に基づき分析処理を実行する分析エンジンに、前記新たな属性を入力し、前記分析エンジンが出力する情報が所定の要件を満たすか否かを判定する処理と、
を実行させるプログラムを記憶するコンピュータ読み取り可能な記録媒体。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2015538885A JP6662637B2 (ja) | 2013-09-27 | 2014-09-11 | 情報処理システム、情報処理方法およびプログラムを記憶する記録媒体 |
US15/024,802 US20160232213A1 (en) | 2013-09-27 | 2014-09-11 | Information Processing System, Information Processing Method, and Recording Medium with Program Stored Thereon |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201361883672P | 2013-09-27 | 2013-09-27 | |
US61/883,672 | 2013-09-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015045318A1 true WO2015045318A1 (ja) | 2015-04-02 |
Family
ID=52742491
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2014/004706 WO2015045318A1 (ja) | 2013-09-27 | 2014-09-11 | 情報処理システム、情報処理方法およびプログラムを記憶する記録媒体 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20160232213A1 (ja) |
JP (1) | JP6662637B2 (ja) |
WO (1) | WO2015045318A1 (ja) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017090475A1 (ja) * | 2015-11-25 | 2017-06-01 | 日本電気株式会社 | 情報処理システム、関数作成方法および関数作成プログラム |
US11514062B2 (en) | 2017-10-05 | 2022-11-29 | Dotdata, Inc. | Feature value generation device, feature value generation method, and feature value generation program |
US11727203B2 (en) | 2017-03-30 | 2023-08-15 | Dotdata, Inc. | Information processing system, feature description method and feature description program |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP7049210B2 (ja) * | 2018-08-07 | 2022-04-06 | 株式会社キーエンス | データ分析装置及びデータ分析方法 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005063353A (ja) * | 2003-08-20 | 2005-03-10 | Nippon Telegr & Teleph Corp <Ntt> | 説明変数有効度検証のためのデータ分析装置、該データ分析をコンピュータに実行させるためのプログラム及び該プログラムの記録媒体 |
JP2006048429A (ja) * | 2004-08-05 | 2006-02-16 | Nec Corp | 解析エンジン交換型システム及びデータ解析プログラム |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1680741B1 (en) * | 2003-11-04 | 2012-09-05 | Kimberly-Clark Worldwide, Inc. | Testing tool for complex component based software systems |
US7904279B2 (en) * | 2004-04-02 | 2011-03-08 | Test Advantage, Inc. | Methods and apparatus for data analysis |
US20080313208A1 (en) * | 2007-06-14 | 2008-12-18 | International Business Machines Corporation | Apparatus, system, and method for automated context-sensitive message organization |
US20090112519A1 (en) * | 2007-10-31 | 2009-04-30 | United Technologies Corporation | Foreign object/domestic object damage assessment |
CN102792240B (zh) * | 2009-11-16 | 2016-06-01 | Nrg系统股份有限公司 | 用于基于条件的维护的数据获取系统 |
US8522083B1 (en) * | 2010-08-22 | 2013-08-27 | Panaya Ltd. | Method and system for semiautomatic execution of functioning test scenario |
-
2014
- 2014-09-11 US US15/024,802 patent/US20160232213A1/en not_active Abandoned
- 2014-09-11 WO PCT/JP2014/004706 patent/WO2015045318A1/ja active Application Filing
- 2014-09-11 JP JP2015538885A patent/JP6662637B2/ja active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005063353A (ja) * | 2003-08-20 | 2005-03-10 | Nippon Telegr & Teleph Corp <Ntt> | 説明変数有効度検証のためのデータ分析装置、該データ分析をコンピュータに実行させるためのプログラム及び該プログラムの記録媒体 |
JP2006048429A (ja) * | 2004-08-05 | 2006-02-16 | Nec Corp | 解析エンジン交換型システム及びデータ解析プログラム |
Non-Patent Citations (1)
Title |
---|
HIROSHI SASAKI ET AL.: "Analysis and Modeling of Multiprogramming Performance of Chip Multiprocessors focusing on Resource Contentions", IPSJ SIG NOTES, vol. 2007, no. 79, 3 August 2007 (2007-08-03), pages 85 - 90 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017090475A1 (ja) * | 2015-11-25 | 2017-06-01 | 日本電気株式会社 | 情報処理システム、関数作成方法および関数作成プログラム |
EP3382572A4 (en) * | 2015-11-25 | 2019-07-31 | Nec Corporation | INFORMATION PROCESSING SYSTEM, FUNCTION CREATING METHOD, AND PROGRAM |
US10885011B2 (en) | 2015-11-25 | 2021-01-05 | Dotdata, Inc. | Information processing system, descriptor creation method, and descriptor creation program |
US11727203B2 (en) | 2017-03-30 | 2023-08-15 | Dotdata, Inc. | Information processing system, feature description method and feature description program |
US11514062B2 (en) | 2017-10-05 | 2022-11-29 | Dotdata, Inc. | Feature value generation device, feature value generation method, and feature value generation program |
Also Published As
Publication number | Publication date |
---|---|
JP6662637B2 (ja) | 2020-03-11 |
US20160232213A1 (en) | 2016-08-11 |
JPWO2015045318A1 (ja) | 2017-03-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10032114B2 (en) | Predicting application performance on hardware accelerators | |
Turner et al. | Word2Vec inversion and traditional text classifiers for phenotyping lupus | |
Sun et al. | Combining knowledge and data driven insights for identifying risk factors using electronic health records | |
EP3166105A1 (en) | Neural network training apparatus and method, and speech recognition apparatus and method | |
US20140013299A1 (en) | Generalization and/or specialization of code fragments | |
JP6662637B2 (ja) | 情報処理システム、情報処理方法およびプログラムを記憶する記録媒体 | |
Perperoglou | Cox models with dynamic ridge penalties on time‐varying effects of the covariates | |
Sunmoo et al. | Using a data mining approach to discover behavior correlates of chronic disease: a case study of depression | |
EP3718116B1 (en) | Apparatus for patient data availability analysis | |
Marino et al. | Compressive Big Data Analytics: An ensemble meta-algorithm for high-dimensional multisource datasets | |
KR101639972B1 (ko) | Plc 시스템 및 연산식 데이터 작성 지원 장치 | |
Wang et al. | Learning from the past: Efficient high-level synthesis design space exploration for fpgas | |
KR20110035944A (ko) | 관계 맵 생성자 | |
JP6358260B2 (ja) | 情報処理システム、情報処理方法およびプログラムを記憶する記録媒体 | |
JP2021500639A (ja) | 多段階パターン発見およびビジュアル分析推奨のための予測エンジン | |
JP5936135B2 (ja) | 情報処理装置、情報処理方法、及び、プログラム | |
Shamsara | Ezqsar: an R package for developing QSAR models directly from structures | |
JP6500698B2 (ja) | 組み合わせ計算によるイベント駆動ソフトウェアのイベント・シーケンス構築 | |
US10529002B2 (en) | Classification of visitor intent and modification of website features based upon classified intent | |
JP7380696B2 (ja) | 人員の手配装置、手配方法およびプログラム | |
KR102019752B1 (ko) | 컴퓨터 수행 가능한 ui/ux 전략제공방법 및 이를 수행하는 ui/ux 전략제공장치 | |
Droste et al. | Logics for weighted timed pushdown automata | |
Naumoski et al. | Influence of algebraic t-norm on different indiscernibility relationships in fuzzy-rough rule induction algorithms | |
JP7224392B2 (ja) | 情報処理装置、情報処理方法およびプログラム | |
Särkkä et al. | Comparison of missing data handling methods for variant pathogenicity predictors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14846913 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2015538885 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 15024802 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 14846913 Country of ref document: EP Kind code of ref document: A1 |