WO2024053370A1

WO2024053370A1 - Information processing device, information processing method, and program

Info

Publication number: WO2024053370A1
Application number: PCT/JP2023/029935
Authority: WO
Inventors: 健人中田; 智佳子浅井; 慎吾高松
Original assignee: ソニーグループ株式会社
Priority date: 2022-09-06
Filing date: 2023-08-21
Publication date: 2024-03-14

Abstract

The present disclosure relates to an information processing device, an information processing method, and a program which make it possible to efficiently find and extract, from time-series data, feature quantities effective in creating a machine learning model. The present disclosure involves: generating metadata from flow data including at least time-series data; estimating a method for generating feature quantities from series data constituting the flow data, on the basis of the generated metadata; and generating feature quantities from the series data using the estimated generation method. The present disclosure can be applied to technology for generating feature quantities necessary to train machine learning models.

Description

Information processing device, information processing method, and program

The present disclosure relates to an information processing device, an information processing method, and a program, and in particular, an information processing device that can efficiently search and extract features effective in creating a machine learning model from time-series data, and Related to information processing methods and programs.

In the Internet of Things (IoT), data sets consisting of multiple time-series data are increasingly being accumulated.

On the other hand, building machine learning models and causal models using such data requires a high level of expertise, so there are expectations for tools that allow even people with limited expertise to build models.

Therefore, when generating a machine learning model that predicts the occurrence of a target event from time-series data, it is necessary to suppress the amount of input data for machine learning from becoming enormous and to determine the reference date of time-series data for negative examples. has been proposed (see Patent Document 1).

JP 2021-189833 Publication

However, in the technique of Patent Document 1, problem settings are limited, and the task of generating feature amounts as preprocessing is extremely complicated.

The present disclosure has been made in view of this situation, and in particular, it is intended to enable efficient searching and extraction of feature quantities effective in creating a machine learning model from time-series data.

An information processing device and a program according to an aspect of the present disclosure include a metadata generation unit that generates metadata of flow data including at least time series data, and a metadata generation unit that generates metadata of flow data including at least time series data, and a metadata generation unit that generates metadata of flow data that includes at least time series data, and The present invention provides an information processing device and a program, comprising: an estimating section that estimates a feature amount generation method; and a feature amount generating section that generates a feature amount from the series data using the generation method estimated by the estimating section.

An information processing method according to one aspect of the present disclosure generates metadata of flow data including at least time-series data, and estimates a method for generating feature amounts from series data constituting the flow data based on the metadata. , an information processing method including the step of generating feature amounts from the series data using an estimated generation method.

In one aspect of the present disclosure, metadata of flow data including at least time-series data is generated, and based on the metadata, a method of generating a feature amount from the series data constituting the flow data is estimated. A feature amount is generated from the series data using a generation method.

FIG. 3 is a diagram illustrating flow data of the present disclosure. FIG. 3 is a diagram illustrating examples of session units, time units, attribute data, and time series data in flow data. FIG. 1 is a hardware block diagram illustrating a configuration example of an information processing device according to the present disclosure. 4 is a functional block diagram illustrating functions realized by the UI control unit, data processing unit, and machine learning model generation unit in FIG. 3. FIG. It is a figure explaining the example of composition of attribute data and time series data in flow data. FIG. 6 is a diagram illustrating an example of a display image of a UI that prompts setting of a column for each session, a column for each time, and a prediction target column in flow data. FIG. 2 is a diagram illustrating an example of a melt format as an output format. FIG. 2 is a diagram illustrating an example of a pivot format as an output format. FIG. 6 is a diagram illustrating an example of an output format when a ball speed sequence is set as a prediction target based on flow data related to a pitching log of a predetermined baseball batter. FIG. 6 is a diagram illustrating an example of an output format when a result string is set as a prediction target based on flow data related to a pitching log of a predetermined baseball batter. FIG. 3 is a diagram illustrating a method for generating feature amounts of time-series data. FIG. 6 is a diagram illustrating an example of setting a window related to generation of feature amounts. FIG. 7 is a diagram illustrating another setting example of a window related to generation of a feature amount. FIG. 3 is a diagram illustrating selection of series data from which feature amounts are generated. FIG. 3 is a diagram illustrating an example of generation of intra-session feature amount data. FIG. 3 is a diagram illustrating an example of generation of inter-session feature amount data. FIG. 6 is a diagram illustrating each example of a session ID, time unit, attribute data, time series data, intra-session feature amount, inter-session feature amount, and session set ID in the flow data of the present disclosure. FIG. 3 is a diagram illustrating an example of presentation of feature amount data. It is a flowchart explaining feature data generation processing. It is a flowchart explaining generation source selection processing. 12 is a flowchart illustrating intra-session feature amount data generation processing. FIG. 7 is a diagram illustrating a modification example of clustering sessions. FIG. 7 is a diagram illustrating a modification example of clustering sessions. An example of the configuration of a general-purpose computer is shown.

Preferred embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. Note that, in this specification and the drawings, components having substantially the same functional configurations are designated by the same reference numerals and redundant explanation will be omitted.

Hereinafter, a mode for implementing the present technology will be described. The explanation will be given in the following order.
1. Summary of this disclosure 2. Preferred embodiment 3. Example of execution by software

<<1. Summary of this disclosure >>
<Flow data>
In particular, the present disclosure makes it possible to efficiently search and extract feature amounts effective for creating a machine learning model from time-series data.

In this specification, a data set consisting of a plurality of time-series data is referred to as flow data, and a technique for efficiently searching and extracting feature quantities effective for creating a machine learning model from flow data will be described.

Therefore, first, the terms used in this specification will be defined.

Flow data is a data set that requires one or more time series data and can optionally include one or more attribute data. That is, while flow data always includes at least one piece of time-series data, it may not include attribute data, but it may include a plurality of attribute data.

Here, time-series data is data that changes over time, and attribute data is data that does not change over time.

For example, in the case of vital signals obtained from sensors attached to patients in a hospital, for example, heartbeat, respiratory rate per unit time, and operation log of a measuring device are time-series data, and each patient's Attribute data includes gender, weight, etc.

When these time-series data and attribute data are accumulated for each patient, flow data is constructed with one patient as a set unit.

Furthermore, when measuring the operating status of a robot arm used in a factory with a sensor, the sensor data that can be obtained from the robot arm becomes time series data, and the number of failures for each individual becomes attribute data.

When these time-series data and attribute data are accumulated for each robot arm, flow data with one robot arm as a collection unit is constructed.

Further, when the pitching history of a baseball game is accumulated, the speed of a pitched ball in the at-bat becomes time-series data, and the information about the pitcher and batter becomes attribute data.

When these time-series data and attribute data are accumulated for each turn at bat, flow data is constructed with one turn at bat as a set unit.

That is, as shown in FIG. 1, the flow data includes time-series data consisting of data Dt1, Dt2, etc., which are measured in time series at the timings indicated by circles on the time axis indicated by arrows, and the person to be measured. It is composed of data Da1 such as the gender and weight of the user, and attribute data including data Da2 such as the name of the measuring device and the setting values of the measuring device.

In FIG. 1, in a hospital, data Dt1 and Dt2 consisting of patient's vital signals are time series data, and data Da1 of the patient's gender and weight, and data Da2 of the device name and setting value of the measuring device are attribute data. This is an example when

Further, as shown by the data Dt1 constituting the time-series data, the time intervals of the individual time-series data indicated by circles may be uneven as shown by the intervals T1 and T2, or even though not shown. It may be.

Further, when flow data consisting of time series data and attribute data constitutes one set for each patient, each measurement device, each set value, etc., this set unit is referred to as a session. In FIG. 1, it is shown that a collection of flow data configured under predetermined conditions is a session SS.

Then, various prediction targets are predicted based on the flow data consisting of a plurality of sessions SS.

Figure 2 summarizes examples of session units, time units, attribute data examples, and time series data examples when hospital vital logs, factory robot logs, and baseball pitching logs each constitute flow data. It is something that

That is, when the flow data is composed of hospital vital logs, an example of a session unit is a patient, an example of a time unit is a date and time, and an example of attribute data is a patient's gender and time. An example of series data is a heartbeat signal.

In addition, when the flow data is a factory robot log, an example of session unit is robot, an example of time unit is date and time, an example of attribute data is the number of robot failures, and an example of time series data. is the torque sensor signal.

Further, when the flow data is a baseball pitching log, an example of the session unit is a turn at bat, an example of the time unit is the number of pitches in an at bat, an example of attribute data is a pitcher's left/right pitching, An example of time series data is ball speed.

In this way, flow data exists as various entities, and is data that can be generated in large quantities in the future as IoT becomes more widespread.

By the way, when predictions are made using a machine learning model on flow data, it is necessary to create features for the machine learning model from the flow data, but creating features for the machine learning model that contribute to prediction accuracy The processing (feature engineering) was a process that took time and effort.

More specifically, although general users understand the target sequence and time information they want to predict, they do not know how to process the data required to build a machine learning model for the task they want to perform. There are many things.

In addition, although some tools have been proposed to generate features for machine learning models, they have limitations on the target of prediction, such as time series data being at equal intervals and only being able to predict future values of time series data. However, in many cases, it is not possible to cover all the predictions that the user wants to make.

Furthermore, because flow data is huge and often spans multiple time series, there are limits to users' understanding of the relationships between datasets, and it is difficult to create features from datasets based on the relationships between series. become difficult or cumbersome.

On the other hand, if feature quantities are created by brute force without any prior knowledge, unnecessary feature quantities will be created and unnecessary calculation costs will be incurred.

Therefore, in the present disclosure, the user can easily generate feature amounts that are effective for a wide range of tasks by inputting the minimum settings to the flow data.

More specifically, in the present disclosure, when a column indicating time, a column indicating session unit, and a prediction target column in flow data are input by a user, prediction of future values of time series data, time series It becomes possible to generate feature amounts that are effective for predicting whether a specific event will occur in data, predicting data that is not time-series (does not change with time), etc. within a realistic amount of time.

<<2. Preferred embodiment >>
<Example of configuration of information processing device of the present disclosure>
Next, with reference to FIG. 3, a configuration example of the information processing apparatus of the present disclosure will be described.

The information processing device 31 includes a control section 51, an input section 52, an output section 53, a storage section 54, a communication section 55, a drive 56, and a removable storage medium 57, which are connected to each other via a bus 58. It is possible to send and receive data and programs.

The control unit 51 is composed of a processor and a memory, and controls the entire operation of the information processing device 31. The control unit 51 also includes a UI control unit 61, a data processing unit 62, and a machine learning model generation unit 63.

When the UI control unit 61 receives input of flow data, it generates a UI (User Interface) that prompts the input of a column indicating time as a task setting, a column indicating a session unit, and a column to be predicted, and outputs it. The display section 71 and the audio output section 72 of the section 53 are controlled and presented.

Then, the UI control unit 61 receives the input task settings by operating the input unit 52 by the user in response, and outputs them to the data processing unit 62 together with the input flow data.

The UI control unit 61 also controls the display unit 71 and the audio output unit 72 of the output unit 53 to display information on the feature amount generated by the data processing unit 62 on the display unit 71 and the audio output unit of the output unit 53. 72 and presents it to the user.

The data processing unit 62 acquires the flow data and task settings supplied from the UI control unit 61, and generates effective feature quantities (hereinafter also referred to as effective feature quantities) in generating a machine learning model as feature data. , the UI control unit 61, and the machine learning model generation unit 63.

The machine learning model generation unit 63 generates a machine learning model based on feature amount data consisting of effective feature amounts supplied from the data processing unit 62.

Note that details of the functions realized by the UI control unit 61 and the data processing unit 62 will be described later with reference to the functional block diagram of FIG. 4.

The input unit 52 is composed of input devices such as a keyboard, a mouse, and a touch panel through which the user inputs operation commands, and supplies various input signals to the control unit 51.

The output section 53 is controlled by the control section 51 and includes a display section and an audio output section. The output unit 53 outputs and displays images of the operation screen and processing results on a display unit 71 that is a display device such as an LCD (Liquid Crystal Display) or an organic EL (Electro Luminescence). The output unit 53 also controls an audio output unit 72 consisting of an audio output device to reproduce various voices, music, sound effects, and the like.

The storage unit 54 is composed of an HDD (Hard Disk Drive), an SSD (Solid State Drive), or a semiconductor memory, and is controlled by the control unit 51 to write or read various data and programs.

The communication unit 55 is controlled by the control unit 51 and realizes wired or wireless communications such as LAN (Local Area Network) and Bluetooth (registered trademark), and performs various types of communication via the network as necessary. Sends and receives various data and programs to and from devices.

The drive 56 includes magnetic disks (including flexible disks), optical disks (including CD-ROMs (Compact Disc-Read Only Memory) and DVDs (Digital Versatile Discs)), magneto-optical disks (including MDs (Mini Discs)), Alternatively, data is read and written from and to a removable storage medium 57 such as a semiconductor memory.

<Functions realized by the UI control unit and data processing unit>
Next, functions realized by the UI control section 61 and the data processing section 62 will be described with reference to the functional block diagram of FIG. 4.

The UI control unit 61 includes a flow data input unit 101, a task setting unit 102, and a generated feature amount visualization unit 103.

The flow data input unit 101 receives operation input from the input unit 52 and input of flow data from at least one of the storage unit 54, the communication unit 55, and the removable storage medium 57 via the drive 56, and inputs the flow data to the data processing unit 62, and output to the generated feature amount visualization unit 103.

When the task setting unit 102 acquires the column estimation results of the flow data supplied from the data processing unit 62, it generates a UI that shows the column estimation results, and also displays the time as a task setting on the UI that presents the column estimation results. The display section 71 and the audio output section 72 of the output section 53 are displayed by adding a column indicating the session unit, a column indicating the session unit, and information prompting the input of the column to be predicted. The task setting unit 102 may further prompt the user to input the prediction frequency and prediction time of the prediction target column as task settings.

The task setting unit 102 is prompted by this UI and the user operates the input unit 52 to set a column indicating time, a column indicating a session unit, a prediction target, and, if necessary, predicting the prediction target column. Information on task settings including the frequency and predicted time is output to the data processing unit 62.

Note that the task settings will be described in detail later with reference to FIGS. 5 and 6.

When the generated feature visualization unit 103 acquires the flow data supplied from the flow data input unit 101 and the feature data consisting of the effective features supplied from the data processing unit 62, it visualizes it as a UI and displays it in the output unit 53. The display section 71 and the audio output section 72 are controlled and presented.

Note that an example of presentation of feature data by the generated feature visualization unit 103 will be described in detail later with reference to FIG. 13.

The data processing unit 62 includes a column estimation unit 121, an output format determination unit 122, a generation source selection unit 123, an intra-session feature generation unit 124, a feature selection unit 125, an inter-session feature generation unit 126, a combination unit 127, and a feature generation unit 124. It includes a quantity data storage 128 and a loop determination section 129.

The column estimation unit 121 analyzes the data format of the flow data supplied from the UI control unit 61, estimates columns that can be columns indicating time, columns indicating session units, etc., and sends the column estimation results to the UI control unit 61. Output.

The output format determining unit 122 determines the flow data based on the information of the column indicating time and the column indicating the session unit as task settings supplied from the task setting unit 102 of the UI control unit 61, and the column to be predicted. The output format is determined and output to the generation source selection unit 123.

At this time, if the prediction frequency and prediction time of the prediction target sequence are also supplied as task settings, the output format determining unit 122 determines an output format that also takes into consideration the prediction frequency and prediction time of the prediction target sequence.

Note that the details of determining the output format will be described later with reference to FIGS. 7 to 10.

The generation source selection unit 123 executes a process of selectively extracting series data from which feature quantities are generated from the flow data according to the output format supplied from the output format determination unit 122, and stores the processing results within the session. It is output to the feature value generation unit 124.

Note that the process of selectively extracting series data from which feature quantities are generated will be described in detail later with reference to FIG. 14.

The in-session feature amount generation unit 124 generates in-session feature amount data and selects the feature amount based on the series data required for feature generation out of the flow data supplied from the generation source selection unit 123. 125.

More specifically, the intra-session feature generation unit 124 includes a metadata extraction unit 124a and an estimation model 124b.

The metadata extraction unit 124a extracts metadata consisting of the number of time series data in the flow data, sequence length (number of samples per sequence), statistical values of each variable (average, variance, etc.), and performs estimation. Output to model 124b.

The estimation model 124b is a model that has been trained in advance by pairing metadata and a method for generating feature amounts used for predicting the prediction target, and based on the metadata, the feature values required for predicting the prediction target. Estimate the generation method.

The in-session feature amount generation unit 124 generates in-session feature amount data using the series data to be the generation source selected by the generation source selection unit 123, using the feature generation method estimated from the metadata by the estimation model 124b. generate.

Note that the method for generating intra-session feature data will be described in detail later with reference to FIG. 15.

The feature quantity selection unit 125 calculates the effectiveness score related to prediction for the prediction target for each of the feature quantities constituting the intra-session feature data and the inter-session feature data, and selects feature quantities higher than a predetermined effectiveness score. , and exclude the others to reconstruct intra-session feature data and inter-session feature data.

More specifically, the feature selection unit 125 includes an intra-session feature selection unit 141, an inter-session feature selection unit 142, and an effectiveness score calculation unit 143.

The intra-session feature quantity selection unit 141 controls the effectiveness score calculation unit 143 to calculate an effectiveness score for each of the feature quantities constituting the intra-session feature data, and selects a feature quantity higher than a predetermined effectiveness score. By selecting and excluding feature values lower than a predetermined effectiveness score, the intra-session feature data is reconfigured and output to the inter-session feature generation unit 126 and the combining unit 127.

The inter-session feature selection unit 142 controls the effectiveness score calculation unit 143 to calculate an effectiveness score for each of the features forming the inter-session feature data supplied from the inter-session feature generation unit 126. , by selecting feature quantities higher than a predetermined effectiveness score and excluding feature quantities lower than a predetermined effectiveness score, the inter-session feature data is reconfigured and output to the combining unit 127.

The effectiveness score calculation unit 143 calculates, for example, the amount of mutual information with the prediction target as the effectiveness score for each of the features forming the intra-session feature data and the inter-session feature data, and calculates the mutual information with the prediction target as the effectiveness score. It is output to the selection unit 141 , the inter-session feature quantity selection unit 142 , and the loop determination unit 129 .

Furthermore, the effectiveness score calculation unit 143 may calculate the accuracy of the machine learning model generated using the intra-session feature data and the inter-session feature data as the effectiveness score. However, in this case, the machine learning model used is a machine learning model determined by a simpler machine learning algorithm or hyperparameter than the machine learning model generated by the machine learning model generation unit 63.

In this case, the intra-session feature quantity selection unit 141 and the inter-session feature quantity selection unit 142 select intra-session feature quantity data whose effectiveness score calculated from the accuracy etc. of the generated machine learning model does not fall below a predetermined value. The intra-session feature data and the inter-session feature data are reconstructed by selecting a subset of the features constituting each of the intra-session feature data and the inter-session feature data.

The inter-session feature amount generation unit 126 generates inter-session feature amount data based on the reconstructed intra-session feature amount data that is output from the feature amount selection unit 125 and is composed of features higher than a predetermined effectiveness score. It is generated and output to the feature selection unit 125.

Note that the method for generating inter-session feature data will be described in detail later with reference to FIG. 16.

The combining unit 127 combines the reconstructed intra-session feature data consisting of features higher than a predetermined effectiveness score supplied from the feature selection unit 125 and the inter-session feature data, and generates a feature value. The data is configured and stored in the feature data storage 128.

The feature data storage 128 stores the feature data supplied from the combining unit 127, and also supplies the stored feature data to the loop determination unit 129 as needed.

The loop determination unit 129 selects a prediction target based on the effectiveness score of the feature amount constituting the feature amount data that is a combination of the intra-session feature amount data and the inter-session feature amount data stored in the feature amount data storage 128. The overall effectiveness score of the feature data in predicting is calculated as, for example, the overall average value.

If the overall effectiveness score of the feature data is lower than a predetermined value, the loop determination unit 129 sends the generation source selection unit 123 again to generate more feature values than the current number of features from the same flow data. Instructs the process to loop again to extract.

Then, when a predetermined time has elapsed or when the effectiveness is higher than a predetermined value, the loop determination unit 129 selects the feature data stored in the feature data storage 128 at that time and the entire feature data. The information on the effectiveness score of is output to the UI control unit 61 and the machine learning model generation unit 63.

The generated feature amount visualization unit 103 visualizes and presents the generated feature amount data and information on the overall effectiveness score of the feature amount data as a UI.

At this time, for example, when the validity score of the generated feature data is deemed to be sufficient and the user instructs the generation of a machine learning model based on the selected feature data, the machine learning The model generation unit 63 may generate a machine learning model based on the supplied feature data.

<About task settings>
Task settings enable tasks such as predicting the future value of time-series data, predicting whether a specific event will occur in time-series data, and predicting non-time-series data (that does not change depending on time) from flow data. This is the setting for

More specifically, the task settings are settings for a column indicating time in flow data, a column indicating a session unit, and a column to be predicted, and if necessary, the prediction frequency and the prediction target column for the prediction target column. Also includes settings for predicted time.

For example, task settings in the case of flow data as shown in FIG. 5 will be explained.

FIG. 5 shows an example of flow data related to a pitching log of a predetermined baseball batter. Flow data FD in FIG. 5 is composed of attribute data AD and time series data TD.

The attribute data AD is composed of three data columns, which from the left in the figure are a pitcher ID column, a turn at bat ID column, and a result column.

The pitcher ID column is a column in which IDs that identify pitchers who have pitched to a predetermined batter are registered, and in the figure, pitcher IDs = A, B, and A are registered from the top.

The turn-at-bat ID column is a column in which IDs that identify the turn-at-bat of a predetermined batter are registered, and in the figure, turn-at-bat IDs=0, 1, and 2 are registered from the top.

The result column is a column in which the results of a given batter's turn at bat identified by the at-bat ID for pitches by the pitcher identified by the pitcher ID are registered. ”, and “out” are registered.

As a result, it is registered that the predetermined batter made a hit in the turn at bat identified by the turn at bat ID=0 in response to the pitch by the pitcher with the pitcher ID=A.

Additionally, it is registered that the predetermined batter was out in the turn at bat identified by turn ID=1 in response to a pitch by pitcher ID=B.

Furthermore, it is registered that the predetermined batter was out in the turn at bat identified by turn ID=2 in response to a pitch by pitcher ID=A.

The time-series data TD is composed of three data columns, from the left in the figure: a turn ID column, a pitch ID column, and a pitch speed column.

The at-bat ID column is a column in which IDs that identify a given batter's at-bat are registered, and in the figure, from the top, the at-bat IDs are 0, 0, 0, 1, 1, 2, 2, 2. Registered.

The pitch ID column is a column in which IDs identifying pitches pitched by a pitcher to a predetermined batter are registered in chronological order, and in the figure, from the top, pitch IDs are 0, 1, 2, 0, 1, 0, 1, 2 are registered.

The ball speed column is a column in which the ball speed (km/h) pitched by a given batter in the at-bat identified by the at-bat ID by the pitcher identified by the pitcher ID is registered. , 150, 120, 120, 110, 90, 130, and 155 are registered.

As a result, in a given batter's turn at bat identified by turn ID=0, the ball speed of the first pitch identified by pitch ID=0 is 140 km/h, and the second pitch identified by pitch ID=1. It is registered that the ball speed of the pitch is 150 km/h, and the ball speed of the third pitch identified by pitch ID = 2 is 120 km/h.

Also, in a given batter's turn at bat identified by turn ID = 1, the ball speed of the first pitch identified by pitch ID = 0 is 120 km/h, and the speed of the second pitch identified by pitch ID = 1. The pitching speed is registered as 110km/h.

Furthermore, in a given batter's turn at bat identified by turn ID = 2, the speed of the first pitch identified by pitch ID = 0 is 90 km/h, and the speed of the second pitch identified by pitch ID = 1. It is registered that the ball speed of the pitch is 130 km/h, and that the ball speed of the third pitch identified by pitch ID=2 is 155 km/h.

In this case, the information on the pitch sequence in the time series data TD is information that is registered in time series, so it is treated as a time sequence.

Furthermore, a common turn-at-bat ID column exists as a session column in each of the time-series data TD and the attribute data AD.

Furthermore, the pitcher ID string can also be thought of as a clustered set (session cluster) above the turn-at-bat ID string as a session string.

Note that the time string may be a value whose order is known (float, int) or a date/time type (YY:MM:DD hh:mm:ss, etc.).

The column estimating section 121 of the data processing section 62 estimates, for example, a time column or a session column as shown in FIG. 5, and supplies the result to the task setting section 102 of the UI control section 61 as a column estimation result.

Based on this UI, when the input unit 52 is operated and information for setting a column indicating time, a column indicating a session unit, a prediction target column, and a prediction frequency and prediction time of the prediction target column is input, The task setting unit 102 outputs to the output format determining unit 122 of the data processing unit.

More specifically, the task setting unit 102 controls the display unit 71 and the audio output unit 72 of the output unit 53 based on the column estimation results to present the flow data to the user.

At this time, the task setting unit 102 presents a UI that prompts to set a column indicating a time unit, a column indicating a session unit, and a prediction target column as task settings, and the task settings are set according to the UI. The information is output to the output format determining section 122 of the data processing section 62.

More specifically, the task setting unit 102 presents a display image PV consisting of a UI as shown in FIG. 6, for example.

In the UI presented in the display image PV of FIG. 6, it is written in the upper row, "Please set the column indicating the time unit, the column indicating the session unit, and the prediction target column.", which indicates the time unit. Information prompting the user to set a column, a column indicating a session unit, and a prediction target column is presented.

Further, below that, attribute data AD is displayed on the left side, and time series data TD is displayed on the right side.

Furthermore, in response to the presentation of information prompting the setting, in FIG. 6, the batting turn ID column indicated by a dotted line is set as a column indicating session units, and the pitching column indicated by a dashed line is set as a column indicating time units. An example is shown in which a ball speed sequence that has been set and is indicated by a solid line is set as a prediction target.

The task setting unit 102 stores information on columns indicating time units, columns indicating session units, and prediction target columns, which are set using frames such as the dotted lines, dashed lines, and solid lines shown in FIG. is output to the output format determining section 122.

At this time, in addition to the column indicating the time unit, the column indicating the session unit, and the information on the prediction target column, the prediction frequency and predicted time of the prediction target column may be input as task settings.

<Determining the output format>
The output format determination unit 122 receives information for setting a column indicating the time, a column indicating the session unit, a prediction target column, and a prediction frequency and prediction time of the prediction target column, which are supplied from the task setting unit 102 of the UI control unit 61. Determine the output format based on.

Examples of the output format include the melt format shown in FIG. 7 and the pivot format shown in FIG. 8.

The melt format in FIG. 7 is composed of an id column, a time column, a name column, and a value column from the left. In the melt format shown in Figure 7, the name column constitutes a session unit, the id column is a session cluster that groups session units, the time column is a sampling time column, and the value column is a sampled time series data column. becomes.

That is, in FIG. 7, there are two upper session clusters, A and B, for grouping session units, and within the session unit there are two series, x and y, and in each series, times t1 and t2 are It is set.

In this way, when the time settings differ for each session and each series, the melt format shown in FIG. 7 is effective.

In FIG. 7, the time series data from the top are x (A, t1), x (A, t2), y (A, t1), y (A, t2), x (B, t1), x (B , t2), y(B, t1), and y(B, t2) are registered.

On the other hand, if the sampling time sequence is common for all series in session units, a pivot format as shown in FIG. 8 may be used.

That is, in FIG. 8, it is composed of an id column, a time column, a value x column, and a value y column. In the pivot format shown in FIG. 8, the sampling time sequence is shared between the two x and y sequences for each session, and the value x and value y columns are registered in parallel.

In FIG. 8, x (A, t1), x (A, t2), x (B, t1), x (B, t2) are registered from the top as the value x column, and y ( A, t1), y (A, t2), y (B, t1), and y (B, t2) are registered.

More specifically, for example, as shown in the left part of FIG. 9, the at-bat ID column indicated by a dotted line is set as a column indicating session units, and the pitching column indicated by a dashed dotted line is set as a column indicating time units. If the ball speed sequence shown by the solid line is set as the prediction target sequence, the output format determining unit 122 determines, for example, the output format FIS1 as shown on the right side of FIG. 9.

The output format FIS1 on the right side of FIG. 9 is composed of the melt format described with reference to FIG. A previous at-bat result column is provided.

In the output format FIS1 in the right part of FIG. , 0, 1, 0, 1, 2 are registered.

In addition, 140, 150, 120, 120, 110, 90, 130, 155 are registered from the top in the ball speed column, and NaN, 140, 150, NaN, 120, NaN, 90 and 130 are registered, and NaN, NaN, NaN, hit, hit, out, out, out are registered in the previous at-bat result column from the top.

That is, here, since the column to be predicted is "ball speed", the data in the ball speed column is formatted as one row per time (one column per pitch) so that it is arranged as time series data.

Further, for example, as shown on the left side of FIG. 10, the turn ID column indicated by a dotted line is set as a column indicating session units, the pitching column indicated by a dashed-dotted line is set as a column indicating time units, When the result column indicated by the solid line is set as the prediction target column, the output format determining unit 122 determines, for example, the output format FIS2 as illustrated on the right side of FIG. 10.

The output format FIS2 on the right side of FIG. 10 is composed of the pivot format described with reference to FIG. 8, and from the left is a pitcher ID column, a turn at bat ID column indicating the session unit, a result column, an average ball speed column for each turn at bat, and a previous at-bat result column.

In the output format FIS2 on the right side of FIG. 10, A, B, and A are registered from the top in the pitcher ID column, 0, 1, and 2 are registered from the top in the turn ID column, and 0, 1, and 2 are registered from the top in the result column. Hit, out, and out are registered, 145, 115, and 110 are registered from the top in the ball speed average column for each at-bat, and NaN, hit, and out are registered from the top in the previous at-bat result column.

That is, here, since the prediction target is the "result", one row corresponds to one session (one row, one at-bat ID column). The time-series data is the average ball speed for each turn at bat, and is in a format in which features are added that aggregate time information using statistics.

<How to generate features>
Next, a method for generating feature amounts for each series data will be explained.

The feature quantity is configured as a vector whose elements are a plurality of statistical quantities for each series data obtained in the time direction from the time series data.

For example, when a predetermined series of time series data is expressed by a waveform Ldt that changes in the time direction as shown in FIG. A window with a time width w is set, and the values of the waveform Ldt in each window are obtained as partial series X1, X2, X3, .

Further, the values of the waveform Ldt at future times t11, t12, t13 corresponding to each of the partial series X1, X2, X3, . . . are acquired as prediction targets y1, y2, y3, . Ru.

Then, for the acquired partial sequences X1, X2, X3, . . . , predetermined statistical values f(X1), The sequence data is converted into f(X2), f(X3), ..., and a vector whose elements are the converted statistical values and the prediction targets y1, y2, y3, ... is constructed. A feature quantity for each is formed.

More specifically, the column consisting of the feature amount of the predetermined series data and the prediction target column are expressed, for example, as in the following equations (1) and (2).

Fs=(f(X1), f(X2), f(X3),...)
...(1)
Fp=(y1, y2, y3,...)
...(2)

Here, Fs is a feature amount of predetermined series data, and f(X1), f(X2), f(X3), ... are partial series of predetermined series data expressed by waveform Ldt, respectively. This is an element consisting of the statistical amount of Xn. Further, y1, y2, y3, . . . are prediction targets corresponding to the partial sequences X1, X2, X3, .

Further, each element f(Xn) constituting the feature amount Fs of the series data corresponding to the partial series Xn is expressed, for example, as in the following equation (3).

f(Xn) = (Ave(Xn), Min(Xn), Max(Xn), Var(Xn), Stde(Xn),...)
...(3)

Here, f(Xn) is each element of the feature amount Fs of the series data of the subsequence Xn, Ave(Xn) is the average value of the subsequence Xn, and Min(Xn) is the average value of the subsequence Xn. Max(Xn) is the maximum value of the subsequence Xn, Var(Xn) is the variance of the subsequence Xn, and Stde(Xn) is the standard deviation of the subsequence Xn.

Note that statistics other than the above-mentioned average value, minimum value, maximum value, variance, and standard deviation may be used for the partial series Xn.

In addition, we have explained an example in which each element f(Xn) of the feature amount Fs of the subsequence Xn in equation (3) is expressed as a vector with each statistic as an element, but the kernel function using each statistic It may also be expressed as a weighted sum of products (convolution kernel). For the convolution kernel, please refer to https://arxiv.org/abs/1910.13051 etc.

<How to set the window>
The windows forming the above-mentioned partial series Xn may be set using various methods.

For example, as shown in the left part of FIG. 12, with the session start time tb as a reference, while changing the offset offset-fb from the start time, A series Xn may be set.

Further, as shown in the right part of FIG. 12, when the prediction start time is, for example, the reference time ts, the prediction start time is set in a predetermined time width ws while changing the offset offset-fs from the reference time ts. The partial sequence Xn may be set using the window WS as a unit.

Furthermore, as shown in FIG. 13, even if partial sequences are set in units of windows WSS, the time width is set from the session start time tb to the time tos offset by a predetermined time from the predicted execution time ts. good.

Further, as shown in FIG. 13, a partial sequence may be set in units of windows WA in which the entire range from the session start time tb to the session end is set as the time width.

Further, the specific value Ldt(s) when shifted a certain period of time from time tos may be obtained as a partial sequence.

<About selecting the series from which feature values are generated>
Next, with reference to FIG. 14, selection of series data to be a generation source of a feature amount by the generation source selection unit 123 will be described.

As described above, the information that is the generation source of the feature amount (hereinafter referred to as the generation source feature amount) is generated as vectorized information for each series of data.

However, not all series data are useful information for predicting the prediction target, and there is information that is unnecessary for predicting the prediction target.

Therefore, in the present disclosure, the generation source selection unit 123 determines whether or not time series data extracted from flow data and series data including attribute data are useful as a generation source of feature amounts. and exclude as necessary.

More specifically, as shown in FIG. 14, for example, consider a case where series data L1 to L3 exist as time series data that can be used as a generation source of a machine learning model that predicts the prediction target T. .

The generation source selection unit 123 determines whether each of the series data L1 to L3 is appropriate as a generation source of a feature amount used for prediction of a prediction target. More specifically, for example, the generation source selection unit 123 selects, for the series data L1, a generation source feature amount F(tn) consisting of statistics Fa to Fd as a time series generation source feature amount used for prediction of the prediction target T. are extracted in time series to generate a feature table TB.

Incidentally, the statistical quantities Fa to Fd mentioned here correspond to Ave (Xn), Min (Xn), Max (Xn), Var (Xn), Stde (Xn), etc. in equation (3) mentioned above. However, the generation source feature amount F(tn) corresponds to each element f(Xn) that constitutes the feature amount Fs of the series data.

In FIG. 14, the generator feature amount F(tn) extracted from the series data L1 is the generator feature amount F(t1)(=(Fa(t1), Fb(t1), Fc(t1), Fd( t1)), and the source feature F(t2) (=(Fa(t2), Fb(t2), Fc(t2), Fd(t2)), are extracted, and the feature table TB is created. Note that in the feature amount table TB of FIG. 14, detailed description of the prediction target T is omitted.

Next, the generation source selection unit 123 selects generation source features F(t1) (=(Fa(t1), Fb(t1), Fc(t1), Fd(t1)), F(t2)(=(Fa (t2), Fb(t2), Fc(t2), Fd(t2)), it is determined whether the series data L1 is a series that contributes to the prediction of the prediction target T.

First, the generation source selection unit 123 excludes the series data L1 from the feature amount, for example, when there is no time-series change in the series data L1 and no correlation with the prediction target is recognized.

Then, when it is recognized that the series data L1 has a time-series change and is correlated with the prediction target, the generation source selection unit 123 selects the generation source feature amount F(t1) (=( Fa (t1), Fb (t1), Fc (t1), Fd (t1)), F (t2) (= (Fa (t2), Fb (t2), Fc (t2), Fd (t2)), ... is input to the prediction model PM to predict the prediction target T, and obtain the prediction result T'.

Note that the prediction model PM is a relatively simple and lightweight prediction model, and is a model for easily predicting the prediction target T based on the generation source feature of a predetermined series.

The generation source selection unit 123 calculates the prediction accuracy PA from a comparison between the prediction target T and the prediction result T', and when the prediction accuracy is lower than a predetermined threshold value, the generation source selection unit 123 selects the series data L1 as a generation source for determining the prediction target. Exclude from

The generation source selection unit 123 similarly determines the prediction accuracy PA not only for the sequence data L1 but also for each of the sequence data L2 and L3, and excludes sequences lower than a predetermined prediction accuracy from the generation sources.

<How to generate intra-session feature data>
Next, a method of generating intra-session feature data by the intra-session feature generation unit 124 will be described.

As shown in FIG. 15, the intra-session feature generation unit 124 controls the metadata extraction unit 124a to determine, for example, the number of sequences, sequence length, variance of each sequence, etc. in the flow data, based on the flow data. and the number of attribute data are extracted as metadata of flow data.

More specifically, the metadata extraction unit 124a extracts metadata from the series data selected as the generation source by the generation source selection unit 123 from among the flow data.

Incidentally, the metadata of the flow data may be a machine learning model or algorithm generated by the machine learning model generation unit 63 based on feature data generated based on the flow data.

The in-session feature generation unit 124 acquires and pools various metadata and the distribution of the final effective feature generation method as paired information, and learns using these to generate information based on the metadata. , an estimation model 124b for estimating a method of generating effective feature quantities.

Therefore, the intra-session feature amount generation unit 124 controls this estimation model 124b to estimate the effective feature amount generation method based on the metadata of the extracted flow data.

That is, the estimation model 124b selects the generation source selection unit 123 from among the series data extracted in the determined output format based on the column indicating the time set by the user, the column indicating the session unit, and the prediction target column. A method for generating an effective feature amount is estimated based on metadata of flow data, which is composed of series data whose prediction accuracy is higher than a predetermined accuracy threshold.

As a result, the prediction accuracy of the prediction target of the series data that constitutes the flow data, the column indicating the time set by the user, the column indicating the session unit, and the series data set as the prediction target column. A method for generating effective features using series data with a high value is estimated.

As a result, it is possible to generate effective features that reflect the column indicating the time set by the user, the column indicating the session unit, and the prediction target column, and are optimal for generating a machine learning model with high prediction accuracy. .

Information on how to generate effective features includes, for example, how to use the series data used for effective features, how to set windows, and how to set the proportions and weights of each value in the elements of the feature. This is information that specifies the generation method (=calculation method).

More specifically, the information specifying how to use the series data used to generate effective features may be, for example, setting the usage ratio of categorical series data to numerical series data to a predetermined ratio such as 40:60. This information is used in proportions.

Further, the information specifying the window setting method is, for example, the information obtained in the window WB and window WS in FIG. 12 and the window WSS and window WA in FIG. 13, for example, 50:20:20:10 This information is used in

Furthermore, information specifying the ratio of each value and the setting method of the weight in each element of the feature amount is, for example, Ave (Xn), Min (Xn), Max (Xn), Var (Xn), Stde (Xn). This information includes the proportions and weights assigned to each.

Then, the in-session feature amount generation unit 124 uses each series data to create an effective feature amount using the estimated creation method, uses the created effective feature amount to generate in-session feature amount data, and outputs it. do.

<Inter-session feature data>
Next, the inter-session feature amount data generated by the inter-session feature amount generation unit 126 will be explained.

As described above, the inter-session feature amount generation unit 126 uses series data whose prediction accuracy PA is higher than a predetermined value based on the prediction model PM among the flow data, and is estimated from the metadata of the flow data. Inter-session feature data is generated using feature values obtained from the time context of intra-session feature data, which is comprised of features generated by the effective feature generation method.

That is, for example, when the flow data is a baseball pitching log, the feature amount before x at-bats for either the same batter, the same pitcher, or the same batter and the same pitcher, or , the entire past feature can be treated as an inter-session feature.

When a turn-at-bat ID column is set as a session unit, since turn-at-bat IDs are so-called integer-type data that has an order, they are treated as time-series data that assumes the context of each session, and are used as inter-session features. be able to.

In addition, when a result column such as a hit or an out is set as a prediction target, information such as a hit or an out in the previous turn at bat is so-called string type data, and no order is recognized, but based on the value of the time column, The inter-session feature amount can be set so that the order is specified.

Furthermore, by using the information of the pitcher ID string that clusters the turn-at-bat ID string, which is a session unit, as a set, the context may be calculated in units of classes within the grouped sessions. For example, if the pitcher ID column can be a set that clusters at-bat IDs in session units, the result column to be predicted such as a hit or out in the previous at-bat of "the same pitcher" or the average ball speed can be used as a feature between sessions. It can be the amount.

More specifically, as shown in the left part of FIG. 16, the at-bat ID is set as the session unit, the pitch row is set as the time row, the result row is set as the prediction target, and the center of FIG. As shown in the section, intra-session feature data consists of, from the left, a pitcher ID column that clusters the at-bat ID column that is the unit of session, a turn-at-bat ID column that is the session unit, a result column, and an average ball speed column for each at-bat. Consider the case where is generated.

In this case, since the turn at bat ID is for each session, the feature amount of the previous turn at bat is an inter-session feature amount. Therefore, as shown in the center part of FIG. 16, for each turn-at-bat ball speed average column that is a session unit in the intra-session feature value, as shown in the right part of FIG. 16, in the inter-session feature value, A pitcher's previous ball speed average column corresponding to the pitcher ID column that clusters the batting turn ID column for each session has been added.

In addition, in the pitcher's previous ball speed average column in the inter-session features in the right part of FIG. The values of 137 km/h and 115 km/h for batting turns

IDs

0 and 1 are recorded as the previous ball speed average values of the same pitcher for batting turns

IDs

2 and 3.

Furthermore, the previous ball speed average value of pitchers whose batting IDs are 0 and 1 with pitcher IDs A and B does not exist, so it is set as "NaN".

Furthermore, by using a session set that clusters session units, it is also possible to create features such as the previous turn at bat in the same session set.

<Example of flow data of this disclosure>
Next, with reference to FIG. 17, an example of session (unit) ID and time when a hospital vital log, a factory robot log, and a baseball pitching log constitute flow data for the flow data of the present disclosure. An example of a unit, an example of attribute data, an example of time series data, an example of an intra-session feature amount, an example of an inter-session feature amount, and an example of session set (unit) ID when session units are clustered will be explained.

That is, when the flow data is a hospital vital log, an example of session (unit) ID is patient ID, an example of time unit is date and time, and an example of attribute data is patient gender and time. An example of series data is a heartbeat signal, an example of an intra-session feature is the patient's average heartbeat, an example of an inter-session feature is the age of a patient by hospital, and the session set (unit) ID is , is the hospital ID.

Further, when the flow data is a factory robot log, an example of the session (unit) ID is the operation ID, an example of the time unit is the date and time, an example of the attribute data is the installation location of the robot, An example of time-series data is a torque sensor signal, an example of an intra-session feature is the average number of stops of a robot on that day, an example of an inter-session feature is the total number of stops for each robot, and an example of a session set ( ID of the unit is the robot ID.

Further, if the flow data is a baseball pitching log, an example of the session (unit) ID is the at-bat ID, an example of the time unit is the number of pitches in the at-bat, and an example of the attribute data is the pitcher's left / He is a right-handed pitcher, an example of time-series data is ball speed, an example of an intra-session feature is the average ball speed within an at-bat, an example of an inter-session feature is the results of the past three at-bats of the same pitcher, The session set (unit) ID is the pitcher ID.

In addition, for flow data, examples of session (unit) IDs, examples of time units, examples of attribute data, examples of time series data, examples of intra-session features, examples of inter-session features, and clustering of session units. Examples of session set (unit) IDs in this case are not limited to those shown in FIG. 17.

<Example of presentation of feature data>
Next, with reference to FIG. 18, a presentation example in which flow data and feature data are visualized and presented by the generated feature visualization unit 103 will be described.

FIG. 18 shows an example of presentation when the flow data is a baseball pitching log.

In the feature data presentation example in FIG. 18, the feature data table is displayed in the upper row, and the detailed data of a part of the specified in-session feature data in the upper feature data table is displayed in the lower row. The graph to be displayed is displayed. In addition, in the upper right corner of the feature data table, there is a field for displaying the effectiveness score of the entire feature data, and in FIG. For example, it is shown that the effectiveness score for the prediction of the prediction target is 85 points out of 100 points.

The feature data table includes, from the left, a data ID column, a turn-at-bat ID column as a session unit, a pitcher ID column as attribute data, a result column, a pitch ID column as a time column, a pitch speed column as a prediction target, and within a session. The pitch speed row of the previous pitch, the average speed of the most recent three pitches, and the pitch type row of the previous pitch are provided as feature data, and the pitcher's previous average pitch speed row of the pitcher as inter-session feature data.

In FIG. 18, in the data ID column, 1, 2, 3, 4, 5, and 6 are displayed in order from the top.

In addition, in the turn-at-bat ID column, which is a column for each session, 1, 1, 1, 1, 1, 2 are displayed in order from the top, and the data from data ID = 1 to 5 is the turn-at-bat ID column of turn ID = 1. This shows that the data with data ID=6 is for turn at bat ID=2.

Furthermore, in the pitcher ID column as attribute data, A, A, A, A, A, B are displayed in order from the top, and data from data ID = 1 to 5 is for pitcher ID = A. This shows that the data with data ID=6 belongs to pitcher ID=B.

In addition, in the result column, "hit", "hit", "hit", "hit", "hit", and "out" are displayed from the top, and the result column for data ID = 1 to 5 is It is a hit, and the result column with data ID=6 is shown to be out.

In the pitching ID column as a time column, 1, 2, 3, 4, 5, 1 are displayed in order from the top, and the pitching ID columns from data ID = 1 to 5 are the same pitcher at bat ID = 1. This is data from the first pitch to the fifth pitch thrown by the pitcher with ID=A, and shows that the batter made a hit with the pitch with pitch ID=5.

It is also shown that the data is data for a pitch with a pitch ID of 6 and a pitch with a pitch ID of 1.

In the ball speed column that is the prediction target, 143.9, 140.2, 130.9, 90.4, 124.3, 150.2 are displayed in order from the top, and the ball speed of the first pitch of pitch ID = 1 in the turn of bat with turn ID = 1 is 143.9 km. /h, the speed of the second pitch of pitch ID = 2 is 140.2 km/h, the speed of the third pitch of pitch ID = 3 is 130.9 km/h, and the speed of the fourth pitch of pitch ID = 4 is 140.2 km/h. The ball speed is 90.4 km/h, the speed of the 5th pitch of pitch ID = 5 is 124.3 km/h, and the speed of the 1st ball of pitch ID = 1 in the at-bat ID = 2 is 150.2 km/h. It has been shown that

In the ball speed column from the previous ball as intra-session feature data, NaN, 143.9, 140.2, 130.9, 90.4, NaN are displayed from the top, and the ball speed from the previous ball for each data ID = 1 to 6 is Displayed.

In the most recent three ball ball speed average column, NaN, NaN, NaN, 138.3, 120.5, NaN are displayed from the top, and the average ball speed of the most recent three balls for each of data IDs = 1 to 6 is displayed.

In the pitch type row before the first pitch, from the top it is written as NaN, straight, slider, changeup, slow ball, NaN, and in the turn at bat with turn ID = 1, the pitcher with pitcher ID = A, pitcher ID = It is shown that the types of pitches pitched one pitch before in pitches 2 to 5 are a straight ball, a slider, a changeup, and a slow ball, respectively.

In the pitcher's previous ball speed average column as inter-session feature data, 120.4, 120.4, 120.4, 120.4, 144.2 are displayed in order from the top. is displayed as 120.4 km/h, and it is displayed that the average ball speed in the turn at bat before the turn at bat with turn ID=2 is 144.2 km/h.

Furthermore, in the lower row, a graph display example is displayed that displays detailed data when the average velocity of the most recent three pitches for pitch ID=4, which is data ID=4 in the feature data table, is specified.

In the lower graph, positions showing that the ball speeds of pitching IDs = 1 to 5 are 143.9 km/h, 140.2 km/h, 130.9 km/h, 90.4 km/h, and 124.3 km/h are plotted, A graph is displayed in which each plotted point is connected by a straight line.

Furthermore, among these, the ball speeds of the most recent three pitches with pitch ID = 4 are 143.9 km/h, 140.2 km/h, and 130.9 km/h, respectively, and the average ball speed is 138.3 km/h. has been done.

In the example shown in FIG. 18, for the ball speed to be predicted, the ball speed of the previous ball, the average ball speed of the last three balls, and the pitch type of one week ago are presented as intra-session feature data, and the inter-session features It is presented that the pitcher's previous ball speed average is generated as quantitative data.

With the presentation shown in FIG. 18, the user can generate a machine learning model that predicts the ball speed to be predicted by using the ball speed of the previous ball, the average speed of the last three balls, and the ball speed of the previous ball as intra-session feature data. It can be recognized that the pitch type from one week ago has been proposed, and that the pitcher's previous pitch average speed has been proposed as the inter-session feature amount data.

Furthermore, by presenting the effectiveness score, it becomes possible to recognize to some extent the accuracy expected in prediction using a machine learning model generated using feature data.

As a result, the feature data required to generate a machine learning model can be generated by simply inputting flow data and specifying a column indicating the time, a column indicating the session unit, and a prediction target column for the flow data. becomes possible.

Note that when referring to the feature data presented in FIG. 18, the validity score is low, and even when referring to the feature data presented, sufficient feature data for generating a machine learning model cannot be obtained. If it is determined that the flow data is not specified, for example, the column indicating the time specified for the flow data and the column indicating the session unit may be changed and the feature data may be generated again. Other flow data may also be used.

<Feature amount data generation process>
Next, with reference to the flowchart in FIG. 19, the feature data generation process realized by the functions of the UI control unit 61 and data processing unit 62 in FIG. 4 will be described.

In step S31, the flow data input unit 101 receives input of flow data and outputs it to the generated feature quantity visualization unit 103 and the data processing unit 62.

In step S32, the column estimation unit 121 of the data processing unit 62 analyzes the flow data, estimates the columns that make up the flow data, and outputs the estimation result to the UI control unit 61.

In step S33, when the task setting unit 102 obtains the estimation result of the flow data column, the task setting unit 102 prompts for input of a session unit column, a time unit column, and a prediction target as task settings together with the estimation result. A UI as shown in the display image PV described with reference to 6 is generated and presented.

Then, the task setting unit 102 receives input from the user and outputs the session unit sequence, time unit sequence, and prediction target information input as the task setting to the data processing unit 62.

At this time, the task setting unit 102 further presents information on the UI prompting the user to input the prediction frequency and prediction time of the prediction target column as a task setting, and also prompts the user to input the prediction frequency and prediction time information of the prediction target column. It accepts and outputs it to the data processing section 62.

In step S34, the output format determining unit 122 determines the output format to be read from the flow data based on the session unit sequence, time unit sequence, and prediction target information supplied as the task settings, and 123.

In step S35, the generation source selection unit 123 extracts sequence data from the flow data according to the output format, executes generation source selection processing, and selects the predicted sequence data from the sequence data extracted from the flow data based on the output format. Sequence data that is highly effective in predicting the target is selected and output to the intra-session feature generation unit 124.

Note that details of the generation source selection process will be described later with reference to the flowchart in FIG. 20.

In step S36, the intra-session feature generation unit 124 executes an intra-session feature generation process, uses the selected series data to generate intra-session feature data, and outputs it to the feature selection unit 125.

Note that details of the intra-session feature amount generation process will be described later with reference to the flowchart of FIG. 21.

In step S37, the intra-session feature selection unit 141 of the feature selection unit 125 controls the effectiveness score calculation unit 143 to predict the prediction target of each feature forming the supplied intra-session feature data. It calculates the effectiveness score related to the prediction, and outputs the calculated effectiveness score to itself and to the loop determination unit 129.

In step S38, the intra-session feature quantity selection unit 141 selects, as an effective feature quantity, a feature quantity whose effectiveness score is higher than a predetermined score threshold from among the respective feature quantities constituting the intra-session feature quantity data. , other features are excluded, intra-session feature data consisting of effective features is reconfigured and output to the inter-session feature generating section 126 and the combining section 127.

In step S39, upon acquiring the intra-session feature data supplied from the feature selection unit 125, the inter-session feature generation unit 126 stores the intra-session feature data and uses other intra-session feature data to create an inter-session feature data. Quantity data is generated and output to the feature quantity selection unit 125.

In step S40, the inter-session feature selection unit 142 of the feature selection unit 125 controls the effectiveness score calculation unit 143 to predict the prediction target of each feature forming the supplied inter-session feature data. It calculates the effectiveness score related to the prediction, and outputs the calculated effectiveness score to itself and to the loop determination unit 129.

In step S41, the inter-session feature quantity selection unit 142 selects, as an effective feature quantity, a feature quantity whose effectiveness score is higher than a predetermined score threshold from among the respective feature quantities constituting the inter-session feature quantity data. , other feature quantities are excluded, and inter-session feature data consisting of effective feature quantities is reconstructed and output to the combining unit 127.

In step S42, the combining unit 127 combines the intra-session feature data and the inter-session feature data to generate feature data, and stores the generated feature data in the feature data storage 128.

In step S43, the loop determination unit 129 determines the effectiveness score for each feature of the intra-session feature data and the inter-session feature data, which correspond to the feature data stored in the feature data storage 128. Based on this, the overall effectiveness score of the feature amount data is calculated, and it is determined whether the effectiveness score is greater than or equal to a predetermined value or whether the elapsed time from the start of the process has exceeded a predetermined time.

If it is determined in step S43 that the overall effectiveness score of the feature data is smaller than the predetermined value and that the elapsed time from the start of the process has not exceeded the predetermined time, the process proceeds to step S44.

In step S44, the loop determination unit 129 selects the generation source selection unit 123 and the feature so that the accuracy threshold used in the generation source selection process and the score threshold set for the effectiveness score are reduced from predetermined values. Controlling the amount selection unit 125, the process returns to step S35 and executes the feature amount data generation process again.

That is, in step S43, if the overall effectiveness score of the feature data is smaller than a predetermined value and the elapsed time from the start of processing has not exceeded the predetermined time, the validity score is also applied to the excluded series data and feature data. Therefore, the accuracy threshold and score threshold are set smaller by predetermined values, and the feature amount data is generated again.

However, in this case, the feature amount data generated up to this process will remain stored in the feature amount data storage 128 and will remain valid thereafter. In addition, from now on, the feature values that have already been generated as feature data will be treated as already generated, and the series data and feature values that have been excluded in the previous processing will be restored, and then again. Enable feature data to be generated. For example, in calculating the effectiveness score in the feature quantity selection unit 125, a machine learning model is created using the union of the feature quantities stored in the storage 128 and the newly generated feature quantities, and the accuracy improvement range is newly calculated. It may also be calculated as a validity score.

If it is determined in step S43 that the overall effectiveness score of the feature data is equal to or greater than the predetermined value, or that the elapsed time from the start of the process has exceeded the predetermined time, the process proceeds to step S45.

In step S45, the loop determination unit 129 reads out the feature data having the highest overall effectiveness score of the feature data from among the feature data stored in the feature data storage 128, and sends it to the UI control unit 61. It is output and presented to the user, and is also output to the machine learning model generation unit 63.

In response, the generated feature visualization unit 103 of the UI control unit 61 generates a UI based on the flow data and feature data and presents it to the user.

In addition, in the first process, if it is determined in the process of step S43 that the elapsed time from the start of the process has passed the predetermined time while the overall effectiveness score of the feature data remains smaller than the predetermined value, the feature data Since the validity score of the data is insufficient and the prediction accuracy of the machine learning model generated based on the feature data may be insufficient, the generated feature visualization unit 103 calculates the current validity score. At the same time, it may also be possible to present that the prediction accuracy may be insufficient with the current feature amount data.

<Generation source selection process>
Next, generation source selection processing by the generation source selection unit 123 will be described with reference to the flowchart of FIG. 20.

In step S71, the generation source selection unit 123 determines that among the series data extracted from the flow data based on the output format, time series data that does not change over time is irrelevant to the prediction of the prediction target. Exclude series data that is

In step S72, the generation source selection unit 123 acquires a partial sequence for each sequence data and creates a feature amount table consisting of predetermined statistics.

In step S73, the generation source selection unit 123 generates a prediction model that predicts the prediction target based on the feature table for each series of data.

In step S74, the generation source selection unit 123 calculates the prediction accuracy of the prediction result based on the prediction model for each series of data.

In step S75, the generation source selection unit 123 selects the series data whose prediction accuracy of the prediction result based on the prediction model is higher than a predetermined accuracy threshold as the generation source of the intra-session feature amount, and generates the intra-session feature amount. 124.

In other words, through the above processing, among the series data extracted from the flow data based on the output format, the series data that is highly effective for predicting the prediction target is selected as the series data from which the intra-session feature values are generated. Then, it becomes possible to output it to the intra-session feature amount generation unit 124.

As a result, it is possible to improve the prediction accuracy of a machine learning model generated by machine learning based on feature data consisting of intra-session feature data and inter-session feature data.

<Intra-session feature data generation process>
Next, with reference to the flowchart of FIG. 21, the intra-session feature quantity data generation process by the intra-session feature quantity generation unit 124 will be described.

In step S91, the intra-session feature generation unit 124 controls the metadata extraction unit 124a to extract and generate metadata from the flow data.

In step S92, the intra-session feature generation unit 124 uses the estimation model 124b to estimate a method for creating an effective feature from the metadata.

In step S93, the intra-session feature quantity generation unit 124 selects the generation source of the intra-session feature quantity supplied from the generation source selection unit 123 based on the creation method of the effective feature quantity estimated by the estimation model 124b. A feature amount is generated using the series data, and based on the generated feature amount, in-session feature amount data is generated and output to the feature amount selection unit 125.

Through the above processing, the generation source selection unit 123 selects the series data that is effective for predicting the prediction target from among the series data extracted from the flow data as the generation source for the feature quantities that constitute the intra-session feature data. After using the flow data, the effective feature amount is further generated using a method of generating an estimated effective feature amount based on metadata generated from the flow data.

Furthermore, as described above, the feature selection unit 125 further calculates the effectiveness score among the features constituting the in-session feature data, and selects only the features whose effectiveness scores are higher than a predetermined score threshold. Once selected, intra-session feature data is generated.

Furthermore, inter-session feature data is generated based on this intra-session feature data, and among the features that make up this inter-session feature data, those whose effectiveness scores are higher than a predetermined score threshold are selected. Then, the inter-session feature data is reconstructed.

That is, intra-session feature data consisting of features with high effectiveness scores related to prediction of the prediction target, and inter-session feature data are generated based on the intra-session feature data, and then further effectiveness scores are generated. Features based on the scores are selected to generate inter-session feature data.

Then, the intra-session feature data generated in this way and the inter-session feature data are combined to generate feature data, so feature data that is highly effective in predicting the prediction target is generated. becomes possible.

In addition, when the overall effectiveness score of the generated feature data is higher than a predetermined threshold and is deemed to be sufficient for predicting the prediction target, as long as the set processing time is within the set processing time, more excluded Since there is a possibility that effective series data and feature quantities exist, the accuracy threshold and score threshold are set smaller by predetermined values, and the feature quantity data is generated again.

As a result, it becomes possible to generate a larger amount of highly effective feature data used to generate a machine learning model that predicts a prediction target.

<Modified example>
By clustering sessions, for example, a set of upper classes of set sessions may be created.

For example, when setting a session, sessions may be clustered in advance, a superset of the session may be set, and a session may be set for each superordinate set.

For example, by decomposing time series data into shapelets, discretizing a set of characteristic partial waveforms, and treating the discretized partial waveforms as words, and treating time series data and sessions as sentences, TF- The IDF (Term Frequency-Inverse Document Frequency) value may be determined to determine the upper set of sessions.

That is, for example, consider a case where sessions FW1 to FW3 as shown in the upper part of FIG. 22 exist.

Here, session FW1 is regarded as a set consisting of characteristic partial waveforms PW1-1, PW2-1, PW3-1, and session FW2 is regarded as a set consisting of characteristic partial waveforms PW1-11, PW3-11, PW3-12. Session FW3 is regarded as a set consisting of characteristic partial waveforms PW2-21 and PW1-21, each of which is discretized, and TF-IDF is performed on the partial waveforms.

In the lower part of FIG. 22, the TF-IDF values of (PW1, PW2, PW3) of session FW1 are (0, 0.1353, 0.1353), and the TF-IDF values of (PW1, PW2, PW3) of session FW2 are , (0, 0, 0.2706), and the TF-IDF value of (PW1, PW2, PW3) of session FW3 is (0, 0.2050, 0).

Then, by clustering based on a vector based on the TF-IDF value for each session, sessions with a high degree of similarity may be placed in the same class, and a superset may be set.

In addition, in the case where the turn at bat ID is set as a session and one line per session as shown in the left part of Fig. 23, metadata is extracted from the flow data and based on the extracted metadata. Then, for each at-bat ID, which is a session, clustering is performed on the at-bat ID, which is a session, based on the statistics of the attribute data, such as the frequency of pitcher IDs, to group the sessions, and create a session superset column (in the figure). You may create a new cluster ID column).

In the right part of FIG. 23, for example, the pitcher ID extracted as metadata of the flow data is clustered by the opposing pitcher for each turn at bat, which is classified by the turn at bat ID that is the session, and the cluster ID is generated as a cluster ID. An example of classification as A, B, and A from top to bottom is shown. That is, here, the cluster ID corresponds to the pitcher ID.

<<3. Example of execution using software >>
Incidentally, the series of processes described above can be executed by hardware, but can also be executed by software. When a series of processes is executed by software, the programs that make up the software can execute various functions by using a computer built into dedicated hardware or by installing various programs. It is installed from a recording medium onto a computer that can be used, for example, a general-purpose computer.

FIG. 24 shows an example of the configuration of a general-purpose computer. This computer has a built-in CPU (Central Processing Unit) 1001. An input/output interface 1005 is connected to the CPU 1001 via a bus 1004. A ROM (Read Only Memory) 1002 and a RAM (Random Access Memory) 1003 are connected to the bus 1004 .

The input/output interface 1005 includes an input unit 1006 consisting of input devices such as a keyboard and mouse for inputting operation commands by the user, an output unit 1007 for outputting processing operation screens and images of processing results to a display device, and an output unit 1007 for outputting programs and various data. A storage unit 1008 consisting of a hard disk drive for storing data, a communication unit 1009 consisting of a LAN (Local Area Network) adapter, etc., and executing communication processing via a network typified by the Internet are connected. In addition, magnetic disks (including flexible disks), optical disks (including CD-ROM (Compact Disc-Read Only Memory), DVD (Digital Versatile Disc)), magneto-optical disks (including MD (Mini Disc)), or semiconductor A drive 1010 that reads and writes data to and from a removable storage medium 1011 such as a memory is connected.

The CPU 1001 executes programs stored in the ROM 1002 or read from a removable storage medium 1011 such as a magnetic disk, optical disk, magneto-optical disk, or semiconductor memory, installed in the storage unit 1008, and loaded from the storage unit 1008 into the RAM 1003. Execute various processes according to the programmed program. The RAM 1003 also appropriately stores data necessary for the CPU 1001 to execute various processes.

In the computer configured as described above, the CPU 1001 executes the above-described series by, for example, loading a program stored in the storage unit 1008 into the RAM 1003 via the input/output interface 1005 and the bus 1004 and executing it. processing is performed.

A program executed by the computer (CPU 1001) can be provided by being recorded on a removable storage medium 1011 such as a package medium, for example. Additionally, programs may be provided via wired or wireless transmission media, such as local area networks, the Internet, and digital satellite broadcasts.

In the computer, a program can be installed in the storage unit 1008 via the input/output interface 1005 by attaching the removable storage medium 1011 to the drive 1010. Further, the program can be received by the communication unit 1009 via a wired or wireless transmission medium and installed in the storage unit 1008. Other programs can be installed in the ROM 1002 or the storage unit 1008 in advance.

Note that the program executed by the computer may be a program in which processing is performed chronologically in accordance with the order described in this specification, in parallel, or at necessary timing such as when a call is made. It may also be a program that performs processing.

Note that the CPU 1001 in FIG. 24 realizes the functions of the control unit 51 of the information processing device 31 in FIG. 2.

Furthermore, in this specification, a system refers to a collection of multiple components (devices, modules (components), etc.), regardless of whether all the components are located in the same casing. Therefore, multiple devices housed in separate casings and connected via a network, and a single device with multiple modules housed in one casing are both systems. .

Note that the embodiments of the present disclosure are not limited to the embodiments described above, and various changes can be made without departing from the gist of the present disclosure.

For example, the present disclosure can take a cloud computing configuration in which one function is shared and jointly processed by multiple devices via a network.

Furthermore, each step described in the above flowchart can be executed by one device or can be shared and executed by multiple devices.

Further, when one step includes multiple processes, the multiple processes included in that one step can be executed by one device or can be shared and executed by multiple devices.

Note that the present disclosure can also take the following configuration.
<1> A metadata generation unit that generates metadata of flow data including at least time-series data;
an estimating unit that estimates a method of generating feature amounts from series data forming the flow data based on the metadata;
An information processing apparatus comprising: a feature amount generating section that generates a feature amount from the series data using a generation method estimated by the estimating section.
<2> Further including a setting unit that accepts settings of a session unit, a time unit, and a prediction target in the flow data,
The metadata generation unit generates the metadata from the series data extracted from the flow data according to settings of the session unit, the time unit, and the prediction target among the flow data. <1 ＞The information processing device described in ＞.
<3> Further including a column estimation unit that estimates columns constituting the flow data,
The setting unit presents a column estimated by the column estimating unit and prompts setting of the session unit, the time unit, and the prediction target column in the flow data using the column as a unit (UI) The information processing device according to <2>, wherein the information processing device generates and presents a User Interface image, and receives settings for the session unit, the time unit, and the prediction target column based on the UI image.
<4> An output format determining unit that determines an output format of series data extracted from the flow data based on the session unit, the time unit, and the prediction target in the flow data set by the setting unit. further including;
The metadata generation unit generates the series data extracted from the flow data based on the output format determined according to the settings of the session unit, the time unit, and the prediction target, among the flow data. The information processing device according to <2>, wherein the metadata is generated from the information processing device.
<5> Selection of determining the prediction accuracy related to the prediction of the prediction target for each of the series data extracted from the flow data based on the output format, and selecting series data higher than a predetermined accuracy threshold. further equipped with a department;
The metadata generation unit generates the metadata from the series data selected by the selection unit from among the series data extracted from the flow data, based on the output format. Information processing device.
<6> The selection unit calculates a feature amount for each partial sequence for each of the sequence data extracted from the flow data based on the output format, and adds the feature amount to the prediction model for predicting the prediction target. By inputting the feature amount for each partial sequence, the prediction target is predicted, and from the comparison of the prediction target and the prediction result by the prediction model, the prediction accuracy related to the prediction of the prediction target for each of the series data is calculated. The information processing apparatus according to <5>, wherein the information processing apparatus calculates the sequence data higher than the predetermined accuracy threshold.
<7> The feature generation unit generates a feature from the series data using the feature generation method estimated by the estimation unit, and generates an intra-session feature based on the generated feature for each session. The information processing device according to <2>, which generates an amount.
<8> An effectiveness score calculation unit that calculates an effectiveness score for the prediction of the prediction target for each of the feature amounts forming the intra-session feature amount;
an in-session feature selection unit that selects a feature higher than a predetermined score threshold based on the effectiveness score from among the features constituting the in-session feature and reconstructs the in-session feature; Furthermore, the information processing device according to <7>.
<9> The information processing device according to <8>, further including an inter-session feature generating unit that generates an inter-session feature including the inter-session feature based on the intra-session feature.
<10> The effectiveness score is calculated by calculating the effectiveness score for the prediction of the prediction target for each of the feature amounts forming the inter-session feature amount,
an inter-session feature selection unit that selects a feature higher than a predetermined score threshold based on the effectiveness score from among the features constituting the inter-session feature and reconstructs the inter-session feature; Furthermore, the information processing device according to <9>.
<11> The effectiveness score calculation unit calculates mutual information between each of the intra-session feature amounts and the inter-session feature amounts and the prediction target as the effectiveness score. <10 ＞The information processing device described in ＞.
<12> The effectiveness score calculation unit calculates the prediction accuracy for predicting the prediction target using a machine learning model that is simply generated based on the intra-session feature amounts and the feature amounts constituting the inter-session feature amounts. is calculated as the effectiveness score,
The intra-session feature quantity selection unit selects a subset of the feature quantities for which the effectiveness score does not become lower than a predetermined score threshold, and reconstructs the intra-session feature quantity;
The information according to <10>, wherein the inter-session feature quantity selection unit selects a subset of the feature quantities for which the effectiveness score does not become lower than a predetermined score threshold, and reconstructs the inter-session feature quantity. Processing equipment.
<13> A coupling unit that couples the reconstructed intra-session feature quantity and the reconstructed inter-session feature quantity;
Features combined by the combining unit based on the effectiveness scores of the respective features of the reconstructed intra-session feature and the reconstructed inter-session feature, which are combined by the linking unit. further comprising a determination unit that calculates an overall effectiveness score and determines whether the overall effectiveness score is smaller than a predetermined threshold;
When the overall effectiveness score is smaller than a predetermined threshold, the determination unit reduces the score threshold by a predetermined value, and controls processing by the intra-session feature selection unit and the inter-session feature selection unit. The information processing device according to <10>, wherein the information processing device is caused to execute again.
<14> The estimating unit calculates the metadata of the flow data and a distribution of a method of creating features used for learning a predetermined machine learning model, which is generated from the series data extracted from the flow data. The estimation model is a pair of information, and is an estimation model generated by learning based on the pair of information, and estimates a method of generating the feature amount based on the metadata. information processing equipment.
<15> In addition to the time-series data that changes over time, the flow data further includes attribute data consisting of data that does not change over time. The information processing device according to any one of the above.
<16> Generate metadata of flow data including at least time series data,
Based on the metadata, estimating a feature generation method from series data forming the flow data,
An information processing method comprising the step of generating a feature amount from the series data using an estimated generation method.
<17> A metadata generation unit that generates metadata of flow data including at least time-series data;
an estimating unit that estimates a method of generating feature amounts from series data forming the flow data based on the metadata;
A program that causes a computer to function as a feature value generation unit that generates a feature value from the series data using a generation method estimated by the estimation unit.

31 Information processing device, 61 UI control unit, 62 Data processing unit, 63 Machine learning model generation unit, 101 Flow data input unit, 102 Task setting unit, 103 Generated feature visualization unit, 121 Column estimation unit, 122 Output format determination unit , 123 Generation source selection unit, 124 Intra-session feature generation unit, 124a Metadata extraction unit, 124b Estimation model, 125 Feature selection unit, 126 Inter-session feature generation unit, 127 Combining unit, 128 Feature data storage, 129 Loop judgment section

Claims

a metadata generation unit that generates metadata of flow data including at least time-series data;
an estimating unit that estimates a method of generating feature amounts from series data forming the flow data based on the metadata;
An information processing apparatus comprising: a feature amount generating section that generates a feature amount from the series data using a generation method estimated by the estimating section.
further comprising a setting unit that accepts settings for each session, each time, and a prediction target in the flow data,
The metadata generation unit generates the metadata from the series data extracted from the flow data according to settings of the session unit, the time unit, and the prediction target among the flow data. 1. The information processing device according to 1.
further comprising a column estimation unit that estimates columns constituting the flow data,
The setting unit presents a column estimated by the column estimating unit and prompts setting of the session unit, the time unit, and the prediction target column in the flow data using the column as a unit (UI) The information processing apparatus according to claim 2, wherein the information processing apparatus generates and presents a user interface image, and receives settings for the session unit, the time unit, and the prediction target column based on the UI image.
The method further includes an output format determining unit that determines an output format of series data extracted from the flow data based on the session unit, the time unit, and the prediction target in the flow data set by the setting unit. ,
The metadata generation unit generates the series data extracted from the flow data based on the output format determined according to the session unit, the time unit, and the settings of the prediction target, among the flow data. The information processing device according to claim 2 , wherein the metadata is generated from the metadata.
further comprising a selection unit that determines the prediction accuracy related to the prediction of the prediction target for each of the series data extracted from the flow data based on the output format, and selects series data higher than a predetermined accuracy threshold. Prepare,
The metadata generation unit generates the metadata from the series data selected by the selection unit from among the series data extracted from the flow data, based on the output format. Information processing device.
The selection unit calculates a feature amount for each subsequence for each of the sequence data extracted from the flow data based on the output format, and adds the feature amount for each subsequence to a prediction model for predicting the prediction target. predicting the prediction target by inputting the feature amount, and calculating the prediction accuracy related to the prediction of the prediction target for each series data from a comparison between the prediction target and the prediction result by the prediction model, The information processing device according to claim 5, wherein the sequence data higher than the predetermined accuracy threshold is selected.
The feature generation unit generates a feature from the series data using the feature generation method estimated by the estimation unit, and generates an intra-session feature based on the generated feature for each session. The information processing device according to claim 2.
an effectiveness score calculation unit that calculates an effectiveness score for the prediction of the prediction target for each of the feature amounts forming the intra-session feature amount;
an in-session feature selection unit that selects a feature higher than a predetermined score threshold based on the effectiveness score from among the features constituting the in-session feature and reconstructs the in-session feature; The information processing device according to claim 7, further comprising:
The information processing apparatus according to claim 8 , further comprising: an inter-session feature generating unit that generates an inter-session feature including the inter-session feature based on the intra-session feature.
The effectiveness score is calculated by calculating the effectiveness score for the prediction of the prediction target for each of the feature amounts constituting the inter-session feature amount,
an inter-session feature selection unit that selects a feature higher than a predetermined score threshold based on the effectiveness score from among the features constituting the inter-session feature and reconstructs the inter-session feature; The information processing device according to claim 9, further comprising:
The effectiveness score calculation unit calculates mutual information between each of the intra-session feature amounts and the inter-session feature amounts and the prediction target as the effectiveness score. information processing equipment.
The effectiveness score calculation unit calculates the prediction accuracy for predicting the prediction target using a machine learning model that is simply generated based on the intra-session feature amount and the feature amount constituting the inter-session feature amount. Calculated as a degree score,
The intra-session feature quantity selection unit selects a subset of the feature quantities for which the effectiveness score does not become lower than a predetermined score threshold, and reconstructs the intra-session feature quantity;
The information according to claim 10, wherein the inter-session feature quantity selection unit selects a subset of the feature quantities in which the effectiveness score does not become lower than a predetermined score threshold, and reconstructs the inter-session feature quantity. Processing equipment.
a coupling unit that couples the reconstructed intra-session feature quantity and the reconstructed inter-session feature quantity;
Features combined by the combining unit based on the effectiveness scores of the respective features of the reconstructed intra-session feature and the reconstructed inter-session feature, which are combined by the linking unit. further comprising a determination unit that calculates an overall effectiveness score and determines whether the overall effectiveness score is smaller than a predetermined threshold;
When the overall effectiveness score is smaller than a predetermined threshold, the determination unit reduces the score threshold by a predetermined value, and controls processing by the intra-session feature selection unit and the inter-session feature selection unit. The information processing device according to claim 10, wherein the information processing device is executed again.
The estimation unit generates pair information of the metadata of the flow data and a distribution of a method of creating features used for learning a predetermined machine learning model, which is generated from the series data extracted from the flow data. The information processing device according to claim 1, wherein the estimation model is generated by learning based on the information of the pair, and the method of generating the feature amount is estimated based on the metadata.
The information processing apparatus according to claim 1, wherein the flow data further includes attribute data consisting of data that does not change with the passage of time, in addition to the time series data that changes with the passage of time.
Generate metadata for the flow data including at least time series data,
Based on the metadata, estimating a feature generation method from series data forming the flow data,
An information processing method comprising the step of generating a feature amount from the series data using an estimated generation method.
a metadata generation unit that generates metadata of flow data including at least time-series data;
an estimating unit that estimates a method of generating feature amounts from series data forming the flow data based on the metadata;
A program that causes a computer to function as a feature value generation unit that generates a feature value from the series data using a generation method estimated by the estimation unit.